***

### Using real datasets (can also be hypothetically constructed by yourself) define thefollowing feature types, and give example values from your dataset. How would you represent these features in a computer program? Numerical, Nominal, Date, Text, Image, Dependent variable

__Numerical__:
- The Abalone dataset uses has an attribute called "Rings", which are the number of rings found on abalone shells. This attribute is numerical and could be represented as a floating point variable.

__Nominal__:
- The Abalone dataset has an attribute that is nominal called "Sex", which is either Male, Female, or Infant. These attributes can be one-hot-encoded and represented as integers.

__Date__:
- The Investing for Bitcoin for Oil dataset has an attribute called "date" in the form of year-month-date. The date attribute can be separated into three attributes: day, month, and year and can be represented as an integer. The three resulting features could then be standardized and normalized along with the rest of the data in that datset.

__Text__:
- Text data is similiar to nominal and so using the same example from above, the Abalone dataset has an attribute that is nominal called "Sex", which is either Male, Female, or Infant. These attributes can be one-hot-encoded and represented as integers.

__Image__:
- The mnist handwritten digits dataset has pixel values as features, as is the case with all images. Each pixel location may be represented as 32 bit integers, but after standardization and normalization will end up as a floating point variable.

__Dependant Variable__:
- The Iris dataset has a "target" column which can be one of three classes, setosa, virginica, and versicolor. These values can either be one-hot-encoded or represented as an integer (1, 2, 3) prior to training.

***

Using online resources, research and find other classifier performance metrics
which are also as common as the accuracy metric. Write down the mathematical equations and the meaning of the metrics that you found.

__Acronyms__:
- True Positive (TP)
- False Positive (FP)
- False Negative (FN)

__metric__:

- Precision

__equation:__ 

- precision = TP / (TP + FP) 

__meaning__:

- Precision is a common metric used in object detection and classification problems and is the percent of true positive classifications, or put another way, it is the percent of correct classifications.


__metric__:

- Recall

__equation__: 

- Recall = TP / (TP + FN)

__meaning__:

- Recall is also a common metric in object detection and classification and represents a ratio of missed detections/classifications.


__metric__:

- F1 Score

__equation__:

- F1 = 2[(precision x recall) / (precision + recall)]

__meaning__:

- Recall and precision both how well certain aspects of a model performed, but a more comprehensive score is the F1-Score. It combines both precision and recall into one ratio. This is an especially good metric for use-cases that are interested in both high precision and recall scores.
 



***

### Implement a correlation program from scratch to look at the correlations between the features of Admission_Predict.csv dataset file (not provided, you have to download it by yourself by following the instructions in the module Jupyter notebook). Display the correlation matrix where each row and column are the features, which should be an 8 by 8 matrix (should we use 'Serial no'?). You can use pandas DataFrame.corr() to verify correctness of yours. Remember, you are not allowed to used numpy methods mean(), and stdev() or other libraries for mean or standard deviation.Observe that the diagonal of this matrix should have all 1's and explain why? Since the last column can be used as the target (dependent) variable, what do you think about the correlations between all the variables? Which variable should be the most important for prediction of 'Chance of Admit'?

***

In [None]:
##imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
##get current working directory
cwd = os.getcwd()

##get data path and open as a pandas dataframe
data_path = cwd + '\\data\\Admission_Predict.csv'
df = pd.read_csv(data_path)

In [None]:
##take a look at the data
df.head()
len(df)

In [None]:
class CalcPearsonCorrelation:
    
    """Calculates the Pearson Correlation Matrix"""
    
    def __init__(self, target_col_name: str) -> None:
        
        """
        Initialize the CalcPearsonCorrelation Object
        
        :param target_col_name: Name of the dependent variable 
        """
        
        self.target_col_name = target_col_name  
    
    def calcColumnMean(self, sample_column: list) -> float:
        
        """
        Calculates the mean of of each column of a dataset
        
        :param sample_column: Independent variable as a list
        """
        
        ##calculate column mean
        col_sum = sum(sample_column)
        col_len = len(sample_column)
        
        return round((col_sum / col_len), 2)
        
    def calcCovariance(self, col_vals: np.array, col_mean: float, targ_vals: np.array, target_mean: float) -> float:
        
        """
        Calculates covariance between a column and the target column
        
        :param col_vals: Array of a single column of dataset
        :param col_mean: Mean value of column of dataset
        :param target_vals: Array of target_vals
        :param target_mean: Mean value of target column of dataset
        """
        
        sum_vals = ((col_vals - col_mean) * (targ_vals - targ_mean)).sum()
        
        return sum_vals
    
    def calcStd(self, col_vals: np.array, col_mean: float) -> float:
        """ 
        Calculates attribute standard deviation
        
        :param col_vals: Array of column values
        :param col_mean: Mean of column values
        """
        
#         std = (((col_vals - col_mean) ** 2) / 10).sum()
        
        std = ((col_vals - col_mean) ** 2).sum()
        
        return std
    
    def calcPearsonCoef(self, xy_cov: float, x_std: float, y_std: float) -> float:
        
        """ 
        Calculates the Pearson Coefiscients of a dataset
        
        :param xy_cov: Covariance of x and y values
        :param x_std: Standard deviation of x
        :param y_std: standard deviation of y
        """
        
        coef = xy_cov / (x_std * y_std)
        
        return coef   

In [None]:
target_name = 'Chance of Admit '

##insantiate correlation object
calc_obj = CalcPearsonCorrelation(target_name)

##calculate target mean value and difference
targ_mean = calc_obj.calcColumnMean(list(df[target_name]))
targ_vals = np.array(df[target_name])
targ_std = calc_obj.calcStd(targ_vals, targ_mean)

##correlation numbers array of shape of dataset
array_size = len(list(df)) - 1
coef_array = np.zeros((array_size, array_size))
# coef_dict = {'attrib': [], 'coef': []}

##iterate through all dataset attributes to calculate the column mean
attribs = list(df)
for i in range(array_size):
    
    attrib = attribs[i]
#     if attrib != target_name:
    
    for j in range(array_size):

        col_mean = calc_obj.calcColumnMean(list(df[attrib]))
        col_vals = np.array(df[attrib])
#             print(col_vals.size)
        xy_cov = calc_obj.calcCovariance(col_vals, col_mean, targ_vals, targ_mean)
        col_std = calc_obj.calcStd(col_vals, col_mean)

        ##calculate the Pearson Correlation score and save to dict
        coef = calc_obj.calcPearsonCoef(xy_cov, targ_std, col_std)
#             print(coef)
        coef_array[i][j] = round(coef, 5)
            