## 1. [20 pts] At a high-level, without entering into mathematical details, compare and contrast the following classifiers:
### Perceptron (textbook's version)
Perceptron classifies based on computed weights. It determines these weights by reading in each training value, computing the prediction, then updating the weights depending on the correctness of this prediciton.<br>
Perceptron will converge if the classes are linearly seperable.<br>
Perception works by creating a computed decision boundary between the classes, and focuses on minimizing misclassified classes.<br>
Perceptron is useful for getting a quick linear boundary on a dataset.<br>
Perceptrons can be trained online which means the data can be input one at a time and the algorithm will update the weights. SVM is unable to do this, which means in some cases, perceptron is better from a integrated systems standpoint<br>
Perceptron is distance based, so the algorithm prefers numrical features. <br>
The textbooks version of perceptron seems to use a linear classifier and no kernel, so this could be the fastest algorithm depending on the depth of the decision tree.
### SVM
Support vector machines work like Perceptron by creating a decision boundary between classes, but instead of focusing on minimizing errors, SVM tries to maximize the margin<br>
The reason that maximizing the margin can be better than minimizing errors is there is less of a chance of overfitting.<br>
SVM, combined with good pre-processing, is a useful model that can be deployed with high accuracy metrics.<br>
Like Perceptron, SVM is distance based, so the algorithm prefers numerical features.
<br>
The speed of SVM depends on the kernel it uses, but in comparison to the other classifiers, it will take longer than perceptron and decision trees. Depending on the kernal and the amount of trees, it may be quicker than a random forest.
### Decision Tree
Decision Trees make classifications based on a series of binary questions. These binary questions usually result in ranges for each feature, for example petalwidth 1<x<1.5 , sepalwidth 2<x<2.5 will be classified as Setosa.<br>
Decision trees are useful for interpreting the data. They are visual and logical, which can be good for understanding data and helping customers understand what data means.<br>
Decision trees use entropy, so the algorithm prefers nominal features.
<br>
Decision trees are pretty fast and if speed is an issue, the depth can be reduced. This will affect accuracy but as I state below, accuracy is not the greatest strength of decision trees.
### Random Forest (you have to research a bit about this classifier)
Random forests are built from decision tress. The dataset is divided into random subsets and fed into decision trees. The output classification of the random forest is the majority vote of those decision trees. <br>
Random forests will provide better accuracy measures than decision tress, but will take longer to train because it has to build multiple decision tress.<br>
Like decision trees, random forests use entropy, so the algorithm prefers nominal features.
<br>
Both decision trees and random forests are able to classify input quickly once the algorithms are trained<br>
#### Some comparison criterion can be, Does the method solve an optimization problem, if yes what is the cost function? Speed? Strength? Robustness? Feature type that the classifier naturally uses (e.g. based on the comparison measure, such as entropy or distance) Which one will be the first that you would try on your dataset?
Overall, the first algorithm I would try on a dataset depends on the data itself.  If the data is numerical, then I would try SVM. Even though it will take longer than Perceptron, it is less prone to overfitting and will give more accurate metrics. This is helpful so that we can determine what preprocessing needs to be done<br>
On the other hand, if the data is more nominal, I would choose a decision tree. Decision trees are pretty fast and I am more using it to find out information on the data, not the actual accuracy metrics. There are libraries like dtreevis that will show a graph of the decision tree, and using this we can see what the most important features might be.

## 2. [20 pts] Using real datasets (can also be hypothetically constructed by yourself) define the following feature types, and give example values from your dataset. How would you represent these features in a computer program? (e.g. 32-bit integer? Floating point? String?)
### Numerical
Numerical feature types are features with numbers in the set -INF to INF. They can be any of the number types such as integers or floats, but floats are better because they capture decimal values and don't necesarily take longer to process.
Below in the Numerical and Image section, you can see each of the pixel features are numerical, and they are represented by floats.
### Nominal
Nominal feature types are features with a finite number of categories. The categories can be represented by many data types such as text, numbers, or booleans. In a computer program, these features can either be represented as the category name itself, or an integer representing the category. In output, the category name itself will probably be more useful.
Below in the nominal section, you can see two different ways the nominal data can be represented. There is the actual label_name, then the corresponding label_number. I also showed how to convert nominal data into integer categories using one hot encoder.
### Text
Text data is harder to use in machine learning algorithms. Text data is a sequence of characters and would be input as a string. In order to use it, bag of words algorithms convert these strings into a vocabulary. Using the vocabulary, the text will end up being represented by an array of booleans.
Below in the text section, you can see a dataframe of text strings.
### Image
Image feature types are represented as a matrix of pixels. Images usually get flattened into an array of pixels, which are numerical values.
Below in the Numerical and Image section, you can see images represented as an array of numerical pixels
### Dependent variable
Dependant feature types are features that depend on the independant variables of the data. Dependant variables are the desired classification or labels. They can be numerical or nominal. If they are nominal, it may be best to convert them to integers. 
Below in the Nominal and Dependant section, you can see label_name, label_num, and One_hot are all dependant variables with different types.

In [1]:
import math
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import LabelEncoder

## Numerical and Image

In [2]:
images = datasets.load_digits()
imagesDF = pd.DataFrame(images.data, columns=images.feature_names)
imagesDF

Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_6,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1792,0.0,0.0,4.0,10.0,13.0,6.0,0.0,0.0,0.0,1.0,...,4.0,0.0,0.0,0.0,2.0,14.0,15.0,9.0,0.0,0.0
1793,0.0,0.0,6.0,16.0,13.0,11.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,6.0,16.0,14.0,6.0,0.0,0.0
1794,0.0,0.0,1.0,11.0,15.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,9.0,13.0,6.0,0.0,0.0
1795,0.0,0.0,2.0,10.0,7.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,5.0,12.0,16.0,12.0,0.0,0.0


## Nominal and Dependant

In [3]:
iris = datasets.load_iris()
irisDF = pd.DataFrame(iris.data, columns=iris.feature_names) 
irisDF['label_num'] = iris.target
irisDF['label_name'] = [iris.target_names[v] for v in iris.target]
labelencoder = LabelEncoder()
irisDF['One_Hot'] = labelencoder.fit_transform(irisDF['label_name'])
irisDF

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label_num,label_name,One_Hot
0,5.1,3.5,1.4,0.2,0,setosa,0
1,4.9,3.0,1.4,0.2,0,setosa,0
2,4.7,3.2,1.3,0.2,0,setosa,0
3,4.6,3.1,1.5,0.2,0,setosa,0
4,5.0,3.6,1.4,0.2,0,setosa,0
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,virginica,2
146,6.3,2.5,5.0,1.9,2,virginica,2
147,6.5,3.0,5.2,2.0,2,virginica,2
148,6.2,3.4,5.4,2.3,2,virginica,2


## Text

In [4]:
text = datasets.fetch_20newsgroups()
textDF = pd.DataFrame(text.data).head(50)
textDF

Unnamed: 0,0
0,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...
5,From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\...
6,From: bmdelane@quads.uchicago.edu (brian manni...
7,From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ...
8,From: holmes7000@iscsvax.uni.edu\nSubject: WIn...
9,From: kerr@ux1.cso.uiuc.edu (Stan Kerr)\nSubje...


## 3. [20 pts] Using online resources, research and find other classifier performance metrics which are also as common as the accuracy metric. Write down the mathematical equations and the meaning of the metrics that you found.
Some important definitions<br>
True Positive: Actually yes and predicted yes<br>
True Negative: Actually no and predicted no<br>
False Positive: Actually no and predicted yes<br>
False Negative: Actually yes and predicted no<br>
### Accuracy
Accuracy = $\frac{True Positive + True Negative}{True Positive + True Negative + False Positive + False Negative}$<br>
This finds the ration of correct predictions to the total predictions. <br>
Good when data is balanced without skew.
### Precision
Precision = $\frac{True Positive}{True Positive + False Positive}$<br>
This finds the ratio of correct positive predictions to total positive predictions. <br>
### Recall
Recall = $\frac{True Positive}{True Positive + False Negative}$<br>
This finds the ratio of correct positive predictions to actual positives.<br>
Accuracy, precision, and recall are usually used in tandem.
### Logarithmic Loss
N samples, M classes <br>
y_ij is whether i belongs to class j<br>
p_ij is the probability that i belongs to j<br>
LogLoss = $\frac{-1}{N}$$\sum_{i=1}^{N}\sum_{j=1}^{M}y_ij * log(p_ij)$<br>
Logarithmic loss is good for multiple classes since it penalizes the false classifications. User sets the probability of classes.
### Confusion Matrix
Creates a matrix showing the true results of the model which is very good for the absolute performance of the algorithm.<br>
Table shows True Positives, True Negatives, False Positives, and False Negatives.
### F1 Score
F1 = 2*$\frac{precision*recall}{precision+recall}$<br>
F1 score is realiant on both precision and recall, so it is gauging the number of correct positive predictions to both total positive predictions and actual prositives. F1 score can also be weigthed if precision is more important than recall in a problem or vice versa.
### Mean Squared Error
MSE = $\frac{1}{N}$$\sum_{i=1}^{N}{(xi-\hat{xi})}^2$<br>
MSE is the average of the distance between the original value and predicted value.

## 4. [40 pts] Implement a correlation program from scratch to look at the correlations between the features of Admission_Predict.csv dataset file (not provided, you have to download it by yourself by following the instructions in the module Jupyter notebook). Display the correlation matrix where each row and column are the features, which should be an 8 by 8 matrix (should we use 'Serial no'?). You can use pandas DataFrame.corr() to verify correctness of yours.
### Remember, you are not allowed to used numpy methods like mean(), etc. Observe that the diagonal of this matrix should have all 1's and explain why? Since the last column can be used as the target (dependent) variable, what do you think about the correlations between all the variables? Which variable should be the most important for prediction of 'Chance of Admit'?
The diagonals will all be 1 because the covariance of the same column will just be the standard deviation squared.  If we look at the formulas, they are nearly the same, except covariance accounts for both x and y. If x and y are equal, then this will jsut be squared. Therefore, the covariance is the standard deviation squared, making the correlation 1.<br>
The correlation between features and the dependant variable means that that feature would be an effective variable in predicting Chance of Admit. CGPA has the highest correlation, so this would be most important for the prediction of Chance of Admit


In [5]:
admissions = pd.read_csv('Admission_Predict.csv')
admissions

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67


In [6]:
admissions.corr()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,1.0,-0.097526,-0.147932,-0.169948,-0.166932,-0.088221,-0.045608,-0.063138,0.042336
GRE Score,-0.097526,1.0,0.835977,0.668976,0.612831,0.557555,0.83306,0.580391,0.80261
TOEFL Score,-0.147932,0.835977,1.0,0.69559,0.657981,0.567721,0.828417,0.489858,0.791594
University Rating,-0.169948,0.668976,0.69559,1.0,0.734523,0.660123,0.746479,0.447783,0.71125
SOP,-0.166932,0.612831,0.657981,0.734523,1.0,0.729593,0.718144,0.444029,0.675732
LOR,-0.088221,0.557555,0.567721,0.660123,0.729593,1.0,0.670211,0.396859,0.669889
CGPA,-0.045608,0.83306,0.828417,0.746479,0.718144,0.670211,1.0,0.521654,0.873289
Research,-0.063138,0.580391,0.489858,0.447783,0.444029,0.396859,0.521654,1.0,0.553202
Chance of Admit,0.042336,0.80261,0.791594,0.71125,0.675732,0.669889,0.873289,0.553202,1.0


In [7]:
def meanSeries(series):
    sum = 0
    length = len(series)
    for index, value in series.items():
        sum = sum + value
    return sum/length

def stdevSeries(series):
    mean = meanSeries(series)
    #devs = [(value - mean) for index, value in series.items()]
    devs = []
    for index, value in series.items():
        devs.append((value-mean) ** 2)
    sum = 0
    for value in devs:
        sum = sum + value
    st = math.sqrt(sum/(len(series)-1))
    return st

def covar(x,y):
    #cov(X, Y) = (sum (x - mean(X)) * (y - mean(Y)) ) * 1/(n-1)
    sum = 0
    meanX = meanSeries(x)
    meanY = meanSeries(y)
    distance = []
    for i in range(len(x)):
        distance.append((x[i]-meanX)*(y[i]-meanY))
    for value in distance:
        sum = sum + value
    return sum/(len(x)-1)

def corr(x,y):
    return covar(x, y)/(stdevSeries(x) * stdevSeries(y))


In [8]:
corrPD = pd.DataFrame(0, index=admissions.columns, columns=admissions.columns)
corrCOL = []
for col in admissions.columns:
    for index in range(len(admissions.columns)):
        corrCOL.append(corr(admissions[col], admissions[admissions.columns[index]]).round(6))
    corrPD[col] = corrCOL
    corrCOL=[]
corrPD

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,1.0,-0.097526,-0.147932,-0.169948,-0.166932,-0.088221,-0.045608,-0.063138,0.042336
GRE Score,-0.097526,1.0,0.835977,0.668976,0.612831,0.557555,0.83306,0.580391,0.80261
TOEFL Score,-0.147932,0.835977,1.0,0.69559,0.657981,0.567721,0.828417,0.489858,0.791594
University Rating,-0.169948,0.668976,0.69559,1.0,0.734523,0.660123,0.746479,0.447783,0.71125
SOP,-0.166932,0.612831,0.657981,0.734523,1.0,0.729593,0.718144,0.444029,0.675732
LOR,-0.088221,0.557555,0.567721,0.660123,0.729593,1.0,0.670211,0.396859,0.669889
CGPA,-0.045608,0.83306,0.828417,0.746479,0.718144,0.670211,1.0,0.521654,0.873289
Research,-0.063138,0.580391,0.489858,0.447783,0.444029,0.396859,0.521654,1.0,0.553202
Chance of Admit,0.042336,0.80261,0.791594,0.71125,0.675732,0.669889,0.873289,0.553202,1.0
