### Task 1
For this task, I will still be using the data about peoples age, gender, salary
and whether or not they purcahsed a vehicle. Instead of using logistic
regression, a decision tree will be used. 

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import cross_val_score
import time

df = pd.read_csv("car_data.csv")
df

Unnamed: 0,User ID,Gender,Age,AnnualSalary,Purchased
0,385,Male,35,20000,0
1,681,Male,40,43500,0
2,353,Male,49,74000,0
3,895,Male,40,107500,1
4,661,Male,25,79000,0
...,...,...,...,...,...
995,863,Male,38,59000,0
996,800,Female,47,23500,0
997,407,Female,28,138500,1
998,299,Female,48,134000,1


In [6]:
def metric(pred, act):
    assert(pred.shape[0] == act.shape[0])
    N = pred.shape[0]   
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    for n in range(N):
        if (pred[n] ==  1 and act[n] == 1):
            tp += 1
        if (pred[n] == 0 and act[n] == 0):
            tn += 1
        if(pred[n] == 1 and act[n] == 0):
            fp += 1
        if(pred[n] == 0 and act[n] == 1):
            fn += 1
    acc = (tp + tn)/(tp + tn + fp + fn)
    prec = (tp)/(tp + fp)
    rec = (tp)/(tp + fn) 
    return (acc, prec, rec)

age = (np.asarray(df["Age"], dtype=np.float64))
gen = (np.asarray([df["Gender"] == "Female"], dtype=np.float64) * 1)
salary = (np.asarray(df["AnnualSalary"], dtype=np.float64))
X = np.vstack((age, gen, salary)).T

P = np.asarray(df["Purchased"], dtype=np.float64)
N = np.size(P)
P = P.reshape(N, 1)

# use decision tree classifier
dtree = tree.DecisionTreeClassifier()
dtree = dtree.fit(X, P)
P_pred = dtree.predict(X)

# perform 5-fold cross validation
scores = cross_val_score(dtree, X, P.flatten(), n_jobs=-1)

# collect metrics
acc, prec, rec = metric(P_pred, P)

# print results
print("accuracy: {}%".format(acc*100))
print("precision: {}%".format(prec*100))
print("recall: {}%".format(rec*100))
print("cv avg score {}".format(np.mean(scores)))

# change input parameters
test_point = [[53, 0, 60000]]
print(f"test_point: {test_point}, result: {dtree.predict(test_point)}")
test_point = [[53, 1, 60000]]
print(f"test_point: {test_point}, result: {dtree.predict(test_point)}")
test_point = [[20, 1, 100000]]
print(f"test_point: {test_point}, result: {dtree.predict(test_point)}")
test_point = [[60, 1, 100000]]
print(f"test_point: {test_point}, result: {dtree.predict(test_point)}")
test_point = [[20, 1, 200000]]
print(f"test_point: {test_point}, result: {dtree.predict(test_point)}")

accuracy: 99.5%
precision: 100.0%
recall: 98.75621890547264%
cv avg score 0.8799999999999999
test_point: [[53, 0, 60000]], result: [1.]
test_point: [[53, 1, 60000]], result: [1.]
test_point: [[20, 1, 100000]], result: [0.]
test_point: [[60, 1, 100000]], result: [1.]
test_point: [[20, 1, 200000]], result: [1.]


As you can see above, with the two inputs (53, 0, 60000) and (53, 1, 60000),
the first purchases a car, and the second does not. The only difference between
these inputs is that the first one is male, and the second one is female.
Somewhere in the decision tree, a node distinguishes between male and female
(there may be more than one) and will cause the output seen above. 

Take the two inputs (20, 1, 100000) and (60, 1, 100000). Here we differ in the
age of the person. The 20-year-old does not buy the car, but the 60-year-old
does. Similar to gender, there will be a node in the decision tree which causes
the 20-year-old to lead to a result of 0 and the 60-year-old to a result of 1.
However, take the input (20, 1, 200000), which results in 1; possibly, higher
in the tree, there is a node that sees a salary of 200000 to be greater than a
certain threshold and will classify the person as purchasing the car.

### Task 2
I used two bagging methods, bagging with SVM and random forest. I used one boosting method, AdaBoost.

In [7]:
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier

# use bagging with svm
svm_bag = BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0)
svm_bag = svm_bag.fit(X, P.flatten())
P_pred = svm_bag.predict(X)

# perform 5-fold cross validation
scores = cross_val_score(svm_bag, X, P.flatten(), n_jobs=-1)

# print results
acc, prec, rec = metric(P_pred, P)
print("accuracy: {}%".format(acc*100))
print("precision: {}%".format(prec*100))
print("recall: {}%".format(rec*100))
print("cv avg score {}".format(np.mean(scores)))

accuracy: 75.7%
precision: 86.63594470046083%
recall: 46.766169154228855%
cv avg score 0.754


In [8]:
from sklearn.ensemble import RandomForestClassifier

# use random forest, a bagging method
rf = RandomForestClassifier(n_estimators=10)
rf = rf.fit(X, P.flatten())
P_pred = rf.predict(X)

# perform 5-fold cross validation
scores = cross_val_score(rf, X, P.flatten(), n_jobs=-1)

# print results
acc, prec, rec = metric(P_pred, P)
print("accuracy: {}%".format(acc*100))
print("precision: {}%".format(prec*100))
print("recall: {}%".format(rec*100))
print("cv avg score {}".format(np.mean(scores)))

accuracy: 98.6%
precision: 98.2587064676617%
recall: 98.2587064676617%
cv avg score 0.8859999999999999


In [9]:
from sklearn.ensemble import AdaBoostClassifier

# use adaboost, a boosting method
adaboost = AdaBoostClassifier(n_estimators=100)
adaboost =  adaboost.fit(X, P.flatten())
P_pred = adaboost.predict(X)

# perform 5-fold cross validation
scores = cross_val_score(adaboost, X, P.flatten(), n_jobs=-1)

# print results
acc, prec, rec = metric(P_pred, P)
print("accuracy: {}%".format(acc*100))
print("precision: {}%".format(prec*100))
print("recall: {}%".format(rec*100))
print("cv avg score {}".format(np.mean(scores)))

accuracy: 90.60000000000001%
precision: 90.31413612565446%
recall: 85.82089552238806%
cv avg score 0.8870000000000001


Looking at the results of the classifiers, we can see that random forests and
AdaBoost did significantly better than bagging SVMs. Moreover, bagging SVMs had
the worst CV score, while the CV score for both random forest and AdaBoost was
reasonably high. Additionally, I expected to see the accuracy of random forests
lower than AdaBoost because the boosting method aims to reduce bias, but the
accuracy for the random forest was higher. Similarly, I expected the CV score
for the random forest to be higher because it is boosting method, and while it
is higher, it is very close to AdaBoost. 

### Task 3
From the four classification models used above, the mean cross-validation score
can be used to judge the generalization of each model. The model that best
generalizes the best is random forest, the second best is AdaBoost, the third
best is decision tree, and the fourth best is bagging SVMs. This is somewhat
consistent with what I expected. Mainly, I expected random forest to generalize
well because it is a bagging method, and it did. 

For each of the models, accuracy, precision, and recall are displayed. When
selling a car, it is not a big deal if I predict someone will buy a car and the
actually do not (false positive). However, if I predict someone will not
purchase a car and they actually would (false negative), I lose money (bad). As
a result, I am interested in the recall metric, which is sensitive to the
number of false negatives ($\frac{TP}{TP + FN}$). 

Looking at the recalls for each model, the decision tree did the best, random
forest second best, AdaBoost third best, and bagging SVMs last. Even though the
decision tree did the best, I would still deploy the random forest model
because it is more likely to generalize better. 
