## Q1. Housing Price (SVM/SVC)

#### Load and Explore the Data

*   Think about standardizing the data.

*  How would you replace discrete attributes


In [557]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt


In [558]:
df = pd.read_csv('lab3_data.csv')
df.head()


Unnamed: 0,area,land,year,price,bldtype
0,2607,1200,2010,825000.0,0
1,1950,1783,1899,1685000.0,0
2,2520,1875,1899,1100000.0,0
3,3750,3125,1931,1200000.0,1
4,7812,5021,1908,1900000.0,1


In [559]:
np.mean(df['year'])

1918.031914893617

#### Based on the mean value of year, We will update the features to 'Modern' and 'Legacy' based on after or before the year 1918.

In [560]:
df['year'] = np.where(df['year'] >= 1918, 'Modern', 'Legacy')
df.head()

Unnamed: 0,area,land,year,price,bldtype
0,2607,1200,Modern,825000.0,0
1,1950,1783,Legacy,1685000.0,0
2,2520,1875,Legacy,1100000.0,0
3,3750,3125,Modern,1200000.0,1
4,7812,5021,Legacy,1900000.0,1


In [561]:
ohe = pd.get_dummies(df['year'])
df = df.drop('year', axis=1)
df = df.join(ohe)

In [562]:
df.head()

Unnamed: 0,area,land,price,bldtype,Legacy,Modern
0,2607,1200,825000.0,0,0,1
1,1950,1783,1685000.0,0,1,0
2,2520,1875,1100000.0,0,1,0
3,3750,3125,1200000.0,1,0,1
4,7812,5021,1900000.0,1,1,0


In [563]:
X = df.loc[:, ['area', 'land', 'price', 'Legacy', 'Modern']]
y = df.loc[:, 'bldtype']

#### Train-Test Split 80/20

In [564]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)



In [565]:
scaler = StandardScaler()
X_subset = scaler.fit_transform(X_train.iloc[:,0:3])
X_last_column = X_train.iloc[:, 3:]
X_trainf = np.concatenate((X_subset, X_last_column), axis=1)

In [566]:
X_subset2 = scaler.transform(X_test.iloc[:,0:3])
X_last_column2 = X_test.iloc[:, 3:]
X_testf = np.concatenate((X_subset2, X_last_column2), axis=1)

This allows us to scale and standardize our data such that all the data points under different parameters fall on the same scale.

For us to replace discrete attributes, we can either rename them to some categorical columns, or we can perform encoding on these columns.


1. One Hot Encoding allows each value to take 1 where the class exists and 0 elsewhere. This prevents the data from getting ordinally encoded giving the classes equal weightage.

2. We can also perform Label or Binary Encoding to our classes however, in case of discrete columns, they take an ordinal value which is not representative of the class. If we have Cat and Dog as two classes, Label encoding would assign them values 0 and 1 which would introduce some type of bias to the data.

3. In this dataset, the discrete value of Year was updated to fit the categories 'Modern' or 'Legacy' based on when hey were built. We can now get dummies or One Hot Encode this column.

4. We can also choose to add the age of the house based on the year built, that is also a good way of removing discrete data.

5. We only scale the columns which have a huge range in their values. Therefore we do not scale our one hot encoded columns only the first 3.



#### Train a linear model with soft margin



*   Try with initial value of C=1



In [568]:
svc = SVC(C=1,kernel='linear', random_state=42)
svc.fit(X_trainf, y_train)
y_pred = svc.predict(X_testf)
accuracy = accuracy_score(y_pred, y_test)
print('Accuracy is:', accuracy, "or:", accuracy * 100,"%")


Accuracy is: 1.0 or: 100.0 %


#### Use cross validation to find best value of C



*   Can do it manually or use GridSearchCV

*   Divide the training set into train+validation



In [569]:
X_tr, X_val, y_tr, y_val = train_test_split(X_trainf, y_train, test_size=0.25, random_state=42)


In [576]:
c = {'C':np.linspace(1,10,1000)}
modelSVC = GridSearchCV(svc, c)
modelSVC.fit(X_trainf, y_train)
print(modelSVC.best_params_)

{'C': 6.828828828828828}


#### Analyse accuracy basis the new values you have computed

In [574]:
y_val_preds = modelSVC.predict(X_val)
print('Accuracy is', accuracy_score(y_val_preds, y_val))

Accuracy is 0.9473684210526315


In [575]:
y_test_preds = modelSVC.predict(X_testf)
print('Accuracy on test data with updated C is', accuracy_score(y_test_preds, y_test)*100)

Accuracy on test data with updated C is 100.0


## Q2. DT and RF

Consider the Wisconsin Breast Cancer dataset available from
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+
(Diagnostic)

The dataset has 32 attributes that predict malignancy. There are a
total of 569 data patterns. Use 5-fold cross-validation. 


1.   Use Keras or any other framework to construct a decision tree from the training data and obtain the performance on the test data

2.   Construct a random forest (of say, 100 trees) from the training data and use the random forest to obtain the performance on the test data

3. Compare the performance you obtain in 1 and 2


In [657]:
from sklearn.tree import DecisionTreeClassifier as DTC
df = pd.read_csv('dataWisconsin.csv')
df.head(69)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.990,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.1622,0.66560,0.71190,0.26540,0.4601,0.11890,
1,842517,M,20.570,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.1238,0.18660,0.24160,0.18600,0.2750,0.08902,
2,84300903,M,19.690,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.1444,0.42450,0.45040,0.24300,0.3613,0.08758,
3,84348301,M,11.420,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.2098,0.86630,0.68690,0.25750,0.6638,0.17300,
4,84358402,M,20.290,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.1374,0.20500,0.40000,0.16250,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,85922302,M,12.680,23.84,82.69,499.0,0.11220,0.12620,0.11280,0.06873,...,33.47,111.80,888.3,0.1851,0.40610,0.40240,0.17160,0.3383,0.10310,
65,859283,M,14.780,23.94,97.40,668.3,0.11720,0.14790,0.12670,0.09029,...,33.39,114.60,925.1,0.1648,0.34160,0.30240,0.16140,0.3321,0.08911,
66,859464,B,9.465,21.01,60.11,269.4,0.10440,0.07773,0.02172,0.01504,...,31.56,67.03,330.7,0.1548,0.16640,0.09412,0.06517,0.2878,0.09211,
67,859465,B,11.310,19.04,71.80,394.1,0.08139,0.04701,0.03709,0.02230,...,23.84,78.00,466.7,0.1290,0.09148,0.14440,0.06961,0.2400,0.06641,


In [647]:
df.drop('Unnamed: 32', inplace=True, axis=1)
df = df.drop('id', axis=1)

In [648]:

X, y = df.iloc[:, 1:], df.loc[:, 'diagnosis']

In [649]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [650]:
model_DTC = DTC()
max_dep = {'max_depth':np.arange(2,10)}
modelCV = GridSearchCV(model_DTC, max_dep, cv=5)


In [651]:
modelCV.fit(X_train, y_train)
y_pred = modelCV.predict(X_test)
print('Accuracy of classification is', accuracy_score(y_pred, y_test)*100,'%')

Accuracy of classification is 93.85964912280701 %


####  Repeat the exercise but add ±10% noise to 25% of the data (Optional)

In [655]:
X, y = df.iloc[:, 1:], df.loc[:, 'diagnosis']
noise_df = X.sample(frac=0.25)
#noise = ((np.random.randint(-10,10)) * noise_df)
#noise_df = noise_df + noise
noise_df = noise_df + ((np.random.randint(-10,10)/100)* noise_df)
X.update(noise_df)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size = 0.2, random_state = 42)


In [656]:
X_train2

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
68,8.30668,15.9436,54.0868,230.46,0.098072,0.129996,0.28796,0.04025,0.194212,0.074023,...,9.4852,20.838,60.26,298.724,0.136344,0.40158,1.15184,0.16100,0.388976,0.10810
181,21.09000,26.5700,142.7000,1311.00,0.114100,0.283200,0.24870,0.14960,0.239500,0.073980,...,26.6800,33.480,176.50,2089.000,0.149100,0.75840,0.67800,0.29030,0.409800,0.12840
63,9.17300,13.8600,59.2000,260.90,0.077210,0.087510,0.05988,0.02180,0.234100,0.069630,...,10.0100,19.230,65.59,310.100,0.098360,0.16780,0.13970,0.05087,0.328200,0.08490
248,10.65000,25.2200,68.0100,347.00,0.096570,0.072340,0.02379,0.01615,0.189700,0.063290,...,12.2500,35.190,77.98,455.700,0.149900,0.13980,0.11250,0.06136,0.340900,0.08147
60,10.17000,14.8800,64.5500,311.90,0.113400,0.080610,0.01084,0.01290,0.274300,0.069600,...,11.0200,17.450,69.86,368.600,0.127500,0.09866,0.02168,0.02579,0.355700,0.08020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,8.88800,14.6400,58.7900,244.00,0.097830,0.153100,0.08606,0.02872,0.190200,0.089800,...,9.7330,15.670,62.56,284.400,0.120700,0.24360,0.14340,0.04786,0.225400,0.10840
106,11.64000,18.3300,75.1700,412.50,0.114200,0.101700,0.07070,0.03485,0.180100,0.065200,...,13.1400,29.260,85.51,521.700,0.168800,0.26600,0.28730,0.12180,0.280600,0.09097
270,14.29000,16.8200,90.3000,632.60,0.064290,0.026750,0.00725,0.00625,0.150800,0.053760,...,14.9100,20.650,94.44,684.600,0.085670,0.05036,0.03866,0.03333,0.245800,0.06120
435,13.98000,19.6200,91.1200,599.50,0.106000,0.113300,0.11260,0.06463,0.166900,0.065440,...,17.0400,30.800,113.90,869.300,0.161300,0.35680,0.40690,0.18270,0.317900,0.10550


In [666]:
model_DTC = DTC(random_state=42)
max_dep = {'max_depth':np.arange(2,10)}
modelCV = GridSearchCV(model_DTC, max_dep, cv=5)
modelCV.fit(X_train, y_train)
y_pred = modelCV.predict(X_test)
print('Accuracy of classification is', accuracy_score(y_pred, y_test)*100,'%')

Accuracy of classification is 94.73684210526315 %


In [667]:
from sklearn.ensemble import RandomForestClassifier
rfc1 = RandomForestClassifier(n_estimators=100, max_depth = 4)
rfc1.fit(X_train, y_train)
y_predRF = rfc1.predict(X_test)
print('Accuracy with Random Forest of 100 trees is:', accuracy_score(y_predRF, y_test) * 100)

Accuracy with Random Forest of 100 trees is: 96.49122807017544


Random Forest with 100 trees is a bagging approach that combines the predictions of all the 100 tres, votes on their predictions and gives the final decided class. It is therefore better than Decision Tree(single)

### Boosting

Implement a boosting classifier algorithm for the same dataset as above (sample without noise)

Feel free to use any boosting algorithm you want

However only run the code for the eventual algorithm you choose and comment out every other algorithm

Briefly explain why you chose a particular algorithm

In [598]:
#pip install xgboost

In [668]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb

In [669]:
dtc = DecisionTreeClassifier(max_depth = 3)
abc = AdaBoostClassifier(dtc, n_estimators=100)
abc.fit(X_train, y_train)
y_hat = abc.predict(X_test)
print('Accuracy Score = ', accuracy_score(y_hat, y_test)*100)

Accuracy Score =  97.36842105263158


In [622]:
# gbc = GradientBoostingClassifier(n_estimators=100, max_depth=3)
# gbc.fit(X_train2, y_train2)
# y_hat = abc.predict(X_test2)
# print('Accuracy Score = ', accuracy_score(y_hat, y_test2)*100)

In [623]:
# y_train_update = np.where(y_train2 == 'B', 0, 1)
# y_test_update = np.where(y_test2 == 'B', 0, 1)


In [624]:
# xgboost = xgb.XGBClassifier(n_estimators=100, max_depth = 3)
# xgboost.fit(X_train, y_train_update)
# y_hat = xgboost.predict(X_test)
# print('Accuracy Score = ', accuracy_score(y_hat, y_test_update)*100)

#### Choice of Boosting Algorithm

I used several Boosting algorithms but went ahead with AdaBoost(Adaptive Boosting) algorithm.
AdaBoost uses decision trees as their base classifiers while XgBoost uses a gradient boosting as base estimator. AdaBoost is not sensitive to noisy dta which was added above and is able to generalise well over a large number of features.

AdaBoost (Adaptive Boosting) is a type of boosting algorithm that can handle noisy data and a large number of features well. Boosting algorithms work by combining the predictions of multiple weak models to create a strong, accurate model. AdaBoost works by training a series of weak models, each of which is a decision tree. The decision trees are trained sequentially, with the focus on the examples that were misclassified by the previous tree.
Each tree in the ensemble is trained on a modified version of the original data set, where the weights of the misclassified examples are increased so that the next tree pays more attention to them. This allows the algorithm to focus on the most important examples and helps to reduce the impact of noise in the data.

In terms of handling a large number of features, AdaBoost is able to handle high-dimensional data well because decision trees are able to effectively handle large numbers of features. Decision trees are able to find the most important features and split the data based on those features, which allows them to effectively handle high-dimensional data.

Overall, AdaBoost is a good choice for handling noisy data and a large number of features. It is able to effectively learn from the data and create a strong, accurate model by combining the predictions of multiple weak models.

### Bagging

Implement a bagging classifier on the RF you created above


> from sklearn.ensemble import BaggingClassifier

You will have to pass the DT into the Bagging Classifier

Once you have the y_pred for Bagging and RF, accurately compute the accuracy by computing the numpy sum where pred(bagging) == pred(RF) and divide by len(pred(bagging))

Please provide rationale behind why this is done.



In [662]:
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

In [670]:
dtc2 = DecisionTreeClassifier(max_depth=4)
bg = BaggingClassifier(dtc2, n_estimators=100, max_samples = 0.45)
rfc = RandomForestClassifier(n_estimators = 100, max_depth = 3, random_state=42)
bg.fit(X_train, y_train)
y_hat_bg = bg.predict(X_test)
rfc.fit(X_train, y_train)
y_hat_rfc = rfc.predict(X_test)
print('Accuracy with Bagging Classifier is:', accuracy_score(y_hat_bg, y_test))
print('Accuracy for Random Forest is:', accuracy_score(y_hat_rfc, y_test))

Accuracy with Bagging Classifier is: 0.956140350877193
Accuracy for Random Forest is: 0.9649122807017544


In [664]:
acc_bg = accuracy_score(y_hat_bg, y_test)
acc_rfc =  accuracy_score(y_hat_rfc, y_test)

In [671]:
s = len(np.where(y_hat_bg==y_hat_rfc)[0])
final_acc = (s/len(y_hat_bg))*100
print('Final accuracy by combining and comparing both classifiers predictions is : ', final_acc, '%')


Final accuracy by combining and comparing both classifiers predictions is :  99.12280701754386 %


This method's rationale is that it enables you to compare the performance of the two models by counting the instances in which their predictions coincide. Similar predictions from the models suggest that they are working effectively and accurately. On the other side, if the models' projections are considerably disagreeing, it can mean that one or both of the models are underperforming.

Bonus : While you are looking at ensemble models, explore VotingClassifier

In [672]:
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

clf1 = modelCV
clf2 = abc
clf3 = gbc
clf4 = bg
clf5 = rfc
clf6 = SVC(kernel = 'poly', degree=7,probability=True)
clf7 = GaussianNB()
eclf1 = VotingClassifier(estimators = [('CVDTC',clf1), ('Naive Bayes',clf7), ('RandomForestClassifier', clf5)], voting='soft')
eclf1.fit(X_train, y_train)
voting_pred1 = eclf1.predict(X_test)
eclf2 = VotingClassifier(estimators=[('CVDTC', clf1), ('AdaBoost', clf2), ('GradientBoosting', clf3), ('BaggingClasifier', clf4),
                                    ('RandomForestClassifier', clf5)], voting='soft')
eclf2.fit(X_train, y_train)
voting_pred2 = eclf2.predict(X_test)

eclf3 = VotingClassifier(estimators=[('BaggingClasifier', clf4),
                                    ('RandomForestClassifier', clf5), ('SVM', clf6)], voting='soft')
eclf3.fit(X_train, y_train)
voting_pred3 = eclf3.predict(X_test)
print('Accuracy for Voting classifier with 3 estimators is: ', accuracy_score(voting_pred1, y_test2)*100,"%")
print('Accuracy for Voting classifier with 5 estimators is: ', accuracy_score(voting_pred2, y_test2)*100,"%")
print('Accuracy for Voting classifier with 3 best estimators is: ', accuracy_score(voting_pred3, y_test2)*100,"%")


Accuracy for Voting classifier with 3 estimators is:  96.49122807017544 %
Accuracy for Voting classifier with 5 estimators is:  95.6140350877193 %
Accuracy for Voting classifier with 3 best estimators is:  96.49122807017544 %
