### <font color='red'> Project 2

Project Description:
- Use same datasets as Project 1.
- Preprocess data: Explore data and apply data scaling.

Classification Task:
- Apply two voting classifiers - one with hard voting and one with soft voting
- Apply any two models with bagging and any two models with pasting.
- Apply any two models with adaboost boosting
- Apply one model with gradient boosting
- Apply PCA on data and then apply all the models in project 1 again on data you get from PCA. Compare your results with results in project 1. You don't need to apply all the models twice. Just copy the result table from project 1, prepare similar table for all the models after PCA and compare both tables. Does PCA help in getting better results?
- Apply deep learning models covered in class

# Classification

## Data Preprocessing

|No|Variables|Description|
|:--|:--------|:-----------|
|1|customerID|Customer ID|
|2|gender|Whether the customer is a male or a female|
|3|SeniorCitizen|senior citizen or not (1, 0)|
|4|Dependents|Whether the customer has a partner or not (Yes, No)|
|5|tenure|Number of months the customer has stayed with the company|
|6|PhoneService| Whether the customer has a phone service or not (Yes, No)|
|7|MultipleLines|Whether the customer has multiple lines or not (Yes, No, No phone service)|
|8| InternetService|Customer’s internet service provider (DSL, Fiber optic, No)|
|9| OnlineSecurity|Whether the customer has online security or not (Yes, No, No internet service)|
|10| OnlineBackup|Whether the customer has online backup or not (Yes, No, No internet service)|
|11| DeviceProtection|Whether the customer has device protection or not (Yes, No, No internet service)|
|12|TechSupport|Whether the customer has tech support or not (Yes, No, No internet service)|
|13|StreamingTV|Whether the customer has streaming TV or not (Yes, No, No internet service)|
|14|StreamingMovies|Whether the customer has streaming movies or not (Yes, No, No internet service)|
|15|Contract|The contract term of the customer (Month-to-month, One year, Two year)|
|16|PaperlessBilling|Whether the customer has paperless billing or not (Yes, No)|
|17|PaymentMethod|The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))|
|18|MonthlyCharges|The amount charged to the customer monthly|
|19|TotalCharges|The total amount charged to the customer|
|20|Churn |Whether the customer churned or not (Yes or No)|

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

In [2]:
telco = pd.read_csv("telco_o.csv", na_values=['?', ' '])
telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        7043 non-null   int64  
 1   customerID        7043 non-null   object 
 2   gender            7043 non-null   object 
 3   SeniorCitizen     7043 non-null   int64  
 4   Partner           7043 non-null   object 
 5   Dependents        6845 non-null   object 
 6   tenure            7043 non-null   int64  
 7   PhoneService      7043 non-null   object 
 8   MultipleLines     7043 non-null   object 
 9   InternetService   7043 non-null   object 
 10  OnlineSecurity    7043 non-null   object 
 11  OnlineBackup      7043 non-null   object 
 12  DeviceProtection  6761 non-null   object 
 13  TechSupport       6722 non-null   object 
 14  StreamingTV       7043 non-null   object 
 15  StreamingMovies   7043 non-null   object 
 16  Contract          7043 non-null   object 


In [3]:
telco_na = telco.isnull().sum()
print(telco_na[telco_na>0])

Dependents          198
DeviceProtection    282
TechSupport         321
TotalCharges         11
dtype: int64


In [4]:
telco.drop(['customerID'], axis=1, inplace=True)

grps = telco.groupby(['Contract', 'MultipleLines'])
telco['DeviceProtection'] = grps['DeviceProtection'].transform(lambda grp: grp.fillna(grp.value_counts().index[0]))

grps = telco.groupby(['MultipleLines'])
telco['TechSupport'] = grps['TechSupport'].transform(lambda grp: grp.fillna(grp.value_counts().index[0]))

telco.dropna(inplace=True)

- Convert categorical features to numerical.

In [5]:
telco["Churn"] = telco["Churn"].map({"No":0, "Yes":1}).astype(int)
telco['gender'] = telco['gender'].map({"Female": 0, "Male":1}).astype(int)
telco["Partner"] = telco["Partner"].map({"Yes": 1, "No": 0}).astype(int)
telco['PhoneService'] = telco['PhoneService'].map({"Yes":1, "No":0}).astype(int)
telco["MultipleLines"] = telco["MultipleLines"].map({"No phone service":0, "No":1, "Yes":2}).astype(int)
telco["OnlineSecurity"] = telco["OnlineSecurity"].map({"No internet service":0, "No":1, "Yes":2}).astype(int)
telco["OnlineBackup"] = telco["OnlineBackup"].map({"No internet service":0, "No":1, "Yes":2}).astype(int)
telco["StreamingMovies"] = telco["StreamingMovies"].map({"No internet service":0, "No":1, "Yes":2}).astype(int)
telco["PaperlessBilling"] = telco["PaperlessBilling"].map({"No":0, "Yes":1}).astype(int)
telco['DeviceProtection'] = telco['DeviceProtection'].map({'No internet service':0, "No":1, "Yes":2}).astype(int)
telco["Dependents"] = telco["Dependents"].map({"Yes": 1, "No": 0}, na_action='ignore').astype(int)
telco['TechSupport'] = telco['TechSupport'].map({'No internet service':0, 'No':1, 'Yes':2}).astype(int)

- Add dummy variables for InternetService column.

In [6]:
its_dummy = pd.get_dummies(telco['InternetService'], columns='InternetService', prefix='ITS') 
telco = pd.concat([telco, its_dummy], axis=1)

In [7]:
telco.drop(['InternetService'], axis=1, inplace=True)

In [8]:
stv_dummy = pd.get_dummies(telco['StreamingTV'], columns='StreamingTV', prefix='STV')
telco = pd.concat([telco, stv_dummy], axis=1)
telco.drop(['StreamingTV'], axis=1, inplace=True)

In [9]:
smv_dummy = pd.get_dummies(telco['StreamingMovies'], columns='StreamingMovies', prefix='SMV')
telco = pd.concat([telco, smv_dummy], axis=1)
telco.drop(['StreamingMovies'], axis=1, inplace=True)

In [10]:
con_dummy = pd.get_dummies(telco['Contract'], columns='Contract', prefix='CTRT')
telco = pd.concat([telco, con_dummy], axis=1)
telco.drop(['Contract'], axis=1, inplace=True)

In [11]:
payment_dummy = pd.get_dummies(telco['PaymentMethod'], columns='PaymentMethod', prefix='Payment')
telco = pd.concat([telco, payment_dummy], axis=1)
telco.drop(['PaymentMethod'], axis=1, inplace=True)

- Convert interger-string column TotalCharge to int.

In [12]:
pd.to_numeric(telco['TotalCharges'])

0         29.85
1       1889.50
2        108.15
3       1840.75
4        151.65
         ...   
7038    1990.50
7039    7362.90
7040     346.45
7041     306.60
7042    6844.50
Name: TotalCharges, Length: 6834, dtype: float64

- Now we have a data frame without missing values and all features have numeric value type.

In [13]:
telco.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6834 entries, 0 to 7042
Data columns (total 32 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         6834 non-null   int64  
 1   gender                             6834 non-null   int32  
 2   SeniorCitizen                      6834 non-null   int64  
 3   Partner                            6834 non-null   int32  
 4   Dependents                         6834 non-null   int32  
 5   tenure                             6834 non-null   int64  
 6   PhoneService                       6834 non-null   int32  
 7   MultipleLines                      6834 non-null   int32  
 8   OnlineSecurity                     6834 non-null   int32  
 9   OnlineBackup                       6834 non-null   int32  
 10  DeviceProtection                   6834 non-null   int32  
 11  TechSupport                        6834 non-null   int32

- Project 1 Result Table <br>

|No|Classifiers|Best Parameters|Accuary Score|Best Model|
|:--|:-----------|:---------------|:-------------|:----------|
|1|KNN|k=18|0.774|
|2|Logistic Regression|c = 0.1, penalty = l2|0.801|
|3|Softmax Regression|c = 0.01|0.802|
|4|Linear SVM|c = 0.01|0.803|<b>Yes|
|5|SVM with Kernel Linear|c=0.01|0.800| 
|6|SVM with Kernel RBF|c=1, gamma=0.1|0.800|
|7|SVM with Kernel Polynomial|degree=3, c=0.1|0.798|
|8|Decision Tree|depth=3|0.791|<b>*|

### Separate Train, Validation and Test dataset

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
y = telco['Churn']
X = telco.drop(['Churn'], axis = 1)

In [16]:
X_train_full, X_test_org, y_train_full, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

X_train_org, X_valid_org, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size = 0.3, random_state = 1)

In [17]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train_org)
X_valid = scaler.transform(X_valid_org)
X_test = scaler.transform(X_test_org)

In [18]:
model_Scores = []

In [19]:
print("train dataset size: ", X_train.shape, "\nvalidation dataset size: ", X_valid.shape, "\ntest dataset size: ", X_test.shape)

train dataset size:  (3826, 31) 
validation dataset size:  (1641, 31) 
test dataset size:  (1367, 31)


## Task 1: Apply two voting classifiers (Hard & Soft)

- For **Hard Voting Classifiers**, we choose **Logistic Regression, KNN, Linear SVM** to evaluate.
- For **Soft Coting Classifiers**, we choose **SVM with Kernel Linear, SVM with Kernel RBF and Decision Tree** to evaluate.

### 1. Voting Classifiers (Hard): Logistic Regression, KNN, Linear SVM

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

In [21]:
# Logistic Regression
log_clf = LogisticRegression(C=0.1, penalty='l2')

# KNN
knn_clf = KNeighborsClassifier(18)

# Linear SVM
lsvm_clf = LinearSVC(C=0.01)

voting1_clf = VotingClassifier(estimators=[('lr', log_clf), ('knn', knn_clf), ('lsvc', lsvm_clf)], voting='hard')

for clf in (log_clf, knn_clf, lsvm_clf, voting1_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, round(accuracy_score(y_test, y_pred), 4))

LogisticRegression 0.7966
KNeighborsClassifier 0.7747
LinearSVC 0.7981
VotingClassifier 0.7974


In [22]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Hard Voting Classifier',
                    'Best Parameters': '',
                    'Train Score': voting1_clf.score(X_train, y_train),
                    'Test Score': voting1_clf.score(X_test, y_test)})

- The accuracy score of **Hard Voting Classifier** is **0.7974**, higher than **K Neighbors Classifier**, and similar to **Logistic Regression** and **Linear SVC**.

### 2. Voting Classifier (Soft): SVM with Kernel Linear, Logistic Regreesion, Decision Tree

In [23]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [24]:
# SVM with Kernel Linear
svmkl_clf = SVC(kernel='linear', C=0.01, probability=True)
svmkl_clf.fit(X_train, y_train)

# Logistic Regression
log_clf = LogisticRegression(C=0.1, penalty='l2')
log_clf.fit(X_train, y_train)

# Decsion Tree
dt_clf = DecisionTreeClassifier(max_depth=3, random_state=0)
dt_clf.fit(X_train, y_train)

voting2_clf = VotingClassifier(estimators=[('svmkl', svmkl_clf), ('log', log_clf), ('dt', dt_clf)], voting='soft')
voting2_clf.fit(X_train, y_train)

for clf in (svmkl_clf, log_clf, dt_clf, voting2_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, round(accuracy_score(y_test, y_pred), 4))

SVC 0.7857
LogisticRegression 0.7966
DecisionTreeClassifier 0.7805
VotingClassifier 0.7915


In [25]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Soft Voting Classifier',
                    'Best Parameters': '',
                    'Train Score': voting2_clf.score(X_train, y_train),
                    'Test Score': voting2_clf.score(X_test, y_test)})

- The accuracy score of **Soft Voting Classifier** is **0.7915**, higher than **SVM with kernel linear** and **Decision Tree classifier**, and slightly lower than **Logistic Regression**.

## Task 2: Apply two models with bagging and two models with pasting.

- On the **Bagging** part, we choose **KNN** and **Decision Tree** as the base model.
- On the **Pasting** part, we choose **Logistic Regression** and **Linear SVM** as the base model.

### Bagging Part

### 1. Bagging with KNN as base model(k=18)

In [26]:
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from  sklearn.metrics import accuracy_score

In [27]:
knn_clf = KNeighborsClassifier(18)
knn_clf.fit(X_train, y_train)

param1 = {
    'n_estimators': [50, 100], 
    'max_samples':[0.5, 0.7]
}

bag_knn = BaggingClassifier(knn_clf, bootstrap=True, random_state=0)
grid1 = GridSearchCV(bag_knn, param1, cv = 5, n_jobs = -1, return_train_score= True)
grid1.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto',
                                                                             leaf_size=30,
                                                                             metric='minkowski',
                                                                             metric_params=None,
                                                                             n_jobs=None,
                                                                             n_neighbors=18,
                                                                             p=2,
                                                                             weights='uniform'),
                                         bootstrap=True,
                                         bootstrap_features=False,
                                         max_features=1.0, max_samples=1.0,
                      

In [28]:
print('Best model parameters : ' + str(grid1.best_params_))
print('Best score with the parameters : {:.2f}'.format(grid1.best_score_))
bag_knn = grid1.best_estimator_

Best model parameters : {'max_samples': 0.5, 'n_estimators': 100}
Best score with the parameters : 0.78


In [29]:
bknn_train_score = bag_knn.score(X_train, y_train)
bknn_valid_score = bag_knn.score(X_valid, y_valid)
print('Train score: {:.4f}'.format(bknn_train_score))
print('Validation score: {:.4f}'.format(bknn_valid_score))

Train score: 0.7956
Validation score: 0.7885


In [30]:
knn_acu = knn_clf.score(X_test, y_test)
print("The accuracy score of KNN Classifier (k=18) is: {:.4f}".format(knn_acu))
bknn_acu = bag_knn.score(X_test, y_test)
print("The accuracy score of Bagging with KNN Classifier (k=18) is: {:.4f}".format(bknn_acu))

The accuracy score of KNN Classifier (k=18) is: 0.7747
The accuracy score of Bagging with KNN Classifier (k=18) is: 0.7827


In [31]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Bagging with KNeighbor Classifier',
                    'Best Parameters': grid1.best_params_,
                    'Train Score': bag_knn.score(X_train, y_train),
                    'Test Score': bag_knn.score(X_test, y_test)})

- **Bagging with KNN** gets a higher score **(0.7827)** than **KNN (0.7747)**.

### 2. Bagging with Decision Tree as base model

In [32]:
from sklearn.tree import DecisionTreeClassifier

In [33]:
dt_clf = DecisionTreeClassifier(max_depth=3, random_state=0)
dt_clf.fit(X_train, y_train)

param2 = {'n_estimators': [50, 100], 'max_samples': [0.5, 0.7]}

bag_dt = BaggingClassifier(dt_clf, bootstrap=True, random_state=0)
grid2 = GridSearchCV(bag_dt, param2, cv=5, n_jobs = -1, return_train_score= True)
grid2.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                               class_weight=None,
                                                                               criterion='gini',
                                                                               max_depth=3,
                                                                               max_features=None,
                                                                               max_leaf_nodes=None,
                                                                               min_impurity_decrease=0.0,
                                                                               min_impurity_split=None,
                                                                               min_samples_leaf=1,
                                                                            

In [34]:
print('Best model parameters : ' + str(grid2.best_params_))
print('Best score with the parameters : {:.2f}'.format(grid2.best_score_))
bag_dt = grid2.best_estimator_

Best model parameters : {'max_samples': 0.5, 'n_estimators': 50}
Best score with the parameters : 0.79


In [35]:
bdt_train_score = bag_dt.score(X_train, y_train)
bdt_valid_score = bag_dt.score(X_valid, y_valid)

print('Train score: {:.4f}'.format(bdt_train_score))
print('Validation score: {:.4f}'.format(bdt_valid_score))

Train score: 0.7948
Validation score: 0.7873


In [36]:
dt_acu = dt_clf.score(X_test, y_test)
print("The accuracy score of Decision Tree is: {:.4f}".format(dt_acu))

bdt_acu = bag_dt.score(X_test, y_test)
print("The accuracy score of Bagging with Decison Tree is: {:.4f}".format(bdt_acu))

The accuracy score of Decision Tree is: 0.7805
The accuracy score of Bagging with Decison Tree is: 0.7791


In [37]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Bagging with Decision Tree',
                    'Best Parameters': grid2.best_params_,
                    'Train Score': bag_dt.score(X_train, y_train),
                    'Test Score': bag_dt.score(X_test, y_test)})

- **Bagging with Decision Tree** gets a similar score **(0.7791)** to **Decision Tree (0.7805)**.

### Pasting Part

### 3. Pasting with Logistic Regression as base model

In [38]:
param3  = {'n_estimators': [50, 100], 'max_samples': [0.5, 0.7]}

bag_log = BaggingClassifier(log_clf, bootstrap=False, random_state=0)
grid3 = GridSearchCV(bag_log, param3, cv=5, n_jobs = -1, return_train_score= True)
grid3.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=BaggingClassifier(base_estimator=LogisticRegression(C=0.1,
                                                                           class_weight=None,
                                                                           dual=False,
                                                                           fit_intercept=True,
                                                                           intercept_scaling=1,
                                                                           l1_ratio=None,
                                                                           max_iter=100,
                                                                           multi_class='auto',
                                                                           n_jobs=None,
                                                                           penalty='l2',
                                                           

In [39]:
print('Best model parameters : ' + str(grid3.best_params_))
print('Best score with the parameters : {:.2f}'.format(grid3.best_score_))
bag_log = grid3.best_estimator_

Best model parameters : {'max_samples': 0.7, 'n_estimators': 50}
Best score with the parameters : 0.80


In [40]:
blog_train_score = bag_log.score(X_train, y_train)
blog_valid_score = bag_log.score(X_valid, y_valid)

print('Train score: {:.4f}'.format(blog_train_score))
print('Validation score: {:.4f}'.format(blog_valid_score))

Train score: 0.8006
Validation score: 0.8001


In [79]:
log_acu = log_clf.score(X_test, y_test)
print("The accuracy score of Logistic Regression Classifier is: {:.4f}".format(log_acu))

blog_acu = bag_log.score(X_test, y_test)
print("The accuracy score of Pasting with Logistic Regression Classifier is: {:.4f}".format(blog_acu))

The accuracy score of Logistic Regression Classifier is: 0.7966
The accuracy score of Pasting with Logistic Regression Classifier is: 0.7974


In [42]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Pasting with Logistic Regression',
                    'Best Parameters': grid3.best_params_,
                    'Train Score': bag_log.score(X_train, y_train),
                    'Test Score': bag_log.score(X_test, y_test)})

- **Pasting with Logistic Regression** gets a score **(0.7974)** slightly higher than **Logistic Regression (0.7966)**.

### 4. Pasting with Linear SVM as base model

In [43]:
param4  = {'n_estimators': [50, 100], 'max_samples': [0.5, 0.7]}

bag_lsvm = BaggingClassifier(lsvm_clf, bootstrap=False, random_state=0)
grid4 = GridSearchCV(bag_lsvm, param4, cv=5, n_jobs = -1, return_train_score= True)
grid4.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=BaggingClassifier(base_estimator=LinearSVC(C=0.01,
                                                                  class_weight=None,
                                                                  dual=True,
                                                                  fit_intercept=True,
                                                                  intercept_scaling=1,
                                                                  loss='squared_hinge',
                                                                  max_iter=1000,
                                                                  multi_class='ovr',
                                                                  penalty='l2',
                                                                  random_state=None,
                                                                  tol=0.0001,
                                                          

In [44]:
print('Best model parameters : ' + str(grid4.best_params_))
print('Best score with the parameters : {:.2f}'.format(grid4.best_score_))
bag_lsvm = grid4.best_estimator_

Best model parameters : {'max_samples': 0.7, 'n_estimators': 50}
Best score with the parameters : 0.80


In [45]:
blog_train_score = bag_log.score(X_train, y_train)
blog_valid_score = bag_log.score(X_valid, y_valid)
print('Train score: {:.4f}'.format(blog_train_score))
print('Validation score: {:.4f}'.format(blog_valid_score))

Train score: 0.8006
Validation score: 0.8001


In [78]:
y_pred_lsvm = lsvm_clf.predict(X_test)
lsvm_acu = lsvm_clf.score(X_test, y_test)
print("The accuracy score of Linear SVM is: {:.4f}".format(lsvm_acu))

blsvm_acu = bag_lsvm.score(X_test, y_test)
print("The accuracy score of Pasting with Linear SVM is: {:.4f}".format(blsvm_acu))

The accuracy score of Linear SVM is: 0.7981
The accuracy score of Pasting with Linear SVM is: 0.7959


In [47]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Pasting with Linear SVM',
                    'Best Parameters': grid4.best_params_,
                    'Train Score': bag_lsvm.score(X_train, y_train),
                    'Test Score': bag_lsvm.score(X_test, y_test)})

- **Pasting with Linear SVM** get a similar score **(0.7959)** to **Linear SVM (0.7981)**.

## Task 3: Apply two models with adaboost boosting

- On this part, we choose **Decision Tree** and **SVM with Kernel Linear** as the base model of adaboost.

### 1. Adaboost with Decision Tree

In [48]:
from sklearn.ensemble import AdaBoostClassifier

In [49]:
ada_dt = AdaBoostClassifier(dt_clf)
param5 = { 
    'n_estimators': [100, 200],
    'learning_rate': [0.5, 0.7]
}

grid5 = GridSearchCV(ada_dt, param5, cv= 5, n_jobs=-1)
grid5.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                          base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                                class_weight=None,
                                                                                criterion='gini',
                                                                                max_depth=3,
                                                                                max_features=None,
                                                                                max_leaf_nodes=None,
                                                                                min_impurity_decrease=0.0,
                                                                                min_impurity_split=None,
                                                                                min_samples_leaf=1,
    

In [50]:
print('Best model parameters : ' + str(grid5.best_params_))
print('Best score with the parameters : {:.4f}'.format(grid5.best_score_))
ada_dt = grid5.best_estimator_

Best model parameters : {'learning_rate': 0.5, 'n_estimators': 100}
Best score with the parameters : 0.7543


In [51]:
ada1_train_score = ada_dt.score(X_train, y_train)
ada1_valid_score = ada_dt.score(X_valid, y_valid)
print('Train score: {:.4f}'.format(ada1_train_score))
print('Validation score: {:.4f}'.format(ada1_valid_score))

Train score: 0.9151
Validation score: 0.7782


In [80]:
print("The accuracy score of Decision Tree is: {:.4f}".format(dt_acu))

ada1_acu = ada_dt.score(X_test, y_test)
print("The accuracy score of adaboost with Decision Tree is: {:.4f}".format(ada1_acu))

The accuracy score of Decision Tree is: 0.7805
The accuracy score of adaboost with Decision Tree is: 0.7505


In [53]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Adaboosting with Decision Tree',
                    'Best Parameters': grid5.best_params_,
                    'Train Score': ada_dt.score(X_train, y_train),
                    'Test Score': ada_dt.score(X_test, y_test)})

- **Adaboost with Decision Tree** gets a score **(0.7505)** slightly lower than **Decision Tree (0.7805)**

### 2. Adaboost with SVM with Kernel Linear

In [54]:
ada_skl = AdaBoostClassifier(svmkl_clf)
param6 = { 
    'n_estimators': [100, 150],
    'learning_rate': [0.5, 0.7]
}

grid6 = GridSearchCV(ada_skl, param6, cv= 5, n_jobs=-1)
grid6.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                          base_estimator=SVC(C=0.01,
                                                             break_ties=False,
                                                             cache_size=200,
                                                             class_weight=None,
                                                             coef0=0.0,
                                                             decision_function_shape='ovr',
                                                             degree=3,
                                                             gamma='scale',
                                                             kernel='linear',
                                                             max_iter=-1,
                                                             probability=True,
                                                      

In [55]:
print('Best model parameters : ' + str(grid6.best_params_))
print('Best score with the parameters : {:.4f}'.format(grid6.best_score_))
ada_skl = grid6.best_estimator_

Best model parameters : {'learning_rate': 0.5, 'n_estimators': 100}
Best score with the parameters : 0.7324


In [56]:
ada2_train_score = ada_skl.score(X_train, y_train)
ada2_valid_score = ada_skl.score(X_valid, y_valid)
print('Train score: {:.4f}'.format(ada2_train_score))
print('Validation score: {:.4f}'.format(ada2_valid_score))

Train score: 0.7324
Validation score: 0.7398


In [81]:
print("The accuracy score of SVM with Kernel Linear is: {:.4f}".format(svmkl_clf.score(X_test, y_test)))

ada2_test_score = ada_skl.score(X_test, y_test)
print("The accuracy score of adaboost with SVM with Kernel Linear is: {:.4f}".format(ada2_test_score))

The accuracy score of SVM with Kernel Linear is: 0.7857
The accuracy score of adaboost with SVM with Kernel Linear is: 0.7315


In [58]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Adaboosting with SVM with Kernel Linear',
                    'Best Parameters': grid6.best_params_,
                    'Train Score': ada_skl.score(X_train, y_train),
                    'Test Score': ada_skl.score(X_test, y_test)})

- **Adaboost with SVM with Kernel Linear** gets a lower score **(0.7315)** than **SVM with Kernel Linear(0.7857)**.

## Task 4: Gradient Boosting Classifier

In [59]:
from sklearn.ensemble import GradientBoostingClassifier

In [60]:
param7 = {
    "learning_rate": [0.3, 0.5, 0.7],
    "max_depth":[3,5,8],
    "n_estimators":[10, 50, 100]
    }

grid7 = GridSearchCV(GradientBoostingClassifier(), param7, cv=5, n_jobs=-1)
grid7.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
         

In [61]:
print('Best model parameters : ' + str(grid7.best_params_))
print('Best score with the parameters : {:.4f}'.format(grid7.best_score_))
gb_clf = grid7.best_estimator_

Best model parameters : {'learning_rate': 0.5, 'max_depth': 3, 'n_estimators': 10}
Best score with the parameters : 0.7982


In [62]:
gb_train_score = gb_clf.score(X_train, y_train)
gb_valid_score = gb_clf.score(X_valid, y_valid)
print('Train score: {:.4f}'.format(gb_train_score))
print('Validation score: {:.4f}'.format(gb_valid_score))

Train score: 0.8225
Validation score: 0.7977


In [63]:
gb_acu = gb_clf.score(X_test, y_test)
print("The accuracy score of Gradient Boosting is: {:.4f}".format(gb_acu))

The accuracy score of Gradient Boosting is: 0.7966


In [64]:
model_Scores.append({'Model Type':'Classification',
                    'Model Name': 'Gradient Boosting',
                    'Best Parameters': grid6.best_params_,
                    'Train Score': gb_clf.score(X_train, y_train),
                    'Test Score': gb_clf.score(X_test, y_test)})

- **Gradient Boosting** gets a score of **0.7966**.

## Result Table of Task 1 to Task 4

In [65]:
table = pd.DataFrame(model_Scores)
table.set_index('Model Name', inplace = True)
table

Unnamed: 0_level_0,Model Type,Best Parameters,Train Score,Test Score
Model Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Hard Voting Classifier,Classification,,0.803189,0.797366
Soft Voting Classifier,Classification,,0.802927,0.791514
Bagging with KNeighbor Classifier,Classification,"{'max_samples': 0.5, 'n_estimators': 100}",0.795609,0.782736
Bagging with Decision Tree,Classification,"{'max_samples': 0.5, 'n_estimators': 50}",0.794825,0.779078
Pasting with Logistic Regression,Classification,"{'max_samples': 0.7, 'n_estimators': 50}",0.800575,0.797366
Pasting with Linear SVM,Classification,"{'max_samples': 0.7, 'n_estimators': 50}",0.801098,0.795903
Adaboosting with Decision Tree,Classification,"{'learning_rate': 0.5, 'n_estimators': 100}",0.915055,0.750549
Adaboosting with SVM with Kernel Linear,Classification,"{'learning_rate': 0.5, 'n_estimators': 100}",0.732358,0.731529
Gradient Boosting,Classification,"{'learning_rate': 0.5, 'n_estimators': 100}",0.82253,0.796635


- The table above is the result table of models from **task 1 to task 4**:
- We can see both **Hard Voting Classifier** and **Pasting with Logistic Regression** achieve the highest test score around 0.7974.
- **Adaboosting with SVM with Kernel Linear** is underpreformance with a test score around 0.7315.

## Task 5: PCA and Apply on all Project 1 models

In [66]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95, random_state=0)

X_train_r = pca.fit_transform(X_train)
X_valid_r = pca.transform(X_valid)
X_test_r = pca.transform(X_test)

print('Number of original train set components : ' + str(X_train.shape[1]))
print('Number of train set components after PCA with 95% feature information : ' + str(pca.n_components_))

Number of original train set components : 31
Number of train set components after PCA with 95% feature information : 16


In [67]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier

In [68]:
knn_pca = KNeighborsClassifier(18)
log_pca = LogisticRegression(C=0.1, penalty='l2')
sr_pca = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=0.01)
lsvm_pca = LinearSVC(C= 0.01)
skl_pca = SVC(C=0.01, kernel='linear')
srbf_pca = SVC(C=1, gamma=0.1, kernel='rbf')
skp_pca = SVC(degree=3, C=0.1)
dt_pca = DecisionTreeClassifier(max_depth=3)

for clf in (knn_pca, log_pca, sr_pca, lsvm_pca, skl_pca, srbf_pca, skp_pca, dt_pca):
    clf.fit(X_train_r, y_train)
    print(clf.__class__.__name__, round(clf.score(X_test_r, y_test), 3))

KNeighborsClassifier 0.772
LogisticRegression 0.778
LogisticRegression 0.774
LinearSVC 0.775
SVC 0.764
SVC 0.775
SVC 0.767
DecisionTreeClassifier 0.771


- Project 1 Result Table <br>

|No|Classifiers|Best Parameters|Original Score|PCA|
|:--|:-----------|:---------------|:-------------|:----------|
|1|KNN|k=18|0.774|0.772|
|2|Logistic Regression|c = 0.1, penalty = l2|0.801|0.778|
|3|Softmax Regression|c = 0.01|0.802|0.774|
|4|Linear SVM|c = 0.01|0.803|0.775|
|5|SVM with Kernel Linear|c=0.01|0.800|0.764| 
|6|SVM with Kernel RBF|c=1, gamma=0.1|0.800|0.775|
|7|SVM with Kernel Polynomial|degree=3, c=0.1|0.798|0.767|
|8|Decision Tree|depth=3|0.791|0.771|

- After using PCA to reduce dimensions, test score of each model is decrease from around **0.80** to around **0.77**.

## Task 6: Deep Learning

In [69]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

In [70]:
print(X_train.shape)

(3826, 31)


In [71]:
def create_model():
    model = Sequential()
    model.add(Dense(30, input_dim=31, activation="relu"))
    model.add(Dense(15, activation="relu"))
    model.add(Dense(10, activation="relu"))
    model.add(Dense(5, activation="relu"))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(loss="mse", optimizer='adam', metrics=["accuracy"])
    return model

In [72]:
np.random.seed(10)
model = KerasClassifier(build_fn=create_model, verbose=0)

params = {'batch_size':[10, 20, 30, 40], 'epochs':[10, 50, 100]}
grid_search = GridSearchCV(estimator=model, param_grid=params, cv=5)

In [73]:
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'batch_size': 30, 'epochs': 10}

In [74]:
batches = grid_search.best_params_['batch_size']
epochs = grid_search.best_params_['epochs']

In [75]:
model = grid_search.best_estimator_
model.fit(X_train, y_train)

<tensorflow.python.keras.callbacks.History at 0x21196cdc5c8>

In [76]:
y_train_pred = model.predict(X_train)
y_valid_pred = model.predict(X_valid)
y_test_pred = model.predict(X_test)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


In [77]:
print("Train score: {:.4f}".format(accuracy_score(y_train, y_train_pred)))
print("Validation score: {:.4f}".format(accuracy_score(y_valid, y_valid_pred)))
print("Test score: {:.4f}".format(accuracy_score(y_test, y_test_pred)))

Train score: 0.8082
Validation score: 0.8001
Test score: 0.7959


- Training the Neural Network using the original training dataset, we finally get the **test score: 0.7959**.