### <font color='red'> Project 2

Project Description:
- Use same datasets as Project 1.
- Preprocess data: Explore data and apply data scaling.

Classification Task:
- Apply two voting classifiers - one with hard voting and one with soft voting
- Apply any two models with bagging and any two models with pasting.
- Apply any two models with adaboost boosting
- Apply one model with gradient boosting
- Apply PCA on data and then apply all the models in project 1 again on data you get from PCA. Compare your results with results in project 1. You don't need to apply all the models twice. Just copy the result table from project 1, prepare similar table for all the models after PCA and compare both tables. Does PCA help in getting better results?
- Apply deep learning models covered in class

# Classification

## Data Preprocessing

|No|Variables|Description|
|:--|:--------|:-----------|
|1|customerID|Customer ID|
|2|gender|Whether the customer is a male or a female|
|3|SeniorCitizen|senior citizen or not (1, 0)|
|4|Dependents|Whether the customer has a partner or not (Yes, No)|
|5|tenure|Number of months the customer has stayed with the company|
|6|PhoneService| Whether the customer has a phone service or not (Yes, No)|
|7|MultipleLines|Whether the customer has multiple lines or not (Yes, No, No phone service)|
|8| InternetService|Customer’s internet service provider (DSL, Fiber optic, No)|
|9| OnlineSecurity|Whether the customer has online security or not (Yes, No, No internet service)|
|10| OnlineBackup|Whether the customer has online backup or not (Yes, No, No internet service)|
|11| DeviceProtection|Whether the customer has device protection or not (Yes, No, No internet service)|
|12|TechSupport|Whether the customer has tech support or not (Yes, No, No internet service)|
|13|StreamingTV|Whether the customer has streaming TV or not (Yes, No, No internet service)|
|14|StreamingMovies|Whether the customer has streaming movies or not (Yes, No, No internet service)|
|15|Contract|The contract term of the customer (Month-to-month, One year, Two year)|
|16|PaperlessBilling|Whether the customer has paperless billing or not (Yes, No)|
|17|PaymentMethod|The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))|
|18|MonthlyCharges|The amount charged to the customer monthly|
|19|TotalCharges|The total amount charged to the customer|
|20|Churn |Whether the customer churned or not (Yes or No)|

In [22]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

In [23]:
telco = pd.read_csv("telco_o.csv", na_values=['?', ' '])
telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        7043 non-null   int64  
 1   customerID        7043 non-null   object 
 2   gender            7043 non-null   object 
 3   SeniorCitizen     7043 non-null   int64  
 4   Partner           7043 non-null   object 
 5   Dependents        6845 non-null   object 
 6   tenure            7043 non-null   int64  
 7   PhoneService      7043 non-null   object 
 8   MultipleLines     7043 non-null   object 
 9   InternetService   7043 non-null   object 
 10  OnlineSecurity    7043 non-null   object 
 11  OnlineBackup      7043 non-null   object 
 12  DeviceProtection  6761 non-null   object 
 13  TechSupport       6722 non-null   object 
 14  StreamingTV       7043 non-null   object 
 15  StreamingMovies   7043 non-null   object 
 16  Contract          7043 non-null   object 


In [24]:
telco_na = telco.isnull().sum()
print(telco_na[telco_na>0])

Dependents          198
DeviceProtection    282
TechSupport         321
TotalCharges         11
dtype: int64


In [25]:
telco.drop(['customerID'], axis=1, inplace=True)

grps = telco.groupby(['Contract', 'MultipleLines'])
telco['DeviceProtection'] = grps['DeviceProtection'].transform(lambda grp: grp.fillna(grp.value_counts().index[0]))

grps = telco.groupby(['MultipleLines'])
telco['TechSupport'] = grps['TechSupport'].transform(lambda grp: grp.fillna(grp.value_counts().index[0]))

telco.dropna(inplace=True)

- Convert categorical features to numerical.

In [26]:
telco["Churn"] = telco["Churn"].map({"No":0, "Yes":1}).astype(int)
telco['gender'] = telco['gender'].map({"Female": 0, "Male":1}).astype(int)
telco["Partner"] = telco["Partner"].map({"Yes": 1, "No": 0}).astype(int)
telco['PhoneService'] = telco['PhoneService'].map({"Yes":1, "No":0}).astype(int)
telco["MultipleLines"] = telco["MultipleLines"].map({"No phone service":0, "No":1, "Yes":2}).astype(int)
telco["OnlineSecurity"] = telco["OnlineSecurity"].map({"No internet service":0, "No":1, "Yes":2}).astype(int)
telco["OnlineBackup"] = telco["OnlineBackup"].map({"No internet service":0, "No":1, "Yes":2}).astype(int)
telco["StreamingMovies"] = telco["StreamingMovies"].map({"No internet service":0, "No":1, "Yes":2}).astype(int)
telco["PaperlessBilling"] = telco["PaperlessBilling"].map({"No":0, "Yes":1}).astype(int)
telco['DeviceProtection'] = telco['DeviceProtection'].map({'No internet service':0, "No":1, "Yes":2}).astype(int)
telco["Dependents"] = telco["Dependents"].map({"Yes": 1, "No": 0}, na_action='ignore').astype(int)
telco['TechSupport'] = telco['TechSupport'].map({'No internet service':0, 'No':1, 'Yes':2}).astype(int)

- Add dummy variables for InternetService column.

In [27]:
its_dummy = pd.get_dummies(telco['InternetService'], columns='InternetService', prefix='ITS') 
telco = pd.concat([telco, its_dummy], axis=1)

In [28]:
telco.drop(['InternetService'], axis=1, inplace=True)

In [29]:
stv_dummy = pd.get_dummies(telco['StreamingTV'], columns='StreamingTV', prefix='STV')
telco = pd.concat([telco, stv_dummy], axis=1)
telco.drop(['StreamingTV'], axis=1, inplace=True)

In [30]:
smv_dummy = pd.get_dummies(telco['StreamingMovies'], columns='StreamingMovies', prefix='SMV')
telco = pd.concat([telco, smv_dummy], axis=1)
telco.drop(['StreamingMovies'], axis=1, inplace=True)

In [31]:
con_dummy = pd.get_dummies(telco['Contract'], columns='Contract', prefix='CTRT')
telco = pd.concat([telco, con_dummy], axis=1)
telco.drop(['Contract'], axis=1, inplace=True)

In [32]:
payment_dummy = pd.get_dummies(telco['PaymentMethod'], columns='PaymentMethod', prefix='Payment')
telco = pd.concat([telco, payment_dummy], axis=1)
telco.drop(['PaymentMethod'], axis=1, inplace=True)

- Convert interger-string column TotalCharge to int.

In [33]:
pd.to_numeric(telco['TotalCharges'])

0         29.85
1       1889.50
2        108.15
3       1840.75
4        151.65
         ...   
7038    1990.50
7039    7362.90
7040     346.45
7041     306.60
7042    6844.50
Name: TotalCharges, Length: 6834, dtype: float64

- Now we have a data frame without missing values and all features have numeric value type.

In [34]:
telco.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6834 entries, 0 to 7042
Data columns (total 32 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         6834 non-null   int64  
 1   gender                             6834 non-null   int32  
 2   SeniorCitizen                      6834 non-null   int64  
 3   Partner                            6834 non-null   int32  
 4   Dependents                         6834 non-null   int32  
 5   tenure                             6834 non-null   int64  
 6   PhoneService                       6834 non-null   int32  
 7   MultipleLines                      6834 non-null   int32  
 8   OnlineSecurity                     6834 non-null   int32  
 9   OnlineBackup                       6834 non-null   int32  
 10  DeviceProtection                   6834 non-null   int32  
 11  TechSupport                        6834 non-null   int32

- Project 1 Result Table <br>

|No|Classifiers|Best Parameters|Accuary Score|Best Model|
|:--|:-----------|:---------------|:-------------|:----------|
|1|KNN|k=18|0.774|
|2|Logistic Regression|c = 0.1, penalty = l2|0.801|
|3|Softmax Regression|c = 0.01|0.802|
|4|Linear SVM|c = 0.01|0.803|<b>Yes|
|5|SVM with Kernel Linear|c=0.01|0.800| 
|6|SVM with Kernel RBF|c=1, gamma=0.1|0.800|
|7|SVM with Kernel Polynomial|degree=3, c=0.1|0.798|
|8|Decision Tree|depth=3|0.791|<b>*|

### Separate Train, Validation and Test dataset

In [35]:
from sklearn.model_selection import train_test_split

In [36]:
y = telco['Churn']
X = telco.drop(['Churn'], axis = 1)

In [37]:
X_train_full, X_test_org, y_train_full, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

X_train_org, X_valid_org, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size = 0.3, random_state = 1)

In [38]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train_org)
X_valid = scaler.transform(X_valid_org)
X_test = scaler.transform(X_test_org)

In [39]:
print("train dataset size: ", X_train.shape, "\nvalidation dataset size: ", X_valid.shape, "\ntest dataset size: ", X_test.shape)

train dataset size:  (3826, 31) 
validation dataset size:  (1641, 31) 
test dataset size:  (1367, 31)


## Task 1: Apply two voting classifiers (Hard & Soft)

- For `Hard Voting Classifiers`, we choose Logistic Regression, KNN, Linear SVM to evaluate.
- For `Soft Coting Classifiers`, we choose SVM with Kernel Linear, SVM with Kernel RBF and Decision Tree to evaluate.

### 1. Voting Classifiers (Hard): Logistic Regression, KNN, Linear SVM

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier

In [47]:
# Logistic Regression
log_clf = LogisticRegression(C=0.1, penalty='l2')
log_clf.fit(X_train, y_train)

# KNN
knn_clf = KNeighborsClassifier(18)
knn_clf.fit(X_train, y_train)

# Linear SVM
lsvm_clf = LinearSVC(C=0.01)
lsvm_clf.fit(X_train, y_train)

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('knn', knn_clf), ('lsvc', lsvm_clf)], voting='hard')
voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
for clf in (log_clf, knn_clf, lsvm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, round(accuracy_score(y_test, y_pred), 4))

LogisticRegression 0.7966
KNeighborsClassifier 0.7747
LinearSVC 0.7981
VotingClassifier 0.7974


### 2. Voting Classifier(Soft): SVM with Kernel Linear, SVM with Kernel RBF, Decision Tree

In [51]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [54]:
# SVM with Kernel Linear
svmkl_clf = SVC(kernel='linear', C=0.01, probability=True)
svmkl_clf.fit(X_train, y_train)

# SVM with Kernel RBF
svmrb_clf = SVC(kernel='rbf', C=1, gamma=0.1, probability=True)
svmrb_clf.fit(X_train, y_train)

# Decsion Tree
dt_clf = DecisionTreeClassifier(max_depth=3, random_state=0)
dt_clf.fit(X_train, y_train)

voting_clf = VotingClassifier(estimators=[('svmkl', svmkl_clf), ('svmrb', svmrb_clf), ('dt', dt_clf)], voting='soft')
voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
for clf in (svmkl_clf, svmrb_clf, dt_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, round(accuracy_score(y_test, y_pred), 4))

SVC 0.7857
SVC 0.7835
DecisionTreeClassifier 0.7805
VotingClassifier 0.7901


## Task 2: Apply any two models with bagging and any two models with pasting.

## Task 3: Apply any two models with adaboost boosting

## Task 4: Apply one model with gradient boosting

## Task 5: Apply PCA on data and then apply all the models in project 1 again on data you get from PCA. 
- Compare your results with results in project 1. You don't need to apply all the models twice. Just copy the result table from project 1, prepare similar table for all the models after PCA and compare both tables. Does PCA help in getting better results?

## Task 6: Apply deep learning models covered in class