## Agenda
<ul>
<li><a href="#sources">Sources</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#eda">Exploratory Data Analysis and some Feature Engineering</a></li>
<li><a href="#model">Modeling</a></li>
<li><a href="#conc">Conclusion</a></li>
</ul>

<a id='sources'></a>
# Sources
[Telco Customer Churn](https://www.kaggle.com/blastchar/telco-customer-churn)

#### Some of these ideas are ispired by [Muslum Polat](https://www.kaggle.com/muslump/telco-customer-churn-analysis?fbclid=IwAR0gRroMTTbjUQCzxf6Rp2FxDVu4n16pTRTcRPnCr9mqRzbu6hF0AZM5bz4)

In [None]:
# For Loading and Manipulating the data
import numpy as np
import pandas as pd
from itertools import combinations

# For splitting, scaling and upsampling the data respectively
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# For Evaluation 
from sklearn.metrics import classification_report, confusion_matrix


# For Visualization Purposes 
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# To display all the columns ( regardless of their number or their width )
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# To change the style of the plots ( so that we all can see the same thing :) )
plt.style.use('seaborn')

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
churn_df = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

<a id='cleaning'></a>
# Data Cleaning

### First let's take a look at the data to know how to clean it

Overall check on columns

In [None]:
churn_df.head()

In [None]:
# First of all let's change some columns names so that all the columns names are written in the same pattern
churn_df.rename(columns={'customerID':'CustomerID', 'gender':'Gender', 'tenure':'Tenure'}, inplace=True)

A closer look on columns types

In [None]:
churn_df.info()

Before going on We can see that "TotalCharges" has an object type despite the fact that it's a numeric feature.

_Let's test what i call **"Hidden NaNs"**_

In [None]:
" " in churn_df.values

> It looks like that we have bad hidden NaNs in our dataset :). We will deal with them later.

Statistical Summary

In [None]:
churn_df.describe().drop(columns='SeniorCitizen') # I droped SeniorCitizen from statistical description 
                                                  # as it will be considered a numeric column which it is not

Checking Duplicates

In [None]:
churn_df.duplicated().sum()

#### Summary: 
> ##### We can see that there are:
   - _Useless columns_  : "CustomerID"
   - _Hidden NaNs_ : " "
   - _Wrong-format Columns_ : 
      - _toObject_ : "SeniorCitizen"	( that is not important but i like it to be 'Yes' and 'No' )
      - _toFloat_  : "TotalCharges"

_Useless Columns_

In [None]:
churn_df.drop(columns='CustomerID', inplace=True)

**Test**

In [None]:
churn_df.columns

_Hidden NaNs_

In [None]:
churn_df.replace(' ', np.nan, inplace=True)

**Test**

In [None]:
churn_df.isnull().sum()

> we can see that there are 11 missing values in "TotalCharges" column ... Let's take a closer look at them.

In [None]:
churn_df[churn_df['TotalCharges'].isnull()]

Before deciding what to do...

In [None]:
churn_df.shape

In [None]:
churn_df['Churn'].value_counts()

In [None]:
churn_df[churn_df['TotalCharges'].isnull()]['Churn'].value_counts()

> We can see that all the missing values have Churn = 'No'. For the whole data, we can see that 'No's are more than 'Yes's. So I think dropping these 11 missing values will not affect the data dramatically.

In [None]:
churn_df.dropna(inplace=True)

**Test**

In [None]:
churn_df.isnull().sum()

_Wrong-format Columns_

In [None]:
# toObject               
churn_df["SeniorCitizen"] = churn_df["SeniorCitizen"].map({1:'Yes', 0:'No'})    

# toFloat
churn_df["TotalCharges"]  = churn_df["TotalCharges"].astype(float) 

**Test**

In [None]:
churn_df["SeniorCitizen"].unique()

In [None]:
churn_df["TotalCharges"].dtype

<font color='green'>
<h2><center> Now I think it is clean now :) </center></h2>

<a id='eda'></a>
# Now it's time for some Exploration

<font color='blue'>
    <h5> Some Helping Functions </h5>

In [None]:
def CountPlot(dataFrame, x, hue=None, ax=None):
    # Main plot
    ax = sns.countplot(data=dataFrame, x=x, hue=hue, ax=ax)
    
    ## Adding Annotation 
    # Total number of clients
    n_clients = dataFrame.shape[0]
    
    # Looping over each column
    for p in ax.patches:

        loc    = p.get_x()
        height = p.get_height()
        width  = p.get_width()
        pct    = '({:0.2f}%)'.format(100*height/n_clients)
        
        # Adding the exact height at the top
        ax.text(loc+width/2, height+3 , str(height), weight = 'bold',ha="center", fontsize=15)
        
        # Adding the percentage wrt the total number of clients at the middle of each column
        ax.text(loc+width/2, int(0.5*height), pct, weight = 'bold',ha="center", fontsize=15, color='w')
        
    # Adding title
    ax.set_title(f"{x} Distribution", fontsize=25, color='brown')
    
    # Before editing the ticks we need to draw the plot first
    plt.draw()
    
    # Editing axes labels and ticks
    ax.set_xlabel(x, fontsize=20)
    ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
        
    ax.set_ylabel('Number of Users', fontsize=20)
    ax.set_yticklabels(ax.get_yticklabels(), fontsize=15);
        
    # Adding legend
    if hue:
        ax.legend(labels=list(dataFrame[hue].unique()),  prop={"size":20}, frameon=True, shadow=True);

In [None]:
def ScatterPlot(dataFrame, x, y, hue=None, ax=None):
    # Main plot
    ax = sns.scatterplot(data=dataFrame, x=x, y=y, hue=hue, ax=ax, alpha=0.7)
    
    # Adding title
    corr = dataFrame[x].corr(dataFrame[y])
    ax.set_title(f"{x} with {y} by {hue}\n (Corr = {round(corr, 2)})", fontsize=25, color='brown')
    
    # Before editing the ticks we need to draw the plot first
    plt.draw()
    
    # Editing axes labels
    ax.set_xlabel(x, fontsize=20)
    ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
    
    ax.set_ylabel(y, fontsize=20)
    ax.set_yticklabels(ax.get_yticklabels(), fontsize=15);
    
    # Adding legend
    if hue:
        ax.legend(prop={"size":13}, frameon=True, shadow=True);

In [None]:
def kdeplot_churn(dataFrame, col, ax=None):
    # Main plot
    ax = sns.kdeplot(dataFrame[col][dataFrame["Churn"] == 'Yes'], color="Red", ax=ax, shade=True)
    ax = sns.kdeplot(dataFrame[col][dataFrame["Churn"] == 'No'], color="Blue", ax=ax, shade=True)
    
    # Adding title
    ax.set_title(f"Distribution of {col} by churn", fontsize=17, color='brown')
    
    # Before editing the ticks we need to draw the plot first
    plt.draw()
    
    # Editing axes labels
    ax.set_xlabel(col, fontsize=15)
    ax.set_xticklabels(ax.get_xticklabels(), fontsize=15)
    
    ax.set_ylabel('Density', fontsize=15)
    ax.set_yticklabels(ax.get_yticklabels(), fontsize=15)
    
    # Adding legend
    ax.legend(["Churn","Not Churn"], loc='upper right', frameon=True, shadow=True);

First let's **"Churn"** distribution

In [None]:
CountPlot(churn_df, 'Churn')

> It looks like that this data is **Imbalanced**

Before dive deeper in exploration phase let's first **divide** our data into three dataframes:
- Demographic 
- Services 
- Account

In [None]:
Demographic_cols = ['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'Churn']
Services_cols    = ['PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                    'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Churn']
Account_cols_cat = ['Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
Account_cols_num = ['Tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']

### Demographic Features...

#### Univariate Exploration 

In [None]:
fig = plt.figure(figsize=(17, 12))
fig.suptitle('Demographic Features Distributions', fontsize=40, weight='bold')
for i, col in enumerate(Demographic_cols[:-1]):
    sorted_counts = churn_df[col].value_counts()
    plt.subplot(2, 2, i+1)
    plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90, autopct='%1.2f%%', 
                 counterclock = False, radius = 1.2, textprops={'fontsize': 14})
    plt.title(f'{col} Distribution',fontsize=15, weight='bold', color='brown', loc='center')

> Summary:
- there is a balance in Gender and Partner columns.
- But on the other hand the majority of people are young and have no dependents. So we should take this imbalance into our consideration when we judge the upcoming results.

#### Bivariate Exploration

In [None]:
fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (22,20))
fig.suptitle('Demographic Features Distributions by Churn', fontsize=40, weight='bold')
for i, col in enumerate(Demographic_cols[:-1]):
    CountPlot(churn_df[Demographic_cols], col, hue="Churn", ax=axes[i//2, i-(i//2)*2])

> **We can see that:**
- The churn rate :
   - Is very close for both Male and Female.
   - Is high for Senior Clients ($\frac{476}{476+666} = {41.68}\% $) compared to Younger ones ($\frac{1393}{1393+4497} = {23.65}\% $)
   - Is high for Clients that have no partner ($\frac{1200}{1200+2439} = {32.98}\% $) compared to the ones that have a partner ($\frac{669}{669+2724} = {19.72}\% $)
   - Also here is high for Clients that have no dependents ($\frac{1543}{1543+3390} = {31.28}\% $) compared to the ones that have a dependent ($\frac{326}{326+1773} = {15.53}\% $)

### Feature Engineering For Demographic Features

In [None]:
# I just need, for now, to convert the Gender column to ones and zeros
churn_df['Gender'] = np.where(churn_df['Gender'] == 'Male', 1, 0)

# Detect if the client has neither Partner nor Dependents
churn_df['NoDep_NoPart'] = np.where((churn_df['Partner'] == 'No')|(churn_df['Dependents'] == 'No'), 1, 0)

# Senior or not
churn_df['SeniorCitizen'] = np.where((churn_df['SeniorCitizen'] == 'Yes'), 1, 0)

Let's drop "Partner" and "Dependents" columns

In [None]:
# Dropping the unnecessary columns ( according to the above analysis )
churn_df.drop(columns=["Partner", "Dependents"], inplace=True)

### Services Features...

#### Univariate Exploration

In [None]:
fig = plt.figure(figsize=(17, 17))
fig.suptitle('Services Features Distributions', fontsize=40, weight='bold')
for i, col in enumerate(Services_cols[:-1]):
    sorted_counts = churn_df[col].value_counts()
    plt.subplot(3, 3, i+1)
    plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90, autopct='%1.2f%%', 
                 counterclock = False, radius = 1.2, textprops={'fontsize': 14})
    plt.title(f'{col} Distribution',fontsize=15, weight='bold', color='brown', loc='center')

> Let's get the bivariate plots before jumping to any conclusions.

#### Bivariate Exploration

In [None]:
fig, axes = plt.subplots(nrows = 3, ncols = 3, figsize = (32,30))
fig.suptitle('Services Features Distributions by Churn', fontsize=50, weight='bold')
for i, col in enumerate(Services_cols[:-1]):
    CountPlot(churn_df[Services_cols], col, hue="Churn", ax=axes[i//3, i-(i//3)*3])

> **According to the univariate distibutions, we can't rely on only the numbers. We should get the rate of churn for each one of them so that we could compare ( as we did in the demographic distributions ):**
- The churn rate :
   - Is very close for both Clients who have phone service ($\frac{1699}{1699+4653} = {26.75}\% $) and who hasn't ($\frac{170}{170+510} = {25}\% $).
   - Is a little higher for Clients who have MultipleLines ($\frac{850}{850+2117} = {28.65}\% $) compared to other ones ($rate_{NoPhoneService}=\frac{170}{170+510} = {25}\% and rate_{No}=\frac{849}{849+2536} = {25.08}\%  $) May be it's expensive or something.
   - Is relatively high for Clients that use Fiber optic in their internet Service ($\frac{1297}{1297+1799} = {41.89}\% $).
   - Is relatively high for Clients that do not have Online Security, OnlineBackup, DeviceProtection,and TechSupport (41.78%, 39.94%, 39.14%,and 41.65% respectively).
   - Also it's a little higher for the clients who do not have StreamingTV or StreamingMovies.

### Feature Engineering For Services Features

In [None]:
# Phone Service
churn_df['PhoneService'] = np.where(churn_df['PhoneService']=='Yes', 1, 0)

# MultipleLines
churn_df['MultipleLines'] = np.where(churn_df['MultipleLines']=='Yes', 1, 0)

# Has Fiber optic 
churn_df['FiberOptic'] = np.where(churn_df['InternetService']=='Fiber optic', 1, 0)

# Has no services ( other than MultipleLines, StreamingTV,and StreamingMovies )
churn_df['NoServ'] = np.where((churn_df['OnlineSecurity'] != 'No') | (churn_df['OnlineBackup'] != 'No') |
                              (churn_df['DeviceProtection'] != 'No') | (churn_df['TechSupport'] != 'No'), 1, 0)

# StreamingTV,and StreamingMovies
churn_df['NoStream'] = np.where((churn_df['StreamingTV'] != 'No') | (churn_df['StreamingMovies'] != 'No'), 1, 0)
 
# number of services subscribed by each client
churn_df["SumOfIntServices"]=(churn_df.iloc[:, 6:12]=='Yes').sum(axis=1)

In [None]:
# Dropping....
churn_df.drop(columns=["InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", 
                       "TechSupport", "StreamingTV", "StreamingMovies"], inplace=True)

### Categorical Account Features...

#### Univariate Exploration

In [None]:
fig = plt.figure(figsize=(17, 10))
fig.suptitle('Account Categorical Features Distributions', fontsize=40, weight='bold')
for i, col in enumerate(Account_cols_cat[:-1]):
    sorted_counts = churn_df[col].value_counts()
    plt.subplot(3, 1, i+1)
    plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90, autopct='%1.2f%%', 
                 counterclock = False, radius = 1.2, textprops={'fontsize': 14})
    plt.title(f'{col} Distribution',fontsize=15, weight='bold', color='brown', loc='center')

>To Complete our story, Let's go to Bivariate Exploration.

#### Bivariate Exploration

In [None]:
fig, axes = plt.subplots(nrows = 3, ncols = 1, figsize = (17,22))
fig.suptitle('Account Categorical Features Distributions by Churn', fontsize=25, weight='bold')
for i, col in enumerate(Account_cols_cat[:-1]):
    CountPlot(churn_df[Account_cols_cat], col, hue="Churn", ax=axes[i])

> **We can see that:**
- Churn rate is :
   - Is high for Clients that has a month-to-month contract ($\frac{1655}{1655+2220} = {42.71}\% $). That is reasonable by the way as he could make the contract more than that if he intended to stay longer.
   - Is high for Clients that has paperless billing ($\frac{1400}{1400+2768} = {33.59}\% $). Maybe there is a problem in the website or something.
   - Is very high for Clients that pay by electronic check ($\frac{1071}{1071+1294} = {45.29}\% $). May be the GUI or the website is not good enough.

### Feature Engineering...

In [None]:
# According to the above note....
churn_df['MonthToMonth'] = np.where((churn_df['Contract'] == 'Month-to-month'), 1,0)
churn_df['PaperlessBilling'] = np.where((churn_df['PaperlessBilling'] == 'Yes'), 1,0)
churn_df['ElectronicCheck'] = np.where((churn_df['PaymentMethod'] == 'Electronic check'), 1,0)

In [None]:
# Dropping...
churn_df.drop(columns=['Contract', 'PaymentMethod'], inplace=True)

### Numerical Account Features...

#### Univariate Exploration

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (17,6))
fig.suptitle('Numerical Account Features Distributions by Churn', fontsize=25, weight='bold')
for i in range(3):
    kdeplot_churn(churn_df, Account_cols_num[i], ax=axes[i])

> We can see that:
 - if the client stayed from 0 to nearly 20 months only he is more likely to churn.
 - if the monthly charges is between 60 to 120 dollars he is more likely to churn
 - there is a little difference between the two density curves for the total charges.

In [None]:
fig, axes = plt.subplots(nrows = 3, ncols = 1, figsize = (15,30))
combs = list(combinations(Account_cols_num[:-1], 2))
for i in range(3):
    ScatterPlot(churn_df, combs[i][0], combs[i-3][1], hue=Account_cols_num[-1], ax=axes[i])

> **We can see that:**
- _From the scatter plot:_
    - There is no specific pattern between Tenure and MonthlyCharges.
    - But we do see that there is a correlation between TotalCharges with both of MonthlyCharges and Tenure, which is reasonable by the way. So i will take only the tenure and monthly charges and drop the total charges.


### Feature Engineering...

In [None]:
pd.qcut(churn_df["MonthlyCharges"],3).unique()  

In [None]:
# According to the above notes...
churn_df["tenure_L20"]=pd.qcut(churn_df["Tenure"],3)                       # 3 to get one of the categories (0.999, 14.0]   
churn_df["MonthlyCharges_60_120"] = pd.qcut(churn_df["MonthlyCharges"],3)  # 3 to get one of the categories (84.0, 118.75]

In [None]:
# Dropping 
churn_df.drop(columns=["Tenure", "MonthlyCharges", "TotalCharges"], inplace=True)

## Final Touch

In [None]:
churn_df.head()

In [None]:
# To avoid get_dummies trap you should put drop_first = True
churn_df = pd.get_dummies(data=churn_df, columns=['tenure_L20', 'MonthlyCharges_60_120'], drop_first=True)

# As for Churn, we don't need LabelEncoder as it's only 'Yes' or 'No'
churn_df['Churn'] = np.where(churn_df['Churn']=='Yes', 1, 0)

**To check if i did something wrong**

In [None]:
churn_df.info()

In [None]:
churn_df.isnull().sum()        

In [None]:
churn_df.shape

<font color='green'>
<h2><center> Hoooooraaaay, It is time for Modeling :) </center></h2>

<a id='model'></a>
<font color='blue'>
<h2><center> Modeling </center></h2>

#### 1- First let's split the feature ( X ) from the target ( Y )

In [None]:
x = churn_df.drop(columns=['Churn'])
y = churn_df['Churn']

#### 2- Splitting the data to training and testing sets

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

#### 3- Scaling Transformation

In [None]:
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

#### 4- Building our Models

- **Note:** Why here we care about precision ?. Because if the model predicts that a client has left the company but he actually hasn't (FP), that is very bad for the company.

In [None]:
def get_precision(y_test, y_pred):
    CM = confusion_matrix(y_test, y_pred)
    TP = CM[1,1]
    FP = CM[0,1]
    precision = TP/(TP+FP)
    
    return precision

_Logistic Regression_

In [None]:
lg_model = LogisticRegression(random_state=0)
lg_model.fit(x_train, y_train)

In [None]:
lg_acc = lg_model.score(x_test, y_test)
print("The logistic Regression model score on train set is: {}".format(lg_model.score(x_train, y_train)))  # To test Overfitting
print("The logistic Regression model score on test set is: {}".format(lg_acc))

In [None]:
y_pred = lg_model.predict(x_test)
print(classification_report(y_test, y_pred))

_KNN_

In [None]:
knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)

In [None]:
knn_acc = knn_model.score(x_test, y_test)
print("The KNN model score on train set is: {}".format(knn_model.score(x_train, y_train)))    
print("The KNN model score on test set is: {}".format(knn_acc))

> It seems that this model is slightly overfitting the data.

In [None]:
y_pred = knn_model.predict(x_test)
print(classification_report(y_test, y_pred))

_Decision Tree_

In [None]:
dt_model = DecisionTreeClassifier(random_state=0)
dt_model.fit(x_train, y_train)

In [None]:
dt_acc = dt_model.score(x_test, y_test)
print("The Decision Tree model score on train set is: {}".format(dt_model.score(x_train, y_train)))
print("The Decision Tree model score on test set is: {}".format(dt_acc))

> Also here there is an overfitting

In [None]:
y_pred = dt_model.predict(x_test)
print(classification_report(y_test, y_pred))

**Let's see if we can make it better and get higher precision...**

In [None]:
# Create for loop to prune tree
precisions = []
accuracies = []
d_range = range(2, 20) 
for d in d_range:
    tree = DecisionTreeClassifier(random_state=0, max_depth=d)
    tree.fit(x_train, y_train)
    y_pred = tree.predict(x_test)
    precisions.append(get_precision(y_test, y_pred))
    accuracies.append(tree.score(x_test, y_test))
    
# Plot graph to see how individual accuracy scores changes with tree depth
plt.plot(d_range, precisions)
plt.plot(d_range, accuracies)
plt.xlabel("Depth of Tree")
plt.ylabel("Precisions(Blue) & Accuracy(Green)");

In [None]:
precisions

> I think the best value here is max_depth = 4

In [None]:
dt_model = DecisionTreeClassifier(random_state=0, max_depth=4)
dt_model.fit(x_train, y_train)

In [None]:
dt_acc = dt_model.score(x_test, y_test)
print("The Decision Tree model score on train set is: {}".format(dt_model.score(x_train, y_train)))
print("The Decision Tree model score on test set is: {}".format(dt_acc))

In [None]:
y_pred = dt_model.predict(x_test)
print(classification_report(y_test, y_pred))

_Random Forest_

In [None]:
rf_model = RandomForestClassifier(oob_score=True, random_state=0, warm_start=True, n_jobs=-1)
rf_model.fit(x_train, y_train)

In [None]:
rf_acc = rf_model.score(x_test, y_test)
print("The Decision Tree model score on train set is: {}".format(rf_model.score(x_train, y_train)))
print("The Decision Tree model score on test set is: {}".format(rf_acc))

In [None]:
y_pred = rf_model.predict(x_test)
print(classification_report(y_test, y_pred))

**Some Prunning**

In [None]:
rf_model = RandomForestClassifier(oob_score=True, random_state=0, warm_start=True, n_jobs=-1)

In [None]:
precisions = []
accuracies = []
# Iterate through all of the possibilities for the number of trees
n_range = range(50, 300, 10)
for n_trees in n_range:
    rf_model.set_params(n_estimators=n_trees)  # Set number of trees
    rf_model.fit(x_train, y_train)
    y_pred = rf_model.predict(x_test)
    precisions.append(get_precision(y_test, y_pred))
    accuracies.append(rf_model.score(x_test, y_test))

plt.plot(n_range, precisions, marker='o')
plt.plot(n_range, accuracies)
plt.xlabel("Number of Trees")
plt.ylabel("Precisions(Blue) & Accuracy(Green)");

In [None]:
precisions

> I think n_estimators = 250 is the best number

<font color='green'>
Best Random Forest

In [None]:
rf_model = RandomForestClassifier(n_estimators=250, oob_score=True, random_state=0, warm_start=True, n_jobs=-1)
rf_model.fit(x_train, y_train)

In [None]:
rf_acc = rf_model.score(x_test, y_test)
print("The Decision Tree model score on train set is: {}".format(rf_model.score(x_train, y_train)))
print("The Decision Tree model score on test set is: {}".format(rf_acc))

> Overfitting again.

In [None]:
y_pred = rf_model.predict(x_test)
print(classification_report(y_test, y_pred))

_SVM_

In [None]:
svm_model = SVC(random_state=0, C=1.5)
svm_model.fit(x_train, y_train)

In [None]:
svm_acc = svm_model.score(x_test, y_test)
print("The Decision Tree model score on train set is: {}".format(svm_model.score(x_train, y_train)))
print("The Decision Tree model score on test set is: {}".format(svm_acc))

In [None]:
y_pred = svm_model.predict(x_test)
print(classification_report(y_test, y_pred))

In [None]:
# To calculate AUC
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred)

<a id='conc'></a>
# Conclusion: 
> As we can see the best model from all of the above is the _SVM_ :
- Accuracy  = 82%
- **Precision = 70%**
- Recall    = 52%
- **F1-score  = 60%**
- **AUC = 72%**