# Customer Churn Prediction

## `Telco_Customer_Churn`

**Content**

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents

In [None]:
import numpy as np
import pandas as pd 
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from collections import Counter
import warnings
warnings.filterwarnings("ignore")

In [None]:
telco=pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
telco.head()

## EDA

In [None]:
telco.info()

In [None]:
telco.isnull().sum()

In [None]:
# Although,TotalCharges should be numerical data type,but it is object type here.
telco.TotalCharges=pd.to_numeric(telco.TotalCharges,errors='coerce') #If 'coerce', then invalid parsing will be set as NaN.

In [None]:
def null_values(telco):
    nv=pd.concat([telco.isnull().sum(), 100 * telco.isnull().sum()/telco.shape[0]],axis=1).rename(columns={0:'Missing_Records', 1:'Percentage (%)'})
    return nv[nv.Missing_Records>0].sort_values('Missing_Records', ascending=False)

In [None]:
null_values(telco)

In [None]:
telco.columns

In [None]:
null_indexes=telco[telco.TotalCharges.isnull()].index
telco.loc[null_indexes,['tenure','MonthlyCharges','TotalCharges','Churn']]

In [None]:
# we can drop null values and customerID column.
telco=telco.fillna(0)
telco=telco.drop(['customerID'],axis=1)

In [None]:
telco['Churn_Rate']=telco['Churn'].map({"No":0,"Yes":1})

In [None]:
telco.rename(columns={"tenure": "Tenure", "gender": "Gender"},inplace=True)

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(telco.corr(), cmap='coolwarm',annot=True);

In [None]:
telco.corr()["Churn_Rate"].sort_values().plot.barh();

In [None]:
telco.describe()

In [None]:
# categorik verilerin istatistiksel analizi icin:
telco.describe(include=['O'])

In [None]:
# her columns in num of uniques
telco.apply(lambda x: x.nunique())

**Exploring the Data**

Our purpose here, mostly, understanding which and how variables are related to 'Churn'. Who wants to leave the telecom service company and why? So we made it 'Churn' oriented.

Customer churn is the loss of clients or customers.

In [None]:
print('Unique Values of Each Features:\n')
for i in telco:
    print(f'{i}:\n{sorted(telco[i].unique())}\n')

In [None]:
telco.Churn.value_counts()

In [None]:
def perc_col(df,col):
    for i in sorted(df[col].unique(),reverse=True):
        print('%s: %%%.2f' % (i, 100*df[col].value_counts()[i]/len(df)))

In [None]:
sns.countplot(x='Churn',data=telco)
plt.show()

print(dict(Counter(telco['Churn'])))
print('\nCustomer Attrition Ratio:')
perc_col(telco,'Churn')

### Gender - Partner 

In [None]:
plt.figure(figsize=(15,4))
plt.subplot(121)
plt.title("Count of Churned According to Gender")
sns.countplot(x='Gender', data=telco, hue='Churn')
plt.subplot(122)
plt.title("Count of Churned According to Partner")
sns.countplot(x='Partner', data=telco, hue='Churn')
plt.show()

print("Churn:'Yes'\n",
      'Gender: ',dict(Counter(telco[telco.Churn=='Yes']['Gender'])),
      '\nPartner: ',dict(Counter(telco[telco.Churn=='Yes']['Partner'])),
     "\n\nChurn:'No'\n",
      'Gender: ',dict(Counter(telco[telco.Churn=='No']['Gender'])),
     '\nPartner: ',dict(Counter(telco[telco.Churn=='No']['Partner'])),sep='')

> **`Gender` is ineffective for prediction of customer churn.**

### Phone Service - Streaming TV - MultipleLines

In [None]:
plt.figure(figsize=(15,4))
plt.subplot(131)
plt.title("PhoneService")
sns.countplot(x='PhoneService',hue='Churn',data=telco)
plt.subplot(132)
plt.title("StreamingTV")
sns.countplot(x='StreamingTV',hue='Churn',data=telco)
plt.subplot(133)
plt.title("MultipleLines")
sns.countplot(x='MultipleLines',hue='Churn',data=telco)
plt.show()

> **Whether customer have `Phone Service` not does not seem to have an effect on the churn.**

### Online Security - Tech Support

In [None]:
plt.figure(figsize=(12,4))
plt.subplot(121)
plt.title("Online Security")
sns.countplot(x = 'OnlineSecurity', hue = 'Churn', data = telco)
plt.subplot(122)
plt.title("Tech Support")
sns.countplot(x = 'TechSupport', hue = 'Churn', data = telco)
plt.show()

print('Churn Ratios by Online Security','\n')
print(round(telco[telco['Churn']=='Yes']['OnlineSecurity'].value_counts() / telco['OnlineSecurity'].value_counts()*100,2))

print('\nChurn Ratios by Tech Support','\n')
print(round(telco[telco['Churn']=='Yes']['TechSupport'].value_counts() / telco['TechSupport'].value_counts()*100,2))

> **Customers who did not sign up for `OnlineSecurity` and `TechSupport` are most likely to churn.**

### Tenure

In [None]:
churn_tenure=telco[telco.Churn=='Yes']['Tenure']
not_churn_tenure=telco[telco.Churn=='No']['Tenure']

plt.figure(figsize=(8,5))
sns.kdeplot(data=not_churn_tenure, shade=True)
sns.kdeplot(data=churn_tenure, shade=True)
plt.legend(("Churn:No", "Churn:Yes"),title='Churn')
plt.title("Distributions of Tenure, by Churn")

plt.show()

print('Average Tenure of Churned Customers:',round(churn_tenure.mean()),
      '\nAverage Tenure of Not-Churned Customers:',round(not_churn_tenure.mean()))

> **Not churned customers have a much longer average tenure (20 months) than the churned customers.**

### Monthly Charges

In [None]:
churn_mcharge=telco[telco.Churn=='Yes']['MonthlyCharges']
not_churn_mcharge=telco[telco.Churn=='No']['MonthlyCharges']
plt.figure(figsize=(8,5))

sns.kdeplot(data=not_churn_mcharge,shade=True)
sns.kdeplot(data=churn_mcharge,shade=True)
plt.legend(("Churn:No", "Churn:Yes"),title='Churn')
plt.title("Distributions of Monthly Charges, by Churn")
plt.show()

print('Average Monthly Fee of Churned Customers:',round(churn_mcharge.mean()),
      '\nAverage Monthly Fee of Not-Churned Customers:',round(not_churn_mcharge.mean()))

> **Churned customers paid over 20% higher on average monthly fee than not-churned customers.**

### Deal with Outliers

In [None]:
telco.info()

In [None]:
categorical=telco.select_dtypes(include='object').columns.tolist()
numeric=telco.dtypes[telco.dtypes!=object].keys().tolist() # result of dtypes is Series, so we use keys(), not columns
print('Categorical Features:',categorical,'\nNumerical Features:',numeric,sep='\n')

In [None]:
plt.figure(figsize=(20,4))
plt.subplot(131)
sns.boxplot(x='Churn', y='Tenure', data=telco, palette="coolwarm",whis=1.6)
plt.subplot(132)
sns.boxplot(x='Churn', y='MonthlyCharges', data=telco, palette="coolwarm")
plt.subplot(133)
sns.boxplot(x='Churn', y='TotalCharges', data=telco, palette="coolwarm")
plt.show()

**'TotalCharges' has some outliers. We can use root square for outliers.**

In [None]:
f=lambda x:(np.sqrt(x) if x>=0 else -np.sqrt(-x))
telco.TotalCharges=telco.TotalCharges.apply(f)

In [None]:
sns.boxplot(x='Churn', y='TotalCharges', data=telco, palette="coolwarm");

### Senior Citizen-Tenure-Monthly Charges

In [None]:
g=sns.FacetGrid(telco,col='SeniorCitizen', hue='Churn',height=4)
g.map(plt.scatter, 'Tenure', 'MonthlyCharges', alpha=0.7)
g.add_legend();

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='SeniorCitizen', data=telco, hue='Churn')
plt.show()

print('SeniorCitizens percentage of customers: % {:.2f}'.format(len(telco[telco.SeniorCitizen==1])/len(telco.SeniorCitizen)*100))
print('SeniorCitizens churn rate: % {:.2f}'.format(len(telco[(telco.SeniorCitizen==1) & (telco.Churn=='Yes')])/len(telco[telco.SeniorCitizen==1])*100))
print('non-SeniorCitizens churn rate: % {:.2f}'.format(len(telco[(telco.SeniorCitizen==0) & (telco.Churn=='Yes')])/len(telco[telco.SeniorCitizen==0])*100))

> **SeniorCitizens are only 16% of customers, but they have a much higher churn rate: 42% against 23% for non-senior customers.**

### Contract-Internet Service-Churn

In [None]:
telco.Contract.value_counts()

In [None]:
telco.InternetService.value_counts()

In [None]:
g = sns.FacetGrid(telco,col='InternetService',height=4)
ax = g.map(sns.barplot, "Contract", "Churn_Rate", palette = "Blues_d", order= telco.Contract.unique())

> **Short term contracts have higher churn rates. It is obvious that contract term does have an effect on churn.There were very few churns when customers have a two-year contract. And most churns occurred on customers with a month-to-month contract.**

> **It seems customers who signed up for Fiber optic are most likely to churn.**

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x=telco.Tenure//12+1,y='MonthlyCharges',data=telco,hue='Churn',estimator=np.sum) #tenure with year
plt.title("Sum of MonthlyCharges of Churned by Tenure and Churn")
plt.show()

In [None]:
print('Average MonthlyCharges of Churned by Tenure:',
      telco.groupby([telco.Tenure//12+1,'Churn']).MonthlyCharges.mean(),sep='\n\n')

In [None]:
print('Sum of Churned Count by Tenure:',
      telco.groupby([telco.Tenure//12+1,'Churn']).MonthlyCharges.count(),sep='\n\n')

> **In first year, count of the Churned and Not Churned are close to each other. In the later years,the Not Churned customers are more than Churned customers.**

### Transform to Dummy and Drop Categorical Features

In [None]:
telco.head()

In [None]:
telco['Churn']=telco['Churn'].map({"No":0,"Yes":1})
telco.drop(columns=['Churn_Rate'],axis=1,inplace=True)

In [None]:
import pickle
pickle.dump(telco,open("telco_not_dummy.pkl","wb"))

In [None]:
telco=pd.get_dummies(telco,drop_first=True)
telco.head()

In [None]:
telco.isnull().sum().any()

In [None]:
telco.to_csv("./telco_clean_20201215.csv", index = False)

## Customer Churn Prediction

In [None]:
# conda install -c districtdatalabs yellowbrick

In [None]:
import pandas as pd
import numpy as np
from numpy import percentile
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from IPython.core.pylabtools import figsize
from scipy.stats import zscore
from scipy import stats
from sklearn.metrics import accuracy_score,f1_score, recall_score, classification_report,confusion_matrix,precision_score,roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score, TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
from statsmodels.formula.api import ols
from scipy.stats import zscore
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from xgboost import XGBClassifier
from yellowbrick.classifier import ClassificationReport
from yellowbrick.datasets import load_occupancy
font_title = {'family': 'times new roman', 'color': 'darkred', 
              'weight': 'bold', 'size': 14}
import warnings
warnings.filterwarnings('ignore')
sns.set_style("whitegrid")

plt.rcParams['figure.dpi'] = 100

### Building Models

In [None]:
df = pd.read_csv("./telco_clean_20201215.csv")
df.head()

In [None]:
df.Churn.value_counts()

In [None]:
print("Percentage of Churned Customer:%",
      round(df.Churn.mean(),2))

> **Target variable is a bit of imbalanced. We should resample the data.**

## Splitting Data

In [None]:
X=df.drop('Churn',axis=1)
y=df.Churn

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=42) # stratify=y

In [None]:
y_train.value_counts()

### SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X_train, y_train) # fit_sample
y_smote.value_counts()

### ADASYN

In [None]:
ad = ADASYN(random_state=42)
X_adasyn, y_adasyn = ad.fit_resample(X_train, y_train)  # fit_sample
y_adasyn.value_counts()

In [None]:
# SMOTE
X_train, y_train = X_smote, y_smote

# ADASYN
# X_train, y_train = X_adasyn, y_adasyn

In [None]:
# pip install lazypredict==0.2.9
# import lazypredict
# from lazypredict.Supervised import LazyClassifier
# from sklearn.utils.testing import ignore_warnings

# clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
# models, predictions = clf.fit (X_train, X_test, y_train, y_test)
# models

# ``1.XGBoost Classifer``

In [None]:
from xgboost import XGBClassifier
xgb= XGBClassifier()
xgb.fit(X_train , y_train)

In [None]:
y_pred = xgb.predict(X_test)

### **Evaluate the performance**

In [None]:
print('Confusion Matrix:',*confusion_matrix(y_test,y_pred), sep="\n")
print(classification_report(y_test, y_pred))

In [None]:
xgb_accuracy = accuracy_score(y_test, y_pred)
xgb_f1_score = f1_score(y_test, y_pred, average='weighted')
xgb_recall = recall_score(y_test, y_pred, average='weighted')
print('xgb_accuracy:',xgb_accuracy,
      '\nxgb_f1_score:',xgb_f1_score,
      '\nxgb_recall:',xgb_recall)

### **Tunning XGBOOST**

In [None]:
xgb = XGBClassifier()

In [None]:
xgb_params = {"n_estimators": [50,500,1000],
             "subsample":[0.1,0.5,1],
             "max_depth":[3,7,9],
             "learning_rate":[0.1,0.01,0.3]}

In [None]:
# xgb_grid= GridSearchCV(xgb, xgb_params, cv = 5, 
#                             n_jobs = -1, verbose = 2).fit(X_train, y_train)

In [None]:
# xgb_grid= RandomizedSearchCV(xgb, xgb_params, cv = 5,
#                              n_iter=10,
#                             n_jobs = -1, verbose = 2,scoring='f1').fit(X_train, y_train)

In [None]:
# xgb_grid.best_params_

In [None]:
xgb_tuned = XGBClassifier(learning_rate= 0.01, 
                                max_depth= 3, 
                                n_estimators= 520, 
                                subsample= 0.15).fit(X_train, y_train)

y_pred = xgb_tuned.predict(X_test)

In [None]:
print('Confusion Matrix:',*confusion_matrix(y_test,y_pred), sep="\n")
print(classification_report(y_test, y_pred))

In [None]:
xgb_f1_true=float(classification_report(y_test, y_pred).split()[12])
xgb_f1_true

In [None]:
xgb_accuracy = accuracy_score(y_test, y_pred)
xgb_f1_score = f1_score(y_test, y_pred, average='weighted')
xgb_recall = recall_score(y_test, y_pred, average='weighted')
print('xgb_accuracy:',xgb_accuracy,
      '\nxgb_f1_score:',xgb_f1_score,
      '\nxgb_recall:',xgb_recall)

**`Cross Validation Scores`**

In [None]:
# xgb_accuracy = cross_val_score(xgb_tuned, X_test, y_test,cv = 10).mean()
# xgb_f1_score = cross_val_score(xgb_tuned, X_test, y_test,cv = 10,scoring='f1_weighted').mean()
# xgb_recall = cross_val_score(xgb_tuned, X_test, y_test,cv = 10,scoring='recall_weighted').mean()
# print('rfc_accuracy:',rfc_accuracy,
#       '\nrfc_f1_score:',rfc_f1_score,
#       '\nrfc_recall:',rfc_recall)

### Visualization of Confusion Matrix with Table

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label');

# ``2.Random Forest Classifier``

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)

In [None]:
y_pred = rf_model.predict(X_test)

### **Evaluate the performance**

In [None]:
print('Confusion Matrix:',*confusion_matrix(y_test,y_pred), sep="\n")
print(classification_report(y_test, y_pred))

In [None]:
rfc_accuracy = accuracy_score(y_test, y_pred)
rfc_f1_score = f1_score(y_test, y_pred, average='weighted')
rfc_recall = recall_score(y_test, y_pred, average='weighted')
print('rfc_accuracy:',rfc_accuracy,
      '\nrfc_f1_score:',rfc_f1_score,
      '\nrfc_recall:',rfc_recall)

**`Cross Validation Scores`**

In [None]:
# rfc_accuracy = cross_val_score(rf_model, X_test, y_test,cv = 10).mean()
# rfc_f1_score = cross_val_score(rf_model, X_test, y_test,cv = 10,scoring='f1_weighted').mean()
# rfc_recall = cross_val_score(rf_model, X_test, y_test,cv = 10,scoring='recall_weighted').mean()
# print('rfc_accuracy:',rfc_accuracy,
#       '\nrfc_f1_score:',rfc_f1_score,
#       '\nrfc_recall:',rfc_recall)

### **RF Tunning**

In [None]:
rfc_params = {"n_estimators":[300,500,1000],
              "max_depth":[7,10,15],
              "max_features": [8,10,15],
              "min_samples_split": [4,6,8]}

In [None]:
# rfc_grid = GridSearchCV(rf_model, rfc_params, cv = 5, n_jobs = -1, verbose = 2).fit(X_train, y_train)

In [None]:
# rfc_grid= RandomizedSearchCV(rf_model, xgb_params, cv = 5,
#                              n_iter=10,
#                             n_jobs = -1, verbose = 2,scoring='f1').fit(X_train_tf_idf, y_train)

In [None]:
# rfc_grid.best_params_

In [None]:
rfc_tuned = RandomForestClassifier(max_depth = 10,             
                                  max_features = 10, 
                                  min_samples_split = 4, 
                                  n_estimators = 500).fit(X_train, y_train)

In [None]:
y_pred = rfc_tuned.predict(X_test)
print('Confusion Matrix:',*confusion_matrix(y_test,y_pred), sep="\n")
print(classification_report(y_test, y_pred))

In [None]:
rf_f1_true=float(classification_report(y_test, y_pred).split()[12])
rf_f1_true

### **Visualization of Confusion Matrix with Table**

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label');

# ``3.KNeighborsClassifer``

### Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test= sc.transform(X_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

In [None]:
y_pred = knn.predict(X_test)

### **Visualize Accuracies of Train & Test Data by Different k`s**

In [None]:
neighbors = range(1,18,2) # k nin tek sayi olmasi beklenir.
train_accuracy =np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

for i,k in enumerate(neighbors):
    #Setup a knn classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors = k)
    
    #Fit the model
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)
    
    #Compute accuracy on the test set
    test_accuracy[i] = knn.score(X_test, y_test)

In [None]:
plt.figure(figsize=(8,5))
plt.title('k-NN assesment of number of neighbors')
plt.plot(neighbors, test_accuracy, label='Accuracy of Test Data')
plt.plot(neighbors, train_accuracy, label='Accuracy of Training Data')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

### **Find Optimum K Value with Elbow Method**

In [None]:
error_rate = []
# Her bir error rate icin olusan k degeri bu listeye atilacak
# k nin tek sayi olmasi beklenir.
# Will take some time
for i in range(1,18,2):
    
    model = KNeighborsClassifier(n_neighbors=i) # k= i
    model.fit(X_train,y_train)
    y_pred_i = model.predict(X_test)
    error_rate.append(np.mean(y_pred_i != y_test)) 
    

# print('Optimum K_Value: ',error_rate.index(min(error_rate)))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,18,2),
         error_rate,
         color='blue', 
         linestyle='dashed', 
         marker='o',
         markerfacecolor='red', 
         markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate');

### Tunning KNN with GridSearchCV

In [None]:
knn = KNeighborsClassifier()
knn_params = {"n_neighbors": range(1,18,2)} # k tek sayi olmali

knn_cv_model = GridSearchCV(knn, knn_params, cv=10).fit(X_train, y_train)

In [None]:
knn_cv_model.best_params_

In [None]:
knn_tuned= KNeighborsClassifier(n_neighbors = 33).fit(X_train, y_train)
y_pred = knn_tuned.predict(X_test)

### **Evaluate the Performance**

In [None]:
print('Confusion Matrix:',*confusion_matrix(y_test,y_pred), sep="\n")
print(classification_report(y_test, y_pred))

In [None]:
knn_accuracy = accuracy_score(y_test, y_pred)
knn_f1_score = f1_score(y_test, y_pred, average='weighted')
knn_recall = recall_score(y_test, y_pred, average='weighted')
print('knn_accuracy:',knn_accuracy,
      '\nknn_f1_score:',knn_f1_score,
      '\nknn_recall:',knn_recall)

In [None]:
knn_f1_true=float(classification_report(y_test, y_pred).split()[12])
knn_f1_true

### Visualization of Confusion Matrix with Table

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label');

# `4-Logistic Regression`

In [None]:
model=LogisticRegression()
model.fit(X_train,y_train)

In [None]:
y_pred=model.predict(X_test)

### **Evaluate the performance**

In [None]:
print('Confusion Matrix:',*confusion_matrix(y_test,y_pred), sep="\n")
print(classification_report(y_test, y_pred))

In [None]:
log_accuracy = accuracy_score(y_test, y_pred)
log_f1_score = f1_score(y_test, y_pred, average='weighted')
log_recall = recall_score(y_test, y_pred, average='weighted')
print('log_accuracy:',log_accuracy,
      '\nlog_f1_score:',log_f1_score,
      '\nlog_recall:',log_recall)

In [None]:
log_f1_true=float(classification_report(y_test, y_pred).split()[12])
log_f1_true

### `Compare Models Accuracies & F1 Scores & Recall`

In [None]:
compare = pd.DataFrame({"Model": ["Random Forest", "XGBoost","Logistic Regression","K-Nearest Neighbor"],
                        "Accuracy": [rfc_accuracy, xgb_accuracy, log_accuracy,knn_accuracy],
                        "F1 Score": [rfc_f1_score, xgb_f1_score, log_f1_score, knn_f1_score],
                        "Recall": [rfc_recall, xgb_recall, log_recall,knn_recall],
                        "F1 Score (True)": [rf_f1_true, xgb_f1_true, log_f1_true, knn_f1_true]})

def labels(ax):
    for p in ax.patches:
        width = p.get_width()    # get bar length
        ax.text(width,       # set the text at 1 unit right of the bar
                p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
                '{:1.2f}'.format(width), # set variable to display, 2 decimals
                ha = 'left',   # horizontal alignment
                va = 'center')  # vertical alignment
    
plt.subplot(411)
compare = compare.sort_values(by="Accuracy", ascending=False)
ax=sns.barplot(x="Accuracy", y="Model", data=compare, palette="Blues_d")
labels(ax)
plt.show()

plt.subplot(412)
compare = compare.sort_values(by="Recall", ascending=False)
ax=sns.barplot(x="Recall", y="Model", data=compare, palette="Blues_d")
labels(ax)
plt.xlabel('Recall (Weighted)')
plt.show()

plt.subplot(413)
compare = compare.sort_values(by="F1 Score", ascending=False)
ax=sns.barplot(x="F1 Score", y="Model", data=compare, palette="Blues_d")
labels(ax)
plt.xlabel('F1 Score (Weighted)')
plt.show()

plt.subplot(414)
compare = compare.sort_values(by="F1 Score", ascending=False)
ax=sns.barplot(x="F1 Score (True)", y="Model", data=compare, palette="Blues_d")
labels(ax)
plt.show()

### Result
* F1 score is used in the case where we have skewed classes i.e one type of class examples more than the other type class examples.
* For Churn Analysis, `F1 score of True Class` is the most important parameter.
* As we can say, XGBoost and Random Forest Classifier with SMOTE algorithm are the best models when we looked at the `F1 score of True Class`

### Feature Importance for XGBoost

In [None]:
feature_imp = pd.Series(xgb_tuned.feature_importances_,
                        index=X.columns).sort_values(ascending=False)

sns.barplot(x=feature_imp, y=feature_imp.index)
plt.title("Feature Importance")
plt.show()

feature_imp[:10]

### Feature Importance for Random Forest

In [None]:
feature_imp = pd.Series(rfc_tuned.feature_importances_,
                        index=X.columns).sort_values(ascending=False)

sns.barplot(x=feature_imp, y=feature_imp.index)
plt.title("Feature Importance")
plt.show()

feature_imp[:10]

   **Top 10 Feature Importance for XGBoost**
   
    Contract_Two year                        0.152446
    Contract_One year                        0.147547
    InternetService_Fiber optic              0.099390
    InternetService_No                       0.072385
    Dependents_Yes                           0.053979
    OnlineSecurity_Yes                       0.046085
    TechSupport_Yes                          0.039856
    PaymentMethod_Credit card (automatic)    0.036813
    Partner_Yes                              0.036286
    Tenure                                   0.036252

### Saving Model

In [None]:
import pickle
import pandas as pd

In [None]:
pickle.dump(xgb_tuned,open("XGBoost.pkl","wb"))
pickle.dump(rfc_tuned,open("RandomForest.pkl","wb"))

In [None]:
xgb_model = pickle.load(open("XGBoost.pkl","rb"))
rfc_model = pickle.load(open("RandomForest.pkl","rb"))
# df = pd.read_csv("telco_clean_20201215.csv")

In [None]:
new_list=["Contract", "InternetService", "Dependents", "OnlineSecurity",'TechSupport',"PaymentMethod",'Partner','Tenure']

In [None]:
my_dict = {"Contract":'Month-to-month', 
           "InternetService":'Fiber optic', 
           "Dependents":"Yes", 
           "OnlineSecurity":'Yes',
           "TechSupport":'Yes',
           'PaymentMethod':'Electronic check',
           'Partner':'Yes',
           'Tenure':60,
           'TotalCharges':2500
            }

X = pd.DataFrame.from_dict([my_dict])

X=pd.get_dummies(X)
X.columns

In [None]:
all_columns=df.drop('Churn',axis=1).columns
all_columns

In [None]:
X = pd.get_dummies(X).reindex(columns=all_columns, fill_value=0)
X

In [None]:
prediction_XGB = xgb_model.predict(X)
print("The Churn : ",'Yes' if prediction_XGB[0] else 'No')

In [None]:
prediction_XGB = xgb_model.predict_proba(X)
print(f'The Probability of the Customer Churn is %{round(prediction_XGB[0][1]*100,1)}')

In [None]:
# pip freeze -o requirements.txt