# Executive summary

Author: Ľubomír Koprla

## Goal of analysis
Explain why customers are leaving the company and what could possibly be done about here.


## Business outcomes

* Customers with Month-to-Month contract have high churn rate and customers with 2 year contract have much lower churn rate.
  * Company can optimize sales proces and sales should focus on selling 2 year contracts to decrease churn rate. They can try lower price, adding additional free services
* Customer with lower tenure and higher monthly charges tends to churn more than other customers
* Customers with electronic check have higher churn rate, but customers with automatic payments have lower churn rate. 
  * Company can motivate customers to set up automatic payments (E.g. one telco company in Slovakia offers discount 2 Euro for automatic card payments)
* Customers with 1 service(extra services for internet connection) have higher churn rate than customers with more services - with increasing number of services churn rate decreasing.
  * Company can offer more services in bundle or offer it for better price in two year contract.
* Customers with internet, customers with Fiber optic pay monthly more than other customers  
  * Company can focus on increasing Fiber Optic network and offer it for more customers
* Our model can identify 77% of customers, who want to churn. With overal accuracy 75.7%.

## Model notes

* The best performing model in term of accuracy was LogisticRegression with 80.4% accuracy and 54.7%. But for our task is more important recall, so we improved recall to 77% (decreased accuracy to 75.7%) with dataset balancing by class weighting. We decreased accuracy by 5.8% relatively, but increased recall by 40.8% relatively. We want to identify as many as possible customers, who want to churn.
* XGBClassifier had similar results, but XGBClassifier is more complicated and Logistic Regression is more explainable than complicated XGBClassifier. Good explanation for our task is  important. 
* LogisticRegression confirmed that monthly contract indicates us higher probability of churn, but 2 years contract indicates us that there is lower probability of churn. Also customers with paperless billing have higher chance to churn.  In Data Exploration part we saw, that feature MonthlyCharges has impact on churn rate - different distribution between churn/not churned customers. From LogisticRegression we see that feature MonthlyCharges does not have impact on churn.
* From decision tree we get rule: When customer has Month-to-Month contract, and does not have Fiber Optic, and tenure has higher than 3, customer will not churn.  
* We tried multiple models: KNeighborsClassifier, SVC, DecisionTreeClassifier,RandomForestClassifier, MLPClassifier, AdaBoostClassifier, XGBClassifier, LogisticRegression.
* We tried hyperparameter tuning, scaling numeric features, dataset balancing, feature selection.











In [None]:
import matplotlib.pyplot as plt
import time
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
import numpy as np 
import pandas as pd 
from sklearn.metrics import accuracy_score    
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
import graphviz
from sklearn import tree
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import plotly.graph_objects as go
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.feature_selection import mutual_info_classif
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning)

In [None]:
df = pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.drop("customerID",axis='columns',inplace=True)

In [None]:
df.describe(include = 'all')

## Missing values resolution
 
TotalCharges is number, but in data it is as String.
 
During transformation, We found 11 not valid numbers - There are " ", because they are new costumer with tenure equal 0, so TotalCharges is zero.

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
print("Number of null values after transform: " + str(df.shape[0]-df['TotalCharges'].describe()["count"]))

df[df.isna().any(axis=1)]

In [None]:
df['TotalCharges'] = df['TotalCharges'].fillna(0)

# Data exploration 

- Dataset is unbalanced - only 27% of customers from dataset churned.
- Customer with lower tenure and higher monthly charges tends to churn more than other customers

In [None]:
import plotly.express as px
def display_boxplot(y,x=None):
    if x == None:
        fig = px.box(df, y=y, points="all")
    else:
        fig = px.box(df, y=y,x=x, points="all", color=x)
    fig.show()

def display_distribution(column):
    fig = go.Figure()
    fig.add_trace(go.Histogram(x=df.loc[df.Churn == "No",column],name="No", histnorm='probability'))
    fig.add_trace(go.Histogram(x=df.loc[df.Churn == "Yes",column],name="Yes", histnorm='probability'))

    # Overlay both histograms
    fig.update_layout(barmode='overlay')
    # Reduce opacity to see both histograms
    fig.update_traces(opacity=0.75)
    fig.show()

## Target - Churn

Dataset is unbalanced. Only 27% of customers from dataset churned. 

For our task, is the most important to catch as much as customers with potential to churn - so in this dataset we want maximize recall, of course with respect on other metrics.

In [None]:
def display_bars(column):
    fig = px.histogram(df, x=column, barmode='group', histnorm = 'probability density')
    fig.show()

In [None]:
display_bars("Churn")

## Numerical columns

No outliers in data (boxplot view).

Customer with lower tenure and higher monthly charges tends to churn more than other customers

In [None]:
numerical_columns = ["tenure","MonthlyCharges","TotalCharges"]

In [None]:
df[numerical_columns].describe()

### Tenure

- No outliers
- Churned customers have lower tenure that not churned customers - if customers are more engage to the company, they stay with company


In [None]:
display_boxplot("tenure")

In [None]:
display_boxplot("tenure",x="Churn")

In [None]:
display_distribution("tenure")

### MonthlyCharges

- No outliers
- Churned customers have highers monthly charges. 


In [None]:
display_boxplot("MonthlyCharges")

In [None]:
display_boxplot("MonthlyCharges",x="Churn")

In [None]:
display_distribution("MonthlyCharges")

### TotalCharges

- No outliers
- Churned customers have _slightly_ lower total charges - Total charges is combination of tenure and monthly charges. There is high correlation between total charges and tenure - so when tenure is low, total charge is low, so customer have higher chance to churn

In [None]:
display_boxplot("TotalCharges")

In [None]:
display_boxplot("TotalCharges",x="Churn")

In [None]:
display_distribution("TotalCharges")

### Scatterplot monthly charges - tenure

From numeric columns point of view, It seems, that customer with lower tenure and higher monthly charges tends to churn more than other customers

In [None]:
fig = px.scatter(df.sample(frac=0.2, random_state=123), x="tenure", y="MonthlyCharges",color="Churn")
fig.show()

## Categorical columns

In [None]:
text_columns = list(df.select_dtypes(exclude='number').columns)
text_columns

In [None]:
def display_bars_splitted(a,b="Churn"):
    df_g = df.groupby([a, b]).size().reset_index()
    df_g['percentage'] = df.groupby([a, b]).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values
    df_g.columns = [a, b, 'Counts', 'Percentage']

    fig = px.bar(df_g, x=a, y=['Counts'], color=b, title = a, text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)))
    fig.show()

### Gender

It has no impact on churn. In both gender churn is similar

In [None]:
display_bars_splitted("gender")

### Partner

Customers without partner have higher chance to churn


In [None]:
display_bars_splitted("Partner")

### Dependents

Customers without dependennts have higher chance to churn. There is correlation between dependents and partners


In [None]:
display_bars_splitted("Dependents")

### PhoneService

It has no impact on churn

In [None]:
display_bars_splitted("PhoneService")

### InternetService

 Customers with Fiber Optich have higher churn than other customers. Customers with Fiber Optic pay more, so they have higher monthly charges

In [None]:
display_bars_splitted("InternetService")

### OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport

In these 4 columns, we see similar pattern in distribution.


Customers without these services but with internet service have higher churn rate.

In [None]:
display_bars_splitted("OnlineSecurity")
display_bars_splitted("OnlineBackup")
display_bars_splitted("DeviceProtection")
display_bars_splitted("TechSupport")

### StreamingTV, StreamingMovies

Comparing to the previous internet services, customer with extra services have churn rate similar as customer without these services.

StreamingTV and StreamingMovies are the most popular services.

In [None]:
display_bars_splitted("StreamingTV")
display_bars_splitted("StreamingMovies")

### Contract

Customers with Month-to-Month contract have high churn rate, but customers with 2 year contract have much lower churn rate.

Contract is great parameter, because it is possible to optimize in sales proces - sales should focus on 2 year contracts to decrease churn rate. They can try lower price, adding additional free services

In [None]:
display_bars_splitted("Contract")

### PaperlessBilling

Customers with paperless billing have higher churn rate comparing to the other customers.

In [None]:
display_bars_splitted('PaperlessBilling')

### PaymentMethod

Customers with electronic check have higher churn rate. Customers with automatic payments have lower churn rate. So company should motivate customers to set up automatic payments (One company in Slovakia offers discount 2 Euro for automatic card payments)

In [None]:
display_bars_splitted("PaymentMethod")

# Feature transformation

We transformed features with binary values to numbers - 1 = yes, in feature gender, there are two genders, so 1=female, 0=male, in future maybe It will require some changes.

We trasformed categorical features by one hot encoding.

In [None]:
def myTransform(df_transformed):
    binary_columns = ["Churn","PaperlessBilling","Partner","Dependents","PhoneService"]
    for column in binary_columns:
        df_transformed[column] = df_transformed[column].apply(lambda x: 1 if x=='Yes' else 0)

    df_transformed["gender"] = df_transformed["gender"].apply(lambda x: 1 if x=='Female' else 0)

    df_transformed = pd.get_dummies(df_transformed)
    return df_transformed


In [None]:
df_transformed = myTransform(df.copy())
df_transformed.info()

## New feature - InternetServices

This feature contains number of extra services for internet connection. 

Customers with one service have higher churn rate. With increasing number of services, churn rate decreases.

In [None]:
df_transformed["InternetServices"] = df_transformed.StreamingMovies_Yes+df_transformed.StreamingTV_Yes+df_transformed.DeviceProtection_Yes+df_transformed.TechSupport_Yes+df_transformed.OnlineBackup_Yes+df_transformed.OnlineSecurity_Yes

df["InternetServices"] = df_transformed["InternetServices"]
display_bars_splitted("InternetServices")

# Correlation between features

Observations from correlation: 

- Customer with Month-to-Month contract have higher chance to churn
- Customer with shorter tenure have higher chance to churn
- The highest correlation is between TotalCharges and tenure - It is obvious, because, if customers use services longer, they paid more in total
- Customers with internet pay more
- Customers with Fiber Optic pay more
- Customers with additional services pay more
- Customer with 2 years contract have longer tenure 
- Customers with partner have longer tenure



In [None]:
columns_to_show = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn',
       'gender', 'Partner',
        'Dependents', 'PhoneService',
         'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic',
       'InternetService_No', 'OnlineSecurity_No', 'OnlineSecurity_Yes',
       'OnlineBackup_No', 
       'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes',
       'TechSupport_No', 'TechSupport_Yes',
       'StreamingTV_No', 'StreamingTV_Yes',
       'StreamingMovies_No', 
       'StreamingMovies_Yes', 'Contract_Month-to-month', 'Contract_One year',
       'Contract_Two year', 'PaperlessBilling',
       'PaymentMethod_Bank transfer (automatic)',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check','InternetServices']
corr = df_transformed[columns_to_show].corr()
plt.figure(figsize=(30,20))
sns.heatmap(corr, cmap="Purples",annot=True)
plt.title('correlation heatmap', fontsize=30)
plt.show()

# Model comparison

We compared several basic models from several point of views - with original data, with scaled data, with balanced data.

The best performing model is LogisticRegression with 80.4% accuracy and 54.7%. But for our task is more important recall, so we tuned it and improved recall to 77% (decreased accuracy to 75.7%) with class weighting.

In [None]:
X = df_transformed.loc[:, df_transformed.columns != 'Churn']
y = df_transformed.loc[:,"Churn"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,stratify=y)

In [None]:
def score_all(y_true,y_pred):
    print("Accuracy:  " + str(accuracy_score(y_true, y_pred.round())))
    print("Recall:    " + str(recall_score(y_true, y_pred.round(), average=None)))
    print("Precision: " + str(precision_score(y_true, y_pred.round(), average=None)))
    print("F1:        " + str(f1_score(y_true, y_pred.round(), average=None)))
    return confusion_matrix(y_test,y_pred.round())
  
def compare_models(X_train,y_train,X_test,y_test):
  names = ["KNeighborsClassifier","SVC","DecisionTreeClassifier","RandomForestClassifier","MLPClassifier","AdaBoostClassifier","XGBClassifier","LogisticRegression"]
  models = [KNeighborsClassifier(3),
      SVC(gamma=2, C=1),
      DecisionTreeClassifier(),
      RandomForestClassifier(),
      MLPClassifier(max_iter=1000),
      AdaBoostClassifier(),
      XGBClassifier(eval_metric = 'logloss',use_label_encoder=False),
      LogisticRegression(max_iter=2000)]

  results = []
  for i,model in enumerate(models):
      print(names[i])
      start = time.time()
      model.fit(X_train,y_train)
      y_pred = model.predict(X_test)
      results.append({ 
      "model":names[i],
      "accuracy":accuracy_score(y_test, y_pred.round()),
      "recall":recall_score(y_test, y_pred.round(), average=None)[1],
      "precision":precision_score(y_test, y_pred.round(), average=None)[1],
      "f1":f1_score(y_test, y_pred.round(), average=None),})
      print((time.time()-start)/60)
  return pd.DataFrame(results)

In [None]:
results = compare_models(X_train,y_train,X_test,y_test)
results.sort_values("accuracy",ascending = False)

## After scaling

Minimal effect of scaling on performance.

When we tried linear SVM, it decreased duration. from 10 minutes to under 1 minute. 

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
results = compare_models(scaler.transform(X_train),y_train,scaler.transform(X_test),y_test)
results.sort_values("accuracy",ascending = False)

## After balancing

For balancing we used SMOTE. 

Balancing improved recall of model. XGBoost increased recall by ~14% relatively.

In [None]:
oversample = SMOTE()
over_X, over_y = oversample.fit_resample(X_train, y_train)
results = compare_models(pd.DataFrame(over_X,columns=X_test.columns),over_y,X_test,y_test)
results.sort_values("accuracy",ascending = False)

# Model comparison - deep look on selected models

## XGBoost

We experiment in XGBClassifier with dataset balancing using sample weight. It increases importances of rows with smaller class. From the results we see, that it increased recall and accuracy is slight smaller.

### XGBoost without balancing

In [None]:
recall = []
accuracy = []
model_xgb = XGBClassifier(verbosity = 0,use_label_encoder=False)
for x in range(5):
    X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train,y_train,test_size=0.2, stratify=y_train, shuffle=True)
    model_xgb.fit(X_train_2, y_train_2)
    y_pred = model_xgb.predict(X_test_2)
    accuracy.append(accuracy_score(y_test_2, y_pred))
    recall.append(recall_score(y_test_2, y_pred, average=None)[1])
print("Recall:   " + str(np.mean(recall)))
print("Accuracy: " + str(np.mean(accuracy)))

### XGBoost with balanced classes

Classes are balances by sample weight

In [None]:
recall = []
accuracy = []
model_xgb = XGBClassifier(verbosity = 0,use_label_encoder=False)
for x in range(5):
    X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train,y_train,test_size=0.2, stratify=y_train, shuffle=True)
    model_xgb.fit(X_train_2, y_train_2, sample_weight = compute_sample_weight("balanced", y_train_2))
    y_pred = model_xgb.predict(X_test_2)
    accuracy.append(accuracy_score(y_test_2, y_pred))
    recall.append(recall_score(y_test_2, y_pred, average=None)[1])
print("Recall:   " + str(np.mean(recall)))
print("Accuracy: " + str(np.mean(accuracy)))

### Feature importance

For feature importance from XGBoost we use gain (the average gain across all splits where feature is used).

The best feature is Month-to-Month contract. 2nd is Fiber Optic feature with much lower importance. It correspond with our findings during data exploration

In [None]:
feature_importances = pd.DataFrame({"columns":X_train.columns,"importances":model_xgb.feature_importances_}).sort_values("importances",ascending=False)
feature_importances

In [None]:
fig = px.bar(feature_importances.sort_values("importances",ascending=False).iloc[:10].sort_values("importances",ascending=True), y='columns', x='importances', text='importances', orientation='h')
fig.update_traces(texttemplate='%{text:.3r}', textposition='outside')
fig.show()

### Learning curves

On logloss curve on testing data we see stabilization after iterations, without increasing logloss - no overfitting 

In [None]:
def plot_results_metrics(result_val, title, eval_metric=["aucpr","logloss"]):

    sns.set_theme()
    for ev in eval_metric:
        df_plot_1_train=pd.DataFrame()
        df_plot_1_test=pd.DataFrame()
        
        for i in result_val:
            df_plot_1_train[str(i)]=result_val[i]['validation_0'][ev]
            df_plot_1_test[str(i)]=result_val[i]['validation_1'][ev]

        fig = plt.figure(figsize=(30,9))


        train_label_dic={}
        for i in range(df_plot_1_train.columns.shape[0]):
            train_label_dic[str(i)]='Train_'+df_plot_1_train.columns[i]

        test_label_dic={}
        for i in range(df_plot_1_test.columns.shape[0]):
            test_label_dic[str(i)]='Test_'+df_plot_1_test.columns[i]

        sns.lineplot(data=df_plot_1_train.rename(columns=train_label_dic), marker='.')
        sns.lineplot(data=df_plot_1_test.rename(columns=test_label_dic), dashes=False)

        plt.title(title+' '+ str(ev), fontsize=20)
        plt.xlabel('epoch')
        plt.ylabel(str(ev))
        plt.show()

In [None]:
result_val = {}
model_multi = {}
eval_metric=["logloss"]
for i in range(5):
    search_time_start = time.time()
    X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train,y_train,test_size=0.3, stratify=y_train, shuffle=True)
    eval_set = [(X_train_2, y_train_2), (X_test_2, y_test_2)]

    model = XGBClassifier(
                            verbosity = 0,
                            use_label_encoder=False,
                            subsample = 0.8, 
                            min_child_weight = 5,
                            max_depth = 4,
                            learning_rate = 0.1,
                            gamma = 5,
                            colsample_bytree= 1.0
                             )
    model.fit(X_train_2, y_train_2, eval_metric=eval_metric, eval_set=eval_set, verbose=False, sample_weight = compute_sample_weight("balanced", y_train_2))
    model_multi[i]=model
    result_val[i]=model.evals_result()
    print("iter: {}, search time: {}".format(i,time.time() - search_time_start))

In [None]:
print('Check learning curve + overfitting?')
title=''
plot_results_metrics(result_val, title,  eval_metric=['logloss'])

### Hyperparameter tuning



In [None]:
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5],
        'learning_rate': [0.02,0.1]
        
        }
xgb = XGBClassifier( objective='binary:logistic',
                    use_label_encoder=False,
                    verbosity = 0, nthread=1)
folds = 3
param_comb = 150

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 42)
oversample = SMOTE()
over_X, over_y = oversample.fit_resample(X_train, y_train)
over_X = pd.DataFrame(over_X,columns=X_train.columns)
random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='accuracy', n_jobs=-1, cv=skf.split(over_X, over_y), verbose=3, random_state=42 )


random_search.fit(over_X, over_y)

pd.DataFrame(random_search.cv_results_).sort_values("rank_test_score",ascending = False)[["param_subsample",	'param_min_child_weight',	'param_max_depth',	'param_learning_rate',	'param_gamma',	'param_colsample_bytree', "mean_test_score","rank_test_score"]]

In [None]:
print('Best hyperparameters: \n')
print(random_search.best_params_)
score_all(y_test,random_search.best_estimator_.predict(X_test))

## RandomForestClassifier

We choosed random forest, because of good explanation of importance of features.

We compared 2 different RF approaches to class balancing:
- without balancing
- balancing dataset with SMOTE

2nd approach increased recall by ~6% relatively, but decreased accuracy. Recall is still low, around 50%.

### RandomForestClassifier without balancing

In [None]:
model_rf = RandomForestClassifier()

recall = []
accuracy = []
for x in range(5):
    X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train,y_train,test_size=0.2, stratify=y_train, shuffle=True)
    model_rf.fit(X_train_2, y_train_2)
    y_pred = model_rf.predict(X_test_2)
    accuracy.append(accuracy_score(y_test_2, y_pred))
    recall.append(recall_score(y_test_2, y_pred, average=None)[1])
print("Recall:   " + str(np.mean(recall)))
print("Accuracy: " + str(np.mean(accuracy)))

### RandomForestClassifier with SMOTE

In [None]:
model_rf = RandomForestClassifier()
oversample = SMOTE()

recall = []
accuracy = []
for x in range(5):
    X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train,test_size=0.2, stratify=y_train, shuffle=True)
    over_X, over_y = oversample.fit_resample(X_train_2, y_train_2)
    over_X = pd.DataFrame(over_X,columns = X_train_2.columns)
    model_rf.fit(over_X, over_y)
    y_pred = model_rf.predict(X_test_2)
    accuracy.append(accuracy_score(y_test_2, y_pred))
    recall.append(recall_score(y_test_2, y_pred, average=None)[1])
print("Recall:   " + str(np.mean(recall)))
print("Accuracy: " + str(np.mean(accuracy)))

### Feature importance

Between the most important features from RandomForest are TotalCharges, Month-to-Month contract and tenure. 

Performance of RF in terms of recall is pretty low, so I prefer to trust more to the XGBoost or logistic regression.

In [None]:
over_X, over_y = oversample.fit_resample(X_train, y_train)
over_X = pd.DataFrame(over_X,columns = X_train.columns)
model_rf.fit(over_X, over_y)
feature_importances = pd.DataFrame({"columns":X_train.columns,"importances":model_rf.feature_importances_}).sort_values("importances",ascending=False)
feature_importances

In [None]:
fig = px.bar(feature_importances.sort_values("importances",ascending=False).iloc[:10].sort_values("importances",ascending=True), y='columns', x='importances', text='importances', orientation='h')
fig.update_traces(texttemplate='%{text:.3r}', textposition='outside')
fig.show()

## Decision Tree

In the decision tree, we can see conditions of decisions. 

Model has prety bad performance in terms of accuracy, but we can see, that customers with Month to Month contract have higher chance to churn.

After increasing depth, there is new rule: When customer has Month-to-Month contract and does not have Fiber Optic and tenure has higher than 3, customer will not churn.  

In [None]:
model_tree = DecisionTreeClassifier(random_state=123,max_depth=2,class_weight='balanced')


model_tree.fit(X_train, y_train)

score_all(y_test, model_tree.predict(X_test))
dot_data = tree.export_graphviz(model_tree, out_file=None, 
                               feature_names=X_train.columns,  
                               class_names=["No","Yes"],
                                filled=True)

# Draw graph
graph = graphviz.Source(dot_data, format="png") 
graph

In [None]:
model_tree = DecisionTreeClassifier(random_state=123,max_depth=3,class_weight='balanced')


model_tree.fit(X_train, y_train)

score_all(y_test, model_tree.predict(X_test))
dot_data = tree.export_graphviz(model_tree, out_file=None, 
                               feature_names=X_train.columns,  
                               class_names=["No","Yes"],
                                filled=True)

# Draw graph
graph = graphviz.Source(dot_data, format="png") 
graph

## LogisticRegression

LogisticRegression is the best performing model from previous experiment, so we looked on it more deeply.

### Class balancing
We tried multiple approaches to the class balancing.

The best results in terms of recall and acceptable accuracy achieved LogisticRegression with class weight (For No - 3, for Yes - 7) - **recall ~77% and accuracy ~75.6%**

### Hyperparameter tuning
We tried put the best performing model from previous parts to the hyperparameter tuning - results were not significantly better, difference between the best and the worst is less than 1%

### Feature importance

From feature weights we see that monthly contract indicates us higher probability of churn, but 2 years contract indicates us that there is lower probability of churn. So our goal to stop churn should be to sign 2 year contract - we can offer lower price or extra free services for Internet - if customer have more services there is lower chance to churn.



In [None]:
y_train.value_counts()/(y_train.shape[0])

In [None]:
from imblearn.under_sampling import NearMiss
print("Without balancing:")
modelLR = LogisticRegression(max_iter = 5000)
modelLR.fit(X_train, y_train)
score_all(y_test,modelLR.predict(X_test))

print("With balancing - over sample:")
modelLR = LogisticRegression(max_iter = 5000)
oversample = SMOTE()
over_X, over_y = oversample.fit_resample(X_train, y_train)
over_X = pd.DataFrame(over_X,columns=X_train.columns)
modelLR.fit(over_X, over_y)
score_all(y_test,modelLR.predict(X_test))

print("With balancing - under sample:")
modelLR = LogisticRegression(max_iter = 5000)
undersample = NearMiss()
under_X, under_y = undersample.fit_resample(X_train, y_train)
under_X = pd.DataFrame(under_X,columns=X_train.columns)
modelLR.fit(under_X, under_y)
score_all(y_test,modelLR.predict(X_test))

print("With balancing - class weights:")
modelLR = LogisticRegression(max_iter = 5000,class_weight={0:3,1:7})
modelLR.fit(X_train, y_train)
score_all(y_test,modelLR.predict(X_test))

Hyperparameter tuning:

In [None]:
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2','l1']
c_values = np.logspace(-4, 4, 20)
class_weight = [{0:3,1:7}]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values,class_weight=class_weight)
model = LogisticRegression(max_iter = 1000)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=4, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
score_all(y_test,grid_result.best_estimator_.predict(X_test))

results = pd.DataFrame(grid_result.cv_results_).sort_values("rank_test_score")
results.head(20)

### Feature importance (weights)

From LogisticRegression, we see that monthly contract indicates us higher probability of churn, but 2 years contract indicates us that there is lower probability of churn. So our goal to stop churn should be to sign 2 year contract - we can offer lower price or extra free services for Internet - if customer have more services there is lower chance to churn. 

Also customers with paperless billing have higher chance to churn - hard to say  how to improve it, or explain it.

In Data Exploration part we saw, that feature MonthlyCharges has impact on churn rate - different distribution between churn/not churned customers. From LogisticRegression we see that feature MonthlyCharges does not have impact on churn. 

In [None]:
modelLR = LogisticRegression(max_iter = 5000,class_weight={0:3,1:7})
modelLR.fit(X_train, y_train)
score_all(y_test,modelLR.predict(X_test))
importance = modelLR.coef_[0]
feature_importances = pd.DataFrame({"columns":X_train.columns,"importances":importance}).sort_values("importances")
feature_importances

In [None]:
fig = px.bar(feature_importances, y='importances', x='columns', text='importances')
fig.update_traces(texttemplate='%{text:.3r}', textposition='outside')
fig.show()

### ROC and Precision-Recall curve

We see that model learned informations from data. 

In [None]:
model = LogisticRegression(max_iter = 5000)
model.fit(X_train, y_train)
lr_probs = model.predict_proba(X_test)


ns_probs = [0 for _ in range(len(y_test))]
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)
# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
# plot the roc curve for the model
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()

In [None]:
model = LogisticRegression(max_iter = 5000)
model.fit(X_train, y_train)
lr_probs = model.predict_proba(X_test)

lr_probs = lr_probs[:, 1]
# predict class values
yhat = model.predict(X_test)
lr_precision, lr_recall, _ = precision_recall_curve(y_test, lr_probs)
lr_f1, lr_auc = f1_score(y_test, yhat), auc(lr_recall, lr_precision)
# summarize scores
print('Logistic: f1=%.3f auc=%.3f' % (lr_f1, lr_auc))
# plot the precision-recall curves
no_skill = len(y_test[y_test==1]) / len(y_test)
plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
plt.plot(lr_recall, lr_precision, marker='.', label='Logistic')
# axis labels
plt.xlabel('Recall')
plt.ylabel('Precision')
# show the legend
plt.legend()
# show the plot
plt.show()

# Feature selection

For feature selection we used mutual information - measures the dependency between feature and target. Features with higher mutual information are more useful for prediction and are selected.

 Feature selection did not have impact on results - It increased recall by 1% and decreased accuracy by 1%.

 Models with a lot of features are harder to understand, with feature selection we make decision more trustful.

Selected features: 

In [None]:
results = mutual_info_classif(X_train,y_train, random_state=42)
feature_selection = pd.DataFrame({"column_names":X_train.columns,"info":results})
feature_selection.sort_values("info",ascending=False).head(10)


In [None]:
fs = feature_selection.sort_values("info",ascending=False).head(10).column_names
modelLR = LogisticRegression(max_iter = 5000,class_weight={0:3,1:7})
modelLR.fit(X_train[fs], y_train)
score_all(y_test,modelLR.predict(X_test[fs]))