#Table of content

1. [Import Packages](#1)
2. [EDA](#2)
3. [Data processing](#3)
    - 3.1. [One-hot encoding and scaling](#3.1.)
    - 3.2. [Oversampling using SMOTE](#3.2.)
    - 3.3. [Feature selection](#3.3.)
4. [Fitting models](#4)
    - 4.1. [Model evaluation - Accuracy, Confusion matrix, ROC-AUC score](#4.1)
    - 4.2. [Adding weights to Logistic Regression](#4.2)
    - 4.3. [Using different Classifiers](#4.3)
    - 4.4. [Hyperparameter tuning for KNN an RandomForest](#4.4)
5. [Summary](#5)


# Import packages <a id="1"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-apython
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
import statsmodels.api as sm
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')


# EDA <a id="2"></a>

First let's get familiar with the data structure, data types and check if there are any missing values.

In [None]:
data.head()

In [None]:
data.dtypes

There is some mix up in data tracking in a sense that sometimes we use boolean values and sometimes Yes/No. This is not a problem at all, as I will later one hot encode all categorical features, but it's something to keep an eye on, because it can cause problems with pipeline/automatization, when we assume that all Yes/No features would be in certain format (Yes/No or 0/1).

Also TotalCharges are object, which is suspicious. Let's convert them to float, as that's what I would expect them to be, since they are only aggregate of MonthlyCharges.

In [None]:
data.TotalCharges = data.TotalCharges.replace(' ',None)
data.TotalCharges = data.TotalCharges.apply(float)
data.TotalCharges

In [None]:
data.info()

We can see that there are no missing values in the dataset. Next I will check some basic statistics for numeric columns.

In [None]:
data.describe()

In [None]:
data.Churn.value_counts().plot(kind='bar',figsize=(10,5))
plt.title('Churn comparison')
plt.ylabel('Number of customers')
plt.show()

It's very important to understand distributions of your data and visualization is very easy way to check for outliers in data.

Now let's see if there is any feature (column), that could be indicative of churn. 

In [None]:
fig, axes = plt.subplots(nrows = 1,ncols = 2,figsize = (20,5))
sns.distplot(data.tenure.tolist(),ax=axes[1])
plt.xlabel('Number of months with company')
plt.ylabel('Numbers of customers')
plt.title('Distribution of tenure lenght of customers')
sns.boxplot(data.tenure,ax=axes[0]) #boxplot is useful, because it has built-in visualization for outliers (black points)
plt.show()

In [None]:
fig, axes = plt.subplots(nrows = 1,ncols = 2,figsize = (20,5))
sns.distplot(data.MonthlyCharges.tolist(),ax=axes[1])
plt.xlabel('Monthly charge')
plt.ylabel('Numbers of customers')
plt.title('Distribution of Monthly charges of customers')
sns.boxplot(data.MonthlyCharges,ax=axes[0])
plt.show()

In [None]:
fig, axes = plt.subplots(nrows = 1,ncols = 2,figsize = (20,5))
sns.distplot(data.TotalCharges.tolist(),ax=axes[1])
plt.xlabel('Total Charges')
plt.ylabel('Numbers of customers')
plt.title('Distribution of Total charges of customers')
sns.boxplot(data.TotalCharges,ax=axes[0])
plt.show()

In [None]:
table=pd.crosstab(data.gender,data.Churn)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar',figsize=(10,5),title='Churn distribution based on gender',stacked=True)
plt.show()

In [None]:
data.groupby('Churn').mean() #quick way to see if there are any significant indicators, from numerical columns

Big difference in tenure, is obvious, because retained users, simply spent more time with the company. Similar logic can be applied to Total charges.

What is more interesting is that the customers, with higher monthly chargers seem to incline to churn more. We can leverage this information, by contacting Product team, and ask them to review the higher priced products. 

Now let's look at some distributions of some parameteres, split by churn indicator.

In [None]:
table=pd.crosstab(data.SeniorCitizen,data.Churn)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar',figsize=(10,5),title='Churn distribution based on senior citizenship',stacked=True)
plt.show()

Senior citizens seem to be more inclined to churning. This info can be leveraged by Marketing team, that can use this to better adjust their advertisements to less incentisive senior audience and rather focus on non-seniors.

In [None]:
table=pd.crosstab(data.tenure,data.Churn)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar',figsize=(20,5),title='Churn distribution based on tenure',stacked=True)
plt.show()

This is the closest we can come to retention from this dataset. 

It's not very accurate, because we don't know the whole starting cohorts, but nonetheless it's very important to understand retention of your customers, mainly to be able to estimate lifetime value of your custome. The biggest advantage of this dataset is that the setting between customer and company is contractual, so there are no silent churns, which greatly improves labeling and also it provides opportunities to gather feedback from customers on why they've decided to churn. 

This would be very useful data for future modelling.

In [None]:
table=pd.crosstab(data.Contract,data.Churn)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar',figsize=(10,5),title='Churn distribution based on contract type',stacked=True)
plt.show()

This graph confirms intuition that the customer is likely to churn with longer term contract. I will work of an assumption that you can cancel only after your contract expires. 

It's an obvious information, but having it reinforced with data and graphs we can communicate to Product/Sales to really push for long-term contracts. And even differencee between Onee year and Two year options looks really significant !

In [None]:
table=pd.crosstab(data.PaymentMethod,data.Churn)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar',figsize=(10,5),title='Churn distribution based on payment method',stacked=True)
plt.show()

The payment method that really stands out here is using Electronic check. 

And it's so significant that it's worth looking at more closely, to understand who are these customers.

In [None]:
ec_df=data.loc[data.PaymentMethod == 'Electronic check',]

In [None]:
ec_df.describe()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='InternetService',data=ec_df)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='MultipleLines',data=ec_df)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='Contract',data=ec_df)
plt.show()

By summarizing the 3 above graphs and the table, we can make a hypothesis that customers, who prefer paying by non-automated Electronic check, tend to do it on really short term contracts and prefer more expensive (my assumption) services, therefore driving up the monthly charge.

Here it's very important to understand the product positioning on the market in a sense how does this particular company's Fiber+MultipleLines compare to competitors. Is the pricing competitive? Since the mean MonthlyCharges are significantly higher in this group than in rest of dataset? What are their cancellation policies? 

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(x="MonthlyCharges", y="TotalCharges", hue="Churn",data=data)
plt.show()

Here we can see that increased monthly chargers bring in more revenue, but they also tend to increase churn. Similar action points can be drawn here as with the previous analysis on Electronic check data.

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(x="tenure", y="MonthlyCharges", hue="Churn",
                     data=data)
plt.show()

The main info from this graph, can be that there seems to be kind of a safe Monthly Charge rate, that doesn't indicate churn across all tenures. 

To better see this line let's look at the Mothly Charges a bit differently.

In [None]:
plt.figure(figsize=(10,5))
ax = sns.kdeplot(data.MonthlyCharges[(data["Churn"] == 'No') ],
                color="Blue", shade = True)
ax = sns.kdeplot(data.MonthlyCharges[(data["Churn"] == 'Yes') ],
                ax =ax, color="Orange", shade= True)
ax.legend(["Retained","Churned"],loc='upper right')
ax.set_ylabel('Density')
ax.set_xlabel('Monthly Charges')
ax.set_title('Distribution of monthly charges by churn')
plt.show()

We can see that a lot of users, who churned, were paying above 60 USD in Monthly Charges and on the other hand the most prevalent Monthly Charges group in retained customers is spending around 20 USD monthly.

In [None]:
plt.figure(figsize=(10,5))
ax = sns.kdeplot(data.TotalCharges[(data["Churn"] == 'No') ],
                color="Blue", shade = True)
ax = sns.kdeplot(data.TotalCharges[(data["Churn"] == 'Yes') ],
                ax =ax, color="Orange", shade= True)
ax.legend(["Retained","Churned"],loc='upper right')
ax.set_ylabel('Density')
ax.set_xlabel('Total Charges')
ax.set_title('Distribution of total charges by churn')
plt.show()

Here is look at another really important metric, that I've already mentioned - Customer lifetime value (CLV).

Same as with retention, this number is not as precise as I would like, but it gives us a good idea. 

The company is not getting much revenue, from customers, as it seems that they are canceling their contracts early. Again Product team can have a look at starting packages for customers and how they stack-up, if there are any.

# Data processing <a id="3"></a>

After exploring data, it's time to start building models. 

First in order is to prepare data, so they are better suited for models, I will later user.

First I will remove the user_ID column, since the data set is one row per one customer, so there is no need to aggregate data here.

I assume here, that the user_ID is assigned only once to each user and if they cancel and rejoin, they go back to their respective ID.

In [None]:
data_mod=data.iloc[:,1::]

## One-hot encoding and scaling of data <a id="3.1"></a>

Since there are a lot of categorical data, I have opted for one-hot encoding the features, as the numerical input is desirable for most common models. 

Also there was an option to encode the categorical data in random order, but I'm not really a big fan of this approach as assigning a certain value to certain feature  (i.e. saying Electric Check = 1 vs Electric check = 4) value can have an impact on final coefficients and adds need for more validation of a model.

Biggest problem with one-hot encoding is that it creates additional columns, which in turn increases linearity in a model, which can lead to increased overfitting. That's why I have decided to reduce the number of new columns where possible.

I'm looking for binary columns, because those columns can be easily transformed to 1 column, by dropping one value and whole information will still be contained in the kept column.

In [None]:
bin_cols = data.columns[data.nunique() == 2] #columns with binary values

In [None]:
data_dumm_bin=pd.get_dummies(data_mod,columns=bin_cols,drop_first=True) #dropping one column, because for binary one-hot encoding it's not necessary

In [None]:
data_dumm=pd.get_dummies(data_dumm_bin) #one hot encoding the rest of the dataset

In [None]:
data_dumm

In [None]:
data_dumm.groupby('Churn_Yes').mean()

In [None]:
y = data_dumm['Churn_Yes'].values
X = data_dumm.drop(columns = ['Churn_Yes'])


features = X.columns.values
scaler = MinMaxScaler(feature_range = (0,1)) #It's important to scale your values, because the models used are not robust in relation to scales used, meaning that 
                                             #features with bigger scales, would skew the ceofficient weights to them.
scaler.fit(X)
X = pd.DataFrame(scaler.transform(X))
X.columns = features

## Data oversampling - SMOTE <a id="3.2"></a>

As we could see in the first graph of this notebook this dataset is imbalanec - there is roughly 3:1 ratio of retained to churned users. 

Having a balanced dataset always reduces the difficulty of modelling tasks. 

I have decided to use SMOTE oversampling, because it creates new users in manner that there might be new observations, that wouldn't be there if I just went with regular over-/undersampling.

One potentional pitfall with SMOTE and this particular dataset is that I'm using a lot of binary values. The way SMOTE works is that it connects two random points from dataset and then randomly selects a point in between them, so this will create for example 0.7 values in a binary feature. 

The effect of this is then showcased in really good training set metrics, but the will perform poorly on validation (test) set. 

But in my experience it was always a good starting point.

In [None]:
os = SMOTE(random_state=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10) #splitting data into training and testing, reduces overfitting and better 
                                                                                          #indicates performance of a model on live (unseen) data.
columns = X_train.columns

In [None]:
os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['Churn_Yes'])
print("length of oversampled data is ",len(os_data_X))
print("Number of churned users in oversampled data",len(os_data_y[os_data_y['Churn_Yes']==0]))
print("Number of retained",len(os_data_y[os_data_y['Churn_Yes']==1]))
print("Proportion of churned users in oversampled data is ",len(os_data_y[os_data_y['Churn_Yes']==0])/len(os_data_X))
print("Proportion of retained users in oversampled data is ",len(os_data_y[os_data_y['Churn_Yes']==1])/len(os_data_X))

## Feature selection <a id="3.3"></a>

As mentioneed before, biggest issue with introducing more features is increasing linearity in model, which in turn negatively impacts interpretability (which is really important in this usecase) of the model. 

Rule of thumb I've learned is that for every 1 feature you need 10 data points, which is more than sufficiently satisfied here, but number of features also impacts the speed at which we train model and that can be problematic if we want to deploy this model on real-time data.

In [None]:
data_dumm_vars=data_dumm.columns.values.tolist()
y=['Churn_Yes']
X=[i for i in data_dumm_vars if i not in y]

In [None]:
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y.iloc[:,0])[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature
cor_support, cor_feature = cor_selector(os_data_X, os_data_y,10) #here I'm just reducing the number of features to 25% of starting values. This is one of the parameters,
                                                                 #that can be tuned later.
print(str(len(cor_feature)), 'selected features')

In [None]:
cols=cor_feature

In [None]:
X=os_data_X[cols]
y=os_data_y['Churn_Yes']

# Fitting model <a id="4"></a>

For a starting model I have originally decided to go with Logistic regression, but there was a problem in a sense that it would converge, but the R^2 was inf, function value was inf etc. and solving this problem would require significant efforts, that are not warranted in my opinion, for this problem.

Also you can notice that I'm not using sklearn library for this part of modelling. The reason is simple - statsmodels has implemented report with coefficients, that's very similar to R's reports, that are very good for infering about features' impact on predicted variable.

It's **VERY IMPORTANT** to note that in this dataset churned customer is labeled as 1. That means when we are looking at coefficients we are looking for negative valueed coefficients as those positively impact retention. 

In [None]:
glm_model=sm.GLM(y,X) 
result=glm_model.fit()
print(result.summary2())

As I've mentioned this report is really clean way of seeing, how each feature influences the target label. 

We can see that all except the tenure are positive, which means they are generating churn and are something that should be looked at.

Also it's important to  note that there is 1 feature where p>0.05, so I will remove that one and refit model. 

Also tenure is not something, that the company can directly influence and is byproduct of customers hapiness, so I will be removing that as well.

In [None]:
cols_red=['TechSupport_No internet service', 'DeviceProtection_No',
       'OnlineBackup_No', 'InternetService_Fiber optic',
       'PaymentMethod_Electronic check', 'TechSupport_No',
       'OnlineSecurity_No', 'Contract_Month-to-month']

X=os_data_X[cols_red]
y=os_data_y['Churn_Yes']

In [None]:
glm_model=sm.GLM(y,X)
result=glm_model.fit()
print(result.summary2())

Here we have only columns with p<0.05 so we can say that the features are important. 

What can be said about those features ? 
* **One is related to contract length**. The strongest predictor of churn is Month-to-month contract.
* **Fiber optic** product is also a strong indicator of churn. We might need to look performance of this product (What do customers expect when they buy this?) and how does it stack up against competitors.
* **4 missing products** 
* **One payment method**
* **No internet service**

This info can be forwarded to Product/Sales team in order to improve current offerings to include those products for new customers and for existing customers, that are missing these products and later models identify those customers as 'at risk of churning', we can offer them these products in new deals.

Now let's see how good is the Logistic regression at predicting possible Churn !

## Model evaluation - Accuracy, Confusion matrix, ROC-AUC score <a id="4.1"></a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logreg = LogisticRegression(max_iter=4000)
logreg.fit(X_train, y_train)

What metrics should we use for evaluating the models?

I'm demostrating the evaluation on 3 metrics:
* Accuracy
* Confusion matrix and related metrics
* Area under curve for receiver operating characteristic 

Why am I not using only accuracy? Well the biggest problem with standalone accuracy is that with imbalanced dataset, if the model labels everything as the major group, then it will have high accuracy, but we will miss the most valuable targets. 

Confusion matrix, helps exactly with this as we will see how many false negatives/positives did the model produce.

AUC ROC is a metric that tells us how good is our model at distinguishing between labels, based on probability outputs from the model.

In [None]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Considering the simplicity of this model, 76% accuracy on testing set isn't that bad. Problem is that it's roughly the ratio of retained to all users, so simply assuming we retain everyone would give the same result and that model is useless.

In [None]:
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

Here we can see that model was able to classify 592 churned users and definitly didn't label everyone as retained !

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
logit_roc_auc = roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure(figsize=(10,5))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Also AUC ROC of 0.84 (max is 1) is really prosiming!

## Adding weights to logistic regression <a id="4.2"></a>

One of the ways that I think I can improve this model is by making it more vary about churned class. 

From Economics theory it's known that retaining a customer is way chaper than signing a new one. By how much? Well I don't have the exact number, but it's something that can be provided by internal Sales team. 

So I will randomly generate cost of signing new customer to be 5-20 times more than keeping and existing customer.

In [None]:
scores = []
ks=[]
for i in range(1,40):
    k=np.random.uniform(0.05,0.2)
    logreg = LogisticRegression(class_weight={0:k,1:1-k},max_iter=4000)
    logreg.fit(X_train, y_train)
    score = roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])
    scores.append(score)
    ks.append(k)

plt.figure(figsize=(10,5))
plt.plot(range(1, 40), scores,marker='o', markerfacecolor='red', markersize=5)
plt.xlabel('Iteration')
plt.ylabel('ROC-AUC')
plt.title('Model response to using weights of neighbors')
plt.show()

We can see from the graph that the model is really indifferent to using weights, which could be cause by balancing the dataset beforehand.

## Using different classifiers <a id="4.3"></a>

Since it's really easy to use different models, without need of implementation, let's try different classifier and see how they stack against the Logistic Regression.

The way I imagine this could work is that, we gather information about feature importance from more interpretable models (Logistic Regression, Bayesian models) and then we use more sophisticated, but less interpretable (although there are methods like LIME, SHAP) models to identify potential customers at risk of churn. 

It's important to note here, that I have various degree of experience with these classifier (ranging from theoretical to using on multiple occasions).

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.dummy import DummyClassifier

classifiers = {
    "Dummy"        : DummyClassifier(strategy='uniform', random_state=2),
    "KNN(3)"       : KNeighborsClassifier(3), 
    "RBF SVM"      : SVC(gamma=2, C=1,probability=True), 
    "Decision Tree": DecisionTreeClassifier(max_depth=7), 
    "Random Forest": RandomForestClassifier(max_depth=7, n_estimators=10, max_features=8), 
    "Neural Net"   : MLPClassifier(alpha=1), 
    "AdaBoost"     : AdaBoostClassifier(),
    "Naive Bayes"  : GaussianNB(), 
    "QDA"          : QuadraticDiscriminantAnalysis(),
    "Gaussian Proc": GaussianProcessClassifier(1.0 * RBF(1.0)),
}

In [None]:
from time import time
nfast = 9      # Not running the Gaussian Process, because it's very very slow method
head = list(classifiers.items())[:nfast]

for name, classifier in head:
    start = time()
    classifier.fit(X, y)
    train_time = time() - start
    start = time()
    score = roc_auc_score(y_test, classifier.predict_proba(X_test)[:,1])
    score_time = time()-start
    print("{:<15}| ROC-AUC score = {:.3f} | time = {:,.3f}s/{:,.3f}s".format(name, score, train_time, score_time))

We can see that there is a significant improvement when using Random Forest Classifier, so let's see if we can get more out of that one. 

Also when I was running this against the accuracy prediction, the KNN Clasifier was performing really well, so let's look at that one as well.

## Hyperparameter tuning for KNN an RandomForest  <a id="4.4"></a>

In [None]:
scores = []
for k in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X, y)
    score = roc_auc_score(y_test, knn.predict_proba(X_test)[:,1])
    scores.append(score)
    
scores = pd.Series(scores, index=range(1,40), name="Score")
scores

It's hard to make an inference about parameter, just from seeing it printed, so let's look at it through graph.

In [None]:
plt.figure(figsize=(10,5))
plt.plot(range(1, 40), scores,marker='o', markerfacecolor='red', markersize=5)
plt.xlabel('n_neighbors')
plt.ylabel('ROC-AUC')
plt.title('Model response to number of neighbors')
plt.show()

We can see that the models is indiferent to number of neighbors used for classification and it only slightly improves, peaking at 11.

Now let's look at the number of trees used for predictions in Random Forest. 

Note : If someone reading this isn't familiar with RF - it's an ensemble method, that has multiple Decision Trees and result is a weighted vote on the outputs of those Trees.

In [None]:
scores = []
for k in range(1, 300, 10):
    RFC = RandomForestClassifier(max_depth=7, n_estimators=k, max_features=8)
    RFC.fit(X, y)
    score = roc_auc_score(y_test, RFC.predict_proba(X_test)[:,1])
    scores.append(score)
    
scores = pd.Series(scores, range(1, 300, 10), name="Score")
scores

In [None]:
plt.figure(figsize=(10,5))
plt.plot(range(1, 300, 10), scores,marker='o', markerfacecolor='red', markersize=5)
plt.xlabel('n_neighbors')
plt.ylabel('ROC-AUC')
plt.title('Model response to number of trees')
plt.show()

Same as with neighbors KNN, this parameter doesn't seem to influence the model quality. 

These were just 2 examples of parameters we can tune. Also I'm tuning them here in isolation. There are methods i.e. CVGridSearch, that look at multiple values of parameters at the same time and tune them according to specified metric, but implementing those methods is out of scope of this assignement in my opinion.

# Summary <a id="5"></a>

After applying data science techniques I was able to identify possible causes of churns in products/services provided. 

Also the team would be provided by list of customers, that are most likely to churn and they can cross-reference the list of their services with aforementioned list, to mitigate churn.

Since this is a contractual setting, it would be very helpful to try and gather more info at the time of churn.

The way I see the models' improvement would be through better hyperparameter tuning and through better feature engineering, which I haven't performed and would like to as part of the next iteration. 
For example there would be a feature that would tell if TotalCharges is tenure * MonthlyCharges, because if the TotalCharges were higher, it might indicate purchasing some items over limit, which in my experience is very negative experience.
Next would be interactions between combinations of services, meaning there would be a variable that would indicate if user has a Fibre+Multiple lines at the same time. 

# References 

During the completion of this notebook I have used parts of code from these kaggle notebooks :

https://www.kaggle.com/bandiatindra/telecom-churn-prediction

https://www.kaggle.com/nicholasgah/churn-prediction-model-and-cap-curve