# Churn Prediction

This notebook performs EDA and tries to predict churn according to the data present

### Importing the data

In [None]:

import pandas as pd
import numpy as np
import seaborn as sns

data = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
data.head()


### Understanding the data

Viewing the columns

In [None]:
data.columns

Check for NULL values

In [None]:
data.info()

In [None]:
count_null = 0
indices = []
for i in range(len(data.TotalCharges)):
    if data.TotalCharges[i]==" ":
        count_null+=1
        indices.append(i)
    
print(count_null)

In [None]:
print(100*count_null/len(data))

Since the percentage of missing data is so low we can just eliminate it

In [None]:
daa=data.drop(indices,axis=0)
test = daa.reset_index()
data = test.drop(["index"],axis=1)

**Categorical**
Gender,SeniorCitizen,Partner,Dependents,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection<br>,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod

**Continuous**<br>
tenure,MonthlyCharges,TotalCharges

### Feature Description ####
customerID: Customer ID<br>
genderCustomer: gender (female, male)<br>
SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)<br>
PartnerWhether: the customer has a partner or not (Yes, No)<br>
Dependents: Whether the customer has dependents or not (Yes, No)<br>
tenure: Number of months the customer has stayed with the company<br>
PhoneService: Whether the customer has a phone service or not (Yes, No)<br>
MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)<br>
InternetService: Customer’s internet service provider (DSL, Fiber optic, No)<br>
OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)<br>
OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)<br>
DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)<br>
TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)<br>
StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)<br>
StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)<br>
Contract: The contract term of the customer (Month-to-month, One year, Two year)<br>
PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)<br>
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))<br>
MonthlyCharges: The amount charged to the customer monthly<br>
TotalCharges: The total amount charged to the customer<br>
Churn: Whether the customer churned or not (Yes or No)<br>

### Data Manipulation

From the above feature description its pretty clear that 
**A.** The feature multiple lines depends on whether the customer has a phone service in place in the first place. Hence we the no phone service can be replaced by no
**B.** Similarly features like OnlineSecurity,OnlineBackup,TechSupport,DeviceProtection,StreamingTV and StreamingMovies depend on Internet Service. In all these places no internet service can be replaced by No


In [None]:
def changeService(data,original_var="No phone service",feature_list=["MultipleLines"]):
    
    for feature in feature_list:
        ls = list(data[feature][data[feature]==original_var].index)
        data[feature].iloc[ls]="No"
    return data

In [None]:
df = changeService(data)

In [None]:
feature_list = ["OnlineSecurity","OnlineBackup","DeviceProtection","TechSupport","StreamingTV","StreamingMovies"]
df = changeService(df,original_var="No internet service",feature_list=feature_list)
df.head()

In [None]:
sns.countplot(df.StreamingTV.value_counts())

### Data Visualization

We have certain categorical features and certain continuous features. Let us view them separately

In [None]:
## See distribution of target variable ###
ac = sns.countplot(df.Churn)
for p in ac.patches:
    height = p.get_height()
    ac.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}'.format(height/len(df)),
            ha="center")

## Conclusion
Almost 73% of the data consists of no -> dataset is imbalanced
Soln. Tried oversampling methods like SMOTE, they seemed to disturb the accuracy hence didnt go ahead with it

In [None]:
### Countplots for categorical features ###
import seaborn as sns
import matplotlib.pyplot as plt
fig,ax = plt.subplots(5,3,figsize=(20,20))
sns.set_style("dark")
categorical = ["gender","SeniorCitizen","Partner","Dependents","MultipleLines","InternetService",\
               "OnlineSecurity","OnlineBackup","DeviceProtection","TechSupport","StreamingTV","StreamingMovies",\
               "Contract","PaperlessBilling","PaymentMethod","PhoneService","Churn"]
k = 0
for i in range(5):
    for j in range(3):
        ac = sns.countplot(df[categorical[k]],ax=ax[i][j])
        for p in ac.patches:
            height = p.get_height()
            ac.text(p.get_x()+p.get_width()/2.,
                    height + 3,
                    '{:1.2f}'.format(height/len(df)),
                    ha="center") 
        k+=1

## Conclusions:
1. Mostly the sample consists of young population since the number of senior citizens is in the sample is just 16%
2. Most people in the sample do not have dependents
3. Most people do not ask for tech support

In [None]:
##Visualize probability distribution of continuous variables
cont = ["tenure","MonthlyCharges","TotalCharges"]
fig,ax = plt.subplots(1,3,figsize=(20,10))
sns.set_style("dark")
for i in range(3):
    sns.distplot(df[cont[i]],ax=ax[i])


## Conclusions
1. The distribution of tenure seems to be multimodal, so does that of MonthlyCharges
2. As for Total charges, data points seem to gather around the mode which is near 0

In [None]:
### Visualize cdf ###
kwargs = {'cumulative': True}
fig,ax = plt.subplots(1,3,figsize=(20,10))
sns.set_style("dark")
for i in range(3):
    x = df[cont[i]]
    sns.distplot(x, hist_kws=kwargs, kde_kws=kwargs,ax = ax[i])




## Conclusions
1. Around 70% of the people in the sample are customers who have been customers for less than 60 months
2. Around 70% of the people pay less than 100 units monthly
3. More than 80% people pay less than 6000 total charges

In [None]:
## Visualize categorical variables per churn label ###
fig,ax = plt.subplots(5,3,figsize=(20,20))
sns.set_style("dark")
k = 0
for i in range(5):
    for j in range(3):
        sns.countplot(data = df,x="Churn",hue=categorical[k],ax=ax[i][j])
        k+=1


In [None]:
### Visualize reverse of above plot ###
fig,ax = plt.subplots(5,3,figsize=(20,20))
sns.set_style("dark")
k = 0
for i in range(5):
    for j in range(3):
        ac = sns.countplot(data = df,x=categorical[k],hue="Churn",ax=ax[i][j])
         
        k+=1


## Conclusion
1. Number of no's are more than the number of yes's in every category
2. The number of people having no internet service have a very low number of churners suggesting that internet service is a major factor in retaning customers.
3. Customers with a two year contract have a very low number of churners
4. People who pay via electronic check like the service more.

In [None]:
### Do the same for the continuous variable distributions ###


def Viz(df,feat = "tenure"):
    df_yes = df[df.Churn == "Yes"]
    df_no = df[df.Churn =="No"]
    #fig,ax = plt.subplots(1,2,figsize=(20,20))

    tenure_yes = df_yes[feat]
    tenure_no = df_no[feat]
    
    sns.kdeplot(tenure_yes,label=feat+"_yes")
    sns.kdeplot(tenure_no,label=feat+"_no")
    plt.xlabel(feat)
    plt.show()



In [None]:
#fig,ax = plt.subplots(1,3,figsize=(20,20))
    
Viz(df,feat=cont[0])
Viz(df,feat=cont[1])
Viz(df,feat=cont[2])



## Conclusion
1. We see clearly in the first plot that people in the early stages of the timeline tend to be churners while the distribution of non churners is not really clear
2. The second plot is also not very clear but seems to suggest most churners are willing to pay high amount of monthly charges and non churners pay a low amount. This is suggestive of the fact that customers paying for additional services like maybe streaming tv or movies are happier 
3. The next plot does not give much information since both plots seem to suggest that churners and non churners pay less total charges.

In [None]:
## Comparative visualization of CDF for the following ###
def VizCDF(df,feat="tenure"):
    kwargs = {'cumulative': True}
    df_yes = df[df.Churn == "Yes"]
    df_no = df[df.Churn =="No"]
    tenure_yes = df_yes[feat]
    tenure_no = df_no[feat]

    #fig,ax = plt.subplots(1,3,figsize=(20,10))

    sns.distplot(tenure_yes, hist_kws=kwargs, kde_kws=kwargs,hist=False,label=feat+"_yes")
    sns.distplot(tenure_no, hist_kws=kwargs, kde_kws=kwargs,hist=False,label=feat+"_no")
    plt.show()

VizCDF(df,feat=cont[0])
VizCDF(df,feat=cont[1])
VizCDF(df,feat=cont[2])


## Conclusion
1. If we study the first CDF, we can see where the x axis and y axis meet at a particular point say (x=60). For non churners, it meets the y axis at an intercept of about 0.6. For churners it meets the y axis at a point just about 0.8 about 0.82. This indicates that for non churners about 60% of the people have a tenure<60 months and for churners, more than 80% of people have a tenure less than 80 months clearly suggesting that long tenure= lesser churners supplementing the fact from the kde plots earlier.
2. Similarly if we look at the middle of the second plot we come to know a similar difference is observed in and around the monthly charges of 80 units.
3. Finally at the total charge CDF, we again look at the middle to find a similar difference. More percentage of people have paid less than the same value and are likely to be churners as compared to non churners.

In [None]:
### Boxplot to find outliers###
#sns.boxplot(df.tenure)
fig,ax = plt.subplots(3,2,figsize=(10,10))
df["TotalCharges"] = df["TotalCharges"].astype("float")
sns.violinplot(x=df.Churn,y=df.tenure,ax=ax[0][0])
sns.violinplot(x=df.Churn,y=df.MonthlyCharges,ax=ax[1][0])
sns.violinplot(x=df.Churn,y=df.TotalCharges,ax=ax[2][0])


sns.boxplot(x=df.Churn,y=df.tenure,ax=ax[0][1])
sns.boxplot(x=df.Churn,y=df.MonthlyCharges,ax=ax[1][1])
sns.boxplot(x=df.Churn,y=df.TotalCharges,ax=ax[2][1])


In [None]:
## Convert Outliers to mean ###

df[df.tenure>65].tenure = df.tenure.mean()


In [None]:
charge_yes = df[df.Churn=="Yes"].TotalCharges
charge_yes[charge_yes>5000] = charge_yes.mean()


In [None]:
charge_no = df[df.Churn=="No"].TotalCharges
df["TotalCharges"] = pd.concat([charge_yes,charge_no])

## Conclusion
1.There exist potential outliers in both classes but when viewed as a whole, no outlier comes into existence<br>
2. Also difference in distributions can be seen once again

In [None]:
### Label encode to convert strings to integers###
from sklearn.preprocessing import LabelEncoder
lec = LabelEncoder()
dct = {}
classes = []
for col in df.columns:
    if col in categorical:
        dct[col] = list(lec.fit_transform(df[col]))
        #print(lec.classes_)
        if len(lec.classes_)>2:
        
            classes.append(lec.classes_)
    else:
        dct[col] = list(df[col].values)



In [None]:
## Visualize correlation heatmap ###
test = pd.DataFrame(dct)
fig = plt.subplots(figsize=(15,15))
sns.heatmap(test.corr(),annot=True)
plt.show()

Some features have more than 2 categories hence should be one hot encoded

In [None]:
### One hot encode variables with more than 2 categories ###
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()

## first column ###
features = ["InternetService","Contract","PaymentMethod"]
d = onehot.fit_transform(test[features[0]].values.reshape(-1,1)) 
onehot_df = pd.DataFrame(d.todense())
onehot_df.columns = ["DSL","Fibre Optic","No Internet service"]

## other 3 columns ##
for i in range(1,len(features)):
    d = onehot.fit_transform(test[features[i]].values.reshape(-1,1)) 
    temp = pd.DataFrame(d.todense())
    cols = []

    
    for j in range(len(classes[i])):
        cols.append(classes[i][j])
    temp.columns = cols
    onehot_df = pd.concat([onehot_df,temp],axis=1)
    
for feat in features:
    test = test.drop(feat,axis=1)
test = pd.concat([test,onehot_df],axis=1)
test.head()

In [None]:
## View Columns ##
test.columns

## Conclusion
We see alot of correlations, mainly:
a. correlation between tenure and totalcharges seems to be very high. This can clearly be intuitively viewed since the more months they are the customers for, the more they pay.<br>
b. Streaming TV and movies features have moderately high correlation with totalcharges since they increase charges. Similarly they have an effect on monthly charges as well.<br>
c. Monthly charges and total charges have a pretty high correlation since basically they are both the same thing.<br>
d. Interestingly churn and tenure again have a negative correlation as confirmed by our previous analysis.<br>
e. Dependencies and Churn seem to have a pretty moderate negative correlation. This suggests that more the dependents less the person wants to spend on telecom services.<br>


### Implement Machine Learning Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest,chi2
from imblearn.over_sampling import RandomOverSampler,SMOTE

In [None]:
## Separate data and labels ###
Y = test.Churn
X = test.drop("Churn",axis=1)
X = X.drop("customerID",axis=1)
## Drop total charges since its a redundant feature ###
#X = X.drop("TotalCharges",axis=1)

### Data Manipulation after viz

In [None]:
## Split train and test data
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,random_state=42,test_size=0.43)

In [None]:
## Set up machine learning models###

## Logistic Regression ###
clf_1 = LogisticRegression(random_state=42,max_iter=500)
clf_1.fit(X_train,Y_train)
pred = clf_1.predict(X_test)

## Random Forest ###
clf_forest = RandomForestClassifier(n_estimators=590)
clf_forest.fit(X_train,Y_train)
pred_forest = clf_forest.predict(X_test)

## Decision Tree ###
clf_tree = DecisionTreeClassifier(min_samples_split=5)
clf_tree.fit(X_train,Y_train)
pred_tree = clf_tree.predict(X_test)

## XGB Classifier ###
clf_xgb = XGBClassifier()
clf_xgb.fit(X_train,Y_train)
pred_xgb = clf_xgb.predict(X_test)


## GradientBoosting Classifier ##
clf_gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.65,\
                                 max_depth=1, random_state=0)
clf_gb.fit(X_train, Y_train)
pred = clf_gb.predict(X_test)

## Voting Classifier ##
clf_vote = VotingClassifier(estimators=[('lr', clf_1), ('xgb', clf_xgb),("gb",clf_gb)],
                         voting='soft')
clf_vote.fit(X_train,Y_train)
pred_vote = clf_vote.predict(X_test)

In [None]:
## Set up neural network ###
from keras.models import Sequential
from keras.layers import Dense
from keras import Input
from keras.optimizers import Adam

model = Sequential()
model.add(Dense(128,activation="relu",input_shape=(26,),use_bias=True))
model.add(Dense(128,activation="relu",use_bias=True))

model.add(Dense(32,activation="relu",use_bias=True))
model.add(Dense(32,activation="relu",use_bias=True))

model.add(Dense(1,activation="sigmoid"))

In [None]:
## Set up opt and loss ###
opt = Adam(learning_rate=1e-5)
model.compile(optimizer=opt,metrics=["accuracy"],loss="binary_crossentropy")

In [None]:
model.summary()

### Model Evaluation

In [None]:
## Generate classification report ###
print("======== Logistic Regression ========")
print(classification_report(Y_test,pred))
print("======= Random Forest ======")
print(classification_report(Y_test,pred_forest))
print("==== Decision tree ======")
print(classification_report(Y_test,pred_tree))
print("========= XGB =========")
print(classification_report(Y_test,pred_xgb))
print("=========GradientBoosting======")
print(classification_report(Y_test,pred))
print("=========Voting======")
print(classification_report(Y_test,pred_vote))



In [None]:
### Training Loop ####
history = model.fit(x=X_train,y=Y_train,batch_size=32,epochs=200,validation_data=(X_test,Y_test))

After observing all models we can safely say that the best model for the problem is XGboost. It gives an accuracy equal to logistic regression, but in terms of the other metrics it does slightly better. It is great to see simpler algorithms performing better than neural networks

In [None]:
fig,ax = plt.subplots(figsize=(8,8))
plt.plot(history.history["loss"],label = "Train Loss")
plt.plot(history.history["val_loss"],label="Val Loss")
plt.legend(['train', 'val'], loc='upper left')
plt.xlabel("epoch")
plt.ylabel("Loss")
plt.title("Training History")
plt.show()

### Interpretation of results

We observe the importance of features for two of our best models xgboost and logistic regression

Weight importance of Logistic Regression

In [None]:
## See the logistic regression weights ###
clf_1.coef_[0]

In [None]:
## Plot feature weights of logistic regression ###
import plotly.express as px
fig = px.bar(x=X.columns,y=clf_1.coef_[0],template="ggplot2",title="Logistic Regression weight visualization")
fig.update_xaxes(title="Features")
fig.update_yaxes(title="Weight")
fig.show()
## Please hover over the plot to get value

## Insights
1. Logistic regression pays most attention to the feature InternetService2, which is the other name for FibreOptics.We had seen in out countplot that most churners preferred fiber optic services. This might be suggestive of the fact that it is an important factor.
2. Last feature which is Internet Service 3 which stands for no service signifies that probably people without internet services are more likely to be non churners due to the highly negative weight.
3. Contract3 feature also has a highly negative weight which signifies a two year long contract which supports our original hypothesis of longer contracts == lesser churn, while month to month contracts have a greater weight.
4. Weights for features like StreamingTv and streaming music are moderately high representing that those features somewhere do affect the churn
5. As seen from the barplot a pretty high negative weight is assigned to phone service and online security which seem to be like good areas to improve.


Weight Importance of XGBoost

In [None]:
## See the xgb weights ##
clf_xgb.feature_importances_

In [None]:
## Plot feature importances of XGB
import plotly.express as px
fig = px.bar(x=X.columns,y=clf_xgb.feature_importances_,template="ggplot2")
fig.update_xaxes(title="Features")
fig.update_yaxes(title="Weight")
fig.show()


The XGB Classifier provides similar conclusions, the only difference being the most important feature is found to be the month to month contract and not the internet service.<br>
Since both models perform similarly well and so does gradientboostingclassifier, our final classifier is an ensemble of the all three in order to combine the best of all the algorithms. That takes our final prediction accuracy to 0.83

### Final Conclusions
Answering some questions here

#### Q1. How did we know which features to eliminate?<br>
a. Initially we removed any kind of redundancy by eliminating an extra categories in those features which were dependent on the Internet service being present.<br><br>
b. Next we explored the data and through the heatmap found that totalcharges and monthly charges were highly collinear. This made sense since totalcharges divided by tenure would end up giving us the monthly charges. Hence we decided to eliminate the same.But on running the models in both cases it was found that the when totalcharges was used as a feature and outliers were replaced by the mean, the model performed better<br><br>
c. After that an effore was made to select relevant features using statistical techniques, but turned out most features provided relevant information about the target<br><br>
d. Using boxplots we found outliers present categorically in the TotalCharges and the tenure feature. By replacing either of the outliers with their respective means, we were able to increase the accuracy of the classifiers to 0.83 (Logistic Regression)

#### Q3. What are the key factors in predicting Churn?<br>
As described from the analysis above, tenure, phoneService,internetService,contract,dependents


#### Q4. What offers should be made to which customers to encourage them to remain with the company?<br>
a. The company should put focus on getting word out about their internet service since people who have fibre optics are more likely to churn as compared to people who have no internet services. The company should provide fibre optics services at lower costs to make people realize how good of a service it is.<br>
b. Contract: The company should either give incentives to people to go after their monthly contracts rather than yearly subscriptions, or find ways to keep their users engaged who are on longer contracts. Offers like Netflix,Amazon subscriptions along with monthly contracts may help boost both, internet services as well as contract services.<br>
c. PhoneService: The company must take steps to improve their phone service since the current one clearly doesnt work well with the customers.<br>
d. PaymentOptions: The company can partner with e-payment gateways to provide incentives to people to pay online. Offers like Save 15% off by paying online may work.<br>
e.Improve on services in key areas like Online security and tech support<br>

#### 5. Assuming these actions were implemented, how would you determine whether they had worked?
Keep collecting data over a period of time and perform similar analysis on the data collected to understand the effects and suggest counter measures in case the analysis does not seem to be correct.