# Telco Customer Churn
#### Focused Customer Retention Programs

Source: https://www.kaggle.com/blastchar/telco-customer-churn

If you enjoy this Kernel and/or have found it helpful, I would be very greatful for an upvote :)

### Goal

This is a telecommunications company. The idea behind this analysis is to find patterns in the customer churn data and indicators as to what drives customers to churn. Predictive analysis will also be done in the end, where a machine learning model will be constructed to predict whether or not a customer is likely to churn.

### About This File

- Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

- The raw data contains 7043 rows (customers) and 21 columns (features).

- The “Churn” column is our target.

### Columns

- **customerID** : Customer ID
- **genderCustomer** : gender (female, male)
- **SeniorCitizen** : Whether the customer is a senior citizen or not (1, 0)
- **Partner** : Whether the customer has a partner or not (Yes, No)
- **Dependents** : Whether the customer has dependents or not (Yes, No)
- **tenure** : Number of months the customer has stayed with the company
- **PhoneService** : Whether the customer has a phone service or not (Yes, No)
- **MultipleLines** : Whether the customer has multiple lines or not (Yes, No, No phone service)
- **InternetService** : Customer’s internet service provider (DSL, Fiber optic, No)
- **OnlineSecurity** : Whether the customer has online security or not (Yes, No, No internet service)
- **OnlineBackup** : Whether the customer has online backup or not (Yes, No, No internet service)
- **DeviceProtection** : Whether the customer has device protection or not (Yes, No, No internet service)
- **TechSupport** : Whether the customer has tech support or not (Yes, No, No internet service)
- **StreamingTV** : Whether the customer has streaming TV or not (Yes, No, No internet service)
- **StreamingMovies** : Whether the customer has streaming movies or not (Yes, No, No internet service)
- **Contract** : The contract term of the customer (Month-to-month, One year, Two year)
- **PaperlessBilling** : Whether the customer has paperless billing or not (Yes, No)
- **PaymentMethod** : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- **MonthlyCharges** : The amount charged to the customer monthly
- **TotalCharges** : The total amount charged to the customer
- **Churn** : Whether the customer churned or not (Yes or No)

***

# Importing Libraries

Pandas, Numpy, Seaborn, and MatplotLib will be used for this analysis.

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Warnings
import warnings
warnings.filterwarnings('ignore')

sns.set(style='darkgrid')
plt.rcParams["patch.force_edgecolor"] = True

***

# Data Exploration

In [None]:
df = pd.read_csv(r"../input/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
# First 5 Rows:
df.head(5)

In [None]:
print(df.info())

In [None]:
# Check for Null Values:
df.isnull().any()

***

# Data Engineering

In [None]:
# Null values in TotalCharges must be dealt with:

df['TotalCharges_new']= pd.to_numeric(df.TotalCharges,errors='coerce_numeric')
TotalCharges_Missing=[488,753,936,1082,1340,3331,3826,4380,5218,6670,6754]
df.loc[pd.isnull(df.TotalCharges_new),'TotalCharges_new']=TotalCharges_Missing
df.TotalCharges=df.TotalCharges_new
df.drop('TotalCharges_new',axis=1,inplace=True)


In [None]:
# Converting 'TotalCharges' column from object type to float type

df['TotalCharges'] = df['TotalCharges'].convert_objects(convert_numeric=True)

In [None]:
# Adding a feature for the total amount of services

df['TotalServices'] = (df[['PhoneService', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']]== 'Yes').sum(axis=1)

# Visual Analysis

This is a telecommunications company that offers several services. The goal for this section is to help the company determine all possible factors that are affecting the customers' decision to stay or leave the company.

First, a broad idea of the correlations can be seen with a plot of the correlation matrix, after which the individual features will be explored more in depth.

In [None]:
sns.heatmap(df.apply(lambda x: pd.factorize(x)[0]).corr(), cmap='Blues')
plt.show()

#### Churn Percentage
Let's begin with seeing just how many customers are churning.

In [None]:
plt.style.use(['seaborn-dark','seaborn-talk'])

fig, ax = plt.subplots(1,2,figsize=(16,6))

df['Churn'].value_counts().plot.pie(explode=[0,0.08], ax=ax[0], autopct='%1.2f%%', shadow=True, 
                                    fontsize=14, startangle=30, colors=["#3791D7", "#D72626"])
ax[0].set_title('Total Churn Percentage')

sns.countplot('Churn', data=df, ax=ax[1], palette=["#3791D7", "#D72626"])
ax[1].set_title('Total Number of Churn Customers')
ax[1].set_ylabel(' ')

plt.show()

The data shows that 26.54% of the company's customers have decided to cut ties. Let's find out why

#### Total and Monthly Charges

In [None]:
plt.style.use(['seaborn-dark','seaborn-talk'])

fig, ax = plt.subplots(1,2,figsize=(16,6))

sns.boxplot(x='Churn', y='TotalCharges', data=df, ax=ax[0], palette=["#3791D7", "#D72626"])
ax[0].set_title('Total Charges')
ax[0].set_ylabel('Total Charges ($)')
ax[0].set_label('Churn')

sns.boxplot(x='Churn', y='MonthlyCharges', data=df, ax=ax[1], palette=["#3791D7", "#D72626"])
ax[1].set_title('Monthly Charges')
ax[1].set_ylabel('Monthly Charges ($)')
ax[1].set_label('Churn')

plt.show()

On average, the total amount charged to the customers is lower for the customers who have churned. On the other hand, in the monthly perspective, we see that the average customers churning are paying higher monthly charges. 

There are a couple major factors to consider before coming to any conclusions regarding the average charges. Firstly, it would make sense that the total amount charged to the customers who churned would be lower, as they may have stayed with the company considerably less time than the costumers who stayed with the company. In this case, we would have to investigate how long each customer has been with the company, a feature not in this dataset. Secondly, in regards to the monthly charges, it could be possible that long-time customers could have had a discount. It's also possible that long-term customers pay for less services. Let's investigate this possibility by using the previously created feature for the total amount of services a customer has.

In [None]:
plt.style.use(['bmh','seaborn-talk'])
plt.figure(figsize=(14,6))

sns.countplot(x='TotalServices', hue='Churn', data=df)
plt.title('Number of Customers per Number of Services')
plt.xlabel('Number of Online Services')
plt.ylabel('Number of Customers')

plt.show()

It certainly looks like most of the company's ongoing contracts consist of a single service, and interestingly, the second most popular number of services for the non-churn customers is 4. In regards to the churn customers, the most popular number of services is 1, and the distribution appears to decrease linearly from then on. 

Let's look at these services a bit closer.

In [None]:
cols = ['PhoneService', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for i in cols:
    plt.figure(figsize=(14,4))
    sns.countplot(x=i, hue='Churn', data=df)
    plt.ylabel('Number of Customers')
    plt.show()

These plots show several patterns amongst these services. Most churn customers appear to have been paying for phone service. Regarding internet service, most of the churn customers had fiber optic services. Churn customers also followed the patterns of no online security, no online backup, no device protection, and no tech support. In terms of streaming services, not much can be concluded.

Now that we have seen these patterns, let's look deeper into the costumers' contracts and payment methods. 

In [None]:
plt.style.use(['seaborn-dark','seaborn-talk'])
fig, ax = plt.subplots(1,2,figsize=(16,6))

sns.countplot(x='Contract', data=df, hue='Churn', ax=ax[0])
ax[0].set_title('Number of Customers per Contract Type')
ax[0].set_xlabel('Contract Type')
ax[0].set_ylabel('Number of Customers')

sns.countplot(x='PaymentMethod', data=df, hue='Churn', ax=ax[1])
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation = 30)
ax[1].set_title('Number of Customers per Payment Method')
ax[1].set_xlabel('Payment Method')
ax[1].set_ylabel('Number of Customers')

plt.show()

Evidently, most churn customers payed by electronic checks, on month-to-month contracts.

Since a clear majority of churn customers were on a month-to-month customer, let's check out how many months the customer has stayed with the company (tenure).

In [None]:
plt.figure(figsize=(14,6))

sns.boxplot(x='Churn', y='tenure', data=df)
plt.title('Tenure by Churn')
plt.xlabel('Churn')
plt.ylabel('Tenure')

plt.show()

This shows that the churn customers appear to follow a trend of shorter tenure, which makes sense.

Let's take a deeper look into who these customers are.

In [None]:
for i in ['gender', 'SeniorCitizen', 'Partner', 'Dependents']:
    plt.figure(figsize=(14,6))
    sns.countplot(x=i, data=df, hue='Churn')

We gather from this that churn customers appear to have no parter and no dependents.

***

# Data Conclusions

We have explored every given feature in our attempt to tell a story with this dataset. We have concluded the following:

#### Churn Customers:
Churn customers tend to lean towards month-to-month contracts and opt for one or two services, paying a higher monthly average with electronic checks. Most choose the fiber optic internet service, with no online security, no online backup, no device protection, and no tech support. They also seem to mostly not have a partner or dependents.

#### Potential Problems:
- It is possible that there is a problem with the fiber optic internet service, as it has the highest churn rate of the internet services provided. It is possible that some improvements can be made to this service to reduce customer churn rates.
- It is possible that month-to-month contracts have higher monthly charges. If this is the case, reducing the monthly charge for the month-to-month contract option would reduce customer churn rates.
- It is possible that there is a problem wih the electronic check payment option, as it seems that it has a particularly high churn rate amongst the other options. This could be an electronic problem that could be looked into and fixed.

***

# Predictive Analysis

Up until now we have been simply telling a story with our data. This form of data analysis allows for a company to see exactly where they're at, and observe key points and patterns about their progress. However, this next part consists of predictions using machine learning algorithms, which is much more complicated. 

Not many businesses have the ability to predict the future of their investments. With this predictive analysis the telecommunications company will have a machine to help them decide the likeliness that a customer will churn, which could be useful information for a company that wants to reduce churn rates or predict future earnings.

## Engineering the Data for Modeling

Some of the services must be changed for the benefit of our model, without losing relevant information.

In [None]:
for i in ['OnlineSecurity','OnlineBackup','DeviceProtection',
          'TechSupport','StreamingTV','StreamingMovies']:
    df[i] = df[i].apply(lambda x: 'No' if x=='No internet service' else x)
    
df.MultipleLines=df.MultipleLines.apply(lambda x: 'No' if x=='No phone service' else x)
    

## Encoding the Data

In [None]:
# For variables with only two classifications:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in [e for e in df.columns if len(df[e].unique())==2]:
    df[i] = le.fit_transform(df[i])

In [None]:
# For variables with more than two classifications:

df = pd.get_dummies(df, columns = [i for i in df.columns if df[i].dtypes=='object'], drop_first=True)

### Resampling the Data

It is important to notice that only 26.54% of customers have churned, making this an imbalanced class. We will deal with this by up-sampling the minority class ( churn = True ).

In [None]:
from sklearn.utils import resample

df_majority = df[df.Churn==0]
df_minority = df[df.Churn==1]

df_minority_upsampled = resample(df_minority,
                                replace=True,
                                n_samples=5000,
                                random_state=123)

df_upsampled = pd.concat([df_majority, df_minority_upsampled])

print('Churn Count in Original Data: \n', df.Churn.value_counts(), '\n')
print('New Churn Count: \n', df_upsampled.Churn.value_counts())

## Splitting the Data 
Before applying any machine learning models, the data must be split for the purpose of training and testing on itself.

In [None]:
from sklearn.model_selection import train_test_split

# Separate input features (X) and target variable (y)
y = df_upsampled.Churn
X = df_upsampled.drop('Churn', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.30)

## Machine Learning Models

We will now investigate the predictive results of several models to see which algorithm works best for our predictive analysis.

- Random Forest : RandomForestClassifier
- Naive Bayes : GaussianNB
- Logistic Regression : LogisticRegression
- K-Neighbors : KNeighborsClassifier

In [None]:
from sklearn.metrics import classification_report, precision_score, accuracy_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


In [None]:
classifier_list = [ LogisticRegression(),
                    KNeighborsClassifier(),
                    GaussianNB(priors=None),
                    RandomForestClassifier()]

for clf in classifier_list:
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    precision = precision_score(y_test, predictions) 
    accuracy = accuracy_score(y_test, predictions)
    
    
# Precision_score = tp / (tp + fp)
# Accuracy_score = (# of correctly assigned rows) / (All rows)

    print(clf, '\n \n',classification_report(y_test, predictions), 
          '\n \nPrecision Score: ' , precision,
          '\nAccuracy Score: ', accuracy,
          '\n\n----------------------------------------------------------------\n\n')

The Random Forest Classifier appears to be the most accurate. Explore these results for yourselves, play with different parameters and optimize them, perhaps find a better way to balance the data. I hope this helped!