Customer churn is when a customer, user, or subscriber “breaks up” with a company and stops using its product or service. Sometimes referred to as attrition, nearly all companies experience churn. In this notebook lets predict behavior to retain customers. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents

# Data Understanding

In [None]:
pd.set_option('display.max_columns', None)
data.head()

In [None]:
data.shape

In [None]:
data.describe()

# Exploratory Data Analysis

In [None]:
#Distribution of the target variable
sns.set_style('darkgrid')
plt.figure(figsize = (6,5))
g = sns.countplot(x = 'Churn', data = data)
i=0
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2., height + 0.1,
    '{}'.format(height),ha="center")
    i += 1
display()

In [None]:
#Visualizing Binary columns
cols = ['gender','SeniorCitizen','Partner','Dependents','PhoneService','PaperlessBilling']
total = len(data['Churn'])
fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(19,10), dpi= 60)
axes = axes.flatten()
for i, ax in zip(cols, axes):
    g = sns.countplot(x = i, data = data, ax = ax, hue = 'Churn')
    g.set_ylabel('Percentage')
    for p in g.patches:
      height = p.get_height()
      g.text(p.get_x()+p.get_width()/2., height + 0.1,
      '{:1.2f}'.format(height/total),ha="center")
display()

From the above plot we can infer that Churn rate is same for both the Gender. However if you are a senior citizen then the difference between number of churned customers and unchurned customers is more or less the same. Also, If a customer doesn't have a partner then they are more likely to churn compared to a customer with a partner. Likewise, if a customer doesn't have dependents then they are more likely to churn compared to a customer with a dependent.

Credits: https://www.kaggle.com/pavanraj159/telecom-customer-churn-prediction

In [None]:
#Replacing spaces with null values in total charges column
data['TotalCharges'] = data["TotalCharges"].replace(" ",np.nan)

#Dropping null values from total charges column which contain .15% missing data 
data = data[data["TotalCharges"].notnull()]
data = data.reset_index()[data.columns]

#convert to float type
data["TotalCharges"] = data["TotalCharges"].astype(float)

#replace 'No internet service' to No for the following columns
replace_cols = [ 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                'TechSupport','StreamingTV', 'StreamingMovies']
for i in replace_cols : 
    data[i]  = data[i].replace({'No internet service' : 'No'})

#replace 'No internet service' to No for MultipleLines
    data['MultipleLines']  = data['MultipleLines'].replace({'No phone service' : 'No'})
    
#replace values
data["SeniorCitizen"] = data["SeniorCitizen"].replace({1:"Yes",0:"No"})

In [None]:
#Visualizing Tenure
plt.figure(figsize = (18,4))
sns.countplot(x = 'tenure',data = data, hue = 'Churn')
plt.show()

# Data Preparation
Credits: https://www.kaggle.com/graeme16161/xgboost-tuned-with-random-search

In [None]:
from sklearn.preprocessing import LabelEncoder

#Make dummy variables for catigorical variables with >2 levels
dummy_columns = ["MultipleLines","InternetService","OnlineSecurity",
                 "OnlineBackup","DeviceProtection","TechSupport",
                 "StreamingTV","StreamingMovies","Contract",
                 "PaymentMethod"]

df = pd.get_dummies(data, columns = dummy_columns)

#Encode catigorical variables with 2 levels
enc = LabelEncoder()
encode_columns = ["Churn","PaperlessBilling","PhoneService",
                  "gender","Partner","Dependents","SeniorCitizen"]

for col in encode_columns:
    df[col] = enc.fit_transform(df[col])
    
#Remove customer ID column
del df["customerID"]


#Make TotalCharges column numeric, empty strings are zeros
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"],errors = 'coerce').fillna(0)

# Build Base Model

In [None]:
from sklearn.model_selection import train_test_split

#Split data into x and y
y = df[["Churn"]]
x = df.drop("Churn", axis=1)

#Create test and training sets
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size= .2, random_state= 1)

In [None]:
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn.metrics import accuracy_score

#Build XGBoost model
model = XGBClassifier()
model.fit(x_train, y_train.values.ravel())


#Predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]

#Accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

#Feature importance
fig, ax = plt.subplots(figsize=(10, 8))
plot_importance(model, ax = ax)
plt.show()