![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTx-9z3sly5qV-SbWOQZhb88Enw010Wsty1RA&usqp=CAU)

****Problem Statement –****

Churn is a one of the biggest problem in the telecom industry.You are the Data Scientist at a telecom company “Neo” whose customers are churning out to its competitors. You have to analyse the data of your company and find insights and stop your customers from churning out to other 
telecom companies.

****Industry: Telecom****

Data Description:

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

1. Customers who left within the last month – the column is called Churn
2. Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
3. Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
4. Demographic info about customers – gender, age range, and if they have partners and dependents
 

Imports Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mtick

In [None]:
cust= pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
cust.head()

In [None]:
cust.describe

In [None]:
cust.info()

In [None]:
cust.isnull().sum()

In [None]:
cust['Churn'].value_counts()

In [None]:
cust.dtypes

In [None]:
cust.shape

In [None]:
cust.duplicated()

The feature `TotalCharges` got read by Pandas as `object` data type. This have impacts during the exploratory analysis and have to be handled. We will convert datatype to `float64` in the coming sections."

In [None]:
cust.TotalCharges = pd.to_numeric(cust.TotalCharges, errors='coerce')


In [None]:
cust.dtypes

In [None]:
cust.isnull().sum()

We can see Null Values Present on Total Charges columns. So we removed those Null Values.Replace those Null values using median method.

In [None]:
total_charges_median = cust.TotalCharges.median()
cust['TotalCharges'].fillna(total_charges_median, inplace=True)

By checking feature's unique values we can see that the column `customerID` have unique identifiers for each customer This feature does not contribute for this analysis, therefore we are going to drop the column.

In [None]:
cust = cust.drop('customerID', axis=1)

In [None]:
cust.head(5)

We sucessfully drop customer ID Column

Lets Starts Descriptive statistics

In [None]:
cust.describe()

In [None]:
colors = ['#003f7f','#ff007f']
ax = (cust['gender'].value_counts()*100.0 /len(cust)).plot(kind='bar',
                                                           stacked = True,
                                                           rot = 0,
                                                           color = colors)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_ylabel('% Customers')
ax.set_xlabel('Gender')
ax.set_ylabel('% Customers')
ax.set_title('Gender Distribution')

 Demographics - Let us first understand the gender, age range, patner and dependent status of the customers

Gender Distribution - From above graph half of the customers in our data set are male while the other half are female

In [None]:

sns.countplot(cust['SeniorCitizen'],palette='twilight');
cust['SeniorCitizen'].value_counts()

Customer Account Information: Let u now look at the tenure, contract****

In [None]:
plt.hist(cust['tenure'], bins = 30, color = 'green')
plt.title('Distribution of tenure')

**Tenure**: After looking at the above histogram we can see that a lot of customers have been with the telecom company for just a month, while quite a many are there for about 72 months. This could be potentially because different customers have different contracts

2. Contracts: To understand the above graph, lets first look at the # of customers by different contracts.

In [None]:

plt.figure(figsize = [10,5])
sns.countplot(cust['Contract'],palette='rocket')
plt.title('Customer vs Contarct Type',fontsize=20)
plt.xlabel('Gender', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()

As we can see from this graph most of the customers are in the month to month contract. While there are equal number of customers in the 1 year and 2 year contracts.

In [None]:
plt.figure(figsize=(8, 4))
sns.countplot(cust['Churn'], palette='tab20')

3- Now let's take a quick look at the relation between monthly and total charges

In [None]:
cust[['MonthlyCharges', 'TotalCharges']].plot.scatter(x = 'MonthlyCharges',
                                                              y='TotalCharges')

We will observe that the total charges increases as the monthly bill for a customer increases.

In [None]:
corr=cust.corr().round(2)
plt.figure(figsize=(10,10))
sns.heatmap(corr, annot = True)

In [None]:
df2=cust

In [None]:
df2 = cust.iloc[:,1:]
#Convertin the predictor variable in a binary numeric variable
df2['Churn'].replace(to_replace='Yes', value=1, inplace=True)
df2['Churn'].replace(to_replace='No',  value=0, inplace=True)

#Let's convert all the categorical variables into dummy variables
df_dummies = pd.get_dummies(df2)
df_dummies.head()

In [None]:
cust.describe()

**After going through the above EDA we will develop some predictive models and compare them.**

# lets Build Logistic Regression Model

In [None]:
y = df_dummies['Churn'].values
X = df_dummies.drop(columns = ['Churn'])
from sklearn.preprocessing import MinMaxScaler
features = X.columns.values
scaler = MinMaxScaler(feature_range = (0,1))
scaler.fit(X)
X = pd.DataFrame(scaler.transform(X))
X.columns = features

 It is important to scale the variables in logistic regression so that all of them are within a range of 0 to 1. This helped me improve the accuracy from 79.7% to 80.7%.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
result = model.fit(X_train, y_train)

In [None]:
from sklearn import metrics
prediction_test = model.predict(X_test)
# Print the prediction accuracy
print (metrics.accuracy_score(y_test, prediction_test))

The prediction accuracy for the test data set using the above Logistic Regression is 80%

# Lets Build Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
random_forest = RandomForestClassifier(criterion = "gini", 
                                       min_samples_leaf = 1, 
                                       min_samples_split = 10,   
                                       n_estimators=100, 
                                       max_features='auto', 
                                       oob_score=True, 
                                       random_state=1, 
                                       n_jobs=-1)

random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
random_forest.score(X_train, y_train)
print("Score: ", round(random_forest.oob_score_, 4)*100, "%")

The prediction accuracy for the test data set using the above Random Forest is 79.7%