**Introduction**

In this notebook we will explore the Telecom Company Customers data set and try to predict customer churn.

In [None]:
#data wrangling & assesing 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
#loading data
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
#Models        
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#warnings
import warnings 
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

In [None]:
# Infos (value count and type for columns) summary of the dataframe
df.info()

In [None]:
n = []
l = df['TotalCharges'].tolist()
for i in l:
    try:
        n.append(float(i))
    except:
        n.append(0)
df['TotalCharges'] = n        

In [None]:
df['TotalCharges'].astype('float')

In [None]:
# Check for Null values
df.isnull().sum()

No null values, Now let's see what are the names of our columns

In [None]:
df.columns

Let's make a list of our Categorical column to make it easy and fast for us to explore

In [None]:
catcolumns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod',]

Check the unique values for each category

In [None]:
for cat in catcolumns:
    print('{} unique values are [{}]\n'.format(cat,df[cat].unique()))

Let's change these 'No internet service' values to 'No' since all of these features depend on internet service to work

In [None]:
catToChange = ['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
for cat in catToChange:
    df[cat] = df[cat].apply(lambda x: 'No' if x == 'No internet service' else x)

In [None]:
#let's make sure
for cat in catcolumns:
    print('{} unique values are [{}]\n'.format(cat,df[cat].unique()))

Ok good let's visualize them 

In [None]:
fig, axes =plt.subplots(4,4, figsize=(60,20), sharex=True,facecolor='white')
axes = axes.flatten()
for ax, catplot in zip(axes, catcolumns):
    sns.countplot(y=catplot, data=df, ax=ax, hue='Churn')
# Open the figure picture in a new tab to be able to zoom    

Now let's see which customer charateristics are more likely to make them churn:

1. Contract - Month to Month
2. Senior Citizen - 0
3. Partener - No
4. Internet service - Fiber Optic 
5. TechSupport - No
6. Paperless billing - Yes
7. Payment Method - Electronic Check
8. Online Security - No
9. Dependents - No

Now let's check numerical values

In [None]:
df.groupby('Churn')['tenure'].mean().plot(kind='barh',color=['lightblue','lightgreen']);

Customers with low Tenure are more likely to churn

In [None]:
df.groupby('Churn')['MonthlyCharges'].mean().plot(kind='barh',color=['lightblue','lightgreen']);

In [None]:
df.groupby('Churn')['TotalCharges'].mean().plot(kind='barh',color=['lightblue','lightgreen']);

Numerical values has there effect also on customer churn

**Let's Predict**

Let's start by picking our features

In [None]:
features = ['Contract','SeniorCitizen','Partner','InternetService','TechSupport','PaperlessBilling','PaymentMethod','OnlineSecurity','Dependents','tenure','MonthlyCharges','TotalCharges']

Let's convert the categorical features to a numerical one that the model can use

In [None]:
#Label Encoding for object to numeric conversion
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for feat in features[:9]:
    df[feat] = le.fit_transform(df[feat].astype(str))

print (df.info())

1. XGBoost

In [None]:
#Target
Y = df['Churn'].values
#Inputs
X = df[features].values
#Split to training and testing
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
#Define the model
model = XGBClassifier(learning_rate = 0.1,n_estimators=200, max_depth=6)
#train the model
model.fit(X_train, y_train)
#Check training accuracy
trainingAccuracy =  metrics.accuracy_score(y_train,model.predict(X_train))
print("Training Accuracy: %.2f%%" % (trainingAccuracy * 100.0))
#Check testing accuracy
testingAccuracy =  metrics.accuracy_score(y_test, model.predict(X_test))
print("Testing Accuracy: %.2f%%" % (testingAccuracy * 100.0))

Let's check feature importance

In [None]:
#Add the name of the features to the model
model.get_booster().feature_names = features
#Get the importance of each feature
importance = model.get_booster().get_score(importance_type="gain")
#Visualize the resutlt
importance

In [None]:
clf = MLPClassifier(max_iter=300).fit(X_train, y_train)
#Check training accuracy
trainingAccuracy =  metrics.accuracy_score(y_train,clf.predict(X_train))
print("Training Accuracy: %.2f%%" % (trainingAccuracy * 100.0))
#Check testing accuracy
testingAccuracy =  metrics.accuracy_score(y_test, clf.predict(X_test))
print("Testing Accuracy: %.2f%%" % (testingAccuracy * 100.0))

We Got similar results from both model without any optimizations 

I hope you can review my code and notebook and tell me where do I need to improve.

Thanks.