Analysis of [Telco's](http://telco.com.br/) customer database, with information about the attributes of its customers.

The intention is to predict customers with greater potential to leave the company.

Please, if you like that Kernel, let me know, leaving your upvote, i would appreciate that so much !


In [None]:
# importanto as bibliotecas

# Manipulação dos dados
import pandas as pd  

# Para uso de matrizes e arrays
import numpy as np  

# Visualização 
import matplotlib.pyplot as plt
import seaborn as sns

# Estatística
import statsmodels as sm

df = pd.read_csv('../input/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
df.head()

In [None]:
df.info()

## Clean and Transform Data!

The dataset has no missing values.

We have more than 7000 rows and 21 attributes (columns)

Some data that should be categorical are saved as number. Let's fix this.

***SeniorCitzen***

In [None]:
# SeniorCitizen

df['SeniorCitizen'] = df.SeniorCitizen.astype('object')

In [None]:
df.info()

Done ! Now the SeniorCitizen column, which indicates whether the customer is elderly or not, through a dummy variable (0 and 1), is properly saved as a qualitative variable.

We need to change the column *** TotalCharges *** to numerical as it refers to the total amount of revenue generated by the client. I'll do this using the _to_numeric () _ function of pandas.

The _errors = 'coerce'_ parameter turns the records into which the conversion could not be converted to values ​​of type NaN.

In [None]:
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors = 'coerce')

df.TotalCharges.describe()

In [None]:
df.info()

The conversion generated 11 null values, we will fill them with the result of multiplying the tenure and MonthlyCharges columns. Since the first represents the number of months that the customer was in the company, and the second indicates the amount paid per month.

In [None]:
df.TotalCharges.isnull().sum()

In [None]:
df.TotalCharges.fillna(value = df.tenure *  df.MonthlyCharges, inplace = True)

In [None]:
df.TotalCharges.isnull().sum()

Problem solved !

Now that all the variables are ok, we can start exploring the data. Let's try to understand which customers spend more, which ones usually stay longer in the company among other information that may be interesting and lead us to some insinghts.

Let's start by observing a statistical summary of the numerical variables, which are:

* tenure: Period in months that customers stay in the company
* MonthlyCharge: Value of the monthly payment paid by the customer
* TotalCharges: Total amount paid by the customer

## Exploratory Analysis

In [None]:
df.describe().round()

* Half of the clients remain in the company for more than 29 months (just over two years);
* The average amount per month $ 30;
* The average total revenue generated per customer is 2280.


*** Let's look at how the relationships between these variables occur.***

In [None]:
numerics = df[['tenure','MonthlyCharges', 'TotalCharges', 'Churn']]

plt.figure(figsize = (8,8))

sns.regplot(x = 'tenure', y = 'TotalCharges', data = numerics)

plt.title('Relationship between loyalty months and total revenue')

In [None]:
plt.figure(figsize = (8,8))
plt.title('Relationship between monthly fee and total revenue')
ax = sns.regplot(x = 'MonthlyCharges', y = 'TotalCharges', data = numerics)

In [None]:
plt.figure(figsize = (15,10))
sns.pairplot(numerics)

In [None]:
plt.figure(figsize = (15,10))

sns.boxplot(x = 'tenure', y = 'TotalCharges', data = df)

plt.title('Box Plot of Total Payments X Months of Loyalty')

In [None]:
plt.figure(figsize = (15,10))
sns.countplot(df['tenure'])

Above we have just explored the relationships between the time of home and the total value of expenses. They are linear as you would expect. That is, the longer the customer stays with us, the greater your total spend.

We also observed the linear relationship between the monthly value and total revenue. Customers with higher monthly value, represent higher revenue.

We observe that there is no relation to the months of the house, and the increase of the monthly payments. To my credit, many clients remain for a long time without hiring new services; in contrast, some already come with more expensive plans.

And through boxblot we have seen that, in general, the Dataset does not have outliers.


***Now let's explore the categorical variables. From here, we'll take into account the variable 'Churn' in all our views. This variable indicates whether the customer has left the company or not. Our target variable.***

In [None]:

df.describe(include = 'object')

Quickly, in this picture we can already observe:

* Most customers are not Senior;
* The most popular internet service is fiber optics;
* Most customers prefer not to receive printed accounts;
* The most popular form of payment is the electronic payment




***SeniorCitizen***

Does the age group influence the escape of customers?

In [None]:
pd.crosstab(df.Churn, df.SeniorCitizen,
            margins = True)

In [None]:
# Should make a function for that..
print('The percentage of elderly people who left the company:{}%'.format(476/1142*100))
print('The non-elderly population is:{}%'.format(1393/5901*100)) 

In [None]:
plt.figure(figsize = (8,8))
sns.set(style = 'whitegrid')

sns.countplot(df.SeniorCitizen, hue = df.Churn )

Proportionally speaking, the volume of older people leaving the company is much higher than the volume of non-elderly.

Does this indicate a dependency relationship? Is it worth considering to investigate this relationship more closely? Or was it mere chance? A chi-square test can help us find out if this assignment is statistically significant.

Just for curiosity, what is the monthly average between young and old?

In [None]:
mens_media_idoso = df[df['SeniorCitizen'] == 1]
mens_media_idoso = mens_media_idoso.MonthlyCharges.mean()
mens_media_idoso

n_idoso_media_mes = df[df['SeniorCitizen'] == 0]
n_idoso_media_mes = n_idoso_media_mes.MonthlyCharges.mean()

print('The average monthly expenditure for the elderly is :{}'.format(mens_media_idoso))
print('The average monthly expenditure for non-elderly persons is :{}'.format(n_idoso_media_mes))

In [None]:
# Checking

media_mes_idade = df.groupby('SeniorCitizen').mean() 
media_mes_idade.round()

In [None]:
plt.figure(figsize = (10,8))

sns.set(style = 'whitegrid')
sns.boxplot(x = df.SeniorCitizen, y = df.TotalCharges, hue = df.Churn)

plt.title('Total Revenue by Seniors and Non-Seniors')

In [None]:
df.SeniorCitizen.value_counts(normalize = True)

Based on the above comparisons:

* Although they represent only 16% of clients, the elderly spend more in the company: It has a monthly average higher, leave more revenue and has a higher fidelity average. However, as we have seen, it has a much higher rate of evasion than the young public. The graphic makes this even clearer.

These numbers make sense. Older people spend more time at home, because they are retired or taking lighter lives, so they consume more television, which leads them to sign more complete and consequently more expensive packages.

Based on these data, I would recommend a deeper analysis to understand the reason for this evasion rate and propose actions to increase retention of this public and to attract them.

This analysis has already shown us some very relevant insights and this considering only the variable relative to age ....


### Lets investigate the gender variable

In [None]:
plt.figure(figsize = (8,8))
sns.set(style = 'whitegrid')
sns.countplot(df.gender, hue = df.Churn)

In [None]:
receita_gender = df.groupby(by = 'gender')['TotalCharges', 'MonthlyCharges'].mean().round()
receita_gender

In [None]:
df.groupby(by = 'gender')['tenure'].mean().round()

***There is no behavior difference between women and men.***

### Lets investigate the Partner variable, that indicates if the customer has any relationship partner.

In [None]:
plt.figure(figsize = (8,8))
sns.set(style = 'whitegrid')
sns.countplot(df.Partner, hue = df.Churn)

In [None]:
df.groupby('Partner')['TotalCharges', 'MonthlyCharges', 'tenure'].mean().plot(kind = 'bar', stacked = True, 
                                                                             figsize = (8,8))

Theres is great diference between people with partners e without it. People with partner(married ones) spent much more money in the company and stay much more time. That may be cause they has children and has more complete packs.

Let's confirm that possibility!

In [None]:
pd.crosstab(df.Partner, df.Dependents).plot(kind = 'bar', stacked = True, figsize = (8,8))

That's right, people who has partners, has more dependents(children) than the single ones.

In [None]:
plt.figure(figsize = (15,10))
sns.countplot(df['tenure'], hue = df.Partner)

Tha above chart, confirm that people who has partner stay more time in the company. 

So, that's the insghts of this analysis: 

* People with partner are very lucrative to the company, cause they stay more time and spent more money. 


***OK, i think that we got some good insights exploring the demographics atributes of our clients. Lets take a look on our products ! Which one is the most lucrative ? Is that the same with the higuer fidelity ?***

In [None]:
df.InternetService.value_counts(normalize = True)

21% of clients don't use internet services. Would be a nice idea explore some ways to making they use our internet services. Maybe using some clustering algorithm to discover especific characteristics to use to our favor....

From here a think that is time to slice our dataset, lets focus on clients that use internet services, but before that, lets take a closer look: 

In [None]:
pd.crosstab(df.InternetService, df.PhoneService, margins = True)

Clients that dont use phone services, use the DSL internet service. Wich means that Fiber Optic is avaliable only for those who has phone services.

For those Who use phone service, DSL still an option.

We are starting to understanding the company's product strategy...

In [None]:
plt.figure(figsize = (15,5))
sns.countplot(df.InternetService, hue = df.Churn)

There is a huge churn tendency in Fiber Optic Services. That mighty show a great insatisfcation with this service. 



In [None]:
df.groupby('InternetService')['TotalCharges'].mean()

 ***Lets star the modeling phase***

First of all,  i'll transform the columns so the model will be able to understand our data !  

In [None]:
df.head()

***Gender***

As we see above, the gender does not have any great difference. So i think that this variable is not important for the model. Let's drop it!

In [None]:
df.drop(['customerID', 'gender'], axis = 1, inplace = True)

In [None]:
df_model = df

df_model.head()

In [None]:
df_model.columns

In [None]:
## Here i forked some code from another Kernel

columns_to_convert = ['Partner', 'Dependents','PhoneService','OnlineSecurity' ,
                      'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies','PaperlessBilling',
                      'Churn']
                      
                      
    
    
for item in columns_to_convert:
    df_model[item].replace(to_replace=['Yes', 'No'], value= [1,0], inplace = True)
df_model.head()

In [None]:
# adjusting the column Multiple Lines

df_model.MultipleLines = df_model.MultipleLines.replace(to_replace= 'No phone service', value = 'No')
df_model.MultipleLines = df_model.MultipleLines.replace(to_replace= ['Yes', 'No'], value = [1,0])
df_model.MultipleLines.value_counts()

In [None]:
pd.get_dummies(df_model, columns = ['InternetService', 'Contract', 'PaymentMethod'], drop_first = True)

In [None]:
df_model.OnlineSecurity = df_model.OnlineSecurity.replace(to_replace= 'No internet service', value = 0)
df_model.OnlineBackup = df_model.OnlineBackup.replace(to_replace= 'No internet service', value = 0)
df_model.DeviceProtection = df_model.DeviceProtection.replace(to_replace= 'No internet service', value = 0)
df_model.TechSupport = df_model.TechSupport.replace(to_replace= 'No internet service', value = 0)
df_model.StreamingTV = df_model.StreamingTV.replace(to_replace= 'No internet service', value = 0)
df_model.StreamingMovies = df_model.StreamingMovies.replace(to_replace= 'No internet service', value = 0)


In [None]:
df_model.head(10)

In [None]:
df_model2 = pd.get_dummies(df_model, columns = ['InternetService', 'Contract', 'PaymentMethod'], drop_first = True)
df_model2.head(20)

In [None]:
from sklearn.model_selection import train_test_split

X = df_model2.drop('Churn',axis=1)
y = df_model2['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
from sklearn import tree


In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

In [None]:
# Predictions

predictions = clf.predict(X_test)

In [None]:
predictions

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test,predictions))
print('\n')
print(confusion_matrix(y_test,predictions))

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators= 10)

In [None]:
rfc.fit(X_train,y_train)

In [None]:
rfc_predictions = rfc.predict(X_test)

In [None]:
print(classification_report(y_test,rfc_predictions))

Ok, thats all for now. Im trying to undestand how to apply these algorithims. But in resume:
    
    Random Forest performed better than decision tree...at the moment, i cant explain much why this occur, but i will keep learning...
    
    Plese, leave your comments, i'll be so glad if someone can explain to me how my model performed ! 
    
    Thank you ! 