# Telecom Users Dataset 
Any business wants to maximize the number of customers. To achieve this goal, it is important not only to try to attract new ones, but also to retain existing ones. Retaining a client will cost the company less than attracting a new one. In addition, a new client may be weakly interested in business services and it will be difficult to work with him, while old clients already have the necessary data on interaction with the service.

Accordingly, predicting the churn, we can react in time and try to keep the client who wants to leave. Based on the data about the services that the client uses, we can make him a special offer, trying to change his decision to leave the operator. This will make the task of retention easier to implement than the task of attracting new users, about which we do not know anything yet.

You are provided with a dataset from a telecommunications company. The data contains information about almost six thousand users, their demographic characteristics, the services they use, the duration of using the operator's services, the method of payment, and the amount of payment.

## What task has to be completed?
The task is to analyze the data and predict the churn of users (to identify people who will and will not renew their contract). The work should include the following mandatory items:

- Description of the data (with the calculation of basic statistics);
- Research of dependencies and formulation of hypotheses;
- Building models for predicting the outflow (with justification for the choice of a particular model) based on tested   hypotheses and identified relationships;
- Comparison of the quality of the obtained models.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
from wordcloud import WordCloud
from IPython.core.display import display, HTML, Javascript

# scaling
from sklearn.preprocessing import StandardScaler

# SMOTE
# from imblearn.over_sampling import SMOTE

# keras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# model Evaluation
from sklearn import metrics
 
# model explainablity
import eli5
from eli5.sklearn import PermutationImportance

# plotly offline
from plotly.offline import download_plotlyjs,init_notebook_mode
init_notebook_mode(connected=True)

# MISC
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'plotly'

## Data Cleaning
In this step we check for NaN values, useless attributes in the data, hot encoding, etc

In [None]:
df=pd.read_csv('telecom_users.csv')

In [None]:
df.head()

In [None]:
df.info()



In [None]:
df.describe().T

In [None]:
df = df.reset_index()
df = df.drop(['index','customerID'],axis=1)

df.head()

In [None]:
df.Contract.value_counts()

# EDA 

In [None]:
# stats
df[df['Churn']=='Yes'][['tenure','MonthlyCharges']].describe()

In [None]:
# stats 
df[df['Churn']=='No'][['tenure','MonthlyCharges']].describe().T

In [None]:
# pairplot
plt.style.use('seaborn-dark')
sns.pairplot(df[['tenure','MonthlyCharges','Churn']],hue='Churn',palette='Dark2');
plt.tight_layout();

In [None]:
# Churn
churn_plot = df['Churn'].value_counts().reset_index()
churn_plot.columns = ['Churn?',"Number_of_customers"]

# plot
px.pie(churn_plot,values ="Number_of_customers",names='Churn?',title='Churn',template='none')

### There is a class imbalance in dataset, instance of churn is only 26.5% and normal is 73.5%

In [None]:
# churn Fiber optic users
print('Total Fiber optic users',df[df['InternetService'] =='Fiber optic']['InternetService'].count())
print('\n')
print('No. of Fiber optic users Not Churn',df[(df['Churn'] =='No')& (df['InternetService'] =='Fiber optic')]['InternetService'].count())
print('\n')
print('No. of Fiber optic users Churned',df[(df['Churn'] =='Yes')& (df['InternetService'] =='Fiber optic')]['InternetService'].count())

In [None]:
print('Median monthly charges of staying customers',df[(df['Churn'] =='No')& (df['InternetService'] =='Fiber optic')]['MonthlyCharges'].median())
print('Median monthly charges of churned customers',df[(df['Churn'] =='Yes')& (df['InternetService'] =='Fiber optic')]['MonthlyCharges'].median())

In [None]:
# categorical columns

cat_columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'Churn']

In [None]:

num_columns = ['TotalCharges','MonthlyCharges','tenure']

In [None]:
for feature in df[cat_columns].columns:
    print('\n ')
    print('*************','Column name:',feature,'*************')
    print('1. Unique vlaues:',df[feature].unique())
    print(' ')
    print('2. Min values:',df[feature].min())
    print(' ')
    print('3. value counts:',df[feature].value_counts(1)*100)
    print(' ')
    print('**************************************************')
    print('***************-end-******************************')
    print('\n ')

#"""

In [2]:

for feature in df[num_columns].columns:
    print('*******','Column name:',feature,'*******')
    
    print('Min values:',df[feature].min())
    print('Max values:',df[feature].max())
    
    print('***********-end-***********')
    print('\n')

NameError: name 'df' is not defined

In [3]:
df['TotalCharges'].min()

NameError: name 'df' is not defined

In [4]:
# empty space rows in Totalcharges 

df[df['TotalCharges'] == df['TotalCharges'].min()][0:3]

NameError: name 'df' is not defined

In [5]:
# replace empty space with median

df['TotalCharges'] =  df['TotalCharges'].replace(' ',2298.06)# replace empty string with median of total charges

NameError: name 'df' is not defined

In [6]:
# change data type to float

df['TotalCharges'] =  df['TotalCharges'].astype(float)

NameError: name 'df' is not defined

In [7]:
df[df['tenure']==0][0:3]

NameError: name 'df' is not defined

In [8]:
# replacing zero with median
df['tenure'] =  df['tenure'].replace(0,29)

NameError: name 'df' is not defined

In [9]:
# convert categorical to numerical
for features in df[cat_columns].columns:
    df[features] = pd.Categorical(df[features]).codes 

NameError: name 'df' is not defined

In [10]:
# plot outliers
plt.style.use('fivethirtyeight')
df.plot(kind='box',figsize=(12,4))
plt.xticks(rotation=70);
plt.title('Outliers check');

NameError: name 'df' is not defined

In [11]:
X= df.drop('Churn',axis=1)
y= df.pop('Churn')

NameError: name 'df' is not defined

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=30,random_state =1)

NameError: name 'X' is not defined

In [None]:
## Scaling data

sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)