<center> <img src="https://images.squarespace-cdn.com/content/v1/588f9607bebafbc786f8c5f8/1607924812500-Y1JR8L6XP5NKF2YPHDUX/image6.png?format=1000w"> </center>

### About Dataset

**Context**
* Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." 

**Content**
* Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

**The data set includes information about:**

* Customers who left within the last month – the column is called Churn
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents

### <center> Notebook Code starts here </center>

* Importing all the required libraries in this notebook

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import scipy.stats as st
from statsmodels.graphics.gofplots import qqplot
from imblearn.combine import SMOTEENN
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'seaborn'

* It is always considered as a good practice to work on a copy of original dataset.

In [None]:
main_df = pd.read_csv("CusomterChurn.csv")
df = main_df.copy()
df.head()

* Dimesion of the dataset

In [None]:
df.shape

* All the columns present in the dataframe

In [None]:
df.columns

* Basic information about the dataset

In [None]:
df.info()

* Number of unique input in each columns

In [None]:
df.nunique()

* Statistical description of the dataset

In [None]:
df.describe()

* Droppping CustomerID. 

In [None]:
df = df.drop('customerID', axis=1)
df.head()

* Changing datatype of the column "TotalCharges" because it was initially in object type. 

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors="coerce")

* Checking for any null value in the dataset.

In [None]:
df.isnull().sum()

* Dropping the rows having missing value in the dataset

In [3]:
df.drop(df[df['TotalCharges'].isnull()].index, inplace=True)
df.reset_index(drop=True, inplace=True)

NameError: name 'df' is not defined

In [None]:
df.head()

In [None]:
df.shape

* Visualizing for the null value.

In [None]:
sns.heatmap(df.isnull())

* checking the correlation values of the variables.

In [None]:
sns.heatmap(df.corr(), annot=True)

* Checking skew value of the variable
* Here SeniorCitizen is highly right skew
* Here TotalCharges is moderately right skew

In [None]:
skew_val = df.skew().sort_values(ascending=False)
skew_val

### Uni-variate Analysis

In [None]:
sns.displot(x="TotalCharges", data=df, kde=True)
description = df['TotalCharges'].describe()
plt.axvline(description["25%"], ls="--", color='r')
plt.axvline(description["mean"], ls="--", color='r')
plt.axvline(description["75%"], ls="--", color='r')

* Above plot shows us the data is highly right skew.

In [None]:
df['TotalCharges'].skew()

In [None]:
fig, (ax1,ax2) = plt.subplots(ncols=2, figsize=(15,4))
sns.distplot(df['TotalCharges'], ax=ax1 , color ='red')
ax1.set(title='TotalCharges distribution')
qqplot(df['TotalCharges'], ax=ax2, line='s')
ax2.set(title='Quantile quantile plot')

In [None]:
# MonthlyCharges

In [None]:
sns.displot(x="MonthlyCharges", data=df, kde=True)
description = df['MonthlyCharges'].describe()
plt.axvline(description["25%"], ls="--", color='r')
plt.axvline(description["mean"], ls="--", color='r')
plt.axvline(description["75%"], ls="--", color='r')

In [None]:
# df['MonthlyCharges'].skew()

In [None]:
fig, (ax1,ax2) = plt.subplots(ncols=2, figsize=(15,4))
sns.distplot(df['MonthlyCharges'], ax=ax1 , color ='red')
ax1.set(title='MonthlyCharges distribution')
qqplot(df['MonthlyCharges'], ax=ax2, line='s')
ax2.set(title='Quantile quantile plot')

In [None]:
# for i, predictor in enumerate(df.columns):
#     plt.figure(i)
#     sns.countplot(data=df, x=predictor, hue='Churn')

In [None]:
# Bi-variate analysis

In [None]:
plt.figure(figsize=(5, 5))
sns.barplot(data = df, y="TotalCharges", x="Churn")

In [None]:
px.scatter(df, y="TotalCharges", x="tenure")

In [None]:
diag = px.histogram(df, x="Churn", color="SeniorCitizen")
# diag.update_layout(width=750, height=550)
diag.show()

In [None]:
df.info()

In [None]:
diag = px.histogram(df, x="Churn", color="gender")
diag.show()

In [None]:
diag = px.histogram(df, x="Churn", color="PhoneService")
diag.show()

In [None]:
diag = px.pie(df, values='TotalCharges', names='Churn', hole=0.5)
diag.show()

In [None]:
labels = df['MultipleLines'].unique()
values = df['MultipleLines'].value_counts()

# pull is given as a fraction of the pie radius
diag = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0.1, 0.2])])
diag.show()

In [None]:
labels = df['InternetService'].unique()
values = df['InternetService'].value_counts()


diag = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0.2, 0.3])])
diag.show()

In [None]:
labels = df['PaymentMethod'].unique()
values = df['PaymentMethod'].value_counts()


diag = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0, 0.2, 0])])
diag.show()

In [None]:
labels = df['Contract'].unique()
values = df['Contract'].value_counts()


diag = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0.2, 0.3])])
diag.show()

In [None]:
print (df['Partner'].value_counts(ascending=True))

In [None]:
for i in df.columns:
    if df[i].dtypes=="object":
        print(f'{i} : {df[i].unique()}')
        print("****************************************************")

In [None]:
df.replace('No internet service', 'No', inplace=True)
df.replace('No phone service', 'No', inplace=True)

In [None]:
for i in df.columns:
    if df[i].dtypes=="object":
        print(f'{i} : {df[i].unique()}')
        print("****************************************************")

In [None]:
print(df['gender'].value_counts(ascending=True))

In [None]:
print(df['Churn'].value_counts(ascending=True))

In [None]:
df['gender'].replace({'Female':1,'Male':0},inplace=True)
df.head()

In [None]:
print(df['InternetService'].value_counts(ascending=True))

In [None]:
for i in df.columns:
    if (len(df[i].unique()) >2) & (df[i].dtypes != "int64") &(df[i].dtypes!= "float64"):
        print(i)

In [None]:
print(df['InternetService'].value_counts(ascending=True))

In [None]:
print(df['Contract'].value_counts(ascending=True))

In [None]:
print(df['PaymentMethod'].value_counts(ascending=True))

In [None]:
more_than_2 = ['InternetService' ,'Contract' ,'PaymentMethod']
df = pd.get_dummies(data=df, columns= more_than_2,drop_first=True )
df.dtypes

In [None]:
# df.drop(['DSL', 'Month-to-month'], axis = 1)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
for i in df.columns:
    if (df[i].dtypes == "int64")  | (df[i].dtypes== "float64"):
        print(i)

In [None]:
df.head()

In [None]:
# from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()

In [None]:
# large_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
# df[large_cols] = scaler.fit_transform(df[large_cols])
# df[large_cols].head()

In [None]:
# After feature scaling we have following dataset

df.head()

In [None]:
for i in df.columns:
    if (df[i].dtypes == "object"):
        print(i)

In [None]:
two_cate = ['Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'Churn']
for i in two_cate:
    df[i].replace({"No":0, "Yes":1}, inplace=True)
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
X = df.drop('Churn', axis=1)
y = df['Churn']

In [None]:
X.shape, y.shape

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
oversample = SMOTE()

In [None]:
X1, y1 = oversample.fit_resample(X,y)

In [None]:
X1.shape, y1.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X1, y1, test_size=0.33, random_state=0)

In [None]:
model_rf=RandomForestClassifier(criterion = "gini",random_state = 10,max_depth=10, min_samples_leaf=5)

In [None]:
model_rf.fit(X_train,y_train)

In [None]:
pred_rf = model_rf.predict(X_test)

In [None]:
rf  = round(accuracy_score(y_test, pred_rf)*100, 2)
print(rf)    

In [None]:
pred=model_rf.predict(X_train)
print(accuracy_score(pred,y_train))

In [None]:
print(classification_report(y_test, pred_rf))

In [None]:
# confusion Maxtrix
cm = confusion_matrix(y_test, pred_rf)
sns.heatmap(cm/np.sum(cm), annot = True, fmt=  '0.2%', cmap = 'Reds')
plt.title("Random Forest Confusion Matrix",fontsize=12)
plt.show()

In [None]:
filename = 'model.sav'

In [None]:
pickle.dump(model_rf, open(filename, 'wb'))

In [None]:
load_model = pickle.load(open(filename, 'rb'))

In [None]:
model_score = load_model.score(X_test, y_test)

In [None]:
model_score

## Prediction

In [None]:
gender = ['Female', 'Male']
Partner = ['Yes','No']
Dependents = ['Yes','No']
PhoneService = ['Yes','No']
MultipleLines = ['Yes','No']
OnlineBackup = ['Yes','No']
DeviceProtection = ['Yes','No']
InternetService = ['No', 'DSL','Fiber optic']
OnlineSecurity = ['Yes','No']
TechSupport  = ['Yes','No']
StreamingTV  = ['Yes','No']
StreamingMovies  = ['Yes','No']
Contract =  ['Month-to-month', 'One year' ,'Two year']
PaperlessBilling = ['Yes','No']
PaymentMethod  =  ['Bank transfer (automatic)', 'Credit card (automatic)' , 'Mailed check', 'Electronic check']

In [None]:
gender = pd.DataFrame(gender)
Partner = pd.DataFrame(Partner)
Dependents =  pd.DataFrame(Dependents)
PhoneService =  pd.DataFrame(PhoneService)
MultipleLines =  pd.DataFrame(MultipleLines)
OnlineBackup = pd.DataFrame(OnlineBackup)
DeviceProtection =  pd.DataFrame(DeviceProtection)
InternetService =  pd.DataFrame(InternetService)
OnlineSecurity =  pd.DataFrame(OnlineSecurity)
TechSupport  = pd.DataFrame(TechSupport)
StreamingTV  = pd.DataFrame(StreamingTV)
StreamingMovies  =  pd.DataFrame(StreamingMovies)
Contract =   pd.DataFrame(Contract)
PaperlessBilling =  pd.DataFrame(PaperlessBilling)
PaymentMethod  =   pd.DataFrame(PaymentMethod)

In [None]:
# SeniorCitizen = [0 , 1]

In [None]:
# Prediction

In [None]:
data = pd.concat([gender,Partner,Dependents,PhoneService , MultipleLines, OnlineBackup,DeviceProtection, InternetService,OnlineSecurity ,TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod],axis=1)
data

In [None]:
data.columns=['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineBackup', 'DeviceProtection', 'InternetService', 'OnlineSecurity','TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
data.head()

In [None]:
data.info()

In [4]:
# data.isnull().sum()

data = data.fillna(0)

NameError: name 'data' is not defined

In [None]:
data.isnull().sum()

In [None]:
two_cat = ['Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling']
for i in two_cat:
    data[i].replace({"No":0, np.nan:0, "Yes":1}, inplace=True)
data.head()

In [None]:
data['gender'].replace({"Male":0, "Female":1}, inplace=True)
data

In [None]:
data.info()

In [None]:
list(data.select_dtypes(include=['object']).columns)

In [None]:
more_than2_val =  ['InternetService' ,'Contract' ,'PaymentMethod']
data = pd.get_dummies(data=data, columns = more_than2_val)
data.dtypes


In [None]:
data = data.drop(columns=['InternetService_DSL'], axis=1)
data = data.drop(columns=['Contract_Month-to-month'], axis=1)
data = data.drop(columns=['PaymentMethod_Bank transfer (automatic)'], axis=1)

In [None]:
# data

In [None]:
# categorical_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineBackup', 'DeviceProtection', 'InternetService', 'OnlineSecurity','TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
# data = pd.get_dummies(data, columns = categorical_cols, drop_first = True )
# data.head()

In [None]:
data.info()

In [None]:
dat=data.head(1)
dat

In [None]:
for col in dat.columns:
    dat[col].values[:] = 0
dat

In [None]:
tenure = int(input('Enter tenure : '))
MonthlyCharges = float(input('Enter MonthlyCharges : '))
TotalCharges = float(input('Enter TotalCharges : '))
SeniorCitizen = int(input('Enter SeniorCitizen type : '))

gender = input('Enter gender Male or Female : ')
Partner = input('Enter Partner type : ')
Dependents =  input('Enter Dependents type : ')
PhoneService = input('Enter PhoneService type : ')
MultipleLines = input('Enter MultipleLines type : ')
OnlineBackup =  input('Enter OnlineBackup type : ')
DeviceProtection =   input('Enter DeviceProtection type : ')
OnlineSecurity =  input('Enter OnlineSecurity type : ') 
TechSupport  = input('Enter TechSupport type : ')
StreamingTV  = input('Enter StreamingTV type : ')
StreamingMovies  =   input('Enter StreamingMovies type : ')
PaperlessBilling =  input('Enter PaperlessBilling type : ')

In [None]:
InternetService =  input('Enter "DSL" or "Fiber optic" or "No"  : ')
Contract =    input('Enter "Month-to-month" or "One year" or "Two year"  : ')
PaymentMethod  =  input('Enter "Electronic check" or "Mailed check" or "Credit card (automatic)"  or "Bank transfer (automatic)" : ')

In [None]:
# data

In [None]:
# more_than_2_again = ['InternetService' ,'Contract' ,'PaymentMethod']
# data = pd.get_dummies(data, columns = more_than_2_again, drop_first = True )
# data.head()

In [None]:
new_number=[tenure,MonthlyCharges,TotalCharges,SeniorCitizen,gender,Partner,Dependents,PhoneService , MultipleLines, OnlineBackup,DeviceProtection, InternetService,OnlineSecurity ,TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod]
data_new = pd.DataFrame(new_number)
data_new=data_new.T

data_new.columns = ['tenure','MonthlyCharges','TotalCharges','SeniorCitizen','gender','Partner','Dependents','PhoneService' , 'MultipleLines', 'OnlineBackup','DeviceProtection', 'InternetService','OnlineSecurity' ,'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
data_new

In [None]:
data_new.shape

In [None]:
# new_number=[tenure,MonthlyCharges,TotalCharges,SeniorCitizen,gender,Partner,Dependents,PhoneService , MultipleLines, OnlineBackup,DeviceProtection, InternetService,OnlineSecurity ,TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod]
# data_new = pd.DataFrame(new_number)
# data_new=data_new.T
# data_new

In [None]:
# data_new['gender'][0]

In [None]:
data_new.columns = ['tenure','MonthlyCharges','TotalCharges','SeniorCitizen','gender', "Partner",  'Dependents', 'PhoneService', 'MultipleLines', 'OnlineBackup', 'DeviceProtection', f"InternetService_{data_new['InternetService'][0]}",  'OnlineSecurity', 'TechSupport',  'StreamingTV',  'StreamingMovies',  f"Contract_{data_new['Contract'][0]}", 'PaperlessBilling', f"PaymentMethod_{data_new['PaymentMethod'][0]}"]
data_new

In [None]:
data_new.shape

In [None]:
num=data_new.iloc[:,0:4]
cat=data_new.iloc[:,4:]

In [None]:
# from sklearn.preprocessing import MinMaxScaler
# scale = MinMaxScaler()

In [None]:
# large_num = ["tenure", "MonthlyCharges", "TotalCharges"]
# num[large_num] = scale.fit_transform(num[large_num])
# num[large_num]

In [None]:
num

In [None]:
cat

In [None]:
lis=[]
for i in cat.columns:
    if i.endswith("DSL") or i.endswith('Month-to-month') or i.endswith('Bank transfer (automatic)') :
        print("true ")
        lis.append(i)
    else:
        print("false")
        
print(lis)

In [None]:
cat=cat.drop(columns=lis, axis=1)

In [None]:
cat

In [None]:
for col in cat.columns:
    dat[col].values[:] = 1

In [None]:
dat.info()

In [None]:
new=pd.concat([num,dat], axis=1)

In [None]:
new

In [None]:
new.info()

In [None]:
new.shape

In [None]:
li=[]
for i in new.columns:
    if i.endswith("0"):
        print("true ")
        li.append(i)
    else:
        print("false")
        
print(li)

In [None]:
new=new.drop(columns=li, axis=1)
# new.info()

In [None]:
print(model_rf.predict(new))