### Churn Analysis

- 1 Data Cleaning (3 marks)
    - 1.1 Load the data and check if it consist of any missing data or not? (3 marks)

- 2 Data Preparation & Analysis (22 marks)
    - 2.1 Drop variables that will not be used for the classification model: state, area code, phone number, customer service calls (3 marks)
    - 2.2 Replace yes with 1 and no with 0 for the following columns: international plan, voice mail plan (2 marks)
    - 2.3 Split the data into X and y, where X will have all the independent features and y will have the dependent feature(churn) (2 marks)
    - 2.4 Check the imbalance percentage: what percentage of churn customer we have in the data? (3 marks)
    - 2.5 Randomly split the data into train and test. Use the following paramters: train_size=0.7, test_size=0.3, random_state=100 (2 marks)
    - 2.6 Use min-max scaler to fit_transform the train data and transform the test data 
    - 2.7 Perfrom Logistic Regression using the SKLean package: Use the following paramters with the Logistic Regression model:max_iter = 1000, class_weight = 'balanced' (2 marks)
    - 2.8 Train the model on train data and check the sensitivity of the model in the test data (3 marks)
    - 2.9 Find the top positive influencing features and top negative influencing features using the coefficients 
    - 2.10 Explain the features and their importance to FREECELL based on the results generated in the last step (5 marks)
    

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Read the file

churn =pd.read_csv(r"telcom.csv")

# Checking top 5 rows
churn.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
# Analysis- 1.1
# Check if there are any missing values for the churn dataframe or not?
# print the count of nulls for all the columns

# Write your code here
churn.isnull().sum()

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

In [4]:
# Analysis- 2.1
# Here we will be dropping those variables which will not be useful for building up the classification model.
# Write your code to drop the following columns: (state, area code, phone number, customer service calls) from the churn dataframe
# update the churn dataframe such that it doesn't contain the above mentioned columns

# Write your code here
# Hint: https://stackoverflow.com/a/37069701
churn.drop(['state','area code','phone number','customer service calls'], axis='columns', inplace=True)

In [5]:
churn.head()

Unnamed: 0,account length,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,churn
0,128,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,False
1,107,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,False
2,137,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,False
3,84,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,False
4,75,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,False


In [6]:
# Analysis- 2.2
# Here we will be replacing the values yes and no with 1 and 0 respectively for the columns 'customer service calls' and 'voice mail plan'
# This is required because, we will not be able to train the model with string values 'yes' and 'no'

# Write your code here
# Hint: https://stackoverflow.com/a/40901792
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
churn[churn.select_dtypes(include=['object']).columns]=churn[churn.select_dtypes(include=['object']).columns].apply(le.fit_transform)

In [7]:
churn.head()

Unnamed: 0,account length,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,churn
0,128,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,False
1,107,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,False
2,137,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,False
3,84,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,False
4,75,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,False


In [25]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   account length         3333 non-null   int64  
 1   international plan     3333 non-null   int32  
 2   voice mail plan        3333 non-null   int32  
 3   number vmail messages  3333 non-null   int64  
 4   total day minutes      3333 non-null   float64
 5   total day calls        3333 non-null   int64  
 6   total day charge       3333 non-null   float64
 7   total eve minutes      3333 non-null   float64
 8   total eve calls        3333 non-null   int64  
 9   total eve charge       3333 non-null   float64
 10  total night minutes    3333 non-null   float64
 11  total night calls      3333 non-null   int64  
 12  total night charge     3333 non-null   float64
 13  total intl minutes     3333 non-null   float64
 14  total intl calls       3333 non-null   int64  
 15  tota

In [8]:
# Ananlysis- 2.3
# Create the variable X and y
# X will have all the columns from the dataframe churn while y will only have the 'churn' column from the dataframe churn

# Write your code here
X=churn.iloc[:,0:16]
Y=churn.iloc[:,-1]

In [29]:
# Analysis- 2.4
# Find out the percentage or rows where churn=1 and churn=0. If the percentage is anything other than 50%-50%,
# we call that data as an imbalace data

# Write your code here
churn.churn.value_counts

<bound method IndexOpsMixin.value_counts of 0       False
1       False
2       False
3       False
4       False
        ...  
3328    False
3329    False
3330    False
3331    False
3332    False
Name: churn, Length: 3333, dtype: bool>

In [13]:
# Analysis- 2.5
# Here we will be splitting the data into train and test
# The train data will be having 70% of the churn data while test will have 30% of the churn data
# Use the following paramters
# train_size=0.7, test_size=0.3, random_state=100

from sklearn.model_selection import train_test_split
# Write your code here
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=100)

# Hint: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [14]:
# Analysis- 2.6
# Here we will be scaling our complete X_train and X_test data using min-max scaler
# The code has already been provided to you, you are not required to write anything over here
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [15]:
# Analysis- 2.7
# Here we will training our Logistic Regression model.
# Please use the following paramters with the LogisticRegression function: max_iter = 1000, class_weight = 'balanced'

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

logreg = LogisticRegression(max_iter=1000,class_weight='balanced')

logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print(metrics.accuracy_score(y_test, y_pred))
print(metrics.recall_score(y_test, y_pred))
metrics.confusion_matrix(y_test, y_pred)

0.7451274362818591
0.7027027027027027


array([[445, 148],
       [ 22,  52]], dtype=int64)

In [18]:
# Analysis-2.8
# Here we need to check the performance of the trained model
# Check the accuracy score
# Check the recall score
# Print the confusion matrix

# Write your code here
from sklearn.metrics import confusion_matrix
tab2=confusion_matrix(y_pred,y_test)
tab2                                   #confusion matrix
tab2.diagonal().sum()*100/tab2.sum()   #accuracy

74.5127436281859

In [19]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

In [20]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

       False       0.75      0.95      0.84       467
        True       0.70      0.26      0.38       200

    accuracy                           0.75       667
   macro avg       0.73      0.61      0.61       667
weighted avg       0.74      0.75      0.70       667



In [21]:
# Analysis-2.9
# Here we will be printing the coefficients of the model together with the variables
# You are not required to write anything here. The code is already provided
feature_importance = pd.DataFrame({"Feature":X.columns.tolist(),"Coefficients":logreg.coef_[0]})
feature_importance.sort_values(by = 'Coefficients')

Unnamed: 0,Feature,Coefficients
2,voice mail plan,-1.370603
14,total intl calls,-1.19687
11,total night calls,0.076786
0,account length,0.204462
13,total intl minutes,0.395082
15,total intl charge,0.406198
12,total night charge,0.43645
10,total night minutes,0.436462
5,total day calls,0.436752
8,total eve calls,0.456365
