# Predictive Analysis Basics

###### Problem Statement:
You are the data scientist of a telecom company "Amdocs" and it's customers are churning out to its competitors. You have to analyze the data and find insights and stop your customers to churn out to your competitors.

###### Tasks to be done:
1. Data Manipulation
2. Data Visualization

###### ML models:
3. Linear regression
4. Logistic Regression
5. Decision Tree
6. Random Forest

### 1. Data Manipulation

In [None]:
# importing libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [None]:
# Loading customer churn data set into cust_churn dataframe:
cust_churn = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
# Top 5 records:
cust_churn.head()

We need 5th column (or Dependents column) for our analysis. So we are creating a seperate dataframe for this column as c_5

In [None]:
# Using column indexing: c_5 = cust_churn.loc['Dependents']
# Below is the integer indexing example we are using (syntax-> df.iloc[rows,columns])
c_5 = cust_churn.iloc[:,4]

In [None]:
c_5.head()

In [None]:
# Similary we are extracting 15th column:
c_15 = cust_churn.iloc[:,14]
c_15.head()

###### Now we need to extract all male senior citizens whose payment method is Electronic cheque.

In [None]:
senior_male_electronic = cust_churn[(cust_churn['gender'] == 'Male') & (cust_churn['SeniorCitizen'] == 1) & (cust_churn['PaymentMethod'] == 'Electronic check')]

In [None]:
senior_male_electronic.head()

##### Now we need to extract all those customers whose tenure is greater than 70 months or their monthly charges is more than 100 dollers

In [None]:
customer_total_tenure = cust_churn[(cust_churn['tenure']>70) | (cust_churn['MonthlyCharges']> 100)]

In [None]:
customer_total_tenure.head()

##### Now we need to extract all those customers whose contract is of two years, payment method is mailed check and value of churn is 'Yes'

In [None]:
two_mail_yes = cust_churn[(cust_churn['Contract']=='Two year')&(cust_churn['PaymentMethod']=='Mailed check')&(cust_churn['Churn']=='Yes')]

In [None]:
two_mail_yes.head()

##### Random Sampling:
Extract total 333 random records from entire dataframe. For this we will be using sample function.

Every time new sample is provided.

In [None]:
custumer_333 = cust_churn.sample(n=333)
custumer_333.head()

##### Get the count of different levels from churn column.

In [None]:
cust_churn['Churn'].value_counts()

In [None]:
# similarly we can calculate for Contract column:
cust_churn['Contract'].value_counts()

### 2. Data Visualization:
a. Build a bar-plot for 'InternetService' column,

b. Build a histogram for the 'tenure' column,

c. Build a scatter-plot between 'MonthlyCharges'(y-axis) and 'tenure'(x-axis),

d. Build a box-plot between 'tenure'(y-axis) and 'contract'(x-axis).

###### a. Build a bar-plot for 'InternetService' column
We use bar-plots when we want visualize categorical column values.

In [None]:
# plt.bar(arg1, arg2, color = 'red')
# arg1 is distinct values of InternetService columns: cust_churn['InternetService'].value_counts().keys().to_list()
# arg2 is the count of Internetservice columns

plt.bar(cust_churn['InternetService'].value_counts().keys().tolist(),cust_churn['InternetService'].value_counts().tolist(), color = 'red')

# Now we need label and title:
plt.xlabel('Categories of Internet Service')
plt.ylabel('Count')
plt.title('Distribution of Internet Service')

##### b. Build a histogram for the 'tenure' column
We use histogram when we want visualize numerical column values.

In [None]:
# plt.hist(arg1, bins, color)
plt.hist(cust_churn['tenure'], bins = 30 ,color = 'green')

plt.title('Distribution of tenure')

##### c. Build a scatter-plot between 'MonthlyCharges'(y-axis) and 'tenure'(x-axis)

In [None]:
# plt.scatter(x, y)

plt.scatter(cust_churn['tenure'], cust_churn['MonthlyCharges'])

plt.xlabel('Tenure')
plt.ylabel('Monthly Charges')
plt.title('MonthlyCharges vs tenure')

##### d. Build a box-plot between 'tenure'(y-axis) and 'contract'(x-axis)

In [None]:
# DF.boxplot(column='y-axis', by='x-axis')

cust_churn.boxplot(column=['tenure'], by=['Contract'])

plt.xlabel('Contract')
plt.ylabel('Tenure')
plt.title('Contract vs Tenure')

### 3. Linear regression (Machine Learning)
Build a simple linear model where dependent variable is 'MonthlyCharges'(y) and independent variable is 'tenure'(x).

Note: Dependent variable is always numerical value column.

y = mx + c

In [None]:
# importing ML libraries
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

##### Seperating Dependent and Independent variables

In [None]:
y = cust_churn[['MonthlyCharges']]
x = cust_churn[['tenure']]

In [None]:
#y.head(),x.head()

##### Dividing dataset into training and test datasets

In [None]:
# Dividing the dataset into training dataset and testing dataset
# train_test_split gives us four results so we store them in differnet datasets
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.30, random_state=0)

In [None]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape

##### Building a ML model:
A ML model is always build on training data.

In [None]:
# Creating instatnce/object of LinearRegression Class:
regressor = LinearRegression()

# Fitting the train data sets
regressor.fit(x_train, y_train)

In [None]:
# Now predict the values based on test data set
y_predict = regressor.predict(x_test)

Now we need to check how well prediction has been done. In Linear regression one way to check this is using Root Mean Square(RMS) values. Here we compare test and predict values and calculate RMS values.

Lower the value of RMS, better is the ML model.

In [None]:
# for RMS we will use numpy function (np.sqrt) and we will use our test data set
from sklearn.metrics import mean_squared_error

np.sqrt(mean_squared_error(y_test,y_predict))

Now we can check our predicted values(Dependent variable values) in y_predict dataframe.

In [None]:
print(y_predict[:5]) # Predicted Values
print(y_test[:5]) # Actual Values

### 4. Logistic Regression

a. Build a Simple Logistic Regression model where dependent variable is 'churn' and independent variable is 'MonthlyCharges'

b. Build a Multiple Logistic Regression model where dependent variable is 'churn' and independent variables are 'MonthlyCharges' and 'tenure'

##### Simple Logistic Regression model

In [None]:
# importing libraries
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
# seperating dependent and independent variables
x = cust_churn[['MonthlyCharges']]
y = cust_churn[['Churn']]

In [None]:
# Dividing dataset into training and test data set
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.35,random_state=0)

# Let's check size of our train and test data sets
x_train.shape,y_train.shape,x_test.shape,y_test.shape

In [None]:
# creating instantce/object of Logistic regression class
regressor = LogisticRegression()

# Fitting the model on our training data sets:
regressor.fit(x_train,y_train)

In [None]:
# Predicting values
y_predict = regressor.predict(x_test)

# Checking first 5 values
y_predict[:5]

Now we want to check how well prediction has been done. One more way to check this is using Confusion Matrix and accuracy score.

Behind the pictures, Confusion matrix is used to calculate Accuracy score.

In [None]:
# importing libraries
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
# Confusion matrix:
confusion_matrix(y_test, y_predict)

In [None]:
# Accuracy Score:
accuracy_score(y_test, y_predict)

# (1815+0)/(1815+0+651+0)

##### Multiple Logistic Regression model

In [None]:
# Importing libraries
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
# Seperating dependent and Independent Variables
x = cust_churn[['MonthlyCharges','tenure']]
y = cust_churn[['Churn']]

In [None]:
# Dividing data set into training and test data sets
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=0)

# Checking shape of train and test data sets
x_train.shape,y_train.shape,x_test.shape,y_test.shape

In [None]:
# Creating instance/object of Logistic Regression class
regressor = LogisticRegression()

# Fitting data sets to our model
regressor.fit(x_train,y_train)

In [None]:
# Predicting values
y_predict = regressor.predict(x_test)

# Checking first 5 values
y_predict[:5]

Checking Confusion Matrix and Accuracy score using our test data set.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
# Confusion Matrix
confusion_matrix(y_test,y_predict)

In [None]:
# Accuracy score
accuracy_score(y_test, y_predict)

# (935+157)/(935+157+211+106)

### 5. Decision Tree

Build a Decision Tree model whose dependent variable is 'churn' and independent variable is 'tenure'

In [None]:
# Importing Libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [None]:
# Seperating dependent and independent variables
x = cust_churn[['tenure']]
y = cust_churn[['Churn']]

In [None]:
# Dividing data set into training and test data sets
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=0)

# Checking shape of train and test data sets
x_train.shape,y_train.shape,x_test.shape,y_test.shape

In [None]:
# Creating instance/object of Logistic Regression class
DTree = DecisionTreeClassifier()

# Fitting data sets to our model
DTree.fit(x_train,y_train)

In [None]:
# Predicting Values
y_predict = DTree.predict(x_test)

# Checking first 5 Values
y_predict[:5]

In [None]:
# Now we will check confusion matrix and accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
# Confusion matrix:
confusion_matrix(y_test,y_predict)

In [None]:
# Accuracy Score
accuracy_score(y_test,y_predict)

### 6. Random Forest
Build a Random Forest model whose dependent variable is 'Churn' and independent variables are 'tenure' and 'MonthlyCharges'

In [None]:
# Importing random forest classifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Creating instance/object
rf = RandomForestClassifier()

# Fitting model
rf.fit(x_train,y_train)

In [None]:
# Predicting values
y_predict = rf.predict(x_test)

# Checking first 5 values
y_predict[:5]

In [None]:
# confusion matrix
confusion_matrix(y_test,y_predict)

In [None]:
# Accuracy Score
accuracy_score(y_test,y_predict)