# Sean Pharris
# Model: Decision Tree
# Data set: Customer data of a telecommunications company
# Date: Jan 21, 2022

Part I: Research Question

A.

1.  What customers are likely to discontinue their services in the next few months?

2.  The goal of identifying the results to our research question can help the stakeholders to understand the turn over of customers in detail.
 

Part II: Method Justification

B.  

1.  Prediction method and outcomes:
* We will be making the prediction with the Decision Tree algorithim. This algorithim find the statistical significance between the differences between sub-nodes (service features) and parent node (churn). We will measure it by the sum of squares of standardized differences between observed and expected frequencies of the target variable. It works with the categorical target variable of “yes” or “no”. Higher the value of the statistical significance of differences between sub-node and Parent node. With the outcome, we can find predict which customers are likely to discontinue their services in the next few months.


2.  Assumptions: 
* Initially, the whole training set is considered as the root (Chauhan, N. (2020)).

* Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model (Chauhan, N. (2020)).

* Records are distributed recursively on the basis of attribute values (Chauhan, N. (2020)).

* Order to placing attributes as root or internal node of the tree is done by using some statistical approach (Chauhan, N. (2020)).

3.  The benefits of Python are vast but the main reason are the versatility, ease of use, and strong support from the community. There are many packages that make it easy to undertake the task of doing data analysis/data prediction.

* Some of those packages are:
    * Pandas and Numpy - make it easy to handle large sets of data
    * Seaborn and Matplotlib - make data visualization a breeze
    * Statsmodels and ScikitLearn - allow for easy data exploration and prediction
 

Part III: Data Preparation

C.  

1.  The customers that have already discontinued their services in the last month have the binary variable as "yes" in the "Churn" column and "no" for those customers that have no discontinued their services. We will preprocess the data as 1s for "yes" and 0s for "no" to process the data..

2.  The initial data set will include the variables below and our dependent variable will be "Churn", which is categorical.

* Continuous:

    'Population', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'TimelyFixes', 'TimelyReplacements', 'Reliability', 'Options', 'RespectfulResponse', 'CourteousExchange', 'EvidenceOfActiveListening'
       
* Categorical: 

    'Area', 'TimeZone', 'Job', 'Marital', 'Gender', 'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple', 'OnlineSecurity','OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod'

3.  Steps to prepare data:
    1. Read the data into the data frame ("df") using Pandas "read_csv()"
    2. Drop unneeded columns
    3. Changing the names of columns to make the data more understandable
    4. Make sure there are no null values
    5. Create dummy variables for categorical columns
    6. Remove the outliers of numerical data types

In [None]:
import warnings

warnings.filterwarnings('ignore')

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Import the data

In [None]:
# Read in data set into the data frame 
df = pd.read_csv('../input/clean-churn-data/churn_clean.csv')

### Removing unneeded columns

In [None]:
# Drop unnecessary columns
df.drop(columns=['CaseOrder','UID', 'Customer_id','Interaction', 'Job','State','City','County','Zip','Lat','Lng', 'TimeZone', 'Marital'], inplace=True)

### Changing the name of columns to make the data more understandable

In [None]:
# Renaming the survey columns
df.rename(columns = {'Item1':'TimelyResponse', 
                    'Item2':'TimelyFixes', 
                     'Item3':'TimelyReplacements', 
                     'Item4':'Reliability', 
                     'Item5':'Options', 
                     'Item6':'RespectfulResponse', 
                     'Item7':'CourteousExchange', 
                     'Item8':'EvidenceOfActiveListening'}, 
          inplace=True)

### Check for null values in the data set

In [None]:
df.isnull().sum()

### Below, we will find our categorical data types

In [None]:
# find categorical variables

categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :', categorical)

In [None]:
# view the categorical variables

print(categorical)

In [None]:
df.columns

In [None]:
# check for cardinality in categorical variables

for var in df:
    print(var, ' contains ', len(df[var].unique()), ' labels')

## Below we will find out numerical data types and remove outliers

In [None]:
# find numerical variables

numerical = [var for var in df.columns if df[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :', numerical)

#### Now we will look for outliers

In [None]:
# view summary statistics in numerical variables

print(round(df.describe()),2)

Variables that may contain outliers:

* Population
* Income
* Bandwidth_GB_Year

In [None]:
# draw boxplots to visualize outliers
import matplotlib.pyplot as plt

plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.boxplot(column='Population')
fig.set_title('')
fig.set_ylabel('Population')


plt.subplot(2, 2, 2)
fig = df.boxplot(column='Income')
fig.set_title('')
fig.set_ylabel('Income')


plt.subplot(2, 2, 3)
fig = df.boxplot(column='Bandwidth_GB_Year')
fig.set_title('')
fig.set_ylabel('Bandwidth_GB_Year')

In [None]:
# plot histogram to check distribution

plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.Population.hist(bins=10)
fig.set_xlabel('Population')
fig.set_ylabel('Churn')


plt.subplot(2, 2, 2)
fig = df.Income.hist(bins=10)
fig.set_xlabel('Income')
fig.set_ylabel('Churn')


plt.subplot(2, 2, 3)
fig = df.Bandwidth_GB_Year.hist(bins=10)
fig.set_xlabel('Bandwidth_GB_Year')
fig.set_ylabel('Churn')

Removing the outliers in our numerical data types
* Bandwidth_GB_Year does not appear to be skewed.
* Population and income apprear to be skewed so we will conduct an interquantile range now.

In [None]:
# find outliers for Population

IQR = df.Population.quantile(0.75) - df.Population.quantile(0.25)
lower = df.Population.quantile(0.25) - (IQR * 3)
upper = df.Population.quantile(0.75) + (IQR * 3)
print('Population outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=lower, upperboundary=upper))

In [None]:
# find outliers for Income

IQR = df.Income.quantile(0.75) - df.Income.quantile(0.25)
lower = df.Income.quantile(0.25) - (IQR * 3)
upper = df.Income.quantile(0.75) + (IQR * 3)
print('Income outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=lower, upperboundary=upper))

Fixing the outliers in our numerical data types

* We have seen that the Population and Income columns contain outliers. 
* We will use top-coding approach to cap maximum values and remove outliers from the above variables.

In [None]:
def max_value(df3, variable, top):
    return np.where(df3[variable]>top, top, df3[variable])

for df3 in [df]:
    df3['Population'] = max_value(df3, 'Population', 50458.0)
    df3['Income'] = max_value(df3, 'Income', 155310.5275)

In [None]:
print(df.Population.max(), df.Income.max())

In [None]:
# plot histogram to check distribution of removed outliers 

plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.Population.hist(bins=10)
fig.set_xlabel('Population')
fig.set_ylabel('Churn')


plt.subplot(2, 2, 2)
fig = df.Income.hist(bins=10)
fig.set_xlabel('Income')
fig.set_ylabel('Churn')

### C4.  Provide a copy of the cleaned data set.

In [None]:
# Desired data set
df.to_csv('Decision_Tree_churn.csv', index=False)

## Part IV: Analysis

### D1.  Split the data into training and test data sets and provide the file(s).

In [None]:
from sklearn.model_selection import train_test_split

# Create arrays for the features and the response variable

train, test = train_test_split(df, test_size = 0.2, random_state = 0)

# check the shape of X_train and X_test

train.shape, test.shape

In [None]:
# Put training and test data into their own CSVs.

train.to_csv('training_churn.csv', index=False)

test.to_csv('test_churn.csv', index=False)

### Now that we have split the test/training data, we will split the dependent variable "Churn" from the independent variables.

In [None]:
# create target(predictor) variable 

X = df.drop(['Churn'], axis=1)

y = df[['Churn']]

In [None]:
# removing churn from categorical list because we will loop through the categorical in the encoding below

categorical.remove('Churn')

categorical

In [None]:
# splitting test/training data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

# check the shape of X_train and X_test

X_train.shape, X_test.shape

In [None]:
churn_df = y_test
churn_df.head()

In [None]:
import sklearn.preprocessing as preprocessing

# transforming categorical datatypes into numerical types
for feature in categorical:
    le = preprocessing.LabelEncoder()
    X_test.loc[:, feature] = le.fit_transform(X_test.loc[:, feature])
    X_train.loc[:, feature] = le.fit_transform(X_train.loc[:, feature])
y_test = le.fit_transform(y_test)
y_train = le.fit_transform(y_train)

In [None]:
from sklearn.preprocessing import StandardScaler

# normalizaing/feature scaling the data

scaler = StandardScaler()

X_test = pd.DataFrame(scaler.fit_transform(X_test), columns = X_test.columns)
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)

### D2: Analysis and Intermediate calculations
    * The analysis technique we are using it is the Decision tree and more specifically the Decision tree regressor model
        * With this model, it will start at a root node (the initial feature with the highest importance)
        * From the root node (which turns into the parent node) splits into the other features (called child nodes at this point or leafs) based on the decision made from the customer data
        * Resulting in the classifcation of customer churn

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Declare Decision Tree algo 
dt = DecisionTreeRegressor(max_depth = 8,
                              min_samples_leaf = 0.1,
                              random_state = 1)
# Fit the model
dt.fit(X_train, y_train)

# Declare prediction variable
y_pred = dt.predict(X_test)

y_pred

In [None]:
from sklearn.datasets import make_regression

# define dataset
X, y = make_regression(n_samples=10000, n_features=len(X_train.columns), n_informative=5, random_state=1)

# get importance
importance = dt.feature_importances_

# summarize feature importance
for i,feature_score in enumerate(importance):
    print((X_train.columns[i]), '- %.5f' % (feature_score))
    
# plot feature importance
plt.figure(figsize=(10,10))
plt.bar([x for x in range(len(importance))], importance)
plt.xlabel('Feature')
plt.ylabel('Score')
plt.show()

In [None]:
# features of importance

for i,feature_score in enumerate(importance):
    if feature_score > 0.0001:
        print((X_train.columns[i]), '- %.5f' % (feature_score))

In [None]:
x = X_test
Y = y_test

### D3.  Code is above.

## Part V: Data Summary and Implications

### E1.  Accuracy and the mean squared error (MSE)

In [None]:
# Import cross validation metrics
from sklearn.model_selection import cross_val_score

# Compute the coefficient of determination (R-squared)
scores = cross_val_score(dt, X, y, scoring='r2')

In [None]:
# Print R-squared value
print('Cross validation R-squared values: ', scores)

In [None]:
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_squared_error as MSE

print("Best Possible score for R-Squared is 1")

# Explained variance of the training set
dt.fit(X_train,y_train)
print("R-Squared on training dataset = {}".format(dt.score(X_test,y_test)))

# Explained variance of the test set
dt.fit(X_test,y_test)
print("R-Squared on test dataset = {}\n".format(dt.score(X_test,y_test)))


print("Best Possible score for MSE is 0")


# Print Mean Squared Error
print("MSE = ", MSE(y_test, y_pred))

#### Mean squared error is approximately 0.1122
 * 0 would be perfect, so this is a pretty accurate model

In [None]:
reduced_X_train = pd.DataFrame({'Contract': X_train['Contract'], 'Tenure': X_train['Tenure'], 'MonthlyCharge': X_train['MonthlyCharge']})
reduced_X_test = pd.DataFrame({'Contract': X_test['Contract'], 'Tenure': X_test['Tenure'], 'MonthlyCharge': X_test['MonthlyCharge']})

In [None]:
dt.fit(reduced_X_train, y_train)
y_pred = dt.predict(reduced_X_test)
y_pred

In [None]:
data = pd.DataFrame({'Contract': X_test['Contract'], 'Tenure': X_test['Tenure'], 'MonthlyCharge': X_test['MonthlyCharge'],'Churn': y_pred})
data

In [None]:
data_features = pd.DataFrame({'Contract': X_test['Contract'], 'Tenure': X_test['Tenure'], 'MonthlyCharge': X_test['MonthlyCharge']})
len(data_features)

In [None]:
# Explained variance after model reduction
print("Best Possible score for R-Squared is 1")

dt.fit(data[['Contract', 'Tenure', 'MonthlyCharge']], data['Churn'])
print("R-Squared on training dataset = {}\n".format(dt.score(data[['Contract', 'Tenure', 'MonthlyCharge']], data['Churn'])))

In [None]:
# Parameters of Decision tree regression model
dt.get_params()

In [None]:
y_pred

In [None]:
churn_df.value_counts()

### Below removes all customers that have already discontinued their services in our sample size

In [None]:
i = 0
churn_likelihood = []
churn_cust = []
for cust in churn_df['Churn']:
    if cust == "No":
        churn_likelihood.append(y_pred[i])
    i += 1
print(len(churn_likelihood))

In [None]:
very_low = 0
low = 0
medium = 0
high = 0
very_high = 0

for customer in churn_likelihood:
    if customer <= .03:
        very_low += 1
    elif customer > .03 and customer <= .05:
        low += 1
    elif customer > .05 and customer <= .08:
        medium += 1
    elif customer > .08 and customer <= .095:
        high += 1
    elif customer > .095:
        very_high += 1
        
print("Likelihood of customer churn:\n", 
      very_low, "customers have very low risk\n",
      low, "customers have low risk\n",
      medium, "customers have medium risk\n",
      high, "customers have high risk\n",
      very_high, "customers have very high risk\n")


2.  Results and implications

    * From decision tree regression technique, we found out that the main variables of the customer churn are:
        * The type of contract the customer was in
        * The amount of years the customer had already been using the service
        * The amount of money they were being charged monthly
        
    * From a sample size of 2000 customers, we found that 438 customers were at high risk of churn as identified from our model.
    
    * The implications that occured during analysis consisted mostly of having to many "branches". Our decision tree started with 37 branches and after "pruning" the tree were were left with only 3 that gave us a solid R-squared value/accuracy.

3.  Limitation

    * The major limitation that occured during analysis was amount of observations that I had to work with. A 80/20 split was chosen for the train/test data split which gives our model a 8000 observations to train from and 2000 to test on. The 2000 observations tested on are essentially our sample size for the results. We could have did a different split but the model would have been less trained.

4.  Recommendation

    * The recommendation for telecommunications comapany is to ensure that the needs of all customers are taken care of but really focus on the customers with high tenures, identify what kind of contract has the highest longevity and perhaps suggest that contract type to more customers, and take a closer look to see what price range for the monthly bill has the best rate of satisfaction with the customer base as the monthly charge is the second most important variable with customers.
 

Part VI: Demonstration

F.  Panopto video:
 
    https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=19c5726f-2558-4948-83f3-ae23016249fd

G.  Third Party Code sources.
    
    Brownlee, J. (2020). How to Calculate Feature Importance With Python. Machine Learning Mastery. https://machinelearningmastery.com/calculate-feature-importance-with-python/

H.  References:
    
    2U INC. (2022). Decision Tree. Master's in Data Science. https://www.mastersindatascience.org/learning/introduction-to-machine-learning-algorithms/decision-tree/
    
    Chauhan, N. (2020). Decision Tree Algorithm, Explained. KDnuggets. https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
