<a href="https://colab.research.google.com/github/yeseniaandrade/yeseniaandrade-IS_4487_Tokyo/blob/main/YeseniaAndradeCopy_of_day9_lab_deploy_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 9 Lab, IS 4487

What do you need to do for today's project?

1. Use the model to predict on a new dataset (without the target), then use those predictions to identify those who should be called--a contact list.
2.  Make a recommendation to the Director of Sales based on all of your analytic work for this project.

Remember that for this example we'll be using the MegaTelCo data, where the target is `leave` not `answer`.  

Note that the first set of steps below is identical to the workflow in previous labs.




#Load Libraries


In [1]:
from sklearn.tree import plot_tree
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz # Import Decision Tree Classifier
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


# Get Data

For this part of the project we will be using the model to predict whether *current* customers will churn.

Remember:  we have trained the model on historical data, which includes information about whether customers have *already* churned.  But the important use case is to predict whether *current* customers will churn.

In [2]:
# Training data
mtc = pd.read_csv("https://raw.githubusercontent.com/jefftwebb/is_4487_base/dd870389117d5b24eee7417d5378d80496555130/Labs/DataSets/megatelco_leave_survey.csv")

# Current customer data
current_customers = pd.read_csv("https://raw.githubusercontent.com/jefftwebb/is_4487_base/main/Labs/DataSets/megatelco_new_customer_data.csv")

We should double check that this new dataset is also clean.  If it isn't there will be problems when predicting.

In [3]:
current_customers.describe()

Unnamed: 0,income,overage,leftover,house,handset_price,over_15mins_calls_per_month,average_call_duration,id
count,24.0,24.0,24.0,24.0,24.0,24.0,24.0,24.0
mean,83705.958333,110.208333,28.625,451115.5,385.125,11.625,5.125,11275.583333
std,40593.33419,97.638632,29.793219,205207.606472,231.486184,10.66409,4.099973,4947.953779
min,20392.0,0.0,0.0,173038.0,132.0,0.0,1.0,3239.0
25%,46900.0,0.0,0.0,263864.0,203.75,1.0,2.0,8620.5
50%,90135.5,112.0,17.5,443579.0,338.0,7.5,4.0,11261.5
75%,120707.5,197.75,50.75,594438.5,416.25,22.25,9.0,14307.5
max,143929.0,252.0,87.0,853464.0,888.0,29.0,14.0,19570.0


In [4]:
current_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   college                      24 non-null     object
 1   income                       24 non-null     int64 
 2   overage                      24 non-null     int64 
 3   leftover                     24 non-null     int64 
 4   house                        24 non-null     int64 
 5   handset_price                24 non-null     int64 
 6   over_15mins_calls_per_month  24 non-null     int64 
 7   average_call_duration        24 non-null     int64 
 8   reported_satisfaction        24 non-null     object
 9   reported_usage_level         24 non-null     object
 10  considering_change_of_plan   24 non-null     object
 11  id                           24 non-null     int64 
dtypes: int64(8), object(4)
memory usage: 2.4+ KB


Looks okay.

And note that there is no target variable in the data.

# Clean data

We need to take care that we perform *exactly* the same cleaning on the new data.

Here is the cleaning/preparation for the historical MegaTelco data:


In [5]:
# Make explicit copy
mtc_clean = mtc.copy()

# filter rows
mtc_clean = mtc_clean[(mtc_clean['house'] > 0) & (mtc_clean['income'] > 0) & (mtc_clean['handset_price'] < 1000)]

# remove NAs
mtc_clean = mtc_clean.dropna()

# Recode college
mtc_clean['college'] = mtc_clean['college'].replace({'one': 'yes', 'zero': 'no'})

# change reported usage and reported satisfaction (ordered)
mtc_clean['college'] = pd.Categorical(mtc_clean['college'],
                                    ordered = False).codes

mtc_clean['considering_change_of_plan'] = pd.Categorical(mtc_clean['considering_change_of_plan'],
                                    ordered = False).codes

# change reported usage and reported satisfaction (ordered)
mtc_clean['reported_usage_level'] = pd.Categorical(mtc_clean['reported_usage_level'],
                                    categories = ['low', 'avg','high'],
                                    ordered = True).codes

mtc_clean['reported_satisfaction'] = pd.Categorical(mtc_clean['reported_satisfaction'],
                                    categories = ['low', 'avg','high'],
                                    ordered = True).codes




And here is that same cleaning applied the data on current customers:

In [6]:
# Make explicit copy: ccc refers to current customers clean
ccc = current_customers.copy()

# filter rows
ccc = ccc[(ccc['house'] > 0) & (ccc['income'] > 0) & (ccc['handset_price'] < 1000)]

# remove NAs
ccc = ccc.dropna()

# Recode college
ccc['college'] = ccc['college'].replace({'one': 'yes', 'zero': 'no'})

# change reported usage and reported satisfaction (ordered)
ccc['college'] = pd.Categorical(ccc['college'],
                                    ordered = False).codes

ccc['considering_change_of_plan'] = pd.Categorical(ccc['considering_change_of_plan'],
                                    ordered = False).codes

# change reported usage and reported satisfaction (ordered)
ccc['reported_usage_level'] = pd.Categorical(ccc['reported_usage_level'],
                                    categories = ['low', 'avg','high'],
                                    ordered = True).codes

ccc['reported_satisfaction'] = pd.Categorical(ccc['reported_satisfaction'],
                                    categories = ['low', 'avg','high'],
                                    ordered = True).codes




# Fit full model

Again, we will set `max_depth = 5` to keep the tree simple and prevent overfitting.

Since we have already determined that the model is not overfitting the data we can dispense with splitting it into train and test sets.  We will therefore use *all* the data to fit the model.

In [7]:
# split the dataframe into predictors (X) and target (y)
X = mtc_clean.drop(['id', 'leave'], axis=1)
y = mtc_clean['leave']

# initialize the tree
full_tree = DecisionTreeClassifier(criterion="entropy", max_depth = 5)

# Create Decision Tree Classifer
full_tree = full_tree.fit(X, y)

# Predict

The next step is to use the model to predict churn for the current customers.

We need to ensure that the new dataset has the same shape and data types  as the data used to fit the model.

This will entail dropping the `id` column.

In [8]:
X_new = ccc.drop(['id'], axis=1)



Now we predict the probability of churn using the new data. Remember:  we are using the model trained on the historical data, `full_model`, to predict for the new data, `ccc` (the clean current customers data).

The `predict_proba()` function returns an array with two columns that are organized according to the levels in the target:  column 0 presents the probability of `LEAVE` (the first level in the target); column 1 presents the probability of `STAY` (the second level).

Hence we need to index that array to obtain the probabilities for `LEAVE` by choosing the first column: `[:, 0]`

In [11]:
# Write your code here
probs = full_tree.predict_proba(X_new)[:,0]

# Add predictions to the data

The next step is to append the predictions to the `current_customers` data so we can link the predictions to the customer ID.  



In [12]:
# Write your code here to create the contact list
ccc['churn_prob']=probs
ccc = ccc[['id', 'churn_prob']]
ccc

Unnamed: 0,id,churn_prob
0,18429,0.792105
1,13530,0.539202
2,9171,0.792105
3,3239,0.033333
4,11815,0.135
5,15632,0.792105
6,14127,0.539202
7,11963,0.9
8,9013,0.792105
9,18810,0.135


# Which customers to target for retention?

This is a contact list that can be handed off to the marketing department to direct their retention efforts!

We need to organize the list by first sorting it then filtering it to include only customers with a predicted probability of churning greater than .2.  For sorting we'll use the Pandas function, `sort_values`:

In [15]:
# Write your code here to sort the list and filter for probabilities > .2
ccc = ccc.sort_values(by='churn_prob', ascending=False)
ccc = ccc[ccc['churn_prob'] > 0.2]
ccc

Unnamed: 0,id,churn_prob
7,11963,0.9
0,18429,0.792105
16,6514,0.792105
14,7443,0.792105
11,3244,0.792105
8,9013,0.792105
23,13861,0.792105
5,15632,0.792105
2,9171,0.792105
19,19570,0.685185


How many customer are on the contact list?

In [16]:
# Write your code here
len(ccc)

17