# Day 8 Lab, IS 4487

What do you need to know how to do for today's project?

1. Fit a tree model using all the available predictors.
2. Create a confusion matrix  and identify numbers of TP, FP, TN and FN.
3. Estimate profit (benefits - costs) using a defined cost-benefit matrix and a confusion matrix for:
  - all customers
  - only the customer predicted to answer

Of course, for this lab we'll be using the MegaTelCo data, in which case the target is `leave` not `answer`.  

Note that the first set of steps below is identical to what we did in the previous lab.




#Load Libraries


In [None]:
import pandas as pd
from sklearn.tree import plot_tree
from sklearn.preprocessing import LabelEncoder # for label encoding
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn import tree


# Get Data


In [None]:
mtc = pd.read_csv("https://raw.githubusercontent.com/jefftwebb/is_4487_base/dd870389117d5b24eee7417d5378d80496555130/Labs/DataSets/megatelco_leave_survey.csv")

# Clean data


In [None]:
# filter rows
mtc_clean = mtc[(mtc['house'] > 0) & (mtc['income'] > 0) & (mtc['handset_price'] < 1000)]

# remove NAs
mtc_clean = mtc_clean.dropna()

# Fit full model

Again, we will set `max_depth = 5` to keep the tree simple and prevent overfitting.

In [None]:
# split the datafram into independent (X) and dependent (predicted) attributes (y)
X = mtc_clean.drop(['id', 'leave'], axis=1)
y = mtc_clean['leave']

# Convert categorical variables to numeric
le = LabelEncoder()
for column in X.select_dtypes(include=['object']):
    X[column] = le.fit_transform(X[column])

# initialize the tree
full_tree = DecisionTreeClassifier(criterion="entropy", max_depth = 5)

# Create Decision Tree Classifer
full_tree = full_tree.fit(X, y)

# Create a confusion matrix

The confusion matrix will show counts of false positives, true positives, false negatives and true negatives. Essentially it provides a window into how the model is performing--where it is making mistakes.



In [None]:
# Get model predictions
pred = full_tree.predict(X)


In [None]:
# create a confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(y, pred)

Explanation of code:

- `ConfusionMatrixDisplay.from_predictions(y, pred)`: creates a confusion matrix in a heatmap format.

It is typical to put the predictions in the columns.  But what matters is that you are able to identify the following categories--and associated counts-- from the table:

- TP--predicted to leave and actually left:  1989
- FP--predicted to leave but stayed:  946
- TN--predicted to stay and stayed: 1580
- FN--predicted to stay but left: 479   

These numbers are related to accuracy.  (TP + TN)/ total = accuracy.  Let's check.

In [None]:
# accuracy calculated from the confusion matrix
(1989 + 1580)/len(y)

In [None]:
# accuracy calculated by comparing predictions and actual
sum(pred == y)/len(y)

# Calculate Profit

Using the above confusion matrix how much profit (revenue - costs) could be expected with the following costs-benefits?

For MegaTelCo we will assume:
- benefit (revenue) = 800
- cost = 200  

TPs are a benefit, FPs are a cost. Again, we ignore those predicted to stay.

Note that the cost-benefit numbers will be different in the AdviseInvest case!

Why are TPs a benefit? In the MegaTelCo scenario, these are customers who are predicted to leave and actually were going to leave. If your marketing campaign is successful, then you can convince them to stay, thereby saving the company money. (In the AdviseInvest scenario, TPs are customers that you have predicted will answer the phone and do answer, thus providing an opportunity for your sales reps to make a sale.)

**Assume you can save 100% of the people who were actually going to leave.**

Here is the calculation:

- Multiply TP (the true leavers) x 600 (benefit - cost).  These are the leave-leave people in the confusion matrix that you retained.
- Multiply FP (the false leavers) x 200 (cost). These are the leave-stay people in the confusion matrix who churned.

In [None]:
1989 * (800-200) - 946 * 200

So, this strategy would show a profit of  about 1 million.  Of course we need to compare this against the default strategy not using a model at all and marketing to all customers.

What would profit be in that case?

In [None]:
sum(y=="LEAVE") * (800-200) - sum(y=="STAY") * 200

So, it would be more profitable to use the model for targeted marketing.

# Functions

- `pd.read_csv()`: Reads a CSV file into a pandas DataFrame
- `dropna()`: Removes rows with missing values from a DataFrame
- `drop()`: Removes specified labels from rows or columns in a DataFrame
- `LabelEncoder()`: Encodes target labels with numeric values
- `fit_transform()`: Fits label encoder and transforms the data
- `DecisionTreeClassifier()`: Creates a decision tree classifier
- `fit()`: Builds a decision tree classifier from the training set (`X`, `y`)
- `predict()`: Predicts class labels for samples in `X`
- `pd.crosstab()`: Computes a cross-tabulation of two or more factors
- `sum()`: Returns the sum of a sequence of numbers
- `len()`: Returns the number of items in an object
