# Day 7 Lab, IS 4487

The purpose of this lab is to prepare you to complete today's project quiz. Here are the questions you need to be able to answer.

- Fit a tree model of the target using all the predictors, then:  create a visualization of the tree and identify the top 3 most important predictors in this model.
    
- How do these models compare to majority class prediction?
    
- How will you use a classification model as part of a solution to the AdviseInvest case?



##Load Libraries


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
import numpy as np
from sklearn.preprocessing import LabelEncoder # for label encoding
from sklearn.tree import DecisionTreeClassifier, export_graphviz # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn import tree


## Get data


In [None]:
mtc = pd.read_csv("https://raw.githubusercontent.com/jefftwebb/is_4487_base/dd870389117d5b24eee7417d5378d80496555130/Labs/DataSets/megatelco_leave_survey.csv")

## Clean  data


In [None]:
# filter rows
mtc_clean = mtc[(mtc['house'] > 0) & (mtc['income'] > 0) & (mtc['handset_price'] < 1000)]

# remove NAs
mtc_clean = mtc_clean.dropna()

Rather than encoding the character variables as categories we will turn them into numbers below, as required by the tree algorithm.

## Fit a full tree model

Use all of the variables.  We'll call this the "full tree." We will use `max_depth = 5` to keep the tree relatively simple.



In [None]:
# split the datafram into independent (X) and dependent (predicted) attributes (y)
X = mtc_clean.drop(['id', 'leave'], axis=1)
y = mtc_clean['leave']

# Convert categorical variables to numeric
le = LabelEncoder()
for column in X.select_dtypes(include=['object']):
    X[column] = le.fit_transform(X[column])

# initialize the tree
full_tree = DecisionTreeClassifier(criterion="entropy", max_depth = 5)

# Create Decision Tree Classifer
full_tree = full_tree.fit(X, y)


Explanation of code:

- `X = mtc_clean.drop(['id', 'leave'], axis=1)`: Creates feature set `X` by removing `id` and `leave` columns
- `y = mtc_clean['leave']`: Sets target variable `y` as the `leave` column
- `le = LabelEncoder()`: Initializes a LabelEncoder object for converting categorical variables to numeric
- `for column in X.select_dtypes(include=['object'])`: Loops through all object (string) columns in `X`
- `X[column] = le.fit_transform(X[column])`: Applies label encoding to each categorical column
- `full_tree = DecisionTreeClassifier(criterion="entropy")`: Initializes a decision tree with entropy criterion
- `full_tree = full_tree.fit(X, y)`: Fits the decision tree model on the entire feature set `X` and target `y`
- Note: This process prepares the data, encodes categorical variables, and trains a decision tree using all available features

##Visualize the full tree

In [None]:
plt.figure(figsize=(20,10))

plot_tree(full_tree,
          feature_names = X.columns,
          class_names=['STAY', 'LEAVE'],
          filled=True,
          max_depth=2) # for legibility

plt.show()

What are the most important predictors of churn based on this model?

## Check Accuracy

In [None]:
pred = full_tree.predict(X = X)

sum(pred == mtc_clean['leave']) / len(pred)

This accuracy is **much** better than the simple model from yesterday.