<a href="https://colab.research.google.com/github/stephenfrein/tree_models/blob/master/Decision_Trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Get and Examine Our Data

In [None]:
# modules we will need
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import metrics
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text

# get raw data and put in data frame
url = "https://drive.google.com/uc?export=download&id=1Okb4RuShkQ0-dxMj0xeBWnrmq1uIsuOw"
cars_raw = pd.read_csv(url)
cars_raw.head(n=10)

In [None]:
# examine data
cars_raw.describe(include='all')

In [None]:
# base rates of acceptable / unacceptable
cars_raw['acceptability'].value_counts(normalize=True)

# Training/Test Split and Building the Tree

![Training and Test Split](https://data-flair.training/blogs/wp-content/uploads/sites/2/2018/08/1-16.png)

We will break our data into training and test sets.

Training set is used to build model – what X values explain our Y?

Test set allows us to check our model against data it has never “seen” and allows us to estimate its performance against future data

Other methods involve use of cross-validation and validation sets so we can tune models without compromising independence of test data (but we won’t go there right now).


In [None]:
# create a separate copy - usually need to massage the data
cars_clean = cars_raw.copy()

# predictor variables - all but column called acceptability
X = cars_clean.drop("acceptability",1)
# target variable
y = cars_clean["acceptability"]

# split into training (70%) and test (30%) sets with seed value for reproducibiity
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

# decision tree classifier with max depth to avoid overfitting
simple_tree = DecisionTreeClassifier(max_depth=3, criterion='gini')
# train decision tree classifer
simple_tree = simple_tree.fit(X_train,y_train)

# Oh no! What happened?

Implementation of decision trees in scikit-learn can’t handle character values – need to “one-hot” encode these!


In [None]:
# see current data types
X.dtypes

In [None]:
### TRY AGAIN ###

# create a separate copy - usually need to massage the data
cars_clean = cars_raw.copy()

### CLEAN DATA THIS TIME ###
# one-hot encode character variables
cars_clean = pd.get_dummies(cars_clean,columns=["purchase_cost","maint_cost","trunk_size","safety_rating"])

# predictor variables - all but column called acceptability
X = cars_clean.drop("acceptability",1)
# target variable
y = cars_clean["acceptability"]

In [None]:
# what does data frame look like?
X.head()

In [None]:
# see data types now
X.dtypes

In [None]:
# okay, let's try to build our tree again

# split into training (70%) and test (30%) sets with seed value for reproducibiity
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

# decision tree classifier with max depth to avoid overfitting
simple_tree = DecisionTreeClassifier(max_depth=3, criterion='gini')
# train decision tree classifer
simple_tree = simple_tree.fit(X_train,y_train)

Implementation of decision trees in scikit-learn does not like null (NaN) values – we need to get rid of them!

In [None]:
# where are those pesky nulls?
X.isna().sum()

In [None]:
# one more time

# create a separate copy - usually need to massage the data
cars_clean = cars_raw.copy()

# one-hot encode character variables
cars_clean = pd.get_dummies(cars_clean,columns=["purchase_cost","maint_cost","trunk_size","safety_rating"])
### NEW CLEANING STEP ###
# drop na (null values)
cars_clean = cars_clean.dropna()

# predictor variables - all but column called acceptability
X = cars_clean.drop("acceptability",1)
# target variable
y = cars_clean["acceptability"]

X.isna().sum()

In [None]:
# try to build our tree one last time

# split into training (70%) and test (30%) sets with seed value for reproducibiity
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

# decision tree classifier with max depth to avoid overfitting
simple_tree = DecisionTreeClassifier(max_depth=3, criterion='gini')
# train decision tree classifer
simple_tree = simple_tree.fit(X_train,y_train)

In [None]:
# plot it
plt.figure(figsize=(2,2), dpi=600)
tree.plot_tree(simple_tree.fit(X_train,y_train), feature_names = list(X.columns), 
               class_names=['acceptable','unacceptable'], filled=True)
print(export_text(simple_tree, feature_names = list(X.columns)))


# Model Peformance and Predictions
 

In [None]:
# predict the response for test dataset
y_pred = simple_tree.predict(X_test)
# how many were classified correctly
print("Simple Tree Accuracy:", metrics.accuracy_score(y_test, y_pred))


In [None]:
# a concept like precision reguires that we specify a target class with pos_label
print("Precision:",metrics.precision_score(y_test, y_pred, pos_label="acc"))

In [None]:
print("Recall:",metrics.recall_score(y_test, y_pred, pos_label="acc"))

In [None]:
print("F1 Score:",metrics.f1_score(y_test, y_pred, pos_label="acc"))

In [None]:
# set up new car examples for predictions
# p = persons, m = maintenance very high, s = safety rating low
#              p               m         s
new_car1 = [[2,2,0,0,0,1,0,0,0,1,1,0,1,0,0,0]]

pred_for_new_car1 = simple_tree.predict(new_car1)
print('Car 1 acceptability prediction: ' + str(pred_for_new_car1[0]))

In [None]:
# p = persons, m = maintenance very high, s = safety rating low
#              p               m         s
new_car2 = [[2,4,0,0,0,1,0,0,0,0,1,0,1,0,0,0]]

pred_for_new_car2 = simple_tree.predict(new_car2)
print('Car 2 acceptability prediction: ' + str(pred_for_new_car2[0]))

In [None]:
# create a new third car and make a prediction


# Exercise #1

You have been asked to predict whether or not a movie is likely to receive a high score on IMDB. 

You will be working with a file of IMDB data that you download from [URL]. You will be building a model to predicting the values in the *imdb_score_high* column. A “1” designates a high-scoring movie and a “0” designates a movie that does not have a high score.

Build a decision tree model on a training set taken from the data and evaluate the performance of your model against the test set. Determine your metrics for accuracy, precision, recall, and F1 score.

Finally, tell us whether the following movie would have earned a high score or not.

```
num_critic_for_reviews	86
duration	130
director_facebook_likes	39
lead_actor_facebook_likes	2000
gross	9589875
num_voted_users	16673
cast_total_facebook_likes	5162
studio	Cosmic
facenumber_in_poster	0
num_user_for_reviews	45
budget	41000000
title_year	2008
aspect_ratio	2.35
movie_facebook_likes	0
```


