In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [None]:
ncaa = pd.read_csv("march-madness.csv")

A note about the data: each row represents a school-season.

* Overall_W is the number of Wins
* Overall_SRS is the StRength of Schedule that year. How difficult were your opponents?
* made_it is our target: Did the school make it to the March Madness Tournament?

So in 1995, the Air Force Academy had 8 wins, had a negative strength of schedule (below average), and didn't make it to the tournament.

In [None]:
ncaa.head()

A quick glance at the summary stats of the data with `describe`.

Question: What are the first and last years in the dataset?

In [None]:
ncaa.describe()

The plot below is nice. It shows that Wins and SRS are mostly related. And that getting into the tournament is associated with being in the upper right hand corner of the plot. There are some edge cases, though.

In [None]:
sns.scatterplot('Overall_W', 'Overall_SRS', data = ncaa, hue = 'made_it')

Prepare our data for a 2 variable logistic regression. How are Overall Wins and Overall Strength of Schedule associated with the odds of making it into the tournament?

In [None]:
X = ncaa[['Overall_W','Overall_SRS']]
y = ncaa.made_it
X.head()

In [None]:
logreg = LogisticRegression()
logreg.fit(X, y)

Now let's understand the predictions and look at the confusion matrix.

In [None]:
logregpred = logreg.predict(X)

confusion_matrix(logregpred, y)

In [None]:
accuracy_score(logregpred, y)

Use `DecisionTreeClassifier` from `sklearn` to fit a decision tree model!

Build a confusion matrix and give an accuracy score for the model.

Hint: the format is very similar to the logistic regression above.
    
Questions to ask your neighbor: What does `max_depth` do? How would you look this up?

In [None]:
dtree = DecisionTreeClassifier(max_depth = 3)

Which model is better? Discuss with your neighbor why.

## If there's time...

Add `year` back into the models. Did accuracy improve for either model?

Write a sentence why I should have used `train_test_split` to split the data up before building the models.

## Homework

Make 2 scatterplots, similar to the one above, but color the dots by whether or not the logistic regression and the decision tree made a correct prediction. Write a sentence why these errors might have happened.

Look at the `fast-frugal.png` image in this directory. Pretty fancy, right? It is an example of a fancy kind of tree used to help doctors make decisions about heart disease. Without worrying too much about what each node means, discuss why this graph would be preferrable to the outputs of a logistic regression. How could the visual be improved?


## Bonus: Plotting Our Tree

The code below will generate a pdf with a visual representation of our tree model `dtree`.

Warning: you need to have graphviz installed in python (via pip) and
the back-end graphviz library installed on your computer. 

Warning 2: putting imports at the bottom of your notebooks is bad practice, 
I did it to prevent errors on everyone else's machine.

We will make sure you can see this after class today.

In [None]:
import graphviz 
from sklearn import tree

made_it = np.where(y == 1, "made it", "didn't make it")

dot_data = tree.export_graphviz(dtree,
                                feature_names=X.columns,
                                class_names = made_it,
                                out_file=None, filled=True, rounded=True,
                                special_characters=True) 
graph = graphviz.Source(dot_data) 
graph.render("ncaa") 