# SLU16 - Data Sufficiency and Selection
### Exercise notebook

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.feature_selection import mutual_info_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot as plt
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from utils import plot_learning_curve
import inspect

from hashlib import sha1 # just for grading purposes
import json

def _hash(obj):
    if type(obj) is not str:
        obj = json.dumps(obj)
    return sha1(obj.encode()).hexdigest()

# Context 
As you've learned, it's very important that the data scientist has a good domain knowledge of the field where they are working in, so that they can recognize unexpected effects, and can use their world model to chose features. 

So... to make sure we're all on the same level going into the exercises, we're going to be distinguishing between young and adult Abalones. 

What are Abalones, you ask? These cool things: 
![](https://nnimgt-a.akamaihd.net/transform/v1/crop/frm/Jesinta.Burton/30bc51dc-c571-4944-8dff-a7b5d0c14ff4.jpg/r0_0_728_409_w1200_h678_fmax.jpg)

For reasons which are frankly beyond me, there are people who know a lot about detecting the age of abalones. 

You will do this with machine learning. 

To make matters worse, your instructor is evil, and has added nonsensical random features. 

### Data
The target is `adult`, and is 0 when the abalone is a child, 1 when it's an adult. 

In [2]:
df = pd.read_csv('data/abalone.csv')
df = pd.get_dummies(df)
df.head(2)

Unnamed: 0,adult,Viscera weight,Coarse-grained Hormones,Diameter,Length,Phosphorylation,Ectopic relationships,Height,Whole weight,Shell weight,Shucked weight,Sex_F,Sex_I,Sex_M
0,1.0,0.101,0.47,0.365,0.455,0.073,0.989,0.095,0.514,0.15,0.2245,0,0,1
1,0.0,0.0485,0.697,0.265,0.35,0.655,0.119,0.09,0.2255,0.07,0.0995,0,0,1



# Exercise 1 - find the nonsense 

There are 3 features which are just random. Without using any model, find out which ones they are. 

To determine this use
1. pearson correlation 
2. mutual information (`mutual_info_classif`)

We don't really care about the intermediate steps, but you should probably visualize these in whatever way you like.  

_Hint #1: you can use `display(<something>)` if you want to force jupyter to display a series_  

In [None]:
# X = ... 
# y = ... 

# pearson_corr = ...
# something something 

# mutual_info = ... 
# something something 

# nonsense_features = [first, second, third]  (feature names only, the order does not matter)

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN TESTS
assert _hash(sorted(nonsense_features)) == '1f2779dbe1c037234cba7a7f7f303bee81757cc1'
print('Great success!')
### END TESTS

# Exercise 2 - observe the tree 

Yay! Time to look at trees. 

To pass this exercise, you will make a function called `train_and_plot_tree` that will do the following: 

1. Fit a tree with `max_depth` of 3, and `min_samples_split` of 20 
2. Plot that tree, in a way that clearly shows the feature names, and the percentage of adults in each node. 
3. Return the plot (just assign the output of a function to a variable and return it for evaluation) 

In [None]:
def train_and_plot_tree(X, y): 

# YOUR CODE HERE
raise NotImplementedError()
    return my_plot 


tree_plot = train_and_plot_tree(X, y);
tree_plot;

In [None]:
sig = inspect.signature(train_and_plot_tree)
assert set(sig.parameters.keys()) == {'X', 'y'}, 'Do not change the signature!'  
all_text = ''.join([tree_plot[i].get_text() for i in range(len(tree_plot))])
assert 'Shell weight' in all_text, 'Your feature names seem weird'
assert 'child' in all_text, 'Did you make the right labels for class names?'
first_node_feature = tree_plot[0].get_text().split('<')[0].strip()
assert _hash(first_node_feature) == 'a0a91ccd0f0074dd419b7750263b9fbe107e7c86', 'Unexpected first node'
assert len(tree_plot) == 15, 'The tree seems to have the wrong size'
node_12 = tree_plot[12].get_text()
assert 'gini = 0.028' in node_12, 'Are you sure you configured the tree correctly?'
assert 'adult' in node_12, 'We want you to have the target labels in the plot'
assert '0.986' in node_12, 'Do you have the proportions in the nodes?'
print('Great success!')

# Exercise 3: model based feature importances (linear) 
You will fit a logistic regression to get the features that produce the top 5 coefficients. 
Note that the coefficients can be both positive and negative, and you care about "the biggest magnitude". 

We will take care of the normalization for you. _(if you ever train a logistic regression without normalizing the features I will place gummybears in your lasagna. Consider yourself warned.)_

In [None]:
rs = RobustScaler()  # just scaling, because I'm nice. 
X_normed = pd.DataFrame(rs.fit_transform(X), 
                        columns=X.columns)  # remember this? cool huh! 


# As before, we just want the names of the features, in a list. 
# From now it's up to you. Use default parameters on the logistic regression. 
# something (~ 5 rows) 
# top_5_by_magnitude_linear = ... 

# YOUR CODE HERE
raise NotImplementedError()

print(sorted(top_5_by_magnitude_linear))

In [None]:
assert len(top_5_by_magnitude_linear) == 5 
assert _hash(sorted(top_5_by_magnitude_linear)) == 'f814d06f92beab782a3d1e0d0d9fe3098520c2b2'
print('Great success!')

# Exercise 4: model based feature importances (non-linear) 
Oh, you made it! Good. Now for non-linear. 

Train a Random Forest, with the following parameters: 
* n_estimators = 50 
* max_depth = 2
* min_samples_split = 50 
* random_state = 1000
* n_jobs = -1  (optional, but speeds things up)

Then use it to get feature importances. Use the non-normalized features. 

As before, get the top 5 features by importance.

In [None]:
# rf = ... 

# something (~5 rows)

# top_5_by_importance_random_forest = ...
# YOUR CODE HERE
raise NotImplementedError()

print(sorted(top_5_by_importance_random_forest))

In [None]:
assert len(top_5_by_importance_random_forest) == 5
assert _hash(sorted(top_5_by_importance_random_forest)) == 'bbc12adaef06b61e02cb766182fab945577633b4'
print('Great success!')

# Exercise 5: 

Do we have enough data, or should we go collect more abalones? Let's find out with learning curves! 

Using the random forest you already initialized, do the following: 

1. Define a numpy array of train_sizes, from 10% of the data to 100%, in increments of 10% (0.1, 0.2, 0.3... etc) 


2. Get the learning curve data, with the following configuration:
    - classifier: your old random forest from exercise 4 
    - metric: use area under the roc curve as your metric 
    - use the train sizes array you just created
    - all features, not normalized 
    - cv = 5 
    - random state = 1000 (needed to pass the grader) 
    - n_jobs = -1 (optional, but faster) 

As with the learning notebooks you should save the output to `train_sizes_abs`, `train_scores` and `test_scores` 

3. Plot it! _(feel free to use plot_learning_curve that we used in the learning notebook, but remember that's custom code)_ 


In [None]:
# train_sizes = ...   (10% increments, starting at 10%)
# train_sizes_abs, train_scores, test_scores (get the data, no plotting here)

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e1 = 'your train scores dont look right. Did you use the right features? Maybe check for categoricals, that can cause issues'
assert np.nan not in train_scores, e1 
assert train_sizes.sum() == 5.5, 'Are your train sizes correct?'
assert train_sizes.mean() == .55, 'Are your train sizes correct?'
assert len(train_sizes) == 10, 'Are your train sizes correct?'
assert train_sizes_abs.mean() == 1837.1, 'Are your train sizes abs correct?'
assert round(pd.DataFrame(train_scores).mean().median(), 2) == 0.94, 'Are your train scores correct?'
assert round(pd.DataFrame(test_scores).median().quantile(.3), 2) == 0.93, 'Are your test scores correct?'
print('Great success!')