# HW4 Trees, Forests, and Bag of Words

Official instructions:

https://www.cs.tufts.edu/comp/135/2020f/hw4.html

This is the *starter code* notebook.

In [1]:
import numpy as np
import pandas as pd
import os
import sys
import time

In [2]:
import sklearn.tree
import sklearn.linear_model
import sklearn.metrics
import sklearn.ensemble

In [3]:
# From the HW4 starter code
from pretty_print_sklearn_tree import pretty_print_sklearn_tree

In [4]:
# Plotting utils
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('seaborn') # pretty matplotlib plots

import seaborn as sns
sns.set('notebook', font_scale=1.25, style='whitegrid')

# Load all data from train/valid/test

In [5]:
# TODO fix to path on your local system
DATA_DIR = os.path.join("../data_product_reviews/")

### Load training

In [6]:
x_tr_df = pd.read_csv(os.path.join(DATA_DIR, 'x_train.csv.zip'))
y_tr_df = pd.read_csv(os.path.join(DATA_DIR, 'y_train.csv'))
x_tr_NF = np.minimum(x_tr_df.values, 1.0).copy()
y_tr_N = y_tr_df.values[:,0].copy()

print("Training data")
print("x_tr_NF.shape: %s" % str(x_tr_NF.shape))
print("y_tr_N.shape : %s" % str(y_tr_N.shape))
print("mean(y_tr_N) : %.3f" % np.mean(y_tr_N))

Training data
x_tr_NF.shape: (6346, 7729)
y_tr_N.shape : (6346,)
mean(y_tr_N) : 0.500


### Load validation set

In [7]:
x_va_df = pd.read_csv(os.path.join(DATA_DIR, 'x_valid.csv.zip'))
y_va_df = pd.read_csv(os.path.join(DATA_DIR, 'y_valid.csv'))

x_va_TF = np.minimum(x_va_df.values, 1.0).copy()
y_va_T = y_va_df.values[:,0].copy()

print("Validation data")
print("x_va_TF.shape: %s" % str(x_va_TF.shape))
print("y_va_T.shape : %s" % str(y_va_T.shape))
print("mean(y_va_T) : %.3f" % np.mean(y_va_T))

Validation data
x_va_TF.shape: (792, 7729)
y_va_T.shape : (792,)
mean(y_va_T) : 0.490


### Load test set 

In [8]:
x_te_df = pd.read_csv(os.path.join(DATA_DIR, 'x_test.csv.zip'))
y_te_df = pd.read_csv(os.path.join(DATA_DIR, 'y_test.csv'))

x_te_TF = np.minimum(x_te_df.values, 1.0).copy()
y_te_T = y_te_df.values[:,0].copy()

print("Heldout Test data")
print("x_te_TF.shape: %s" % str(x_te_TF.shape))
print("y_te_T.shape : %s" % str(y_te_T.shape))
print("mean(y_te_T) : %.3f" % np.mean(y_te_T))

Heldout Test data
x_te_TF.shape: (793, 7729)
y_te_T.shape : (793,)
mean(y_te_T) : 0.515


### Load vocabulary as a list of strings

In [9]:
vocab_list = x_tr_df.columns.tolist()

In [10]:
for word in vocab_list[:8]:
    print(word)
print("...")
for word in vocab_list[-8:]:
    print(word)

good
great
time
book
don't
work
i_have
read
...
never_get
i'd_like
loves_it
an_author
nomin
could_give
bad_but
gap


### Pack training and validation sets into big arrays (so we can use sklearn's hyperparameter search tools)

In [11]:
xall_LF = np.vstack([x_tr_NF, x_va_TF])
yall_L = np.hstack([y_tr_N, y_va_T])

In [12]:
valid_indicators_L = np.hstack([
    -1 * np.ones(y_tr_N.size), # -1 means never include this example in any test split
    0  * np.ones(y_va_T.size), #  0 means include in the first test split (we count starting at 0 in python)
    ])

In [13]:
# Create splitter object using Predefined Split
# Will be used later by all hyperparameter searches

my_splitter = sklearn.model_selection.PredefinedSplit(valid_indicators_L)

# Problem 1: Decision Trees

## 1A: Train a simple tree with depth 3

In [14]:
simple_tree = sklearn.tree.DecisionTreeClassifier(
    max_depth=3, min_samples_split=2, min_samples_leaf=1, criterion='gini')

### **Fit the tree** 

**TODO Train on the training set** in the next coding cell

In [15]:
simple_tree.fit(None, None) #TODO

ValueError: This DecisionTreeClassifier estimator requires y to be passed, but the target y is None.

### **Print Tree** 

Use a helper function from the starter code

In [16]:
pretty_print_sklearn_tree(simple_tree, feature_names=vocab_list)

AttributeError: 'DecisionTreeClassifier' object has no attribute 'tree_'

## 1B : Find best Decision Tree with grid search

In [17]:
# Construct the default predictor
# Any hyperparameters here may be overridden by the hyperparameter grid

tree = sklearn.tree.DecisionTreeClassifier(
    criterion='gini', min_samples_split=2, min_samples_leaf=1)

In [18]:
tree_hyperparameter_grid_by_name = dict(
    max_depth=[2, 8, 32, 128],
    min_samples_leaf=[1, 3, 9],
    random_state = [101],
    )

**TODO Build the Grid Search** in the next coding cell

Hint: See Lab for Grid Search: https://github.com/tufts-ml-courses/comp135-20f-assignments/blob/master/labs/day13_HyperparameterSearch.ipynb

Key Function: sklearn.model_selection.GridSearchCV

Key Requirements:

* Provide the above tree_hyperparameter_grid_by_name dictionary as the set of hyperparameters to search
* Set scoring='balanced_accuracy', since our target metric is balanced accuracy
* Set cv=my_splitter so you can use the predefined split we defined earlier.
* Set return_train_score=True (we want training set scores as well as test set scores)
* Set refit=False (we only want fits on x_tr not on x_all)

In [19]:
tree_grid_searcher = None # TODO

### Do the search!


In [20]:
start_time_sec = time.time()
tree_grid_searcher.fit(xall_LF, yall_L)
elapsed_time_sec = time.time() - start_time_sec

AttributeError: 'NoneType' object has no attribute 'fit'

### Build dataframe of results

Move the results of grid search into a nice pandas data frame.

In [21]:
tree_search_results_df = pd.DataFrame(tree_grid_searcher.cv_results_).copy()
print("Grid search of %3d configurations done after %6.1f sec" % (
    tree_search_results_df.shape[0], elapsed_time_sec))

AttributeError: 'NoneType' object has no attribute 'cv_results_'

### Display search results

This block will make a pretty printed table of the results of your grid search

In [22]:
pd.set_option('precision', 4)
tree_keys = ['param_max_depth', 'param_min_samples_leaf']
tree_search_results_df.sort_values(tree_keys, inplace=True)
tree_search_results_df[tree_keys + ['mean_train_score', 'mean_test_score', 'rank_test_score', 'mean_fit_time']]

NameError: name 'tree_search_results_df' is not defined

In [23]:
print("Printing a dict of the best hyperparameters")
print(tree_grid_searcher.best_params_)

Printing a dict of the best hyperparameters


AttributeError: 'NoneType' object has no attribute 'best_params_'

### Build the best decision tree

**TODO Build the Best Tree** in the next coding cell

This is necessary so you have the specific best performing tree in your workspace.

Although you fit many trees in the search, they were not stored, so we need to recreate the best one.

Hint: Just feed the best hyperparameters as keyword args to construct the tree. Or see the lab about grid search.

In [24]:
best_tree = tree.set_params(None) #TODO
best_tree.fit(x_tr_NF, y_tr_N)

TypeError: set_params() takes 1 positional argument but 2 were given

### Interpret the best decision tree

In [25]:
best_tree.tree_

NameError: name 'best_tree' is not defined

In [26]:
pretty_print_sklearn_tree(best_tree, feature_names=vocab_list)

NameError: name 'best_tree' is not defined

# Problem 2: Random forest

In [27]:
forest = sklearn.ensemble.RandomForestClassifier(
    criterion='gini', min_samples_split=2, random_state=101)

## 2A: Best Random Forest via grid search

Follow the instructions and using what you learn in 1B to finish this step.

This block might take 2-10 minutes. (Takes about 2 min on staff Macbook laptops.)

If yours runs significantly longer, try this out on Google Colab instead.

In [28]:
forest = sklearn.ensemble.RandomForestClassifier(
    n_estimators=125,
    criterion='gini',
    max_depth=15,
    min_samples_split=2,
    min_samples_leaf=1)


In [29]:
forest_hyperparameter_grid_by_name = dict(
    max_features=[3, 10, 33, 100, 333],
    max_depth=[16, 32],
    min_samples_leaf=[1],
    n_estimators=[125],
    random_state=[101],
    )

In [32]:
# TODO construct a GridSearchCV object like you did above.

forest_searcher = None 

### Do the search!

In [31]:
#TODO

### Build dataframe of results

In [None]:
#TODO

### Display search results

In [None]:
#TODO

### Build the best random forest using the best hyperparameters found in 2B and train it.

This is necessary so you have the specific best performing forest in your workspace.


In [None]:
#TODO

## 2B & Figure 2 : Feature Importances 

Access the **feature_importances_** attribute of your trained forest to get a score for each term in our vocabulary.

A higher value of this score indicates the feature is "more important".

In one panel of Figure 2, display a list of the top 10 vocabulary words of your best forest with highest feature importance.

In another panel of Figure 2, display a list of 10 randomly chosen terms that have close-to-zero feature importance (use anything with importance less than 0.00001).

### Figure 2.2
**Sample Output** (Feel free to print all words and organize them in any software)

|**Important Words**|**Unimportant Words**|
|:-:|:-:|
|I1 |  U1  |
|I2 |  U2  |
|I3 |  U3  |
|I4 |  U4  |
|I5 |  U5  |
|I6 |  U6  |
|I7 |  U7  |
|I8 |  U8  |
|I9 |  U9  |
|I0 |  U0  |

# Problem 3: Comparison of models

### 3A Implementation **TODO**:

Selecting the best logistic regression with L1 penalty

Table 3 in Report

In [None]:
lasso = sklearn.linear_model.LogisticRegression(
    penalty='l1', solver='saga', random_state=101)

In [None]:
lasso_hyperparameter_grid_by_name = dict(
    C=np.logspace(-4, 4, 9),
    max_iter=[20, 40], # sneaky way to do "early stopping" 
                       # we'll take either iter 20 or iter 40 in training process, by best valid performance
    )

#### TODO Selecting the best logistic regression with L1 penalty
Hint: Follow 1B and 2A

### Table 3: Comparison of methods on the bag-of-words to sentiment classification task.

Please report balanced accuracy on the train, valid, and test sets, to 3 digits of precision

**Sample Output** (Feel free to print all values and organize them in any software)

|**method**|**train**|**valid**|**test**|
|:-|:-:|:-:|:-:|
|L1-penalized LogisticRegression	|0.123	|0.456	|0.890|
|best RandomForest	|0.123	|0.456	|0.890|
|best Tree	|0.123	|0.456	|0.890|
|simple Tree	|0.123	|0.456	|0.890|