# Week 11 - Classification with RandomForest

KentB

Many randomly assembled **DecisionTrees**.

### Load Data

In [4]:
import pandas as pd
import numpy as np

**Predicting wine quality**

In [5]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",sep=';')

In [6]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [7]:
# Define a simple binary feature for quality
df['quality_good'] = np.where(df.quality > 5, True, False)

In [8]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,quality_good
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,False
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,False
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,False
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,True
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,False


### Bagging

* just handling one feature/aspect in a single tree
* tries to avoid overfitting
* probly better choice for smaller datasets; less complicated model

In [9]:
# Remove target columns from input data
#X = np.array(df.iloc[:,:-2]) # there is an alternate universe just using arrays
# ...but we will use dataFrames here
X = df.iloc[:,:-2]
# Target data is in the very last column
target = df.iloc[:,-1]

In [10]:
target

0       False
1       False
2       False
3        True
4       False
        ...  
1594    False
1595     True
1596     True
1597    False
1598     True
Name: quality_good, Length: 1599, dtype: bool

In [11]:
X.shape

(1599, 11)

In [12]:
# Define our target column as a constant
TARGET_NAME = 'quality_good'

In [13]:
# num rows to extract - FIRST RANDOMNESS - extracted in random order
n_rows = X.shape[0]
# num cols to extract - SECOND RANDOMNESS - extracting random subset
#   we don't want the trees to see all the features - this is just some heuristic
#   to choose a subset size - nothing magical or mandatory
n_col = int(np.sqrt(X.shape[1]))
# num copies (learners / num trees) - a hyperparameter - tune as you like
n_trees = 50

In [44]:
n_rows, n_col

(1599, 3)

**RandomForest Implementation**

* Can train the trees in parallel; quick
* No scaling required! Data is not subject to numeric processing requiring normalizing
* Used in a lot in anomaly detection scenarios (?)
* Note can also be used as a Regressor

In [15]:
columns = X.columns

In [16]:
#index = list(range(0,n_rows))  # if using arrays vs. dataframes....
index = X.index

In [17]:
data_collection = []

In [None]:
for i in range(n_trees):
  # Randomly choose n_rows from the index
  #   After each are selected they are put back, and can be selected again!
  #   Process is called bootstrapping
  row_draw = np.random.choice(index, size = n_rows, replace=True)

  # print(row_draw)    # shuffled index array

  # Randomly choose n_col columns (some subset)
  #   Do not use 'replace' for cols b/c np.random.choice might select them twice
  col_draw = np.random.choice(columns, size = n_col, replace=False)

  #print(col_draw)  # list of 3 column names
  #print(X.loc[row_draw, col_draw])  # large! full df, but with only 3 cols

  # Capture X, y, and list of selected columns
  data_collection.append((X.loc[row_draw, col_draw], target.iloc[row_draw], col_draw))

**Fit a list of sklearn DecisionTreeClassifiers**

We have randomly extracted N sets of data, each with 3 random columns.

Create a Classifier for each.

In [20]:
from sklearn.tree import DecisionTreeClassifier

In [21]:
tree_coll = []

In [22]:
for data in data_collection:
  dt = DecisionTreeClassifier()
  # Fit a DecisionTree
  #  where
  #   data[0] is the full dataset w/shuffled index, minus all but 3 random cols
  #   data[1] is the full target value set, with matching shuffled index
  dt.fit(X=data[0], y=data[1])
  # Save this trained DecisionTree, which is trained on these 3 cols
  tree_coll.append(dt)

In [47]:
prediction = []

In [27]:
# For each trained tree.....
for idx, dt in enumerate(tree_coll):
  # Slice original X as input - just take first 2 rows to predict on as a test
  #   and note we are only selecting the columns listed in this Tree
  #
  # Returns an array of boolean results - Classifying each input row
  prediction.append(dt.predict(X.loc[0:1, data_collection[idx][2]]))

In [28]:
# Calc average RF prediction for 1st row and 2nd row of X
#    Avg of all trained trees!
np.mean(np.array(prediction).astype(int), axis=0)

array([0.04, 0.06])

**Note:** *The average prediction for rows 1&2 are almost zero, meaning there is a lot of agreement across the trained trees!*

**Try sklearn RandomForest**

Compare our result to a full sklearn impl.

In [30]:
from sklearn.ensemble import RandomForestClassifier

In [31]:
rf = RandomForestClassifier(n_estimators=50)

In [32]:
rf.fit(X, target)

RandomForestClassifier(n_estimators=50)

In [56]:
# Again, predict over the first 2 rows only - just as a test
rf.predict_proba(X.loc[0:1,:])[:,1]

array([0. , 0.1])

*Probabilities here also extremely certain that the first 2 rows are False.*

In [35]:
from sklearn.metrics import average_precision_score

*Another possible metric*

In [53]:
# Need the predictions in hand
y_pred_rf = rf.predict(X)

In [37]:
# Calculate precision
average_precision_score(target, y_pred_rf)

1.0

### Gradient Boosting

* understand in terms of prediction
* take errs from prediction...use as target for next tree...iterate
* smaller amount of hyperparams
* more popular than NeuralNets


In [38]:
!pip install xgboost --upgrade



In [39]:
from xgboost.sklearn import XGBClassifier

In [40]:
xgb = XGBClassifier(n_estimator = 20)

In [41]:
xgb.fit(X, target)

Parameters: { "n_estimator" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimator=20,
              n_estimators=100, n_jobs=0, num_parallel_tree=1, predictor='auto',
              random_state=0, reg_alpha=0, ...)

In [42]:
y_pred_xgb = xgb.predict(X)

In [43]:
average_precision_score(target, y_pred_xgb)

1.0