# Advances in Machine Learning with Big Data

### (part 1 of 2) 
### Trinity 2020 Weeks 1 - 4
### Dr Jeremy Large
#### jeremy.large@economics.ox.ac.uk


&#169; Jeremy Large ; shared under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)

In [1]:
%load_ext autoreload
%autoreload 2
%pylab inline
plt.rcParams['figure.figsize'] = [12, 4]

import sys, os
from mpl_toolkits.mplot3d import axes3d
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.INFO)

import warnings
warnings.filterwarnings('ignore')

# point at library; I need some lessons on doing good PYTHONPATHs:
REPO_DIR = os.path.dirname(os.getcwd())
UCI_LIB = os.path.join(REPO_DIR, 'lib')
sys.path.append(UCI_LIB)

import numpy as np  
import pandas as pd  

#  pull in scikit-learn libraries:
from sklearn import linear_model
from sklearn import model_selection

import sbs_sklearn    # module where I've put some functions from the last class
from uci_retail_data import uci_files, stock_codes

Populating the interactive namespace from numpy and matplotlib


In [2]:
def plot_coeffs(mod, mod_name, comment):
    plt.plot(mod.coef_, marker='o')
    plt.grid()
    plt.title(f"The betas of the {mod_name} - {comment}")
    plt.axhline(color='k')

## 6. Decision trees, bagging, and random forests

## Contents Weeks 1-4:

1. Introducing this course's dataset

1. Being an econometrician _and_ a data scientist

1. Overfit and regularization

1. Regularization through predictor/feature selection (Lasso etc.)

1. Resampling methods, and model selection

1. **Decision trees, bagging, and random forests**

1. Single-layer neural networks

Load data per previous classes ...

In [3]:
df = uci_files.standard_uci_data_access()

2020-05-12 23:36:19,277 INFO:Loading c:\Users\user\Desktop\Oxford\MFE\Advances in Machine Learning\ox-sbs-ml-bd\data\raw.csv , sheet Year 2009-2010
2020-05-12 23:36:23,822 INFO:Loaded c:\Users\user\Desktop\Oxford\MFE\Advances in Machine Learning\ox-sbs-ml-bd\data\raw.csv , sheet number one, obviously


In [4]:
invalids = stock_codes.invalid_series(df)

In [5]:
invoices = stock_codes.invoice_df(df, invalid_series=invalids)

2020-05-12 23:36:24,410 INFO:NumExpr defaulting to 8 threads.


In [6]:
invoices.columns

Index(['customer', 'codes_in_invoice', 'items_in_invoice', 'invoice_spend',
       'hour', 'month', 'words', 'country', 'words_per_item'],
      dtype='object')

In [7]:
invoices.head()

Unnamed: 0_level_0,customer,codes_in_invoice,items_in_invoice,invoice_spend,hour,month,words,country,words_per_item
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
489434,13085.0,8,166,505.3,7,200912,"{WHITE, POT, SWEET, MUG, RECORD, BALL, CHRISTM...",United Kingdom,3.625
489435,13085.0,4,60,145.8,7,200912,"{LARGE, ,, DOG, WITH, HEART, BOWL, CHASING, ME...",United Kingdom,4.0
489436,13078.0,19,193,630.33,9,200912,"{IVORY, CLASSIC, PLATE, SIGN, BLACK, DUCKS, MA...",United Kingdom,3.315789
489437,15362.0,23,145,310.75,9,200912,"{MARIA, PACK, BOTTLE, DESIGN, BOX, SMALL, 20, ...",United Kingdom,3.0
489438,18102.0,17,826,2286.24,9,200912,"{JUMBO, WRITING, DOORSTOP, RED, COASTER, BOTTL...",United Kingdom,2.235294


Prepare our dataset for linear regression:

In [8]:
invoices['log_item_spend'] = np.log(invoices.invoice_spend / invoices.items_in_invoice)

y = invoices.log_item_spend

# Move to after categorical variables are defined
# X = invoices[predictors] 

In [21]:
# Adapted from https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html

from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
#from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

rng = np.random.RandomState(seed=42)
invoices['random_cat'] = rng.randint(3, size=invoices.shape[0])
invoices['random_num'] = rng.randn(invoices.shape[0])

# Using headers from the invoices dataframe

# Can we use words?

categorical_columns = ['country', 'random_cat']
numerical_columns = ['customer','hour', 'month', 'codes_in_invoice', 'words_per_item', 'random_num']

invoices = invoices[categorical_columns + numerical_columns]

X_train, X_test, y_train, y_test = train_test_split(invoices, y, test_size=0.2)

# Polynomial?

categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

preprocessing = ColumnTransformer(
    [('cat', categorical_pipe, categorical_columns),
     ('num', numerical_pipe, numerical_columns)])

rf = Pipeline([
    ('preprocess', preprocessing),
    ('regressor', RandomForestRegressor(min_weight_fraction_leaf=0.1, 
                                               max_features=int(np.sqrt(len(invoices.columns))),  
                                               n_estimators=500))
])

In [22]:
X_train.head()

Unnamed: 0_level_0,country,random_cat,customer,hour,month,codes_in_invoice,words_per_item,random_num
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
502921,Spain,0,12496.0,13,201003,12,2.25,-0.188981
500616,United Kingdom,2,15437.0,10,201003,12,3.25,0.662994
494635,United Kingdom,2,15945.0,17,201001,13,4.0,-0.706743
513307,United Kingdom,1,15738.0,14,201006,7,3.142857,-1.136891
537836,United Kingdom,0,14866.0,14,201012,1,3.0,1.068496


In [44]:
rf.fit(X_train, y_train)
rf.score(X_train,y_train)

0.004283893480901635

In [45]:
rf.fit(X_test, y_test)
rf.score(X_test,y_test)

0.010356679642113951

In [48]:
gb = Pipeline([
    ('preprocess', preprocessing),
    ('regressor', ensemble.GradientBoostingRegressor(n_estimators=500,                # akin to d
                                               min_weight_fraction_leaf=0.1,    # this is B
                                               learning_rate=0.01))              # this is lambda
])
gb.fit(X_train, y_train)
gb.score(X_train, y_train)

0.09663209509194172

In [49]:
gb.fit(X_test, y_test)
gb.score(X_test, y_test)

0.1104665955755514

In [34]:
#from sklearn import preprocessing
#poly = preprocessing.PolynomialFeatures(4, include_bias=False)
#polynomial_X = pd.DataFrame(poly.fit_transform(X.values))
#polynomial_X.columns = poly.get_feature_names(X.columns)

In [None]:
#poly_std_X = ((polynomial_X - polynomial_X.mean()) / polynomial_X.std())

**Exercise**: We aim to:
1. take account of categorical information in our raw data, in order to get a better fit

1. use a Pipeline.

* Adapt the code [here](https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html) to the present namespace. 
    * It may be of use to observe the related [notebook](https://scikit-learn.org/stable/_downloads/9e4e8e1cf9e1bc7322177aeb4a2af787/plot_permutation_importance.ipynb)
    * It will be convenient to copy the lecture notebook - then work in the copy. 

* After adding missing imports, an early line of the code should be:

> `X_train, X_test, y_train, y_test = train_test_split(invoices, y, test_size=0.2)`

* Define `categorical_columns` and `numerical_columns` as ambitiously as possible, 
    * but do not make use of `items_in_invoice` or `invoice_spend`, which combine to create the target.

* Report `test` and `train` scores for the model that you obtain - if these are R2s, find code to report MSEs.

* Plot the features importance diagrams, and discuss.

* Swap in the `GradientBoostingRegressor` in place of the `RandomForestRegressor` - compare.