# Analyzing hyperparameter gridsearch results with xarray

This notebook will show how to explore a multidimensional hyperparameter sweep using the xarray package. The xarray package builds upon pandas to work with and quickly analyze datasets with an arbitrary number of dimensions. We can use this to understand the effect of each hyperparameter into our model.

I will keep the actual NLP methods used in this notebook to a basic gridsearchCV using a standard sklearn pipeline. The original contribution of the notebook is a conversion of of the gridsearchCV results to an xarray dataset, and an overview of how to use the xarray package to visualize the results. If you are familiar with doing a gridsearch with sklearn you should skip to that part. 

# Perform NLP and grid search with sklearn

## Load in the data

In [None]:
from numpy.lib.function_base import median
import pandas as pd
import numpy as np
import xarray as xr

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

df_train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv', index_col=0)

df_train.info()

## Perform some basic text processing

I am just using some functions to clean up the text from a 'getting started notebook'

In [None]:

import re
import string

url_re = re.compile(r'https?://\S+|www\.\S+')
tag_re = re.compile(r'<.*?>')
table_punct = str.maketrans('','',string.punctuation)

def text_preprocess(text):
    text = url_re.sub(r'', text)
    text = tag_re.sub(r'', text)
    text = text.translate(table_punct)
    return text


texts = df_train['text'].apply(text_preprocess)
target = df_train['target']

#TODO: does this need to happen, as gridsearch already is doing cross-validation
X_train, X_test, y_train, y_test = train_test_split(texts, target,train_size=0.25)


## Build the sklearn pipeline

I build a standard sklearn pipeline for classifying text and perform a fit/scoring just to ensure that things are setup properly. 

In [None]:

from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier


pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('tsvd', TruncatedSVD()),
    ('clf', RidgeClassifier()),
])

pipe.fit(X_train,y_train)
pipe.score(X_test, y_test)

## Setup and perform a hyperparameter grid search

In [None]:
from sklearn.model_selection import GridSearchCV

params = {
    'vect__min_df': np.linspace(0,0.01,4),
    'tsvd__n_components': [int(n) for n in np.logspace(0.3,2,4)],
    'tsvd__n_iter': [1,3,5,7],
    'clf__alpha': np.linspace(0.5,1, 3),

}

gridsearch = GridSearchCV(pipe, params) 

gridsearch.fit(X_train, y_train)

# Conversion of grid search results to xarray DataSet and visualization

At this stage, the most often practice is to just select the model parameters that gave the highest test score. However we may want to actually examine the dependence of the results on the model paramers in order to better understand our model, and see which parameters are particulalry important or not important. 

The parameter-dependent results of the gridsearchCV are located in `gridsearch.cv_results_`. 

In [None]:
results = gridsearch.cv_results_


Even though our search was over a 'grid' of parameters, both the results and corresponding parameters are output in a 1D arrays. For example the 'mean_test_score':

In [None]:
results['mean_test_score'].shape

And the corresponding coordinates for one of the grid search parameters:  

In [None]:
results['param_vect__min_df'].data.shape

First we convert this data to a multindexed pandas dataframe. It would be possible to skip this step and just directly form the xarray Dataset, but this 1D input data is particularly amenable to making a multindexed dataframe, where each point as multiple indices. 

I will only look at some key results: `mean_test_score` and `mean_fit_time`

In [None]:
var_keys = ['mean_test_score','mean_fit_time']
results_downselect = {key: results[key] for key in var_keys}

param_grid = gridsearch.param_grid
param_coords = {param : results['param_' + param].data for param in param_grid.keys()}

mi = pd.MultiIndex.from_arrays(list(param_coords.values()), names = param_grid.keys())

df = pd.DataFrame(results_downselect, index=mi)

df.head()

At this stage we could analyze the results with the multindexed dataframe. However, working with multiindexes in pandas is somewhat cumbersome as we must keep track of the different levels of the index during downselection etc. 

This is where xarray comes in, which is a package specifially meant for multidimensional datasets. xarray and pandas are highly interoperable, and we can simply create a dataset from using the Dataset.from_dataframe method

In [None]:
ds = xr.Dataset.from_dataframe(df)
ds

Now we have a `xarray.Dataset` object, which is analagous to a `pandas.DataFrame`, where the 'data variables' are analagous to columns. The key difference is that now our data is represented as an N-dimensional array, instead of a 2D tabular form as the multindexed DataFrame. 

Individual variables can be selected out as `xarray.DataArray` objects (analgous to `pandas.Series`)

In [None]:
ds['mean_test_score']

I won't go into the details of the xarray package but just demonstrate some basic visualization. 

## Data visualiztion

It is possible to visualize up to 4 dimensions

In [None]:

ds['mean_test_score'].plot(col='clf__alpha', row='tsvd__n_components')

It doesn't seem like the TVSD number of interations or the classifier alpha don't seem to have a dramatic effect. I'm going to downselect the data visually to the point `tsvd_n_iter = 5` and `alpha = 0.75`

In [None]:
ds_sel = ds.sel(tsvd__n_iter = 5, clf__alpha = 0.75)

ds_sel['mean_test_score'].plot()

To plot the second dimension as different color lines we use the 'hue' parameter.

Matplotlib keyword arguments can be passed to matplotlib, such as the 'markers' or 'xscale' parameter. 

In [None]:
ds_sel['mean_test_score'].plot(hue='vect__min_df', marker='o', xscale='log')

The score increases as we reduce the amount of dimensional reduction, as expected. Looking at the computational time we can see the tradeoff: 

In [None]:
ds_sel['mean_fit_time'].plot(hue='vect__min_df', marker='o', xscale='log')

Let's calculate the ratio of the score to the fit time to try and quantify this tradeoff, and assign this to a new variable in the dataset 

In [None]:
ds_sel['ratio'] = ds_sel['mean_test_score']/ds_sel['mean_fit_time']

One useful technique is to transform the variables into a dimension in a dataarray with the `Dataset.to_array` method, to be able to plot the variables along a column or row. 

Because each variable will have a very different y axis, it's important to pass the `sharey = False` keyword argument.

In [None]:
ds_sel.to_array('var').plot(row = 'var', hue='vect__min_df', marker='o', xscale='log', sharey=False)