<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-MVC-project-description" data-toc-modified-id="1.-MVC-project-description-1">1. MVC project description</a></span></li><li><span><a href="#2.-Setup" data-toc-modified-id="2.-Setup-2">2. Setup</a></span></li><li><span><a href="#3.-Get-the-data" data-toc-modified-id="3.-Get-the-data-3">3. Get the data</a></span><ul class="toc-item"><li><span><a href="#3.1.-From-matlab-to-dict" data-toc-modified-id="3.1.-From-matlab-to-dict-3.1">3.1. From matlab to dict</a></span></li><li><span><a href="#3.2.-From-dict-to-pandas-dataframe" data-toc-modified-id="3.2.-From-dict-to-pandas-dataframe-3.2">3.2. From dict to pandas dataframe</a></span></li></ul></li><li><span><a href="#4.-Data-analysis" data-toc-modified-id="4.-Data-analysis-4">4. Data analysis</a></span><ul class="toc-item"><li><span><a href="#4.1.-Muscles-by-dataset" data-toc-modified-id="4.1.-Muscles-by-dataset-4.1">4.1. Muscles by dataset</a></span></li><li><span><a href="#4.2.-Tests-by-dataset" data-toc-modified-id="4.2.-Tests-by-dataset-4.2">4.2. Tests by dataset</a></span></li><li><span><a href="#4.3.-Muscles-and-tests-count" data-toc-modified-id="4.3.-Muscles-and-tests-count-4.3">4.3. Muscles and tests count</a></span></li><li><span><a href="#4.4.-Max-for-each-test-(normalized-by-participant-number)" data-toc-modified-id="4.4.-Max-for-each-test-(normalized-by-participant-number)-4.4">4.4. Max for each test (normalized by participant number)</a></span></li><li><span><a href="#4.5.-Distribution" data-toc-modified-id="4.5.-Distribution-4.5">4.5. Distribution</a></span></li><li><span><a href="#4.6.-Summary" data-toc-modified-id="4.6.-Summary-4.6">4.6. Summary</a></span></li></ul></li><li><span><a href="#5.-Tests-selection" data-toc-modified-id="5.-Tests-selection-5">5. Tests selection</a></span><ul class="toc-item"><li><span><a href="#5.1.-Outcome" data-toc-modified-id="5.1.-Outcome-5.1">5.1. Outcome</a></span></li></ul></li><li><span><a href="#6.-Machine-learning-pipeline" data-toc-modified-id="6.-Machine-learning-pipeline-6">6. Machine learning pipeline</a></span><ul class="toc-item"><li><span><a href="#6.1.-Configurations" data-toc-modified-id="6.1.-Configurations-6.1">6.1. Configurations</a></span></li><li><span><a href="#6.2.-Get-&amp;-split-data" data-toc-modified-id="6.2.-Get-&amp;-split-data-6.2">6.2. Get &amp; split data</a></span></li><li><span><a href="#6.2.-Pipeline" data-toc-modified-id="6.2.-Pipeline-6.3">6.2. Pipeline</a></span><ul class="toc-item"><li><span><a href="#6.2.1.-Temporary-step:-normalize-data" data-toc-modified-id="6.2.1.-Temporary-step:-normalize-data-6.3.1">6.2.1. Temporary step: normalize data</a></span></li><li><span><a href="#6.2.2.-Assembling-the-pipeline" data-toc-modified-id="6.2.2.-Assembling-the-pipeline-6.3.2">6.2.2. Assembling the pipeline</a></span></li></ul></li><li><span><a href="#6.3.-Optimization" data-toc-modified-id="6.3.-Optimization-6.4">6.3. Optimization</a></span></li><li><span><a href="#6.4.-Diagnostic" data-toc-modified-id="6.4.-Diagnostic-6.5">6.4. Diagnostic</a></span></li><li><span><a href="#6.5.-Evaluation" data-toc-modified-id="6.5.-Evaluation-6.6">6.5. Evaluation</a></span><ul class="toc-item"><li><span><a href="#6.5.1.-Prediction" data-toc-modified-id="6.5.1.-Prediction-6.6.1">6.5.1. Prediction</a></span></li><li><span><a href="#6.5.2.-Denormalize-data" data-toc-modified-id="6.5.2.-Denormalize-data-6.6.2">6.5.2. Denormalize data</a></span></li><li><span><a href="#6.5.3.-Evaluate-on-all-muscles" data-toc-modified-id="6.5.3.-Evaluate-on-all-muscles-6.6.3">6.5.3. Evaluate on all muscles</a></span></li><li><span><a href="#6.5.4.-Evaluate-on-each-muscle" data-toc-modified-id="6.5.4.-Evaluate-on-each-muscle-6.6.4">6.5.4. Evaluate on each muscle</a></span></li><li><span><a href="#6.5.5.-Summary" data-toc-modified-id="6.5.5.-Summary-6.6.5">6.5.5. Summary</a></span></li></ul></li></ul></li></ul></div>

# 1. MVC project description

**Links**
- [github repo](https://github.com/romainmartinez/mvc)
- [plotly figures](https://plot.ly/organize/romainmartinez:114)

**Todos**
- update readme, description
- add data analysis summary
- try Keras
- stacked bar
- picture for the four mvc tests
- implement DalMaso's method

**Author**: _Romain Martinez._

# 2. Setup

In [1]:
# Common imports
import scipy.io as sio
import pandas as pd
import numpy as np

# Path
from pathlib import Path
PROJECT_PATH = Path('./')
DATA_PATH = PROJECT_PATH.joinpath('data')

# to make this notebook's output stable across runs
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Figures
OFFLINE = True
if OFFLINE:
    import plotly.offline as py
    py.init_notebook_mode(connected=True)
else:
    import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
BASE_LAYOUT = go.Layout(hovermode='closest', font=dict(size=14))
MARKER_LAYOUT = dict(
    color='rgba(27, 158, 119, 0.6)',
    line=dict(
        color='rgba(27, 158, 119, 1.0)',
        width=2,
    ))

# 3. Get the data

## 3.1. From matlab to dict

In [2]:
def load_data(data_path, data_format, normalize=False):
    if not data_path.is_dir():
        raise ValueError('please provide a valid data path')
        
    mat = {}
    data = {key: [] for key in ('datasets', 'participants', 'muscles', 'tests', 'mvc')}
    count = -1
    dataset_names = []
    
    for idataset, ifile in enumerate(data_path.iterdir()):
        if ifile.parts[-1].endswith(f'{data_format}.mat'):
            dataset = ifile.parts[-1].replace('_only_max.mat', '').replace('MVE_Data_', '')
            
            if dataset not in dataset_names:
                dataset_names.append(dataset)
            
            mat[dataset] = sio.loadmat(ifile)['MVE']
            n_participants = mat[dataset].shape[0]
            print(f"project '{dataset}' ({n_participants} participants)")
            
            for iparticipant in range(mat[dataset].shape[0]):
                count += 1
                for imuscle in range(mat[dataset].shape[1]):
                    max_mvc = np.nanmax(mat[dataset][iparticipant, imuscle, :])
                    for itest in range(mat[dataset].shape[2]):
                        data['participants'].append(count)
                        data['datasets'].append(idataset)
                        data['muscles'].append(imuscle)
                        data['tests'].append(itest)
                        if normalize:
                            data['mvc'].append(mat[dataset][iparticipant, imuscle, itest] * 100 / max_mvc)
                        else:
                            data['mvc'].append(mat[dataset][iparticipant, imuscle, itest])
                            
    print(f'\n\ttotal participants: {count}')
    return data, dataset_names

In [3]:
DATA_FORMAT = 'only_max'
data, DATASET_NAMES = load_data(data_path=DATA_PATH, data_format=DATA_FORMAT, normalize=False)

MUSCLES_NAMES = [
    'upper trapezius', 'middle trapezius', 'lower trapezius',
    'anterior deltoid', 'middle deltoid', 'posterior deltoid',
    'pectoralis major', 'serratus anterior', 'latissimus dorsi',
    'supraspinatus', 'infraspinatus', 'subscapularis'
]
TEST_NAMES = np.arange(16)

project 'Sylvain_2015' (10 participants)
project 'Landry2016' (15 participants)
project 'Landry2012' (18 participants)
project 'Tennis' (16 participants)
project 'Violon' (10 participants)
project 'Patrick_2013' (16 participants)
project 'Yoann_2015' (22 participants)
project 'Landry2013' (21 participants)
project 'Landry2015_1' (14 participants)
project 'Landry2015_2' (11 participants)

	total participants: 152



All-NaN slice encountered



## 3.2. From dict to pandas dataframe

In [4]:
df_tidy = pd.DataFrame({
    'participant': data['participants'],
    'dataset': data['datasets'],
    'muscle': data['muscles'],
    'test': data['tests'],
    'mvc': data['mvc']
}).dropna()

print(f'dataset shape = {df_tidy.shape}')
df_tidy.head()

dataset shape = (16456, 5)


Unnamed: 0,dataset,muscle,mvc,participant,test
0,0,0,0.131901,0,0
1,0,0,0.091436,0,1
2,0,0,0.112007,0,2
3,0,0,0.106817,0,3
4,0,0,0.085951,0,4


In [5]:
df_wide = df_tidy.pivot_table(
    index=['dataset', 'participant', 'muscle'],
    columns='test',
    values='mvc',
    fill_value=np.nan).reset_index()

df_wide = df_wide.drop(['dataset', 'participant'], axis=1)
df_wide.columns = df_wide.columns.astype(str)

print(f'dataset shape = {df_wide.shape}')
df_wide.head()

dataset shape = (1468, 17)


test,muscle,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0,0.131901,0.091436,0.112007,0.106817,0.085951,0.040268,0.033935,0.059248,0.105998,0.087777,0.0287,0.102914,0.012532,0.029702,0.07777,
1,1,0.209955,0.357945,0.339552,0.184616,0.249566,0.233603,0.114915,0.047925,0.26544,0.26644,0.133096,0.264917,0.041322,0.022607,0.3411,
2,2,0.162873,0.093305,0.409138,0.279272,0.196073,0.098213,0.156744,0.042138,0.146441,0.251222,0.043447,0.179067,0.024739,0.087391,0.240742,
3,4,0.087598,0.086329,0.102634,0.085689,0.083726,0.051671,0.011998,0.02323,0.07906,0.069491,0.079671,0.075335,0.012705,0.017431,0.087955,
4,5,0.170896,0.391435,0.328289,0.210572,0.169269,0.291393,0.062967,0.012713,0.212864,0.251469,0.269839,0.221003,0.025882,0.011095,0.312398,


# 4. Data analysis

## 4.1. Muscles by dataset

In [6]:
def plot_count_by_dataset(d, values, index, columns, **kwargs):
    table = d.pivot_table(
        values,
        index,
        columns,
        aggfunc=lambda x: len(x) / x.nunique(),
        fill_value=0).astype(int)

    fig = ff.create_annotated_heatmap(
        z=np.array(table),
        x=kwargs.get('xlabel'),
        y=kwargs.get('ylabel'),
        showscale=True,
        colorscale='YlGnBu',
        colorbar=dict(title='Count', titleside='right'))

    fig['layout'].update(BASE_LAYOUT)
    fig['layout'].update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(title=kwargs.get('xtitle'), side='bottom'),
            yaxis=dict(title=kwargs.get('ytitle'), autorange='reversed'),
            margin=go.Margin(t=80, b=80, l=150, r=80, pad=0)))
    return fig

In [7]:
muscle_by_dataset = plot_count_by_dataset(
    df_tidy,
    values='test',
    index='dataset',
    columns='muscle',
    ylabel=DATASET_NAMES,
    xlabel=MUSCLES_NAMES,
    title='Muscles by dataset')
py.iplot(muscle_by_dataset, filename='mvc/muscles_by_dataset')

## 4.2. Tests by dataset

In [8]:
muscle_by_dataset = plot_count_by_dataset(
    df_tidy,
    values='muscle',
    index='dataset',
    columns='test',
    ylabel=DATASET_NAMES,
    xtitle='Tests',
    title='Tests by dataset')
py.iplot(muscle_by_dataset, filename='mvc/tests_by_dataset')

## 4.3. Muscles and tests count

In [9]:
def plot_count_bar(d, column, **kwargs):
    count = np.array(d[column].value_counts(sort=False))
    trace = go.Bar(
        x=count,
        y=kwargs.get('ylabel'),
        marker=MARKER_LAYOUT,
        orientation='h')

    layout = BASE_LAYOUT.copy()
    layout.update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(
                title=kwargs.get('xtitle'), showline=True, linewidth=1.5),
            yaxis=dict(
                title=kwargs.get('ytitle'), showline=True, linewidth=1.5)))

    # adjust y axis
    layout['yaxis'].update(nticks=count.shape[0])
    layout.update(margin=go.Margin(t=80, b=80, l=150, r=80, pad=0))
    return dict(data=[trace], layout=layout)

In [10]:
test_count_bar = plot_count_bar(
    df_tidy, 'test', title='Tests count', xtitle='n', ytitle='Tests')
py.iplot(test_count_bar, filename='mvc/test_count_bar')

In [11]:
muscle_count_bar = plot_count_bar(
    df_tidy, 'muscle', ylabel=MUSCLES_NAMES, title='Muscles count', xtitle='n')
py.iplot(muscle_count_bar, filename='mvc/muscle_count_bar')

## 4.4. Max for each test (normalized by participant number)

In [12]:
normalized, _ = load_data(
    data_path=DATA_PATH, data_format=DATA_FORMAT, normalize=True)

df_normalized = pd.DataFrame({
    'participant': normalized['participants'],
    'dataset': normalized['datasets'],
    'muscle': normalized['muscles'],
    'test': normalized['tests'],
    'mvc': normalized['mvc']
}).dropna()

project 'Sylvain_2015' (10 participants)
project 'Landry2016' (15 participants)
project 'Landry2012' (18 participants)
project 'Tennis' (16 participants)
project 'Violon' (10 participants)
project 'Patrick_2013' (16 participants)
project 'Yoann_2015' (22 participants)
project 'Landry2013' (21 participants)
project 'Landry2015_1' (14 participants)
project 'Landry2015_2' (11 participants)

	total participants: 152



All-NaN slice encountered



In [13]:
def plot_max_by_test(d, **kwargs):
    maximum = d[d['mvc'] == 100].pivot_table(
        values='muscle',
        index='dataset',
        columns='test',
        aggfunc='count',
        fill_value=0)
    maximum = (maximum.div(maximum.sum(axis=1), axis=0) * 100).astype(int)

    fig = ff.create_annotated_heatmap(
        z=np.array(maximum),
        x=kwargs.get('xlabel'),
        y=kwargs.get('ylabel'),
        showscale=True,
        colorscale='YlGnBu',
        colorbar=dict(title='Percentage', titleside='right'))

    fig['layout'].update(BASE_LAYOUT)
    fig['layout'].update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(title=kwargs.get('xtitle'), side='bottom'),
            yaxis=dict(title=kwargs.get('ytitle'), autorange='reversed'),
            margin=go.Margin(t=80, b=80, l=150, r=80, pad=0)))
    return fig

In [14]:
max_by_test = plot_max_by_test(
    df_normalized,
    ylabel=DATASET_NAMES,
    xtitle='Tests',
    title='Max for each test (normalized by participant number)')

py.iplot(max_by_test, filename='mvc/max_by_test')

## 4.5. Distribution

In [15]:
def plot_mvc_boxplot(d, xlabel, by='muscle', **kwargs):
    if 'subset' in kwargs:
        subset = kwargs.get('subset')
        if by == 'muscle':
            d = d[d['test'] == subset]
            print(f'test {subset} selected')
        else:
            d = d[d['muscle'] == subset]
            print(f'muscle {subset} selected')

    traces = []
    for ilabel in range(len(xlabel)):
        traces.append(
            go.Box(
                y=np.array(d[d[by] == ilabel]['mvc']),
                name=xlabel[ilabel],
                boxpoints='all',
                jitter=0.5,
                whiskerwidth=0.2,
                marker=dict(size=2),
                line=dict(width=1)))

    layout = BASE_LAYOUT.copy()
    layout.update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(
                title=kwargs.get('xtitle'), showline=True, linewidth=1.5),
            yaxis=dict(
                title=kwargs.get('ytitle'),
                showline=True,
                linewidth=1.5,
                zeroline=False)))

    return dict(data=traces, layout=layout)

In [16]:
mvc_box = plot_mvc_boxplot(
    df_normalized, xlabel=TEST_NAMES, by='test')

py.iplot(mvc_box, filename='mvc/mvc_box')

## 4.6. Summary

# 5. Tests selection

In [17]:
def plot_count_nan(d, **kwargs):
    nan_count = d.isnull().sum()
    nan_id = nan_count.index

    trace = go.Bar(x=nan_id, y=nan_count, marker=MARKER_LAYOUT)
    layout = BASE_LAYOUT.copy()
    layout.update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(
                title=kwargs.get('xtitle'), showline=True, linewidth=1.5),
            yaxis=dict(
                title=kwargs.get('ytitle'), showline=True, linewidth=1.5)))

    annotations = []
    for count, idx in zip(nan_count, nan_id):
        annotations.append(
            dict(
                y=count + 50,
                x=idx,
                text=count,
                font=dict(size=14),
                showarrow=False))
        if count < 50:
            annotations.append(
                dict(
                    y=count + 100,
                    x=idx,
                    text=f'test {idx}',
                    font=dict(size=14),
                    showarrow=True,
                    arrowhead=7,
                    ax=0,
                    ay=-200))
    layout['annotations'] = annotations
    return dict(data=[trace], layout=layout)

In [18]:
nan_count_bar = plot_count_nan(
    df_wide.iloc[:, df_wide.columns != 'muscle'],
    title='NaN count for each test',
    xtitle='Tests',
    ytitle='NaN count')
py.iplot(nan_count_bar, filename='mvc/nan_count_bar')

## 5.1. Outcome
- Based on the missing values plotted on the previous cell, **four tests seem to have very few missing values**
- The learning algorithm will be feeded by the following tests: $2, 3, 4, 5$

# 6. Machine learning pipeline

## 6.1. Configurations

In [19]:
REF_COLS = {
    'test_cols': ['2', '3', '4', '5'],
    'categorical_cols': ['muscle'],
    'test_ref': '2'
}

## 6.2. Get & split data

In [20]:
def get_X_and_y(d, test_col_str, other_col_to_keep, remove_nans=False):
    test_col = np.in1d(d.columns, test_col_str)

    # get y (row maximum)
    y = np.nanmax(d[test_col_str], axis=1)

    subset = d[test_col_str]
    X = np.c_[d[other_col_to_keep], subset]
    col_names = other_col_to_keep + test_col_str

    nan_row_idx = np.isnan(X).any(axis=1)
    if remove_nans:
        X = X[~nan_row_idx, :]
        y = y[~nan_row_idx]
        print(f'Removed {np.sum(nan_row_idx)} rows')
    else:
        if np.any(nan_row_idx):
            print(
                f'Warning: {np.sum(nan_row_idx)} rows have nans. You should remove it or use Imputer'
            )
    return X, y, np.array(col_names)

In [21]:
X, y, COL_NAMES = get_X_and_y(
    df_wide,
    test_col_str=REF_COLS['test_cols'],
    other_col_to_keep=REF_COLS['categorical_cols'],
    remove_nans=True)

Removed 4 rows



All-NaN axis encountered



In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=X[:, 0])

## 6.2. Pipeline

### 6.2.1. Temporary step: normalize data
- This step needs to be temporary out of the pipeline because scikit-learn doesn't have the [TransformedTargetRegressor](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.TransformedTargetRegressor.html#sklearn.preprocessing.TransformedTargetRegressor) yet.

In [23]:
class Normalize:
    """Normalize array(s) in args base on the ref column of the first array."""

    def __init__(self, to_normalize=None, ref=0):
        self.to_normalize = to_normalize
        self.ref = ref

    def transform(self, X, y):
        normalize = lambda a, b: a * 100 / b
        X_out, y_out = [], []

        if X.any():
            X_out = X.copy()
            self.ref_vector = X_out[:, self.ref].ravel()
            X_out[:, self.to_normalize] = np.apply_along_axis(
                normalize, 0, X_out[:, self.to_normalize], self.ref_vector)

            if y.any():
                y_out = np.apply_along_axis(normalize, 0, y, self.ref_vector)

        return X_out, y_out

    def inverse_transform(self, X, y):
        denormalize = lambda a, b: a / 100 * b
        X_out, y_out = [], []
        if X.any():
            X_out = X.copy()
            X_out[:, self.to_normalize] = np.apply_along_axis(
                denormalize, 0, X_out[:, self.to_normalize], self.ref_vector)

        if y.any():
            y_out = np.apply_along_axis(
                denormalize, 0, y, self.ref_vector)
            
        return X_out, y_out

In [24]:
# normalize train set
normalizer_train = Normalize(
    to_normalize=np.in1d(COL_NAMES, REF_COLS['test_cols']),
    ref=np.in1d(COL_NAMES, REF_COLS['test_ref']))
X_train, y_train = normalizer_train.transform(X_train, y_train)

# normalize test set
normalizer_test = Normalize(
    to_normalize=np.in1d(COL_NAMES, REF_COLS['test_cols']),
    ref=np.in1d(COL_NAMES, REF_COLS['test_ref']))
X_test, y_test = normalizer_test.transform(X_test, y_test)

### 6.2.2. Assembling the pipeline

In [25]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures

from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor


def get_categorical_cols(X, col_names=REF_COLS):
    return X[:, np.in1d(COL_NAMES, REF_COLS['categorical_cols'])]


def get_numerical_cols(X, col_names=COL_NAMES):
    return X[:, np.in1d(COL_NAMES, REF_COLS['test_cols'])]


pipeline_categorical = Pipeline([
    ('selector', FunctionTransformer(get_categorical_cols, validate=False)),
    ('encoder', OneHotEncoder(sparse=False))
])

pipeline_numerical = Pipeline([
    ('selector', FunctionTransformer(get_numerical_cols, validate=False))
])

pipeline_preprocessing = FeatureUnion([
    ('categorical', pipeline_categorical),
    ('numerical', pipeline_numerical)
])

model_param = dict(
    alpha=0.85,
    learning_rate=0.1,
    loss="ls",
    max_depth=10,
    max_features=1.0,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=100,
    subsample=0.55)

pipeline_full = Pipeline([
    ('preprocessing', pipeline_preprocessing),
    ('regressor', GradientBoostingRegressor(**model_param))
])

In [26]:
from sklearn.metrics import make_scorer


def mape(y_test, y_pred):
    val = (np.abs((y_test - y_pred) / y_test)) * 100
    return np.mean(val)


mape_scorer = make_scorer(mape, greater_is_better=False)

## 6.3. Optimization

In [27]:
from sklearn.ensemble import ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.feature_selection import SelectFromModel, SelectFwe, f_regression
from sklearn.preprocessing import Normalizer

pipeline_categorical = Pipeline([
    ('selector', FunctionTransformer(get_categorical_cols, validate=False)),
    ('encoder', OneHotEncoder(sparse=False))
])

pipeline_numerical = Pipeline([
    ('selector', FunctionTransformer(get_numerical_cols, validate=False)),
    ('normalizer', Normalizer(norm="max")),
    ('selectfwe', SelectFwe(score_func=f_regression, alpha=0.015)), 
    ('selectfrommodel', SelectFromModel(
        estimator=ExtraTreesRegressor(max_features=0.8, n_estimators=100),
        threshold=0.25)),
])

pipeline_preprocessing = FeatureUnion([
    ('categorical', pipeline_categorical),
    ('numerical', pipeline_numerical)
])

model_param = dict(
    alpha=0.85,
    learning_rate=0.1,
    loss="ls",
    max_depth=10,
    max_features=1.0,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=100,
    subsample=0.55)

pipeline_full = Pipeline([
    ('preprocessing', pipeline_preprocessing),
    ('regressor', GradientBoostingRegressor(**model_param, random_state=RANDOM_SEED))
])

In [28]:
pipeline_full.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('preprocessing', FeatureUnion(n_jobs=1,
       transformer_list=[('categorical', Pipeline(memory=None,
     steps=[('selector', FunctionTransformer(accept_sparse=False,
          func=<function get_categorical_cols at 0x7f7a58e248c8>,
          inv_kw_args=None, inverse_func=None, kw_args=No...rs=100, presort='auto', random_state=42,
             subsample=0.55, verbose=0, warm_start=False))])

In [29]:
y_pred = pipeline_full.predict(X_test)

In [30]:
from sklearn.model_selection import cross_val_score

# evaluate
cv_score = cross_val_score(
    pipeline_full, X_train, y_train, cv=5, scoring=mape_scorer)

print(cv_score)

print(f'mean: {np.mean(cv_score)}')

[-0.15879414 -0.17814664 -0.20262298 -0.86782877 -0.11136198]
mean: -0.3037509019530941


## 6.4. Diagnostic

In [31]:
from sklearn.model_selection import learning_curve


def plot_learning_curve(estimator,
                        X,
                        y,
                        scoring=mape_scorer,
                        cv=None,
                        n_jobs=1,
                        train_sizes=np.linspace(.1, 1.0, 5),
                        **kwargs):
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, scoring=scoring, cv=cv, train_sizes=train_sizes, n_jobs=n_jobs)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)

    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    trace = []

    # training mean
    trace.append(
        go.Scatter(
            x=train_sizes,
            y=train_scores_mean,
            marker=dict(color='red'),
            name='Training score'))

    # training std
    trace.append(
        go.Scatter(
            x=train_sizes,
            y=train_scores_mean + train_scores_std,
            mode='lines',
            line=dict(color='red', width=1),
            showlegend=False))

    trace.append(
        go.Scatter(
            x=train_sizes,
            y=train_scores_mean - train_scores_std,
            mode='lines',
            line=dict(color='red', width=1),
            fill='tonexty',
            showlegend=False))

    # test mean
    trace.append(
        go.Scatter(
            x=train_sizes,
            y=test_scores_mean,
            marker=dict(color='green'),
            name='Cross-validation score'))

    # test std
    trace.append(
        go.Scatter(
            x=train_sizes,
            y=test_scores_mean + test_scores_std,
            mode='lines',
            line=dict(color='green', width=1),
            showlegend=False))

    trace.append(
        go.Scatter(
            x=train_sizes,
            y=test_scores_mean - test_scores_std,
            mode='lines',
            line=dict(color='green', width=1),
            fill='tonexty',
            showlegend=False))

    data = [itrace for itrace in trace]
    layout = BASE_LAYOUT.copy()
    layout.update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(
                title=kwargs.get('xtitle'), showline=True, linewidth=1.5),
            yaxis=dict(
                title=kwargs.get('ytitle'), showline=True, linewidth=1.5)))
    fig = dict(data=data, layout=layout)
    return fig

In [32]:
learning_curves = plot_learning_curve(
    estimator=pipeline_full,
    X=X_train,
    y=y_train,
    scoring=mape_scorer,
    cv=5,
    train_sizes=np.linspace(.1, 1.0, 10),
    n_jobs=-1,
    title='Learning curves',
    xtitle='Training examples',
    ytitle='MAPE (%)')

py.iplot(learning_curves, filename='mvc/learning_curves')

## 6.5. Evaluation

### 6.5.1. Prediction

In [33]:
y_pred = pipeline_full.predict(X_test)

### 6.5.2. Denormalize data

In [34]:
_, y_pred_denorm = normalizer_test.inverse_transform(X_test, y_pred)
_, y_test_denorm = normalizer_test.inverse_transform(X_test, y_test)

### 6.5.3. Evaluate on all muscles

In [35]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score


def regression_report(y_test, y_pred, verbose=True):
    report = {
        'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),
        'mape': mape(y_test, y_pred),
        'r2': r2_score(y_test, y_pred),
        'variance': explained_variance_score(y_test, y_pred)
    }
    if verbose:
        print(f'\tRMSE: {report["rmse"]:.5f}')
        print(f'\tMAPE: {report["mape"]:.5f}')
        print(f'\tR2: {report["r2"]:.5f}')
        print(f'\tvariance: {report["variance"]:.5f}')
    return report

In [36]:
print('Regression report for all muscles:')
report = {}
report['mean'] = regression_report(y_test_denorm,  y_pred_denorm)

Regression report for all muscles:
	RMSE: 0.00073
	MAPE: 0.10628
	R2: 0.99999
	variance: 0.99999


### 6.5.4. Evaluate on each muscle

In [37]:
for imuscle in np.unique(X_test[:, COL_NAMES == 'muscle']).astype(int):
    subset = (X_test[:, COL_NAMES == 'muscle'] == imuscle).ravel()
    report[imuscle] = regression_report(
        y_test_denorm[subset], y_pred_denorm[subset], verbose=False)
    
report_by_muscle = pd.DataFrame(report).T.drop('mean', axis=0)
report_by_muscle.index = MUSCLES_NAMES

In [38]:
def plot_bar_metrics(d, **kwargs):
    trace_mape = go.Bar(
        x=np.array(d.index),
        y=np.array(d['mape']),
        marker=MARKER_LAYOUT,
        name='mape')

    trace_rmse = go.Scatter(
        x=np.array(d.index),
        y=np.array(d['rmse']),
        marker=dict(
            color='rgba(117, 112, 179, 0.6)',
            line=dict(
                color='rgba(117, 112, 179, 1.0)',
                width=2,
            )),
        name='rmse',
        xaxis='x1',
        yaxis='y2')

    traces = [trace_mape, trace_rmse]

    layout = BASE_LAYOUT.copy()
    layout.update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(
                title=kwargs.get('xtitle'), showline=True, linewidth=1.5),
            yaxis=dict(
                title=kwargs.get('ytitle'), showline=True, linewidth=1.5),
            yaxis2=dict(
                overlaying='y',
                side='right',
                showgrid=False,
                title=kwargs.get('y2title'),
                showline=True,
                zeroline=False,
                linewidth=1.5),
            legend=dict(x=5, y=1.1)))

    return dict(data=traces, layout=layout)

In [39]:
metrics_by_muscle = plot_bar_metrics(
    report_by_muscle,
    title='MAPE and RMSE for each muscle',
    ytitle='MAPE (%)',
    y2title='RMSE (mV)')
py.iplot(metrics_by_muscle, filename='mvc/metrics_by_muscle')

### 6.5.5. Summary

In [40]:
def table_regression_report(d, **kwargs):
    table = ff.create_table(d, index=[d.index])
    
    table['layout'].update(font=dict(size=14))
    table['layout'].update(title='coucou')
    return table

In [41]:
summary_report = pd.DataFrame(report).T
summary_report.index = ['mean'] + MUSCLES_NAMES

table_report = table_regression_report(
    summary_report,
    title='MAPE and RMSE for each muscle',
    ytitle='MAPE (%)',
    y2title='RMSE (mV)')
py.iplot(table_report, filename='mvc/table_report')