<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-MVC-project-description" data-toc-modified-id="1.-MVC-project-description-1">1. MVC project description</a></span></li><li><span><a href="#2.-Setup" data-toc-modified-id="2.-Setup-2">2. Setup</a></span></li><li><span><a href="#3.-Get-the-data" data-toc-modified-id="3.-Get-the-data-3">3. Get the data</a></span><ul class="toc-item"><li><span><a href="#3.1.-From-matlab-to-dict" data-toc-modified-id="3.1.-From-matlab-to-dict-3.1">3.1. From matlab to dict</a></span></li><li><span><a href="#3.2.-From-dict-to-pandas-dataframe" data-toc-modified-id="3.2.-From-dict-to-pandas-dataframe-3.2">3.2. From dict to pandas dataframe</a></span></li></ul></li><li><span><a href="#4.-Data-analysis" data-toc-modified-id="4.-Data-analysis-4">4. Data analysis</a></span><ul class="toc-item"><li><span><a href="#4.1.-Muscles-by-dataset" data-toc-modified-id="4.1.-Muscles-by-dataset-4.1">4.1. Muscles by dataset</a></span></li><li><span><a href="#4.2.-Tests-by-dataset" data-toc-modified-id="4.2.-Tests-by-dataset-4.2">4.2. Tests by dataset</a></span></li><li><span><a href="#4.3.-Muscles-and-tests-count" data-toc-modified-id="4.3.-Muscles-and-tests-count-4.3">4.3. Muscles and tests count</a></span></li><li><span><a href="#4.4.-Max-for-each-test-(normalized-by-participant-number)" data-toc-modified-id="4.4.-Max-for-each-test-(normalized-by-participant-number)-4.4">4.4. Max for each test (normalized by participant number)</a></span></li><li><span><a href="#4.5.-Distribution" data-toc-modified-id="4.5.-Distribution-4.5">4.5. Distribution</a></span></li><li><span><a href="#4.6.-Summary" data-toc-modified-id="4.6.-Summary-4.6">4.6. Summary</a></span></li></ul></li><li><span><a href="#5.-Tests-selection" data-toc-modified-id="5.-Tests-selection-5">5. Tests selection</a></span><ul class="toc-item"><li><span><a href="#5.1-Outcome" data-toc-modified-id="5.1-Outcome-5.1">5.1 Outcome</a></span></li></ul></li><li><span><a href="#6.-Machine-learning-pipeline" data-toc-modified-id="6.-Machine-learning-pipeline-6">6. Machine learning pipeline</a></span><ul class="toc-item"><li><span><a href="#5.1.-Shuffle-&amp;-split-data" data-toc-modified-id="5.1.-Shuffle-&amp;-split-data-6.1">5.1. Shuffle &amp; split data</a></span></li></ul></li></ul></div>

# 1. MVC project description

**Links**
- [github repo](https://github.com/romainmartinez/mvc)
- [plotly figures](https://plot.ly/organize/romainmartinez:114)

**Todos**
- update readme, description
- add data analysis summary
- one model by muscle

**Author**: _Romain Martinez._

# 2. Setup

In [79]:
# Common imports
import scipy.io as sio
import pandas as pd
import numpy as np

# Path
from pathlib import Path, PurePath
PROJECT_PATH = Path('./')
DATA_PATH = PROJECT_PATH.joinpath('data')

# to make this notebook's output stable across runs
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Figures
OFFLINE = True
if OFFLINE:
    import plotly.offline as py
    py.init_notebook_mode(connected=True)
else:
    import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
BASE_LAYOUT = go.Layout(hovermode='closest', font=dict(size=14))
MARKER_LAYOUT = dict(
    color='rgba(50, 171, 96, 0.6)',
    line=dict(
        color='rgba(50, 171, 96, 1.0)',
        width=2,
    ))

# 3. Get the data

## 3.1. From matlab to dict

In [2]:
def load_data(data_path, data_format, normalize=False):
    if not data_path.is_dir():
        raise ValueError('please provide a valid data path')
        
    mat = {}
    data = {key: [] for key in ('datasets', 'participants', 'muscles', 'tests', 'mvc')}
    count = -1
    dataset_names = []
    
    for idataset, ifile in enumerate(data_path.iterdir()):
        if ifile.parts[-1].endswith(f'{data_format}.mat'):
            dataset = ifile.parts[-1].replace('_only_max.mat', '').replace('MVE_Data_', '')
            
            if dataset not in dataset_names:
                dataset_names.append(dataset)
            
            mat[dataset] = sio.loadmat(ifile)['MVE']
            n_participants = mat[dataset].shape[0]
            print(f"project '{dataset}' ({n_participants} participants)")
            
            for iparticipant in range(mat[dataset].shape[0]):
                count += 1
                for imuscle in range(mat[dataset].shape[1]):
                    max_mvc = np.nanmax(mat[dataset][iparticipant, imuscle, :])
                    for itest in range(mat[dataset].shape[2]):
                        data['participants'].append(count)
                        data['datasets'].append(idataset)
                        data['muscles'].append(imuscle)
                        data['tests'].append(itest)
                        if normalize:
                            data['mvc'].append(mat[dataset][iparticipant, imuscle, itest] * 100 / max_mvc)
                        else:
                            data['mvc'].append(mat[dataset][iparticipant, imuscle, itest])
                            
    print(f'\n\ttotal participants: {count}')
    return data, dataset_names

In [3]:
DATA_FORMAT = 'only_max'
data, DATASET_NAMES = load_data(data_path=DATA_PATH, data_format=DATA_FORMAT, normalize=False)

MUSCLES_NAMES = [
    'upper trapezius', 'middle trapezius', 'lower trapezius',
    'anterior deltoid', 'middle deltoid', 'posterior deltoid',
    'pectoralis major', 'serratus anterior', 'latissimus dorsi',
    'supraspinatus', 'infraspinatus', 'subscapularis'
]
TEST_NAMES = np.arange(16)

project 'Landry2016' (15 participants)
project 'Landry2015_2' (11 participants)
project 'Landry2015_1' (14 participants)
project 'Violon' (10 participants)
project 'Yoann_2015' (22 participants)
project 'Landry2013' (21 participants)
project 'Landry2012' (18 participants)
project 'Tennis' (16 participants)
project 'Patrick_2013' (16 participants)
project 'Sylvain_2015' (10 participants)

	total participants: 152



All-NaN slice encountered



## 3.2. From dict to pandas dataframe

In [4]:
df_tidy = pd.DataFrame({
    'participant': data['participants'],
    'dataset': data['datasets'],
    'muscle': data['muscles'],
    'test': data['tests'],
    'mvc': data['mvc']
}).dropna()

print(f'dataset shape = {df_tidy.shape}')
df_tidy.head()

dataset shape = (16456, 5)


Unnamed: 0,dataset,muscle,mvc,participant,test
2,0,0,0.127825,0,2
3,0,0,0.124255,0,3
4,0,0,0.146927,0,4
5,0,0,0.041583,0,5
8,0,0,0.162206,0,8


In [5]:
df_wide = df_tidy.pivot_table(
    index=['dataset', 'participant', 'muscle'],
    columns='test',
    values='mvc',
    fill_value=np.nan).reset_index()

df_wide = df_wide.drop(['dataset', 'participant'], axis=1)

print(f'dataset shape = {df_wide.shape}')
df_wide.head()

dataset shape = (1468, 17)


test,muscle,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0,,,0.127825,0.124255,0.146927,0.041583,,,0.162206,0.017711,0.014369,,,,,0.036916
1,2,,,0.179864,0.294909,0.295846,0.107769,,,0.199097,0.20215,0.022668,,,,,0.07146
2,3,,,0.078753,0.244578,0.272709,0.010146,,,0.10649,0.0106,0.007517,,,,,0.238814
3,4,,,0.150353,0.104654,0.115272,0.057845,,,0.065429,0.05293,0.009894,,,,,0.079885
4,5,,,0.172669,0.124655,0.133114,0.196436,,,0.101393,0.187997,0.051396,,,,,0.050041


# 4. Data analysis

## 4.1. Muscles by dataset

In [6]:
def plot_count_by_dataset(d, values, index, columns, **kwargs):
    table = d.pivot_table(
        values,
        index,
        columns,
        aggfunc=lambda x: len(x) / x.nunique(),
        fill_value=0).astype(int)

    fig = ff.create_annotated_heatmap(
        z=np.array(table),
        x=kwargs.get('xlabel'),
        y=kwargs.get('ylabel'),
        showscale=True,
        colorscale='YlGnBu',
        colorbar=dict(title='Count', titleside='right'))

    fig['layout'].update(BASE_LAYOUT)
    fig['layout'].update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(title=kwargs.get('xtitle'), side='bottom'),
            yaxis=dict(title=kwargs.get('ytitle'), autorange='reversed'),
            margin=go.Margin(t=80, b=80, l=150, r=80, pad=0)))
    return fig

In [7]:
muscle_by_dataset = plot_count_by_dataset(
    df_tidy,
    values='test',
    index='dataset',
    columns='muscle',
    ylabel=DATASET_NAMES,
    xlabel=MUSCLES_NAMES,
    title='Muscles by dataset')
py.iplot(muscle_by_dataset, filename='mvc/muscles_by_dataset')

## 4.2. Tests by dataset

In [8]:
muscle_by_dataset = plot_count_by_dataset(
    df_tidy,
    values='muscle',
    index='dataset',
    columns='test',
    ylabel=DATASET_NAMES,
    xtitle='Tests',
    title='Tests by dataset')
py.iplot(muscle_by_dataset, filename='mvc/tests_by_dataset')

## 4.3. Muscles and tests count

In [80]:
def plot_count_bar(d, column, **kwargs):
    count = np.array(d[column].value_counts(sort=False))
    trace = go.Bar(
        x=count,
        y=kwargs.get('ylabel'),
        marker=MARKER_LAYOUT,
        orientation='h')

    layout = BASE_LAYOUT.copy()
    layout.update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(
                title=kwargs.get('xtitle'), showline=True, linewidth=1.5),
            yaxis=dict(
                title=kwargs.get('ytitle'), showline=True, linewidth=1.5)))

    # adjust y axis
    layout['yaxis'].update(nticks=count.shape[0])
    layout.update(margin=go.Margin(t=80, b=80, l=150, r=80, pad=0))
    return dict(data=[trace], layout=layout)

In [81]:
test_count_bar = plot_count_bar(
    df_tidy, 'test', title='Tests count', xtitle='n', ytitle='Tests')
py.iplot(test_count_bar, filename='mvc/test_count_bar')

In [82]:
muscle_count_bar = plot_count_bar(
    df_tidy, 'muscle', ylabel=MUSCLES_NAMES, title='Muscles count', xtitle='n')
py.iplot(muscle_count_bar, filename='mvc/muscle_count_bar')

## 4.4. Max for each test (normalized by participant number)

In [12]:
normalized, _ = load_data(
    data_path=DATA_PATH, data_format=DATA_FORMAT, normalize=True)

df_normalized = pd.DataFrame({
    'participant': normalized['participants'],
    'dataset': normalized['datasets'],
    'muscle': normalized['muscles'],
    'test': normalized['tests'],
    'mvc': normalized['mvc']
}).dropna()

project 'Landry2016' (15 participants)
project 'Landry2015_2' (11 participants)
project 'Landry2015_1' (14 participants)
project 'Violon' (10 participants)
project 'Yoann_2015' (22 participants)
project 'Landry2013' (21 participants)
project 'Landry2012' (18 participants)
project 'Tennis' (16 participants)
project 'Patrick_2013' (16 participants)
project 'Sylvain_2015' (10 participants)

	total participants: 152



All-NaN slice encountered



In [13]:
def plot_max_by_test(d, **kwargs):
    maximum = d[d['mvc'] == 100].pivot_table(
        values='muscle',
        index='dataset',
        columns='test',
        aggfunc='count',
        fill_value=0)
    maximum = (maximum.div(maximum.sum(axis=1), axis=0) * 100).astype(int)

    fig = ff.create_annotated_heatmap(
        z=np.array(maximum),
        x=kwargs.get('xlabel'),
        y=kwargs.get('ylabel'),
        showscale=True,
        colorscale='YlGnBu',
        colorbar=dict(title='Percentage', titleside='right'))

    fig['layout'].update(BASE_LAYOUT)
    fig['layout'].update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(title=kwargs.get('xtitle'), side='bottom'),
            yaxis=dict(title=kwargs.get('ytitle'), autorange='reversed'),
            margin=go.Margin(t=80, b=80, l=150, r=80, pad=0)))
    return fig

In [14]:
max_by_test = plot_max_by_test(
    df_normalized,
    ylabel=DATASET_NAMES,
    xtitle='Tests',
    title='Max for each test (normalized by participant number)')

py.iplot(max_by_test, filename='mvc/max_by_test')

## 4.5. Distribution

In [15]:
def plot_mvc_boxplot(d, xlabel, by='muscle', **kwargs):
    if 'subset' in kwargs:
        subset = kwargs.get('subset')
        if by == 'muscle':
            d = d[d['test'] == subset]
            print(f'test {subset} selected')
        else:
            d = d[d['muscle'] == subset]
            print(f'muscle {subset} selected')

    traces = []
    for ilabel in range(len(xlabel)):
        traces.append(
            go.Box(
                y=np.array(d[d[by] == ilabel]['mvc']),
                name=xlabel[ilabel],
                boxpoints='all',
                jitter=0.5,
                whiskerwidth=0.2,
                marker=dict(size=2),
                line=dict(width=1)))

    layout = BASE_LAYOUT.copy()
    layout.update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(
                title=kwargs.get('xtitle'), showline=True, linewidth=1.5),
            yaxis=dict(
                title=kwargs.get('ytitle'),
                showline=True,
                linewidth=1.5,
                zeroline=False)))

    return dict(data=traces, layout=layout)

In [36]:
mvc_box = plot_mvc_boxplot(
    df_normalized, xlabel=TEST_NAMES, by='test', subset=4)

py.iplot(mvc_box, filename='mvc/mvc_box')

muscle 4 selected


## 4.6. Summary

# 5. Tests selection

In [105]:
def plot_count_nan(d, **kwargs):
    nan_count = d.isnull().sum()
    nan_id = nan_count.index

    trace = go.Bar(x=nan_id, y=nan_count, marker=MARKER_LAYOUT)

    annotations = []
    for count, idx in zip(nan_count, nan_id):
        annotations.append(
            dict(
                y=count + 50,
                x=idx,
                text=count,
                font=dict(size=14, color='rgb(50, 171, 96)'),
                showarrow=False))

    layout = BASE_LAYOUT.copy()
    layout.update(
        dict(
            title=kwargs.get('title'),
            xaxis=dict(
                title=kwargs.get('xtitle'), showline=True, linewidth=1.5),
            yaxis=dict(
                title=kwargs.get('ytitle'), showline=True, linewidth=1.5)))

    layout['annotations'] = annotations
    return dict(data=[trace], layout=layout)

In [106]:
nan_count_bar = plot_count_nan(df_wide.iloc[:, df_wide.columns != 'muscle'])
py.iplot(nan_count_bar, filename='mvc/nan_count_bar')

## 5.1 Outcome
- Based on the missing values plotted on the previous cell, **five tests seem to have very few missing values**
- The learning algorithm will be feeded by the tests: $0, 2, 3, 4, 5$
the algorithm will be powered by test values

In [55]:
nan_count = df_wide.isnull().sum().sort_values()
nan_count

test
muscle       0
3            3
4            3
5            3
2            4
0           76
6          415
9          438
8          441
10         449
7          462
14         543
1          591
13         619
11         924
12         924
15        1137
dtype: int64

In [53]:
nan_count.sort_values()

test
muscle       0
3            3
4            3
5            3
2            4
0           76
6          415
9          438
8          441
10         449
7          462
14         543
1          591
13         619
11         924
12         924
15        1137
dtype: int64

# 6. Machine learning pipeline
**Steps** (_for each muscle_):
1. test selector (n test ?)
2. normalize (try to count for infinite)
3. imputer (also work for infinite)
4. polyfeatures

In [42]:
df_wide

test,muscle,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0,,,0.127825,0.124255,0.146927,0.041583,,,0.162206,0.017711,0.014369,,,,,0.036916
1,2,,,0.179864,0.294909,0.295846,0.107769,,,0.199097,0.202150,0.022668,,,,,0.071460
2,3,,,0.078753,0.244578,0.272709,0.010146,,,0.106490,0.010600,0.007517,,,,,0.238814
3,4,,,0.150353,0.104654,0.115272,0.057845,,,0.065429,0.052930,0.009894,,,,,0.079885
4,5,,,0.172669,0.124655,0.133114,0.196436,,,0.101393,0.187997,0.051396,,,,,0.050041
5,7,,,0.073350,0.127731,0.126761,0.032419,,,0.008587,0.045480,0.020818,,,,,0.166193
6,9,,,0.204902,0.250500,0.308364,0.186857,,,0.116249,0.055184,0.021912,,,,,0.063211
7,10,,,0.025012,0.016006,0.012417,0.002298,,,0.009578,0.357556,0.001553,,,,,0.030852
8,11,,,0.023162,0.027063,0.034356,0.019402,,,0.018158,0.018384,0.015013,,,,,0.020348
9,0,0.000208,,0.000162,0.000248,0.000124,0.000040,,0.000033,0.000092,0.000020,0.000019,,,0.000093,,0.000034


In [31]:
MUSCLE_SUBSET = 4
X = np.array(df_wide[df_wide['muscle'] == MUSCLE_SUBSET].drop('muscle', axis=1))
y = np.nanmax(X, axis=1)

## 5.1. Shuffle & split data

In [34]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED)

In [39]:
xi = pd.DataFrame(X_train)

len(xi) - xi.count()

0     10
1     51
2      0
3      0
4      0
5      0
6     39
7     40
8     44
9     44
10    45
11    82
12    82
13    59
14    49
15    97
dtype: int64