<div style="text-align: center">
<img src="https://raw.githubusercontent.com/ramp-kits/meg/master/figs/meg_logo.png" width="250px" />
</div>

# RAMP: predicting source of MEG signal
<br>
<div style="text-align: center">
    <em>
        <i>Authors: Maria Teleńczuk, Lucy Liu, Hicham Janati, Guillaume Lemaitre, Alexandre Gramfort</i><br>
        <a href="http://www.datascience-paris-saclay.fr">Paris Saclay Center for Data Science</a> (Inria)
    </em>
</div>

# Table of content
1. [Introduction](#Introduction)
    - [Origin of electrical signal in the brain](#EEG)
    - [Origin of magnetic signal in the brain](#MEG)
    - [MEG in practice and problem description](#MEG_in_practice)
2. [Data exploration](#Data_exploration)
    - [Import Python libraries](#Import)
    - [Download the data](#Download_data)
    - [MEG recordings](#X) 
    - [K-nearest neighbors algorithm](#KNN)
    - [Use of lead fields](#lars)
    - [Lasso Lars algorithm](#lassolars)
3. [Submission](#Submission) 

# Introduction <a class="anchor" id="Introduction"></a>

Brain activity produces electrical currents which are the origin of the magnetic field. Both electric and magnetic signals can be recorded from the scalp of the subject with help of electroencephalography (EEG) and magnetoencephalography (MEG). 
Here, we will focus on the magnetic signals recorded by MEG.

## Origin of electrical signal in the brain  <a class="anchor" id="EEG"></a>

Communication between brain cells (neurons) happens at the locations called synapses. That's where the signal passed from one neuron to another causes the electrical current to flow within and outside. A current flowing into one part of the cell (forming a current sink), flows within the cell and must leave it elsewhere (forming a current source). The pair of source and sink forms a current dipole. The dipole generated by a single cell can be therefore understood as a vector with constantly changing direction and length. Below you can see the visualisation of the positive current (excitator synaptic input) comming at different locations of the neuron. Please note, that in real conditions there are many such inputs happening simultaneusly. 

<style>
     .equalDivide tr td { width:25%; }
</style>

<table class="equalDivide" cellpadding="0" cellspacing="0" width="100%" border="0">
    <tr>
        <td width="50%">
            <img src="https://raw.githubusercontent.com/ramp-kits/meg/master/figs/4neurons.gif" width="400px" ALIGN=”left”>
        </td>
    <td width="50%">
        <b>A schema of 4 neurons with the same morphology.</b> Each is stimulated with the synaptic input at different location (<span style="color: #FF0000">red dot</span>) and at the same time. This is where the current enters the cell. Then, most of it flows through the cell and comes out in the cell body (<span style="color: #4B0082">purple dot</span>) forming a dipole. The direction and relative size of the dipole of each cell is represented by a <span style="color: #4B0082">purple line</span>. Note the difference. The colorful background shows the changes in the extracellular field.
    </td>
    </tr>
</table>

Neurons are constantly active, constantly receiving and propagating the electrical signal, but a single neuron is relatively tiny and therefore the potential it generates is too small to be recorded from the scalp. However, there are billions of neurons in the human brain which together form brain structures. Many of the neuron types align and correlate in the activity. As you can imagine, in this environment some of the single-cell dipoles are cancelled out while the others add up to form much stronger signal. Now, this group signal along with a lot of noise can be recorded by EEG.

## Origin of magnetic signal in the brain  <a class="anchor" id="MEG"></a>

So far we have only spoken of the electric currents, but you might remember from your physics class that electric currents are always associated with magnetic field. Now, if you consider the electric currents and the dipoles which we discussed above you can imagine magnetic fields forming closed loops  around them. Note that due to alignment of cell bodies in the brain, the magnetic fields are generated by the intracellular current (i.e. current flowing inside the cell) rather than the transmembrane currents (flowing inside/outside the cell) that are responsible for EEG.

<style>
     .equalDivide tr td { width:25%; }
</style>

<table class="equalDivide" cellpadding="0" cellspacing="0" width="100%" border="0">
    <tr>
    <td width="30%">
         <img src="https://raw.githubusercontent.com/ramp-kits/meg/master/figs/magnetic_schema_small.png" width="250px" ALIGN=”left” /> 
    </td>
    <td width="30%">
        The current flow is represented by the purple line (left), and the red lines show the direction of the magnetic field. Those magnetic fields are then recorded by the MEG sensors (grey on the right).
    </td>
    <td width="50%">
        <img src="https://raw.githubusercontent.com/ramp-kits/meg/master/figs/meg_scetch.png" width="250px" ALIGN=”left” /> 
    </td>
    </tr>
</table>

Please note that this is only a simplified explanation

## MEG in practice and problem description<a class="anchor" id="MEG_in_practice"></a>

There is a lot of things happening in the human brain at every moment, so how is it possible to know what to look for? The subject participating in a cognitive neuroscience experiment is usually asked to perform the same task multiple times (watching something on a screen, remember something, pressing buttons etc.). Then the recorded signals obtained during all the repetitions of the experiment are averaged out leading to noise removal and clearer data related to that task. This is related to so-called [evoked responses](https://en.wikipedia.org/wiki/Evoked_potential). However, there are other challenges facing the data analyst: although MEG has many sensors measuring magnetic field around the scalp it is difficult to judge where exactly the signal is coming from.

This is the question we ask you in this challenge: given some simulated MEG data you should predict the brain region(s) (sources) which are at the origin of the signals. Let's explore the data.

# Data exploration <a class="anchor" id="Data_exploration"></a>

## Import <a class="anchor" id="Import"></a>

### Prerequisites

- Python >= 3.7
- [numpy](https://pypi.org/project/numpy/)
- [scipy](https://pypi.org/project/scipy/)
- [pandas](https://pypi.org/project/pandas/)
- [scikit-learn](https://pypi.org/project/scikit-learn/)
- [matplolib](https://pypi.org/project/matplotlib/)
- [jupyter](https://pypi.org/project/jupyter/)
- [ramp-workflow](https://pypi.org/project/ramp-workflow/)

The following cell will install if necessary the missing dependencies.

In [1]:
import sys
!{sys.executable} -m pip install scikit-learn

# Install ramp-workflow from the master branch on GitHub.
!{sys.executable} -m pip install https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master

Collecting https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master
  Using cached https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master
Building wheels for collected packages: ramp-workflow
  Building wheel for ramp-workflow (setup.py) ... [?25ldone
[?25h  Created wheel for ramp-workflow: filename=ramp_workflow-0.4.0.dev0-py3-none-any.whl size=124968 sha256=582f7f54babaa83bbac64be02a62fbe810edd6247ed791f1da85c539fc9f8d89
  Stored in directory: /tmp/pip-ephem-wheel-cache-ocph721h/wheels/b7/18/af/cba50ad54ec8862831140e9f4fa3d006795fba0adab31d2853
Successfully built ramp-workflow


Required dependencies and downloads
Installation of libraries and ramp-workflow

To get this notebook running and test your models locally using the `ramp_test_submission`, we recommend that you use the Python distribution from Anaconda or Miniconda.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import sparse
import os

## Download the data (optional) <a class="anchor" id="Download_data"></a>

If the data has not yet been downloaded locally, uncomment the following cell and run it.

In [3]:
# !python download_data.py

You should now be able to find the `test` and `train` folders in the `data/` directory

## MEG recordings <a class="anchor" id="X"></a> 

In [4]:
X = pd.read_csv("data/train/X.csv.gz")
X.columns

Index(['e1', 'e2', 'e3', 'e4', 'e5', 'e6', 'e7', 'e8', 'e9', 'e10',
       ...
       'e196', 'e197', 'e198', 'e199', 'e200', 'e201', 'e202', 'e203', 'e204',
       'subject'],
      dtype='object', length=205)

The data has a lot of columns named e1, e2, .. ,e204 and a column named 'subject'. Each column marked with 'e' is a recording from one of the MEG sensors. There are 204 sensors in this MEG recordings. 'subject' is the subject id on whom this recording was performed. Let's see what subjects do we have:

In [5]:
np.unique(X['subject'])

array(['subject_1', 'subject_2', 'subject_3', 'subject_4', 'subject_5'],
      dtype=object)

In [6]:
X.shape

(2500, 205)

Let's now look at the heat maps of the first three samples on the head:

<div style="text-align: center">
    <img src="https://raw.githubusercontent.com/ramp-kits/meg/master/figs/topomaps.png" width="500px" ALIGN=”left” /> 
</div>

(optional) If you wish to see and run the code which plots the above heatmaps, you will have to additionally install [MNE](https://pypi.org/project/mne/) library and uncomment the following line and then run the code:

In [7]:
# %load plot_topomap.py

The above heat maps are taken from the first three samples of the `train` dataset. The darker the color, the higher is the recorded value. Can you already make a guess how many sources lead to generation of this signals?

Before we look at the ground truth let's discuss what do we actually mean by 'source'. The brain is a continous mass and so we could consider millions of points to be a potential source. However, for the sake of this study we subdivided the brain of each subject to 450 regions (225 subregions per hemisphere,  <i>corpus callosum</i> located between the two hemispheres excluded). Each of the subregions is a part of a larger region which has an anatomical meaning (represented by different colors below):

<div style="text-align: center">
<img src="https://raw.githubusercontent.com/ramp-kits/meg/master/figs/aparc_brain.png" width="500px" ALIGN=”left” />  
</div>

Your task is to predict in which subregion(s) the MEG signal originates from.
So let's look at the target:

In [8]:
y = sparse.load_npz('data/train/target.npz').toarray()
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

What you can see is an array mostly filled with 0s. Each column represent a different region. 1s represent the sources. You should be able to guess what is the shape of the target, can you?

In [9]:
print(f'There are {y.shape[0]} samples and {y.shape[1]} brain regions')

There are 2500 samples and 450 brain regions


And let's see if you guessed correctly the number of sources in each of the three heatmaps above?

In [10]:
n_sources = np.sum(y, axis=1)
print(f'Number of sources in first three samples: {n_sources[0:3]}')

Number of sources in first three samples: [2. 3. 3.]


Because we simulated this data (using [MNE](https://mne.tools/stable/index.html) Python library) we were free to limit the number of sources. Let's check what are the number of sources in other samples:

In [11]:
n_sources_per_sample = np.sum(y, axis=1)
n_sources = np.unique(n_sources_per_sample)
print(f'Possible number of sources: {n_sources}')

Possible number of sources: [1. 2. 3.]


## k-nearest neighbors algorithm <a class="anchor" id="KNN"></a> 

We are now goint to make some predictions. We will start from the algorithm called k-nearest neighbors. You can read more about it in the [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)  or [Scikit learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). For reading and writing the data we will now use functions stored in `problem.py`. The same functions will be used by RAMP when scoring your solution:

In [12]:
# loading the data
from problem import get_train_data, get_test_data

X_train, y_train = get_train_data()
X_test, y_test = get_test_data()

# print info
print(f"There are {len(X_train)} measurements"
      f" recorded from {len(np.unique(X_train['subject']))} subjects"
      " in the train dataset, and\n"
      f"{len(X_test)} measurements"
      f" recorded from {len(np.unique(X_test['subject']))} subjects"
      " in the test dataset.")

There are 2500 measurements recorded from 5 subjects in the train dataset, and
2500 measurements recorded from 5 subjects in the test dataset.


First, import all the libraries which we will need to write this estimator:

In [13]:
from sklearn.compose import make_column_transformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

Let's now apply KNeigbors Classifier:

In [14]:
# K Nearest Neihbors
clf = KNeighborsClassifier(n_neighbors=3)

If we just use KNeigborsClassifier on our data it will not work because the target is multioutput meaning that we might have more than a single predicted output. That is why we also use [MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html). 

`kneighbours = MultiOutputClassifier(clf, n_jobs=1)`

Each time when you write the solution for RAMP you will have to pass it as a sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). However, what we have done so far is not yet sufficient. Because the data X consists not only of the sensor measurements (`dtype`: `float64`) but also of the subject id which is of `dtype`: `object`:

In [15]:
X_train.dtypes

e1         float64
e2         float64
e3         float64
e4         float64
e5         float64
            ...   
e201       float64
e202       float64
e203       float64
e204       float64
subject     object
Length: 205, dtype: object

KNeighbors won't accept it in this form. Here, we decide to just drop the whole column and do not use the information about the subjects. We can do it using [ColumnTranformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) or function [make_column_transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html):

In [16]:
preprocessor = make_column_transformer(("drop", 'subject'),
                                       remainder='passthrough')

Now, we will apply the Scikit-learn pipeline:

In [17]:
pipeline = Pipeline([
        ('transformer', preprocessor),
        ('classifier', clf)
    ])

The code presented above is implemented as a sample solution in: `submissions/starting_kit/estimator.py`. If you wish to load it here, uncomment the line below:

In [18]:
# %load submissions/starting_kit/estimator.py

Let's fit this pipeline with the data and make a prediction. We will then use [hamming loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html) to make the score

In [19]:
pipeline.fit(X_train, y_train)
y_pred_knn = pipeline.predict(X_test)

In [20]:
from sklearn.metrics import hamming_loss

score = hamming_loss(y_test, y_pred_knn)
print(f"The hamming loss for KNN is {score}")

The hamming loss for KNN is 0.004828444444444444


In [21]:
n_sources_per_sample = np.sum(y_pred_knn, axis = 1)
n_sources = np.unique(n_sources_per_sample)
print(f'Predicted number of sources: {n_sources}')

Predicted number of sources: [0. 1. 2. 3.]


## Lead fields <a class="anchor" id="lars"></a> 

Looking at your data files you probably realized that there are more than only `X.csv` and `target.npz` in both `data/train` and `data/test` folders but also you have some files stored in `data/` directory. Their names begin with some `id` and end with "_lead_field.npz". Perhaps you have noticed that the `id` corresponds to the `id` of the `subject` in your `X.csv` data file. Let's make sure that we have the same number of subjects in the data as provided lead_field files:

In [22]:
import glob

data_dir = 'data/'

lead_field_files = os.path.join(data_dir, '*lead_field.npz')
lead_field_files = sorted(glob.glob(lead_field_files))
n_subj_train = np.unique(X_train['subject'])
n_subj_test = np.unique(X_test['subject'])
len_unique = (len(n_subj_train) +
              len(n_subj_test) -
              len(np.intersect1d(n_subj_train, n_subj_test)))

print(f"{len(lead_field_files)} lead field files and"
      f" {len_unique} subjects")

10 lead field files and 10 subjects


Each brain is different in a shape and a structure. Therefore the signal from the source to the sensors propagates differently in each subject. You might think of those lead_fields as of weights between the sources and the sensors. Let's look at the shape of one of those files:

In [23]:
L = np.load(lead_field_files[0])
L['lead_field'].shape

(204, 4690)

This is not the shape of a Lead Field that you might have expected. 204 is the number of sensors. But why number of regions is not 450? As we mentioned previously each region we consider is of a specific size, but the source can by any single point within this region and lead_field stores the value corresponding to every of those points. Furthermore, this number of points will differ between subjects. Let's look at the lead_field of another subject:

In [24]:
L = np.load(lead_field_files[1])
L['lead_field'].shape

(204, 4688)

So how do we match which point belongs to which region? In your `lead_field` file you will find another argument called `parcel_indices`:

In [25]:
parcel_indices = L['parcel_indices']
print(f"There are {len(parcel_indices)} consisting of {len(np.unique(parcel_indices))} numbers")

There are 4688 consisting of 450 numbers


Meaning that each number in parcel_indices tells us which which point of the `lead_field` belongs to which region of the `target`. How can we use this information for the predictions?

## Lasso Lars algorithm <a class="anchor" id="lassolars"></a> 

We will now construct slightly more complicated estimator which will use lead fields. First we want to load those lead_fields which are used in our data. Note that we are scaling all the lead_fields by 1e8. That is to avoid having too small numbers given to the estimator

In [26]:
import glob

data_dir = 'data/'

# find all the files ending with '_lead_field' in the data directory
lead_field_files = os.path.join(data_dir, '*lead_field.npz')
lead_field_files = sorted(glob.glob(lead_field_files))

parcel_indices_leadfield, L = [], []
subj_dict = {}
for idx, lead_file in enumerate(lead_field_files):
    lead_matrix = np.load(lead_file)

    lead_file = os.path.basename(lead_file)
    subj_dict['subject_' + lead_file.split('_')[1]] = idx

    parcel_indices_leadfield.append(lead_matrix['parcel_indices'])

    # scale L to avoid tiny numbers
    L.append(1e8 * lead_matrix['lead_field'])
    assert parcel_indices_leadfield[idx].shape[0] == L[idx].shape[1]

assert len(parcel_indices_leadfield) == len(L) == idx + 1
assert len(subj_dict) >= 1  # at least a single subject

print(f'Loaded {len(L)} lead_fields and {len(parcel_indices_leadfield)} parcel_indices')
print(f'Created dictionary of subject_ids and matching indices: {subj_dict}')

Loaded 10 lead_fields and 10 parcel_indices
Created dictionary of subject_ids and matching indices: {'subject_10': 0, 'subject_1': 1, 'subject_2': 2, 'subject_3': 3, 'subject_4': 4, 'subject_5': 5, 'subject_6': 6, 'subject_7': 7, 'subject_8': 8, 'subject_9': 9}


We created the `subj_dict` to keep track which row of `L` and which row of `parcel_indices_leadfield` correspond to which subject

Now we will use `subj_dict` to map the subjects in the X datasets:

In [27]:
X_train_mapped = X_train.copy()
X_train_mapped['subject_id'] = X_train['subject'].map(subj_dict)
# scale to avoid tiny numbers
X_train_mapped.iloc[:, :-2] *= 1e12

X_test_mapped = X_test.copy()
X_test_mapped['subject_id'] = X_test_mapped['subject'].map(subj_dict)
# scale to avoid tiny numbers
X_test_mapped.iloc[:, :-2] *= 1e12

Now we will write a class `SparseRegressor` which will accept the estimator (ie model) with which it will make the decision using lead fields:

In [28]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.base import TransformerMixin

def _get_coef(est):
    if hasattr(est, 'steps'):
        return est.steps[-1][1].coef_
    return est.coef_


class SparseRegressor(BaseEstimator, ClassifierMixin, TransformerMixin):
    def __init__(self, lead_field, parcel_indices, model, n_jobs=1):
        self.parcel_indices = parcel_indices
        self.lead_field = lead_field
        self.model = model
        self.n_jobs = n_jobs

    def fit(self, X, y):
        return self

    def predict(self, X):
        return (self.decision_function(X) > 0).astype(int)

    def _run_model(self, model, L, X, fraction_alpha=0.2):
        norms = np.linalg.norm(L, axis=0)
        L = L / norms[None, :]

        est_coefs = np.empty((X.shape[0], L.shape[1]))
        for idx, idx_used in enumerate(X.index.values):
            x = X.iloc[idx].values
            model.fit(L, x)
            est_coef = np.abs(_get_coef(model))
            est_coef /= norms
            est_coefs[idx] = est_coef

        return est_coefs.T

    def decision_function(self, X):
        X = X.reset_index(drop=True)

        n_parcels = max(max(s) for s in self.parcel_indices)
        betas = np.empty((len(X), n_parcels))
        for subj_idx in np.unique(X['subject_id']):
            l_used = self.lead_field[subj_idx]

            X_used = X[X['subject_id'] == subj_idx]
            X_used = X_used.iloc[:, :-2]

            est_coef = self._run_model(self.model, l_used, X_used)

            beta = pd.DataFrame(
                       np.abs(est_coef)
                   ).groupby(
                   self.parcel_indices[subj_idx]).max().transpose()
            betas[X['subject_id'] == subj_idx] = np.array(beta)
        return betas

In [29]:
from sklearn import linear_model

model_lars = linear_model.LassoLars(alpha=1.0, max_iter=3,
                                    normalize=False,
                                    fit_intercept=False)

lasso_lars = SparseRegressor(L, parcel_indices_leadfield, model_lars)

In [32]:
lasso_lars.fit(X_train_mapped, y_train)
y_pred_lassolars = lasso_lars.predict(X_test_mapped)

In [33]:
score = hamming_loss(y_test, y_pred_lassolars)
print(f'Hamming loss for the Lasso Lars using lead fields is {score}')

Hamming loss for the Lasso Lars using lead fields is 0.0044


The score is very similar to the one we got for knn. Let's see if the number of sources is more or less predicted correctly

In [34]:
n_sources_by_sample = np.sum(y_pred_lassolars, axis = 1)
n_sources = np.unique(n_sources_by_sample)
print(f'Predicted possible number of sources: {n_sources}')

Predicted possible number of sources: [0]


So in fact, the LassoLars with this settings predicted no sources at all.
We are getting different results, but the hamming loss remains almost the same. That is because it only calculates the fraction of wrongly predicted labels. 

But before we try to change the score, let's look at the LassoLars. We previously set `alpha` to 1.0. We will now try setting it in relation to the data:

In [35]:
from sklearn.multioutput import MultiOutputRegressor


def _get_coef(est):
    if hasattr(est, 'steps'):
        return est.steps[-1][1].coef_
    return est.coef_


class SparseRegressorAlpha(BaseEstimator, ClassifierMixin, TransformerMixin):
    def __init__(self, lead_field, parcel_indices, model, n_jobs=1):
        self.lead_field = lead_field
        self.parcel_indices = parcel_indices
        self.model = model
        self.n_jobs = n_jobs

    def fit(self, X, y):
        return self

    def predict(self, X):
        return (self.decision_function(X) > 0).astype(int)

    def decision_function(self, X):
        model = MultiOutputRegressor(self.model, n_jobs=self.n_jobs)
        X = X.reset_index(drop=True)

        betas = np.empty((len(X), 0)).tolist()
        for subj_idx in np.unique(X['subject_id']):
            l_used = self.lead_field[subj_idx]

            X_used = X[X['subject_id'] == subj_idx]
            X_used = X_used.iloc[:, :-2]

            norms = l_used.std(axis=0)
            l_used = l_used / norms[None, :]

            alpha_max = abs(l_used.T.dot(X_used.T)).max() / len(l_used)
            alpha = 0.2 * alpha_max
            model.estimator.alpha = alpha
            model.fit(l_used, X_used.T)  # cross validation done here

            for idx, idx_used in enumerate(X_used.index.values):
                est_coef = np.abs(_get_coef(model.estimators_[idx]))
                est_coef /= norms
                beta = pd.DataFrame(
                        np.abs(est_coef)
                        ).groupby(
                        self.parcel_indices[subj_idx]).max().transpose()
                betas[idx_used] = np.array(beta).ravel()
        betas = np.array(betas)
        return betas

In [36]:
model_lars_alpha = linear_model.LassoLars(max_iter=3,
                                          normalize=False,
                                          fit_intercept=False)

lasso_lars_alpha = SparseRegressorAlpha(L, parcel_indices_leadfield,
                                        model_lars_alpha)

In [37]:
lasso_lars_alpha.fit(X_train_mapped, y_train)
y_pred_alpha = lasso_lars_alpha.predict(X_test_mapped)

In [38]:
score = hamming_loss(y_test, y_pred_alpha)
print(f'Hamming loss for the Lasso Lars using lead fields is {score}')

Hamming loss for the Lasso Lars using lead fields is 0.004428444444444444


In [39]:
n_sources_by_sample = np.sum(y_pred_alpha, axis = 1)
n_sources = np.unique(n_sources_by_sample)
print(f'Possible number of sources: {n_sources}')

Possible number of sources: [0 1 2 3]


To use the above algorithm in `RAMP` you need to change it to be able to return a `scikit-learn` type of pipeline. This is saved in the `submissions/lasso_lars/estimator.py`. You can load the code here by uncommenting the line below:

In [43]:
# %load 'submissions/lasso_lars/estimator.py'

Now our estimator is predicting more feasable number of sources at each sample. But the score still remains the same. Let's calculate all the three results with the jaccard error (meaning 1-[jaccard score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html)):

In [41]:
from sklearn.metrics import jaccard_score
score_knn = 1 - jaccard_score(y_test, y_pred_knn, average='samples')
score_lassolars = 1 - jaccard_score(y_test, y_pred_lassolars, average='samples')
score_alpha = 1 - jaccard_score(y_test, y_pred_alpha, average='samples')

print(f'The Jaccard error for KNN model is {score_knn},')
print(f'for the model which predicts only 0s is {score_lassolars},')
print(f'for SparseRegressor with LassoLars as a model and updating alpha is {score_alpha}')

The Jaccard error for KNN model is 0.9219533333333333,
for the model which predicts only 0s is 1.0,
for SparseRegressor with LassoLars as a model and updating alpha is 0.6583266666666666


With Jaccard error you can indeed see that the last model gave us the best results

## Submission <a class="anchor" id="Submission"></a> 

Once you found a good model you wish to test you should place it in the directory with the name of your choice and place it in the `submissions` folder (you can already find there two submissions named `starting_kit` and `lasso_lars` which we talked about above). The file placed in your submission directory should be called `estimator.py` and should return `scikit-learn` type of pipeline.

You might then test your submission locally using command:

`ramp-test --submission <your submission folder name>`




For more information on how to submit your code on [ramp.studio](https://ramp.studio/), refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).