# Capstone Project: Classifying clinically actionable genetic mutations

***

## Notebook 5: Kaggle Submission

This notebook contains the code to format the predictions based on the testing dataset, to the format required for Kaggle submission at https://www.kaggle.com/c/msk-redefining-cancer-treatment/submit.

### Contents

- [Importing of Libraries](#Importing-of-Libraries)
- [Data Import](#Data-Import)
- [Data Formatting](#Data-Formatting)
- [Data Export](#Data-Export)
- [Kaggle Scores](#Kaggle-Scores)

## Importing of Libraries

In [24]:
# Standard libraries
import pandas as pd
import numpy as np

## Data Import

In [25]:
# Import the test predictions
test_pred = pd.read_csv("../assets/test_pred.csv")

In [26]:
test_pred.shape

(986, 2)

In [27]:
test_pred.head()

Unnamed: 0,id,class
0,1,6
1,2,4
2,3,2
3,4,9
4,5,7


In [28]:
test_pred['class'].value_counts()

2    252
7    200
4    130
1    111
9     78
8     65
5     61
6     51
3     38
Name: class, dtype: int64

## Data Formatting

We format the test predictions based on the format required by Kaggle

We first create an empty array with the header and zeros in second row onwards

In [29]:
# Create a list of the required column headers
columns = ["class{}".format(i) for i in range(1,10)]
columns.insert(0, 'ID')

In [30]:
# Create empty dataframe with the appropriate columns
kaggle = pd.DataFrame(np.zeros((test_pred.shape[0], 10), dtype=int), columns=columns)

In [31]:
kaggle.head()

Unnamed: 0,ID,class1,class2,class3,class4,class5,class6,class7,class8,class9
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0


We now populate the array with the test predictions.

In [32]:
for row_index in range(kaggle.shape[0]):
    kaggle.loc[row_index,\
           ['class'+str(test_pred.loc[row_index, ['class']][0])]] = 1

In [33]:
# Verify that the respective class columns have been populated correctly
kaggle.head()

Unnamed: 0,ID,class1,class2,class3,class4,class5,class6,class7,class8,class9
0,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,1,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,1,0,0


In [34]:
# Now populate the 'ID' column
kaggle['ID'] = test_pred['id']

In [35]:
# Show the final kaggle submission dataframe
kaggle.head()

Unnamed: 0,ID,class1,class2,class3,class4,class5,class6,class7,class8,class9
0,1,0,0,0,0,0,1,0,0,0
1,2,0,0,0,1,0,0,0,0,0
2,3,0,1,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,1
4,5,0,0,0,0,0,0,1,0,0


In [36]:
kaggle.shape

(986, 10)

## Data Export

In [37]:
kaggle.to_csv("../assets/submission.csv", index=False)

## Kaggle Scores

The scores below are the Kaggle Greatness Index (KGI) scores in terms of multi-class loss. As shown by the **private scores**, the baseline model perform better with a smaller multi-class loss compared to the alternative model.

### For Baseline Model

# ![](../assets/scores/kaggle_score_basemodel_20200413.jpg)

### For Alternative Model

# ![](../assets/scores/kaggle_score_altmodel_20200413.jpg)

![](../assets/kaggle_score_20200325(3).jpg)