# Capstone Project: Classifying clinically actionable genetic mutations

***

## Notebook 5: Kaggle Submission

This notebook contains the code to format the predictions based on the testing dataset, to the format required for Kaggle submission at https://www.kaggle.com/c/msk-redefining-cancer-treatment/submit.

### Contents

- [Importing of Libraries](#Importing-of-Libraries)
- [Data Import](#Data-Import)
- [Data Formatting](#Data-Formatting)
- [Data Export](#Data-Export)

## Importing of Libraries

In [21]:
import pandas as pd
import numpy as np

## Data Import

In [3]:
# Import the dataset
test_pred = pd.read_csv("../assets/test_pred.csv")

In [6]:
test_pred.shape

(5668, 2)

In [4]:
test_pred.head()

Unnamed: 0,id,class
0,0,7
1,1,4
2,2,7
3,3,7
4,4,4


In [7]:
test_pred['class'].value_counts()

7    3792
4    1017
2     393
1     378
5      37
6      36
9       8
3       7
Name: class, dtype: int64

## Data Formatting

We format the test predictions based on the format required by Kaggle

We first create an empty array with the header and zeros in second row onwards

In [17]:
# Create a list of the required column headers
columns = ["class{}".format(i) for i in range(1,10)]
columns.insert(0, 'ID')

In [24]:
# Create empty dataframe with the appropriate columns
kaggle = pd.DataFrame(np.zeros((test_pred.shape[0], 10), dtype=int), columns=columns)

In [25]:
kaggle.head()

Unnamed: 0,ID,class1,class2,class3,class4,class5,class6,class7,class8,class9
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0


We now populate the array with the test predictions.

In [54]:
for row_index in range(kaggle.shape[0]):
    kaggle.loc[row_index,\
           ['class'+str(test_pred.loc[row_index, ['class']][0])]] = 1

In [55]:
# Verify that the respective class columns have been populated correctly
kaggle.head()

Unnamed: 0,ID,class1,class2,class3,class4,class5,class6,class7,class8,class9
0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,1,0,0,0,0,0


In [56]:
# Now populate the 'ID' column
kaggle['ID'] = test_pred['id']

In [57]:
# Show the final kaggle submission dataframe
kaggle.head()

Unnamed: 0,ID,class1,class2,class3,class4,class5,class6,class7,class8,class9
0,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,1,0,0,0,0,0
2,2,0,0,0,0,0,0,1,0,0
3,3,0,0,0,0,0,0,1,0,0
4,4,0,0,0,1,0,0,0,0,0


In [4]:
kaggle.shape

(116293, 2)

## Data Export

In [6]:
kaggle.to_csv("../assets/submission.csv", index=False)

The screenshot below shows the final Kaggle submission that had a public KGI score of 0.64734.

![](../assets/kaggle_score_20200325(3).jpg)