# Personalized Medicine: Redefining Cancer Treatment

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Training set
First, let's have a look on the training set itself. We'll look at the text later.

In [2]:
df_train = pd.read_csv('../input/training_variants')
print('Size of training set: {} rows and {} columns'.format(*df_train.shape))
df_train.head()

As we see, we have an ID (seemingly corresponding to row number), gene name, its variation, and class label. Let's verify how many classes do we have.

In [3]:
df_train["Class"].unique()

What about class distribution?

In [4]:
plt.figure(figsize=(8,5))
sns.countplot(x="Class", data=df_train)
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Class Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Class Distribution", fontsize=15)
plt.show()

Let's see how many unique Gene and Variation names we got.

In [5]:
print("{} unique Gene".format(len(df_train["Gene"].unique())))
print("{} unique Variation".format(len(df_train["Variation"].unique())))

Okay, so we have just 264 genes with different variations, and variations are pretty diverse, not having many duplicates. Let's have a closer look at the variations.

In [6]:
value_counts = df_train["Variation"].value_counts()
value_counts.head(20)

Okay, so most of the duplicates in variations are five categories of some kind. Letter-number sequences have 4 or less occurences, with most being unique.

Let's take a peek at the test data, I'm kinda curious to see if there would be many common genes or variations.

In [7]:
df_test = pd.read_csv('../input/test_variants')
print('Size of test set: {} rows and {} columns'.format(*df_test.shape))
df_test.head()

In [8]:
print("{} unique Gene in test set".format(len(df_test["Gene"].unique())))
print("{} unique Variation in test set".format(len(df_test["Variation"].unique())))

In [9]:
print("Number of common genes: {}".format(len(set(df_train["Gene"]).intersection(df_test["Gene"]))))
print("Number of common variations: {}".format(len(set(df_train["Variation"]).intersection(df_test["Variation"]))))

As we see, the test set has way more gene and variation sequences, but only a few of them are common. We will need to somehow dissect them before we even build a simple model without looking at text data.