# Descrição dos dados (do Kaggle):

In this competition you will develop algorithms to classify genetic mutations based on clinical evidence (text).

There are nine different classes a genetic mutation can be classified on.

This is not a trivial task since interpreting clinical evidence is very challenging even for human specialists. Therefore, modeling the clinical evidence (text) will be critical for the success of your approach.

Both, training and test, data sets are provided via two different files. One (training/test_variants) provides the information about the genetic mutations, whereas the other (training/test_text) provides the clinical evidence (text) that our human experts used to classify the genetic mutations. Both are linked via the ID field.

Therefore the genetic mutation (row) with ID=15 in the file training_variants, was classified using the clinical evidence (text) from the row with ID=15 in the file training_text

Finally, to make it more exciting!! Some of the test data is machine-generated to prevent hand labeling. You will submit all the results of your classification algorithm, and we will ignore the machine-generated samples. 

File descriptions
- training_variants - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations), Class (1-9 the class this genetic mutation has been classified on)
- training_text - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)
- test_variants - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations)
- test_text - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)
- submissionSample - a sample submission file in the correct format

In [1]:
# Neste caso, o dataset de test não fornece a classe de mutação, pois é destinado somente à previsão para
# submissão à competição do kaggle.

# In this case, the test dataset does not provide the class of the mutation. The test dataset is only for
# submition to kaggle.

## Carregando o arquivo training_text

In [5]:
import pandas as pd

In [6]:
## Função para estruturar o texto bruto em um dicionário, salvo como resultado intermediário em um artuivo json.

## Function to give the raw text a structure and save the intermediary result in a json file.

def structure_text(path, json_name):
    f = open(path, 'r')
    text = f.read()
    f.close()
    
    lines = text.split(sep='\n')
    dummy = {'ID':[], 'Text':[]}
    
    for line in lines:
        splitted_line = line.split(sep='||')
        if len(splitted_line) > 1:
            dummy['ID'].append(splitted_line[0])
            dummy['Text'].append(splitted_line[1])
    
    df = pd.DataFrame.from_dict(dummy)
    df.set_index('ID', inplace=True)
    df.to_json(path_or_buf=json_name, orient='split')

In [7]:
path_1 = './data_files/training_text'
#path_2 = './data_files/test_text'

In [8]:
structure_text(path_1, './data_files/training_text.json')

In [5]:
#structure_text(path_2, './data_files/test_text.json')

In [9]:
df1 = pd.read_json('./data_files/training_text.json', orient='split')
df1.head()

Unnamed: 0,Text
0,Cyclin-dependent kinases (CDKs) regulate a var...
1,Abstract Background Non-small cell lung canc...
2,Abstract Background Non-small cell lung canc...
3,Recent evidence has demonstrated that acquired...
4,Oncogenic mutations in the monomeric Casitas B...


In [10]:
#df2 = pd.read_json('./data_files/test_text.json', orient='split')
#df2.head()

In [11]:
df1

Unnamed: 0,Text
0,Cyclin-dependent kinases (CDKs) regulate a var...
1,Abstract Background Non-small cell lung canc...
2,Abstract Background Non-small cell lung canc...
3,Recent evidence has demonstrated that acquired...
4,Oncogenic mutations in the monomeric Casitas B...
...,...
3316,Introduction Myelodysplastic syndromes (MDS) ...
3317,Introduction Myelodysplastic syndromes (MDS) ...
3318,The Runt-related transcription factor 1 gene (...
3319,The RUNX1/AML1 gene is the most frequent targe...


## Carregando o arquivo training_variants

In [12]:
df2 = pd.read_csv('./data_files/training_variants')

In [13]:
df2

Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2
3,3,CBL,N454D,3
4,4,CBL,L399V,4
...,...,...,...,...
3316,3316,RUNX1,D171N,4
3317,3317,RUNX1,A122*,1
3318,3318,RUNX1,Fusions,1
3319,3319,RUNX1,R80C,4


In [25]:
df2.set_index('ID', inplace=True)
df2

Unnamed: 0_level_0,Gene,Variation,Class
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,FAM58A,Truncating Mutations,1
1,CBL,W802*,2
2,CBL,Q249E,2
3,CBL,N454D,3
4,CBL,L399V,4
...,...,...,...
3316,RUNX1,D171N,4
3317,RUNX1,A122*,1
3318,RUNX1,Fusions,1
3319,RUNX1,R80C,4


In [26]:
# Salvando resultado intermediário in json
# Saving intermediary result in json file.

df2.to_json(path_or_buf='./data_files/training_variants.json', orient='split')

df1

## Fazendo a união dos dois datasets pelo índice

In [31]:
df3 = df1.merge(df2, left_index=True, right_index=True)
df3

Unnamed: 0,Text,Gene,Variation,Class
0,Cyclin-dependent kinases (CDKs) regulate a var...,FAM58A,Truncating Mutations,1
1,Abstract Background Non-small cell lung canc...,CBL,W802*,2
2,Abstract Background Non-small cell lung canc...,CBL,Q249E,2
3,Recent evidence has demonstrated that acquired...,CBL,N454D,3
4,Oncogenic mutations in the monomeric Casitas B...,CBL,L399V,4
...,...,...,...,...
3316,Introduction Myelodysplastic syndromes (MDS) ...,RUNX1,D171N,4
3317,Introduction Myelodysplastic syndromes (MDS) ...,RUNX1,A122*,1
3318,The Runt-related transcription factor 1 gene (...,RUNX1,Fusions,1
3319,The RUNX1/AML1 gene is the most frequent targe...,RUNX1,R80C,4
