As per [this answer](https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/39782#224266) from discussion forum, we can use the test labels from stage 1 for training.  
I've concatenated all former csv's into train, test DataFrames for my own use, thought it would be helpfull for others as well.  

---
This notebook will read all data into memory, join and output train and test DataFrames.  
Training DataFrame has `[3689, 4]` *(3321 training values + 368 stage 1 test values)* shape with `['Class', 'Gene', 'Text', 'Variation']` columns.  
While Test DataFrame now has `[5300, 3]` *(5668 test values - 368 stage 1 test values)* shape with `['Gene', 'Text', 'Variation']` columns.  
  
Bear in mind the results you achieve from the test DataFrame will not work for your kaggle submissions csv since I've removed the ones with labels. Use the old test file for your submission. This is necessary for better cross validation if you are going to use this kernel.  
  
*Since the relased test ID's and former training ID's conflict with each other, I've just reset the index on final training DataFrame to make them unique and cause less confusion.*

In [1]:
import pandas as pd 

In [2]:
# read train_variants, train_text and join them
df_variants_train = pd.read_csv('../input/training_variants', usecols=['Gene', 'Variation', 'Class'])
df_text_train = pd.read_csv('../input/training_text', sep='\|\|', engine='python', skiprows=1, names=['ID', 'Text'])
df_variants_train['Text'] = df_text_train['Text']
df_train = df_variants_train

In [3]:
# read test_variants, test_text and join them
df_variants_test = pd.read_csv('../input/test_variants', usecols=['ID', 'Gene', 'Variation'])
df_text_test = pd.read_csv('../input/test_text', sep='\|\|', engine='python', skiprows=1, names=['ID', 'Text'])
df_variants_test['Text'] = df_text_test['Text']
df_test = df_variants_test

# read stage1 solutions
df_labels_test = pd.read_csv('../input/stage1_solution_filtered.csv')
df_labels_test['Class'] = df_labels_test.drop('ID', axis=1).idxmax(axis=1).str[5:]

# remove known labels from test_set
df_stage_2_test = df_test[~df_test.index.isin(df_labels_test['ID'])].set_index('ID')

# join with test_data on same indexes
df_test = df_test.merge(df_labels_test[['ID', 'Class']], on='ID', how='left').drop('ID', axis=1)
df_test = df_test[df_test['Class'].notnull()]

In [4]:
# join train and test files
df_stage_2_train = pd.concat([df_train, df_test])

# reset index to a range for readability
df_stage_2_train.reset_index(drop=True, inplace=True)

In [5]:
# training DataFrame
df_stage_2_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3689 entries, 0 to 3688
Data columns (total 4 columns):
Class        3689 non-null object
Gene         3689 non-null object
Text         3689 non-null object
Variation    3689 non-null object
dtypes: object(4)
memory usage: 115.4+ KB


In [6]:
# test DataFrame
df_stage_2_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5300 entries, 0 to 5667
Data columns (total 3 columns):
Gene         5300 non-null object
Variation    5300 non-null object
Text         5300 non-null object
dtypes: object(3)
memory usage: 165.6+ KB
