# Truncate Output 

### Our earlier experiments with a massive label set (~1000 labels) yielded no results, so we decided to "make it easier" on ourselves by limiting the label set to 6 ICD codes.

We picked out codes that appeared frequently (between 10-25% of all cases). We ultimately settled on the following:
1) "Disorders of Lipoid Metabolism"
2) "Diseases of Esophagus"
3) "Chronic Kidney Disease"
4) "Acute Myocardial Infarction (heart attack)"
5) "Anemia"
6) "Disorders of Urethra and Urinary Tract

In [1]:
def save_as_pkl(vocab, filename):
    '''Function takes a file and saves it as a pkl file'''
    with open(filename, 'wb') as f:
        pickle.dump(vocab, f)

def load_pkl(filename):
    '''This function loads a pkl file'''
    with open(filename, 'rb') as f:
        vocab = pickle.load(f)
    return vocab

import pickle
import numpy as np
import pandas as pd

In [3]:
path = './Perotte code/'

#Get list of SubjectIDs associated with test set
test_subj_ids = []
with open(path+'testing_SUBJ_IDs.data') as fin:
    for line in fin:
        test_subj_ids.append(line.strip('\n'))

#Get list of SubjectIDs associated with MIMIC II training set
train_subj_ids = []
with open(path+'training_SUBJ_IDs.data') as fin:
    for line in fin:
        train_subj_ids.append(line.strip('\n'))
train_subj_ids.remove('"subject_id"')

In [9]:
Diagnoses_ICD = pd.read_csv('../MIMIC-III/DIAGNOSES_ICD.csv')
Diagnoses_ICD = Diagnoses_ICD[-Diagnoses_ICD['ICD9_CODE'].isnull()]

#Only consider rows that appear in MIMIC II
Diagnoses_ICD = Diagnoses_ICD[(Diagnoses_ICD['SUBJECT_ID'].isin(train_subj_ids))|
                             (Diagnoses_ICD['SUBJECT_ID'].isin(test_subj_ids))]

Diagnoses_ICD['Rolled_ICD'] = np.where(Diagnoses_ICD['ICD9_CODE'].str[0] == 'E',
                                       Diagnoses_ICD['ICD9_CODE'].str[0:4],
                                       Diagnoses_ICD['ICD9_CODE'].str[0:3])
                                       
NumberCodes = len(Diagnoses_ICD['ICD9_CODE'].unique())
NumberRolled = len(Diagnoses_ICD['Rolled_ICD'].unique())

print('Unique ICD-9 codes:', NumberCodes, '\nUnique 3-Digit Codes:',NumberRolled)
print('HADM_IDs', len(Diagnoses_ICD['HADM_ID'].unique()), '\nSUBJECT_IDs', 
                     len(Diagnoses_ICD['SUBJECT_ID'].unique()))

Diagnoses_ICD.head()

Unique ICD-9 codes: 5431 
Unique 3-Digit Codes: 970
HADM_IDs 30050 
SUBJECT_IDs 21685


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE,Rolled_ICD
0,1297,109,172335,1.0,40301,403
1,1298,109,172335,2.0,486,486
2,1299,109,172335,3.0,58281,582
3,1300,109,172335,4.0,5855,585
4,1301,109,172335,5.0,4254,425


In [10]:
#Dictionary where each key is a HADM_ID, each value is a list of rolled-up ICD9 codes
HADMID_Code_Dict = dict()
Unique_Visits = Diagnoses_ICD['HADM_ID'].unique()

for visit in Unique_Visits:
    VisitDF = Diagnoses_ICD[Diagnoses_ICD['HADM_ID']==visit].reset_index()
    ListOfICDs=[]
    for i in range(len(VisitDF)):
        ListOfICDs.append(VisitDF.loc[i, 'Rolled_ICD'])
    UniqueICDs = np.unique(ListOfICDs) #For rolled ICDs
    #ICDs = ' '.join(UniqueICDs)
    HADMID_Code_Dict[visit] = list(UniqueICDs)

HADMID_Code_Dict[100095]

['276',
 '285',
 '286',
 '403',
 '410',
 '414',
 '424',
 '428',
 '458',
 '486',
 '564',
 '585',
 '785']

In [11]:
HADMID_Code_DF = pd.DataFrame(pd.Series(HADMID_Code_Dict)).\
                reset_index().rename(columns={0: 'ICD9_Codes','index': 'HADM_ID'})
    
HADMID_Code_DF['ICD9_Codes_str'] = HADMID_Code_DF['ICD9_Codes'].map(lambda x: ' '.join(x))
HADMID_Code_DF.head()

Unnamed: 0,HADM_ID,ICD9_Codes,ICD9_Codes_str
0,100006,"[203, 276, 309, 486, 493, 518, 785, V12, V15]",203 276 309 486 493 518 785 V12 V15
1,100007,"[401, 486, 557, 560, 997]",401 486 557 560 997
2,100009,"[250, 272, 278, 285, 401, 411, 414, 426, 440, ...",250 272 278 285 401 411 414 426 440 996 V15 V4...
3,100014,"[278, 300, 718, 726, 738, V45]",278 300 718 726 738 V45
4,100020,"[041, 276, 293, 337, 340, 344, 345, 369, 401, ...",041 276 293 337 340 344 345 369 401 428 530 56...


### Dealing with Notes 

In [4]:
NoteEvents = pd.read_csv('../MIMIC-III/NOTEEVENTS.csv')
Notes = NoteEvents[NoteEvents['CATEGORY'] == 'Discharge summary'].reset_index(drop=True)
Notes = Notes[['SUBJECT_ID','HADM_ID','CHARTDATE','DESCRIPTION', 'TEXT']]
Notes = Notes[(Notes['SUBJECT_ID'].isin(train_subj_ids))|
                (Notes['SUBJECT_ID'].isin(test_subj_ids))]

#Dummy dataset
#Notes = Notes[Notes['HADM_ID']<100400]
Notes.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SUBJECT_ID,HADM_ID,CHARTDATE,DESCRIPTION,TEXT
0,22532,167853.0,2151-08-04,Report,Admission Date: [**2151-7-16**] Dischar...
1,13702,107527.0,2118-06-14,Report,Admission Date: [**2118-6-2**] Discharg...
2,13702,167118.0,2119-05-25,Report,Admission Date: [**2119-5-4**] D...
3,13702,196489.0,2124-08-18,Report,Admission Date: [**2124-7-21**] ...
4,26880,135453.0,2162-03-25,Report,Admission Date: [**2162-3-3**] D...


In [94]:
print(Notes.loc[1, 'TEXT'])

Admission Date:  [**2118-6-2**]       Discharge Date:  [**2118-6-14**]

Date of Birth:                    Sex:  F

Service:  MICU and then to [**Doctor Last Name **] Medicine

HISTORY OF PRESENT ILLNESS:  This is an 81-year-old female
with a history of emphysema (not on home O2), who presents
with three days of shortness of breath thought by her primary
care doctor to be a COPD flare.  Two days prior to admission,
she was started on a prednisone taper and one day prior to
admission she required oxygen at home in order to maintain
oxygen saturation greater than 90%.  She has also been on
levofloxacin and nebulizers, and was not getting better, and
presented to the [**Hospital1 18**] Emergency Room.

In the [**Hospital3 **] Emergency Room, her oxygen saturation was
100% on CPAP.  She was not able to be weaned off of this
despite nebulizer treatment and Solu-Medrol 125 mg IV x2.

Review of systems is negative for the following:  Fevers,
chills, nausea, vomiting, night sweats, change in we

In [5]:
NotesPerID = Notes.groupby('HADM_ID').count()[['TEXT']]

#HADM_IDs_with_MultipleCounts: the HADM_IDs with more than one discharge summary.
HADM_IDs_with_MultipleCounts = NotesPerID[NotesPerID['TEXT'] > 1].index

#NotesToAppend: rows of text that need to be merged according to HADM_ID
NotesToAppend = Notes[Notes['HADM_ID'].isin(HADM_IDs_with_MultipleCounts)]

UnchangedNotes = Notes[-Notes['HADM_ID'].isin(HADM_IDs_with_MultipleCounts)]

In [7]:
ChangedNotes = pd.DataFrame()
i=0
for ID in HADM_IDs_with_MultipleCounts: #For each HADM_ID with multiple texts
    i=i+1
    print(i)
    subdf = NotesToAppend[NotesToAppend['HADM_ID'] == ID] #create a smaller df with only that HADMID
    combined_text = subdf['TEXT'].str.cat() #combine all text in that column into one entry 
    subdf = subdf.drop_duplicates(subset='HADM_ID') #turn smaller df into 1-row df
    subdf['TEXT'] = combined_text 
    ChangedNotes = ChangedNotes.append(subdf)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [15]:
TotalNotes = ChangedNotes.append(UnchangedNotes).reset_index(drop=True)
print(len(TotalNotes), len(TotalNotes['HADM_ID'].unique())) #now a 1-1 relationship.

29685 29685


### IDs 

In [13]:
IDs = Diagnoses_ICD[['SUBJECT_ID', 'HADM_ID']].drop_duplicates()
IDs['Test'] = np.where(IDs['SUBJECT_ID'].isin(test_subj_ids), 1, 0)
IDs = IDs.sort_values(by='Test').reset_index(drop=True)
NumTrain = round(0.70*len(IDs))
IDs['Test'] = np.where((IDs.index > NumTrain)&(IDs['Test']==0), 0.5, IDs['Test'])
IDs = IDs.drop('SUBJECT_ID', 1)
IDs.groupby('Test').count()

Unnamed: 0_level_0,HADM_ID
Test,Unnamed: 1_level_1
0.0,21036
0.5,6138
1.0,2876


### Merge 

In [16]:
TotalNotes = pd.merge(IDs, TotalNotes, on='HADM_ID' )
TotalNotes = TotalNotes.sort_values(by=['Test', 'HADM_ID']) 
print(len(TotalNotes))

29683


In [17]:
FinalTotal = TotalNotes.merge(HADMID_Code_DF, on = 'HADM_ID')
print(len(FinalTotal))

29683


In [18]:
FinalTotal.head()

Unnamed: 0,HADM_ID,Test,SUBJECT_ID,CHARTDATE,DESCRIPTION,TEXT,ICD9_Codes,ICD9_Codes_str
0,100007,0.0,23018,2145-04-07,Report,Admission Date: [**2145-3-31**] ...,"[401, 486, 557, 560, 997]",401 486 557 560 997
1,100009,0.0,533,2162-05-21,Report,Admission Date: [**2162-5-16**] ...,"[250, 272, 278, 285, 401, 411, 414, 426, 440, ...",250 272 278 285 401 411 414 426 440 996 V15 V4...
2,100031,0.0,6892,2140-11-24,Report,Admission Date: [**2140-11-11**] Discha...,"[401, 424, 427, 441, 443, 453, 530, 578, 733]",401 424 427 441 443 453 530 578 733
3,100038,0.0,21234,2127-07-13,Report,Admission Date: [**2127-7-11**] ...,"[250, 272, 276, 285, 401, 414, 458, 584, 786, ...",250 272 276 285 401 414 458 584 786 V45
4,100045,0.0,1569,2176-02-15,Report,Admission Date: [**2176-2-5**] D...,"[250, 276, 285, 287, 571, 572, 585, 599, 780, ...",250 276 285 287 571 572 585 599 780 V54


In [19]:
def filter_out_test_set_codes(df):
    '''This function removes codes that appear exclusively in the validation/test sets of a data frame.'''
    new_df = df.copy()
    
    train_set = new_df[new_df['Test']==0].copy()
    test_set = new_df[new_df['Test']!=0].copy().reset_index()
    training_codes = np.unique(np.concatenate(train_set['ICD9_Codes']))
    testing_codes = np.unique(np.concatenate(test_set['ICD9_Codes']))
    
    codes_to_delete = [x for x in testing_codes if x not in training_codes]    
    new_df['ICD9_Codes'] = new_df['ICD9_Codes'].map(lambda x: [code for code in x 
                                                          if code not in codes_to_delete])
    new_df['ICD9_Codes_str'] = new_df['ICD9_Codes'].map(lambda x: ' '.join(x))
    return new_df, codes_to_delete

#------------------------------------------------

Full_DF, Full_deleted_codes = filter_out_test_set_codes(FinalTotal)

In [21]:
len(Full_DF)

29683

In [23]:
def generate_binary_output(chart):
    '''
    Note: index of input df should be "regular" -- np.arange(len(chart))
    @chart is the big dataframe (including summaries and ICD-9 codes)    
        
    This function returns the df, but creates a "Labels" column where each cell
    contains a binary list (only 1's and 0's).

    Each binary list is ~1000 elements long (representing the number of ICD9 Codes).
    A "1" means the summary is associated with a ICD9 code, "0" otherwise.
    
    For example, if the only codes in the universe were 001, 002, 003, 004, and one summary 
    was associated with 002 and 004, the binary list for that summary would be [0, 1, 0, 1]
    '''
    
    import sklearn
    from sklearn import feature_extraction
    VectorizerCodes = sklearn.feature_extraction.text.CountVectorizer()
    
    matrix = VectorizerCodes.fit_transform(chart['ICD9_Codes_str'])
    colnames = VectorizerCodes.get_feature_names()
    array_out = np.array(matrix.toarray(), dtype=np.float32)    
    Output = pd.DataFrame(array_out, columns = colnames, index = chart.index) #copy index from large table.
    
    Output['Labels'] = Output.index.map(lambda x: list(Output.loc[x,:]))
    O = Output[['Labels']]
    new_chart = pd.merge(chart, O, right_index=True, left_index=True)
    return new_chart, colnames, array_out

In [31]:
full_table, code_labels, binary_array = generate_binary_output(Full_DF)
full_table.head()

Unnamed: 0,HADM_ID,Test,SUBJECT_ID,CHARTDATE,DESCRIPTION,TEXT,ICD9_Codes,ICD9_Codes_str,Labels
0,100007,0.0,23018,2145-04-07,Report,Admission Date: [**2145-3-31**] ...,"[401, 486, 557, 560, 997]",401 486 557 560 997,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,100009,0.0,533,2162-05-21,Report,Admission Date: [**2162-5-16**] ...,"[250, 272, 278, 285, 401, 411, 414, 426, 440, ...",250 272 278 285 401 411 414 426 440 996 V15 V4...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,100031,0.0,6892,2140-11-24,Report,Admission Date: [**2140-11-11**] Discha...,"[401, 424, 427, 441, 443, 453, 530, 578, 733]",401 424 427 441 443 453 530 578 733,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,100038,0.0,21234,2127-07-13,Report,Admission Date: [**2127-7-11**] ...,"[250, 272, 276, 285, 401, 414, 458, 584, 786, ...",250 272 276 285 401 414 458 584 786 V45,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,100045,0.0,1569,2176-02-15,Report,Admission Date: [**2176-2-5**] D...,"[250, 276, 285, 287, 571, 572, 585, 599, 780, ...",250 276 285 287 571 572 585 599 780 V54,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [26]:
CountsPerCode = {}
for x in range(binary_array.shape[1]):
    count = binary_array[:,x].sum()
    code = code_labels[x]
    CountsPerCode[code] = count

In [33]:
codes_ranked_by_count = sorted(CountsPerCode, key=CountsPerCode.get, reverse = True)
codes_ranked_by_count[5:16]

['276', '518', '272', '285', '584', 'v45', '599', '038', '410', '585', '530']

### Basically looked at six codes ranked between 6-16, appear between 10%-25% of cases and are different types of diseases  
585, 272, 530, 599, 285, 410

In [38]:
print(CountsPerCode['410'], CountsPerCode['585'], CountsPerCode['272'], CountsPerCode['530'],
     CountsPerCode['599'], CountsPerCode['285'])

3745.0 3584.0 7213.0 3584.0 4339.0 6600.0


In [45]:
#indexes that these labels correspond to in the list
relevant_index = [code_labels.index('410'), code_labels.index('585'), code_labels.index('272'),
      code_labels.index('530'), code_labels.index('599'), code_labels.index('285')]

print(relevant_index)

[302, 446, 180, 404, 459, 193]


In [44]:
len(code_labels)

936

### Finally, extract these numbers from the "real" preprocessed dataset. (Notebook for preprocessing took too long to run, but we have a copy we can use.  

In [39]:
FTest = load_pkl('./NewDFs/FullTest')
FValid = load_pkl('./NewDFs/FullValidation')

In [40]:
FTrain = load_pkl('./NewDFs/FullTrain')

In [42]:
Full = FTrain.append(FValid.append(FTest))
Full = Full.reset_index(drop=True)

In [43]:
Full.head()

Unnamed: 0,Sentences-PlainVocab,Sentences-Leven,Sentences-Byte5K,Sentences-Byte10K,Sentences-Byte25K,Sentences-Hybrid10K,Labels
0,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,"[[4750, 8785, 21849, 28064, 22019, 18646, 3024...","[[4750, 8785, 21849, 28064, 22019, 18646, 3024...","[[232, 1218, 110, 124, 21, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 24, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 42, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 24, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,"[[4750, 8785, 21849, 28064, 1918, 18646, 30249...","[[4750, 8785, 21849, 28064, 1918, 18646, 30249...","[[232, 1218, 110, 124, 24, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 27, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 45, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 27, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


### Each list in the "Labels" is now 936 elements long. We want to shorten the list to 6 elements each -- specifically so that they only retain the ICD9 codes listed in the section above

In [59]:
def extract_correct_list(full_list, relevant_list_index):
    '''This function goes through a binary list and picks out the elements whose positions
    correspond with the "relevant_list_index" 
    
    For example, from above we want the elements corresponding to the indices of [302, 446, 180,
    404, 459, and 193], so this will be our "relevant_list_index. full_list will be the 936-long list
    of all ICD 9 codes
    '''
    return [full_list[x] for x in range(len(full_list)) if x in relevant_list_index]

In [69]:
FullTrain = FTrain.copy()
FullValid = FValid.copy()
FullTest = FTest.copy()

FullTrain['Labels'] = FullTrain['Labels'].map(lambda x: extract_correct_list(x, relevant_index))
FullValid['Labels'] = FullValid['Labels'].map(lambda x: extract_correct_list(x, relevant_index))
FullTest['Labels'] = FullTest['Labels'].map(lambda x: extract_correct_list(x, relevant_index))

In [78]:
FullTest.head()

Unnamed: 0,Sentences-PlainVocab,Sentences-Leven,Sentences-Byte5K,Sentences-Byte10K,Sentences-Byte25K,Sentences-Hybrid10K,Labels
26853,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
26854,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 1.0, 1.0, 0.0, 1.0, 0.0]"
26855,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 1.0, 0.0, 1.0]"
26856,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 1.0, 0.0, 0.0, 0.0, 1.0]"
26857,"[[4750, 8785, 21849, 28064, 1918, 18646, 30249...","[[4750, 8785, 21849, 28064, 1918, 18646, 30249...","[[232, 1218, 110, 124, 24, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 27, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 45, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 27, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"


In [81]:
F = FullTrain.append(FullValid.append(FullTest))
print(sum(np.concatenate(F['Labels'])))

29065.0


In [82]:
3745.0+ 3584.0+ 7213.0+ 3584.0+ 4339.0+ 6600.0

29065.0

In [83]:
FullTest.head()

Unnamed: 0,Sentences-PlainVocab,Sentences-Leven,Sentences-Byte5K,Sentences-Byte10K,Sentences-Byte25K,Sentences-Hybrid10K,Labels
26853,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
26854,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 1.0, 1.0, 0.0, 1.0, 0.0]"
26855,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 1.0, 0.0, 1.0]"
26856,"[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[4750, 8785, 21849, 28064, 31007, 18646, 3024...","[[232, 1218, 110, 124, 23, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 26, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 44, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 26, 221, 2667, 2378, 20...","[0.0, 1.0, 0.0, 0.0, 0.0, 1.0]"
26857,"[[4750, 8785, 21849, 28064, 1918, 18646, 30249...","[[4750, 8785, 21849, 28064, 1918, 18646, 30249...","[[232, 1218, 110, 124, 24, 126, 1378, 1218, 11...","[[419, 2387, 205, 220, 27, 223, 2675, 2387, 20...","[[1358, 6220, 808, 843, 45, 846, 6884, 6220, 8...","[[415, 2378, 204, 219, 27, 221, 2667, 2378, 20...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"


In [86]:
save_as_pkl(FullTrain, './FixedDFs/FullTrain')
save_as_pkl(FullValid, './FixedDFs/FullValidation')
save_as_pkl(FullTest, './FixedDFs/FullTest')

In [87]:
len(FullTest)

2830

In [88]:
len(FullTrain)

20778

In [89]:
len(FullValid)

6075