# Feature Map-back

## Task Description

__Given files 1-5 containing GPT2 features (see "feature_engg" notebook), map them back to the original datasets__

- snli_train_premise: all of file 1; 0 - 250,152 (250,152 records) of file 2
- snli_train_hypothesis: from 250,152 in file 2 (49,848 records); all of file 3; 0 - 200,304 (200,304 records) of file 4
- snli_valid_premise: 200,304 - 210,304 in file 4 (10,000 records)
- snli_valid_hypothesis: 210,304 - 220,304 in file 4 (10,000 records)
- snli_test_premise: 220,304 - 230,304 in file 4 (10,000 records)
- snli_test_hypothesis: 230,304 - 240,304 in file 4 (10,000 records)
- hans_train_premise: 240,304 - 270,304 in file 4 (30,000 records)
- hans_train_hypothesis: from 270,304 in file 4 (29,626 records); 0 - 304 of file 5 (304 records)
- hans_valid_premise: 304 - 30,304 in file 5 (30,000 records)
- hans_valid_hypothesis: 30,304 - 60,304 in file 5 (30,000 records)
- nli_diagnostics_premise: 60,304 - 61,408 in file 5 (1,104 records)
- nli_diagnostics_hypothesis: 61,408 onwards in file 5

__Ultimately, the following datasets are needed:__
- snli_train: 550152 obs x 768 features x 2 (hypothesis + premise) - final shape: 550152 x 1536
- snli_valid: 10000 obs x 768 features x 2 (hypothesis + premise) - final shape: 10000 x 1536
- snli_test: 10000 obs x 768 features x 2 (hypothesis + premise) - final shape: 10000 x 1536
- hans (includes train and valid folds): 30000 obs x 768 features x 2 (hypothesis + premise) x 2 (train + valid) - final shape: 60000 x 1536
- nli_diagnostics: 1104 obs x 768 features x 2 (hypothesis + premise) - final shape: 1104 x 1536

### 1. Imports

In [17]:
import numpy as np
import time
import pandas as pd

### 2. Read files in and perform mapping
- Ordering of operations based on availability of resources (memory and disk)

In [2]:
start_time = time.time()
file1 = pd.read_csv('../data/features/out1.csv', header = None)
print("--- %s seconds ---" % (time.time() - start_time))

--- 55.29924917221069 seconds ---


In [4]:
start_time = time.time()
file2 = pd.read_csv('../data/features/out2.csv', header = None)
print("--- %s seconds ---" % (time.time() - start_time))

--- 59.43683886528015 seconds ---


In [6]:
start_time = time.time()
file3 = pd.read_csv('../data/features/out3.csv', header = None)
print("--- %s seconds ---" % (time.time() - start_time))

--- 60.7894070148468 seconds ---


In [8]:
snli_train_premise = pd.concat([file1, file2.iloc[0:250152, :]], axis = 0).reset_index(drop=True)
del file1

In [10]:
start_time = time.time()
file4 = pd.read_csv('../data/features/out4.csv', header = None)
print("--- %s seconds ---" % (time.time() - start_time))

--- 60.93753695487976 seconds ---


In [11]:
snli_train_hypothesis = pd.concat(
    [file2.iloc[250152:, :], file3, file4[0:200304]], axis = 0).reset_index(drop=True)
del file2, file3

In [13]:
snli_train = pd.concat([snli_train_premise, snli_train_hypothesis], axis = 1, ignore_index = True)
del snli_train_premise, snli_train_hypothesis
snli_train.shape

(550152, 1536)

In [25]:
snli_valid = pd.concat([file4[200304:210304].reset_index(drop=True), 
                        file4[210304:220304].reset_index(drop=True)], axis = 1, ignore_index = True)

In [26]:
snli_test = pd.concat([file4[220304:230304].reset_index(drop=True),
                       file4[230304:240304].reset_index(drop=True)], axis = 1, ignore_index = True)

In [21]:
start_time = time.time()
file5 = pd.read_csv('../data/features/out5.csv', header = None)
print("--- %s seconds ---" % (time.time() - start_time))

--- 11.934419870376587 seconds ---


In [34]:
hans_premise = pd.concat([file4[240304:270304].reset_index(drop=True),
                          file5[304:30304].reset_index(drop=True)], axis = 0).reset_index(drop=True)
hans_hypothesis = pd.concat([file4[270304:].reset_index(drop=True),
                             file5[0:304].reset_index(drop=True),
                             file5[30304:60304].reset_index(drop=True)], axis = 0).reset_index(drop=True)
hans = pd.concat([hans_premise, hans_hypothesis], axis = 1, ignore_index = True)

In [38]:
del hans_premise, hans_hypothesis, file4

In [39]:
nli_diagnostics = pd.concat([file5[60304:61408].reset_index(drop=True),
                             file5[61408:].reset_index(drop=True)], axis = 1, ignore_index = True)

In [41]:
del file5

### 3. Write back to disk
- Here again, ordering of operations based on availability of resources (memory and disk)

In [42]:
nli_diagnostics.to_csv('../data/features/nli_diagnostics.csv')

In [43]:
del nli_diagnostics

In [45]:
hans.to_csv('../data/features/hans.csv')

In [46]:
del hans

In [47]:
snli_test.to_csv('../data/features/snli_test.csv')
del snli_test

In [48]:
snli_valid.to_csv('../data/features/snli_valid.csv')
del snli_valid

In [49]:
snli_train.to_csv('../data/features/snli_train.csv')
del snli_train