# Lab 1 - Data Prep

## Read and process data using pandas

The training data we will use today for all labs are inside the folder 'trainingdata' and labels are provided in an Excel file named "TrainingLabelData.xlsx". In this lab, we will prep the data for labs 2,3 and 4. 

Note: If you want to proceed to Labs 2-4 quickly, you can safely click `Run > Run All Cells` from the menu above and continue. 

In [None]:
import pandas as pd
labels = pd.read_excel('TrainingLabelData.xlsx')
len(labels.impactCategoryName.unique())

Separating labels with a pipe when this is a multi-class dataset (if the dataset is not multi-label, this will still generate a normal multiclass dataset)

In [None]:
grouped_labels = labels.groupby('fileCusip').agg(lambda x : '|'.join(set(x))).reset_index()

In [None]:
grouped_labels.impactCategoryName = grouped_labels.impactCategoryName.apply(lambda x: '_'.join(x.replace('/','').split(' ')))
grouped_labels

In [None]:
grouped_labels.impactCategoryName.unique()

## Lets look at some training data ..

In [None]:
import json
training_data_sample = json.load(open('./trainingdata/011415PX8.json'))

In [None]:
training_data_sample

### What is the Cusip for this record? 

In [None]:
training_data_sample['cusip']

In [None]:
training_data_sample['file_name'].split('/')[1].split('.')[0]

### What is the corresponding label for this record? 

In [None]:
grouped_labels[grouped_labels.fileCusip=='13063CEQ9'].impactCategoryName.tolist()[0]

### Does it look like the proceeds/first page paragraphs has more label-specific information?

In [None]:
' '.join(training_data_sample['first_page_paragraphs'])[:5000].replace(',','')

In [None]:
' '.join(training_data_sample['use_of_proceeds_paragraphs'])[:5000].replace(',','')

# Great, lets prepare a training dataset!

## 1. Prepare data for Amazon Comprehend Custom Labels (Lab 2)

For training, multi-label mode on Amazon Comprehend supports up to 1 million examples containing up to 100 unique classes.

To train a custom classifier, you can provide training data as a two-column CSV file. In it, labels are provided in the first column, and documents are provided in the second.

Do not include headers for the individual columns. Including headers in your CSV file may cause runtime errors. Each line of the file contains one or more classes and the text of the training document. More than one class can be indicated by using a delimiter (such as a | ) between each class.

```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS|CLASS|CLASS,Text of document 3
```

For example, the following line belongs to a CSV file that trains a custom classifier to detect genres in movie abstracts:
```
COMEDY|MYSTERY|SCIENCE_FICTION|TEEN,"A band of misfit teens become unlikely detectives when they discover troubling clues about their high school English teacher. Could the strange Mrs. Doe be an alien from outer space?"
```


The default delimiter between class names is a pipe (|). However, you can use a different character as a delimiter. The delimiter cannot be part of your class name. For example, if your classes are CLASS_1, CLASS_2, and CLASS_3, the underscore (_) is part of the class name. You cannot use then use an underscore as the delimiter for separating class names.


Let's create our Comprehend Custom dataset:

In [None]:
import glob
c=0
errors=0
with open('comprehend_input.csv','w') as f:
    for file in glob.glob('./trainingdata/*'):
        
        
        try:
            c+=1
            training_data_sample = json.load(open(file,'r'))
            text = ' '.join(training_data_sample['first_page_paragraphs'])[:5000].replace(',','')
            proctext = ' '.join(training_data_sample['use_of_proceeds_paragraphs'])[:5000].replace(',','')
            filecusip = training_data_sample['file_name'].split('/')[1].split('.')[0]
            label = grouped_labels[grouped_labels.fileCusip==filecusip].impactCategoryName.tolist()[0]
#             print(label + ',' + text[:10])
            # Writing different windows of data as input since Comprehend needs at least 10 examples in each category
            # when you have more data, delete the first 4 lines..
            f.write(label + ',' + text[:1000] + '\n')
            f.write(label + ',' + text[:2000] + '\n')
            f.write(label + ',' + text[:3000] + '\n')
            f.write(label + ',' + text[:4000] + '\n')
            f.write(label + ',' + text + '\n')
    
            f.write('PROCEEDS_PARA,'+ proctext + '\n')
        
        except:
#             print(file)
            errors+=1
            print('***')
            print(training_data_sample['cusip'])
            print('***')

Here, we ignore records that don't have a label

In [None]:
errors

In [None]:
!cat comprehend_input.csv

In [None]:
import sagemaker

In [None]:
lab2path = sagemaker.session.Session().upload_data(path='comprehend_input.csv',key_prefix='data')

#### ^ Copy this S3 location and move on to Lab 2!

In [None]:
%store lab2path

## 2. Prepare data for Amazon SageMaker Training (Lab 3)

We prepare a dataset that can be ingested using Huggingface on SageMaker for Lab 3. For example, for the Amazon reviews dataset, we could create a dataset that looks like the following:

```
{"label":4,"review":"These are awesome, I like them somjuxh I already ordered another pair"}
{"label":4,"review":"This was purchased as a gift for my son who is a GoT lover and when he got it on Christmas day his face lit up with a smile. He wasn't sure anyone would buy it for him or not so this was a double win. Well done Crazy Dog."}
{"label":4,"review":"Nice looking shirt, actually better looking in person than in the picture<br \/>I got an XLG, (I am not a big guy but don't like clingy, tight clothing). I might have been able to get a LG, but after a washing I think it fits just right and it really didn't shrink much.<br \/>I ordered it on a Sunday and it was shipped on Monday and delivered on Wednesday So RockWaresUSA did a great job on their end as well."}
{"label":3,"review":"as expected"}
```

#### First, label encode the data

In [None]:
grouped_labels2 = grouped_labels.copy()

# converting type of columns to 'category'
grouped_labels2['impactCategoryName'] = grouped_labels2['impactCategoryName'].astype('category')
# Assigning numerical values and storing in another column

grouped_labels2['impactCategoryName'] = grouped_labels2['impactCategoryName'].cat.codes

grouped_labels2.impactCategoryName.unique()

In [None]:
# Create a category for the proceeds paragraph
proclabel = len(grouped_labels2.impactCategoryName.unique()) + 1

#### Great! Now let's create the dataset for Lab 3

In [None]:
labeldict={}
for i, x in enumerate(labels.impactCategoryName.unique().tolist(),1):
    labeldict[x]=i

labeldict['proclabel']=i+1
labeldict

In [None]:
import glob
import json 
import re

c=0
errors=0

def write_windows(f,label, text):

    words = text.split(' ')
    num_words = len(words)
    
    wpoint = 0

    while wpoint*500<len(words):

        f.write(json.dumps({"label":labeldict[label], "source":' '.join(words[wpoint*500:(wpoint+1)*500])}) + '\n')
        wpoint+=1

    #f.write(json.dumps({"label":label, "text":' '.join(words[wpoint*500:])}) + '\n')
    

with open('sagemaker_input.json','w') as f:
    for file in glob.glob('./trainingdata/*'):
        
        
        try:
            c+=1
            training_data_sample = json.load(open(file,'r'))
          
            text = re.sub(r'[^A-Za-z0-9 ]+', '', ' '.join(training_data_sample['first_page_paragraphs']))
        
            proctext = re.sub(r'[^A-Za-z0-9 ]+', '', ' '.join(training_data_sample['use_of_proceeds_paragraphs']))
            
            filecusip = training_data_sample['file_name'].split('/')[1].split('.')[0]
            label = labels[labels.fileCusip==filecusip].impactCategoryName.tolist()[0]
            
            write_windows(f,label,text)  
            
            write_windows(f,'proclabel',proctext)   
            
        except Exception as e:
            print(e)
            errors+=1
            print(training_data_sample['cusip'])
            print('***')

In [None]:
errors

In [None]:
lab3path = sagemaker.session.Session().upload_data(path='sagemaker_input.json',key_prefix='data')
%store lab3path

### And finally, prepare data for Lab 4

In [None]:
import glob
import json 
import re

c=0
errors=0

def write_windows(f,label, text):

    words = text.split(' ')
    num_words = len(words)
    
    wpoint = 0
    
    labellist = [0]*len(labeldict)
    
    for l in label:
        labellist[labeldict[l]]=1

    while wpoint*500<len(words):

        f.write(json.dumps({"label":labellist, "source":' '.join(words[wpoint*500:(wpoint+1)*500])}) + '\n')
        wpoint+=1

    #f.write(json.dumps({"label":label, "text":' '.join(words[wpoint*500:])}) + '\n')
    

with open('sagemaker_input_hf.json','w') as f:
    for file in glob.glob('./trainingdata/*'):
        
        
        try:
            c+=1
            training_data_sample = json.load(open(file,'r'))
          
            text = re.sub(r'[^A-Za-z0-9 ]+', '', ' '.join(training_data_sample['first_page_paragraphs']))
        
            proctext = re.sub(r'[^A-Za-z0-9 ]+', '', ' '.join(training_data_sample['use_of_proceeds_paragraphs']))
            
            filecusip = training_data_sample['file_name'].split('/')[1].split('.')[0]
            label = labels[labels.fileCusip==filecusip].impactCategoryName.tolist()

            write_windows(f,label,text)           
            
        except:
            errors+=1
            print('***')
            print(training_data_sample['cusip'])
            print('***')

In [None]:
lab4path = sagemaker.session.Session().upload_data(path='sagemaker_input_hf.json',key_prefix='data')
%store lab4path

### DONE! Please proceed to Labs 2,3 and 4 in order

Note - If you Ran all cells, make sure there were no errors and the following 3 files are generated successfully:

1. comprehend_input.csv
2. sagemaker_input.json
3. sagemaker_input_hf.json