## Milestone 1: Data Preprocessing

1. Open the data
2. Convert the `reply_body` from string to list of string using `ast.literal_eval`

In [15]:
import pandas as pd
import ast
raw_df = pd.read_csv('goodnotes_submission.csv')
raw_df['reply_body'] = raw_df.reply_body.apply(lambda x: ast.literal_eval(x))

3. Create a new field called `all_text` that combines `submission_title`, `submission_selftext` and first few comments from `reply_body`

In [16]:
for idx, row in raw_df.iterrows():
    all_text = ''
    all_text += row.submission_title
    if type(row.submission_selftext)==str:
        all_text += row.submission_selftext
    if type(row.reply_body) ==list:
        all_text += ' '.join(row.reply_body[:3]) # Include first 4 replies
    
    raw_df.loc[idx,'all_text'] = all_text

4. Split df into `unlabelled_df` and `labelled_df` based on whether the `submission_link_flair_text` is present or not

In [17]:
unlabelled_df = raw_df[raw_df['submission_link_flair_text'].isnull()]
labelled_df =raw_df[raw_df['submission_link_flair_text'].notnull()] 

5. We are simplifying the problem. For any tags (`Questions - iPhone`, `Question - Other`), their tag is reduced to `Question`

In [18]:
labelled_df.loc[labelled_df.submission_link_flair_text.str.contains('Question'),'submission_link_flair_text'] = 'Question'
print(unlabelled_df.shape, labelled_df.shape)

(438, 6) (241, 6)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


6. Train test split with 20% data as test set 

In [19]:
X = list(labelled_df['all_text'])
y = list(labelled_df['submission_link_flair_text'])
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(X, y, test_size=.2,stratify=y,random_state=42)

7. Create `train_tag_df` and `val_tag_df` with `prefix`, `input_text` (text), and `target_text` (labels) column

In [20]:
category = ['tag classification']*len(train_labels)
train_tag_df = pd.DataFrame({'prefix':category,
                       'input_text': train_texts,
                        'target_text': train_labels})

In [21]:
category = ['tag classification']*len(val_labels)
val_tag_df = pd.DataFrame({'prefix':category,
                       'input_text': val_texts,
                        'target_text': val_labels})

8. Export `train_tag_df` as `singletask_noupsampling_train.csv`,
`val_tag_df` as `singlatask_noupsampling_val.csv`,
`unlabelled_df` as `unlabelled_df.csv`

In [22]:
train_tag_df.to_csv('singletask_noupsampling_train.csv',index=False)
val_tag_df.to_csv('singletask_noupsampling_val.csv',index=False)
unlabelled_df.to_csv('unlabelled_df.csv',index=False)

9. We want to upsampling certain groups of data and see whether that helps with training performance. We want to upsampling certain groups of data and see whether that helps with training performance.

    - Create a function that will replicates certain number of copies based on the tag it has, and add the copies to the existing `train_tag_df`.

    - Upsample 'Review','Stylus problems' tags 3 times and 'Templates' 2 times

    - Do not add new rows to `val_tag_df`

In [24]:
from typing import List

class Upsampler():    
    def __init__(self, num_repetitions:int,label_targets:List[int]):
        self.num_repetitions = num_repetitions
        self.upsample = None
        self.label_targets = label_targets
        
    def _generate_text(self, num, train_texts, train_labels):
        new_train_text = []
        new_labels = []
        for text, label in zip(train_texts,train_labels):
            if label ==num:
                new_train_text.append(text)
                new_labels.append(label)
        return new_train_text, new_labels
    
    def create_upsample(self,train_texts, train_labels):
        self.upsample = [(self._generate_text(num,train_texts,train_labels)) for num in self.label_targets]
        
    def insert_to_data(self,train_texts, train_labels):
        self.create_upsample(train_texts,train_labels)
        for new_text, new_labels in self.upsample:
            for i in range(self.num_repetitions):
                train_texts.extend(new_text)
                train_labels.extend(new_labels)
        return train_texts, train_labels

In [25]:
cur_sampler = Upsampler(num_repetitions=4,label_targets=['Review','Stylus problems'])
train_texts, train_labels = cur_sampler.insert_to_data(train_texts,train_labels)
cur_sampler = Upsampler(num_repetitions=2,label_targets=['Templates'])
train_texts, train_labels = cur_sampler.insert_to_data(train_texts,train_labels)
print("Post up-sampling:",len(train_texts), len(train_labels))

Post up-sampling: 326 326


In [26]:
pd.Series(train_labels).value_counts()

Question           143
Templates           93
Review              70
Stylus problems     20
dtype: int64

10. Export the expanded `train_tag_df` as 'singletask_train.csv'

In [29]:
category = ['tag classification']*len(train_labels)
train_tag_df = pd.DataFrame({'prefix':category,
                       'input_text': train_texts,
                        'target_text': train_labels})

In [30]:
train_tag_df.to_csv('singletask_train.csv',index=False)