### Preprocessing
**Purpose**  
Make the changes determined in notebook 2 and save them to a new file. Add additional features to improve model performance.

*High Level Approach*
 * Combine all text into single feature
 * Remove placeholders
 * Engineer post length and title length features
 * Engineer sentiment analysis features
 
The amount of data gathered, paired with the eda steps taken in the previous notebook and the feature engineering in this notebook, lead me to believe that my models will predict the subreddit well above the baseline accuracy. Thus, I will be able to answer my problem statement in this particular case, as to how much the accuracy increases as larger training datasets are used.

In [2]:
import pandas as pd
import numpy as np

In [3]:
import httpimport
url = 'https://raw.githubusercontent.com/zach-brown-18/class-toolkit/main/eda/'
with httpimport.remote_repo(['nlp'], url):
    import nlp

url = 'https://raw.githubusercontent.com/zach-brown-18/class-toolkit/main/feature-engineering/'
with httpimport.remote_repo(['nlp_features'], url):
    import nlp_features

## Functions

In [4]:
def load_data_combine(file1, file2):
    '''Loads files and combines into single DataFrame.'''
    base_path = '../data/raw/'
    df1 = pd.read_csv(base_path + file1)
    df2 = pd.read_csv(base_path + file2)
    return pd.concat([df1, df2]).reset_index(drop=True)

In [5]:
def add_len_columns(df):
    df['title_length'] = df['title'].map(len)
    df['selftext_length'] = df['selftext'].map(len)

In [6]:
def clean(df):
    '''Removes nan and unwanted words. Adds length columns. Comines text into a single column. Drops old text columns. Binarizes target column.'''
    new_df = df.copy()
    new_df['selftext'] = new_df['selftext'].fillna('').copy()
    new_df = new_df.apply(lambda x: x.replace('[removed]', '').replace('[deleted]', ''))
    add_len_columns(new_df)
    new_df['all_text'] = new_df['title'] + ' ' + new_df['selftext']
    new_df['all_text'] = new_df['all_text'].apply(lambda x: nlp.expand_contractions(x))
    new_df.drop(columns=['selftext', 'title'], inplace=True)
    new_df['subreddit'] = new_df['subreddit'].map({'oceans':1, 'diving':0})
    
    return new_df

In [7]:
def load_clean_data(file1, file2):
    return clean(load_data_combine(file1, file2))

---
## Load the data

In [8]:
small = load_clean_data('oceans.csv', 'diving.csv')
medium = load_clean_data('oceans-medium.csv', 'diving-medium.csv')
large = load_clean_data('oceans-large.csv', 'diving-large.csv')

In [9]:
print(f'Small: {small.shape}')
print(f'Medium: {medium.shape}')
print(f'Large: {large.shape}')

Small: (3000, 4)
Medium: (5300, 4)
Large: (7491, 4)


*Perspective*  
Medium dataset has 77% more posts than the baseline.  
Large dataset has 150% more posts than the baseline.

## Sentiment Column

In [10]:
nlp_features.add_sentiment_columns(small, 'all_text')
nlp_features.add_sentiment_columns(medium, 'all_text')
nlp_features.add_sentiment_columns(large, 'all_text')

## Save Clean Data

In [31]:
dfs = [small, medium, large]
filenames = ['small.csv', 'medium.csv', 'large.csv']
for df, filename in zip(dfs, filenames):
    df.to_csv(f'../data/clean/{filename}', index=False)