# Data Cleaning

## Data Collection:
For our example product, we collect approximately 3,000 posts each from both the Physics and Biology subreddits on reddit.com. We utilize reddit's pushift api (source: https://github.com/pushshift/api) to scrape these subreddits with parameters aimed at ignoring deleted content and keeping the most recent of posts. 


CATALOGIQUE believes training on recent data will help keep our models up to date and improve accuracy over time. For modeling purposes, we parsed the data down to just the titles.

#### Imports

In [1]:
import requests
import pandas as pd
import numpy as np
import time
import nltk
import re
import codecs
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

import warnings 
warnings.filterwarnings('ignore')

%config InlineBackend.figure_format='retina'

#### Pushshift Params
super simple parameters

In [4]:
def get_subreddit_data(subreddit,epoch_time):
    url =f'https://api.pushshift.io/reddit/search/submission?subreddit={subreddit}
    &author!=[deleted]&size=500&is_self=true&before={epoch_time}'
    res = requests.get(url)
    data = res.json()
    return data['data']

#### Make sure keys are present
We don't want any incomplete data

In [None]:
def exist_keys(post_to_check):
    if ("author" in post_to_check and "selftext" in post_to_check and "is_self" in post_to_check):
        return True
    else:
        return False

#### Check post for deleted and removed authors and remove posts.
Removed/Deleted authors and posts don't do us any favors so let's get rid of them!

In [None]:
def check_post(post_to_check):
    if exist_keys(post_to_check):
        author = post_to_check['author']
        selftext = post_to_check['selftext']
        is_self = post_to_check['is_self']
        if (author != '[deleted]' and author != 'deleted' and author != 'removed' 
                and selftext != 'removed' and selftext != ""
                and selftext != 'deleted' and 50 < len(selftext) < 50000
                and "http://" not in selftext and "https://" not in selftext
                and is_self) :
            return True
        else:
            return False
    else:
        return False

#### Get filtered posts filtered by time created


In [5]:
def get_filtered_posts(subreddit, post_count):
    result = []
    epoch_time = int(time.time())
    is_end_of_topic = False
    while len(result) <= post_count and not is_end_of_topic:
        post_list = get_subreddit_data(subreddit, epoch_time)
        temp_result = [post for post in post_list if check_post(post)]
        result.extend(temp_result)
        if epoch_time != int(result[-1]['created_utc']):
            epoch_time = int(result[-1]['created_utc'])
        else:
            is_end_of_topic = True
    return result

## Note: the next few lines of code will take a while as they scrape data.
Code is commented out in case you accidentally run the notebook and don't want to scrape and download the data. The cleaned datasets are stored in ('datasets/**')

#### Biology Posts
Let's scrape about 3000 biology posts. I couldn't figure out why it was scraping slighlty more than 3000 in both sets.

Uncomment code to run

In [6]:
#biology_posts = get_filtered_posts("biology", 3000)
print('We have',len(biology_posts), 'titles in the data')

NameError: name 'biology_posts' is not defined

#### Physics Posts
Let's scrape about 3000 physics posts

In [8]:
#physics_posts = get_filtered_posts("physics", 3000)
print('We have',len(physics_posts), 'titles in the data')

NameError: name 'physics_posts' is not defined

#### Create DataFrames 

In [7]:
bio_df = pd.DataFrame(biology_posts)
phys_df = pd.DataFrame(physics_posts)

NameError: name 'biology_posts' is not defined

#### Choose DF Features
Initially looking at Title, Selftext, Score and time created (created_utc). 

In [None]:
bio_df = bio_df[['title', 'selftext', 'score', 'created_utc']]

In [None]:
phys_df = phys_df[['title', 'selftext', 'score', 'created_utc']]

#### Create Cleaning Function
Clean up links, punctutaion and symbols using regex

In [None]:
def standardize_text(df, text_field):
    df[text_field] = df[text_field].str.replace(r"http\S+", "")
    df[text_field] = df[text_field].str.replace(r"http", "")
    df[text_field] = df[text_field].str.replace(r"@\S+", "")
    df[text_field] = df[text_field].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")
    df[text_field] = df[text_field].str.replace(r"@", "at")
    df[text_field] = df[text_field].str.lower()
    return df

#### Run cleaning function on the post titles
And inspect

In [None]:
bio_df = standardize_text(bio_df, 'title')
bio_df.head()

In [None]:
phys_df = standardize_text(phys_df, 'title')
phys_df.head()

#### Lemmatize!
I know this runs another quick clean function but I was lazy. No good excuse. This should and will be cleaned up in future iterations. 

In [None]:
# Major credit to 
# https://towardsdatascience.com/topic-modeling-quora-questions-with-lda-nmf-aff8dce5e1dd
    
import spacy
def clean_text(text):
    '''Make text lowercase, remove text in square brackets, 
    remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub('[^\w\s]','', text)
    text = re.sub(r'\w*\d\w*', '', text)
    return text

bio_clean = pd.DataFrame(bio_df.title.apply(lambda x: clean_text(x)))
phys_clean = pd.DataFrame(phys_df.title.apply(lambda x: clean_text(x)))

nlp = spacy.load('en_core_web_sm')

def lemmatizer(text):        
    sent = []
    doc = nlp(text)
    for word in doc:
        sent.append(word.lemma_)
    return " ".join(sent)
    

#### Apply the lemmatizer to our dataframes. 

In [None]:
bio_clean['title'] = bio_clean.apply(lambda x: lemmatizer(x['title']), axis=1)
bio_clean['title'] = bio_clean['title'].str.replace('-PRON-', '')

phys_clean['title'] = phys_clean.apply(lambda x: lemmatizer(x['title']), axis=1)
phys_clean['title'] = phys_clean['title'].str.replace('-PRON-', '')

#### Add a column with class labels for each. 
Biology subreddit = 0
Physics subreddit = 1

I also did some quick additional inspection for null values and quick check on the df head before saving.

In [9]:
bio_clean['class_label'] = 0

NameError: name 'bio_clean' is not defined

In [None]:
bio_clean.head()

In [None]:
bio_clean.isnull().sum()

In [None]:
phys_clean['class_label'] = 1

In [None]:
phys_clean.head()

In [None]:
phys_clean.isnull().sum()

#### Save the new dataframes to our datasets folder for use in our models.
Removed index as it just adds an unnecessary column. 

In [None]:
bio_clean.to_csv('datasets/bio_clean.csv', index=False)
phys_clean.to_csv('datasets/phys_clean.csv', index=False)