# Project 3: Web APIs & NLP
# Notebook 2: Data Cleaning

https://github.com/pushshift/api<br>
https://api.pushshift.io/reddit/search/comment/

## Contents
- [Import Libraries and Data](#Import-Libraries-and-Data)
- [Data Cleaning](#Data-Cleaning)
- [Save Data](#Save-Data)

## Import Libraries and Data

In [1]:
import regex as re
import pandas as pd

pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 100)

In [2]:
df_p = pd.read_csv("../datasets/politics_subreddit_latest_6500.csv")
df_c = pd.read_csv("../datasets/conservative_subreddit_latest_6500.csv")
df_l = pd.read_csv("../datasets/libertarian_subreddit_latest_6500.csv")

In [3]:
df_p

Unnamed: 0,subreddit,subreddit_id,author,author_fullname,author_flair_text,body
0,politics,t5_2cneq,wayfarout,t2_hwcbf,,It was actually talked about in The West Wing ...
1,politics,t5_2cneq,trowawee1122,t2_ay8vmvm,,"""There's no way he'll be elected...""\n\n-2015"
2,politics,t5_2cneq,klowny,t2_4ffwz,:flag-ca: California,I think Hillary started around this point as w...
3,politics,t5_2cneq,6p6ss6,t2_rr1ku,:flag-ca: California,"""Oops, my Excel just crashed and auto recover ..."
4,politics,t5_2cneq,seKer82,t2_4axcp,,"Eh, not like they mean much to them anymore. ..."
...,...,...,...,...,...,...
6495,politics,t5_2cneq,nickebee,t2_9dbis,,hunter biden smokes crack. guess we know where...
6496,politics,t5_2cneq,Jwhitmore89,t2_mrw33,,Its got what plants crave!
6497,politics,t5_2cneq,AnAveragePotSmoker,t2_7nk5ucg,,Look at you with your strong leadership and yo...
6498,politics,t5_2cneq,110110,t2_6smdg,,Been playing hockey all my life in MD. I’m ready.


## Data Cleaning

### Drop Comments from the Comment Bot Labeled by Author Flair Text in r/politics

In [4]:
df_p.drop(df_p[df_p["author_flair_text"].str.contains("Bot", na = False, regex = True) == True].index, inplace = True)

### Check That No Comments from Comment the Bot Labeled by Author Flair Text in r/politics Remain

In [84]:
df_p[df_p["author_flair_text"].str.contains("Bot", na = False, regex = True) == True]

Unnamed: 0,subreddit,subreddit_id,author,author_fullname,author_flair_text,body


### Remove the Extra Column for Author Flair Text

In [85]:
df_p.drop(columns = "author_flair_text", inplace = True)
df_c.drop(columns = "author_flair_text", inplace = True)
df_l.drop(columns = "author_flair_text", inplace = True)

### Drop Null Values in Author Name Column from Deleted Posts

In [86]:
df_p = df_p.dropna()
df_c = df_c.dropna()
df_l = df_l.dropna()

In [87]:
print(df_p.isnull().sum())
print(df_c.isnull().sum())
print(df_l.isnull().sum())

subreddit          0
subreddit_id       0
author             0
author_fullname    0
body               0
dtype: int64
subreddit          0
subreddit_id       0
author             0
author_fullname    0
body               0
dtype: int64
subreddit          0
subreddit_id       0
author             0
author_fullname    0
body               0
dtype: int64


### Check That No Removed Messages or Deleted Authors Remain

In [88]:
df_p[(df_p["body"] == "[removed]") | (df_p["author"] == "[deleted]")]

Unnamed: 0,subreddit,subreddit_id,author,author_fullname,body


In [89]:
df_c[(df_c["body"] == "[removed]") | (df_c["author"] == "[deleted]")]

Unnamed: 0,subreddit,subreddit_id,author,author_fullname,body


In [90]:
df_l[(df_l["body"] == "[removed]") | (df_l["author"] == "[deleted]")]

Unnamed: 0,subreddit,subreddit_id,author,author_fullname,body


### Remove Formatting from Comment Bodies

In [91]:
def split_and_recombine_post(text):
    matches = re.findall("n*https*://[www.]*([A-Za-z0-9\.]*)/[\w+/\w+-.html]*|([A-Za-z0-9]+[\.]*[']*[A-Za-z]*)", text, re.MULTILINE)
    match_list = []
    for i in matches:
        if i[0]:
            match_list.append(i[0])
        if i[1]:
            match_list.append(i[1])
    recombined = " ".join(match_list)
    return recombined


In [92]:
for i in df_p.index:
    df_p.loc[i,"body"] = split_and_recombine_post(df_p.loc[i,"body"])
for i in df_c.index:
    df_c.loc[i,"body"] = split_and_recombine_post(df_c.loc[i,"body"])
for i in df_l.index:
    df_l.loc[i,"body"] = split_and_recombine_post(df_l.loc[i,"body"])

### Remove Any Posts That Had No Content in the Body

In [99]:
df_p.drop(df_p[df_p["body"] == ""].index, inplace = True)
df_c.drop(df_c[df_c["body"] == ""].index, inplace = True)
df_l.drop(df_l[df_l["body"] == ""].index, inplace = True)

## Save Data

In [101]:
df_p.to_csv("../datasets/politics_subreddit_latest_6500_cleaned.csv", index = False)
df_c.to_csv("../datasets/conservative_subreddit_latest_6500_cleaned.csv", index = False)
df_l.to_csv("../datasets/libertarian_subreddit_latest_6500_cleaned.csv", index = False)

## See Notebook 3 for Variable Creation