# Project 3: Web APIs & NLP
## *Data Cleaning and EDA*

In this notebook:

* [Initial data exploration](#initial)
* [Handle null entries](#null-entries)
* [Encoding target variable](#ohe)
* [Cleaning text](#clean-text)
* [Export to CSV](#export-csv)

#### Import Libraries & Read in Data

In [31]:
## standard imports 
import pandas as pd 
import numpy as np
import re
## visualizations
import matplotlib.pyplot as plt
import seaborn as sns
## preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.dummy import DummyClassifier
## modeling
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import MultinomialNB
## trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, ExtraTreesClassifier, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor
## NLP
from sklearn.feature_extraction.text import CountVectorizer
## analysis
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, make_scorer, f1_score, mean_squared_error

## options
import sklearn
pd.options.display.max_rows = 4000
pd.options.display.max_columns = 100
pd.set_option('max_colwidth', 100)

In [32]:
### read in data
data = pd.read_csv('../data/reddit_posts.csv')

# Initial data exploration<a class="anchor" id="initial"></a>
<hr/>

In [33]:
data.head()

Unnamed: 0,title,created_utc,selftext,subreddit,author,media_only,permalink
0,What to do after joining the brotherhood of st...,1601669856,I just joined after completing the quest for b...,Fallout,scud214,False,/r/Fallout/comments/j424ru/what_to_do_after_jo...
1,Pipboy,1601669271,[removed],Fallout,locogringo90,False,/r/Fallout/comments/j41yey/pipboy/
2,Fallout 5 location?,1601667863,[removed],Fallout,leolupascu03,False,/r/Fallout/comments/j41ifv/fallout_5_location/
3,Maxsons brotherhood of steel vs the NCR,1601667591,Who’d win? The NCR in my opinion would win but...,Fallout,Urmomgay890,False,/r/Fallout/comments/j41f9u/maxsons_brotherhood...
4,I have seen a new power generator its the dome...,1601667534,,Fallout,HeyyyItsFrosty,False,/r/Fallout/comments/j41ep3/i_have_seen_a_new_p...


In [19]:
data.columns

Index(['title', 'created_utc', 'selftext', 'subreddit', 'author', 'media_only',
       'permalink'],
      dtype='object')

In [20]:
data.shape

(4000, 7)

## Handle Null Entries <a class="anchor" id="null-entries"></a>
<hr/>

In [34]:
data.isna().sum()

title             0
created_utc       0
selftext       1774
subreddit         0
author            0
media_only        0
permalink         0
dtype: int64

### Missing comments

Reddit posts often pose questions to people so the actual comment portion is blank. Instead of dropping these rows, we'll use the title of the post as the comment text.

In [35]:
data['selftext'].fillna(data['title'], inplace=True)

In [36]:
data.head()

Unnamed: 0,title,created_utc,selftext,subreddit,author,media_only,permalink
0,What to do after joining the brotherhood of st...,1601669856,I just joined after completing the quest for b...,Fallout,scud214,False,/r/Fallout/comments/j424ru/what_to_do_after_jo...
1,Pipboy,1601669271,[removed],Fallout,locogringo90,False,/r/Fallout/comments/j41yey/pipboy/
2,Fallout 5 location?,1601667863,[removed],Fallout,leolupascu03,False,/r/Fallout/comments/j41ifv/fallout_5_location/
3,Maxsons brotherhood of steel vs the NCR,1601667591,Who’d win? The NCR in my opinion would win but...,Fallout,Urmomgay890,False,/r/Fallout/comments/j41f9u/maxsons_brotherhood...
4,I have seen a new power generator its the dome...,1601667534,I have seen a new power generator its the dome...,Fallout,HeyyyItsFrosty,False,/r/Fallout/comments/j41ep3/i_have_seen_a_new_p...


### Handle `[removed]` text

There are still a few posts with 'removed' as the comment text. We'll just drop these rows.

In [37]:
data = data[data['selftext'] !='[removed]']

In [38]:
data.shape

(9072, 7)

In [39]:
data['subreddit'].value_counts()

startrek    4754
Fallout     4318
Name: subreddit, dtype: int64

## Encoding target variable <a class="anchor" id="ohe"></a>
<hr/>

In [40]:
data['is_fallout'] = np.where(data['subreddit'] == 'Fallout', 1, 0)

# data['is_fallout'] = data['subreddit'].replace({'Fallout':1, 'startrek':0})

In [41]:
data.tail()

Unnamed: 0,title,created_utc,selftext,subreddit,author,media_only,permalink,is_fallout
9995,Sub Rosa has to be the worst episode of all ti...,1594922247,Sub Rosa has to be the worst episode of all ti...,startrek,GuiltChip,False,/r/startrek/comments/hseles/sub_rosa_has_to_be...,0
9996,That’s a stupid question! (DS9),1594919601,I saw a subreddit today called act like you be...,startrek,alexanderatl,False,/r/startrek/comments/hsdozd/thats_a_stupid_que...,0
9997,Remember when star trek tos (supposed to be) e...,1594918197,[https://giphy.com/gifs/Kf6Wn2D1IDM4jOEY7H](ht...,startrek,britbrayt,False,/r/startrek/comments/hsd6hx/remember_when_star...,0
9998,Beware the masks for sale from the Official St...,1594916790,They are super expensive for what they are (wh...,startrek,FrMark,False,/r/startrek/comments/hscnw5/beware_the_masks_f...,0
9999,Does star trek picard ruin tng’s ending,1594916769,Because of the quarantine I finally convinced ...,startrek,kuzuthunder,False,/r/startrek/comments/hscnnm/does_star_trek_pic...,0


## Cleaning Text <a class="anchor" id="clean-text"></a>
<hr/>
Text from urls keeps appearing in top words. Decided to remove urls from comment text.

### Remove urls from comments

In [42]:
data['selftext'].replace('https?:\/\/.*[\r\n]*',' ', regex=True, inplace=True)

In [43]:
### remove some extra characters
data['selftext'].replace('\n',' ', regex=True, inplace=True)

### Export cleaned data to CSV <a class="anchor" id="export-csv"></a>
<hr/>

In [44]:
data.to_csv('../data/reddit_posts_clean.csv')