# Web Scraping Reddit API and Data Cleaning

This notebook webscrapes posts and comments from Subreddits for Real Housewives and NBA, and creates a final dataframe of 3604 unique observations, with 2 columns: the text body and a binarized target variable. 

In [1]:
import pandas as pd
import requests
import time

### Begin by webscraping posts and comments from Subreddit for Real Housewives (r/BravoRealHousewives)

#### Scraping Real Housewives Posts:

In [2]:
#loop through reddit api to collect post data
after = None
list_posts = []
for _ in range(20):
    posts = requests.get('https://www.reddit.com/r/bravorealhousewives.json',
                         headers = {'User-agent': 'Kyle Richards'},
                         params = {'after': after, 'limit': 100}).json()
    after = posts['data']['after']
    for post in posts['data']['children']:
        list_posts.append(post['data'])
    time.sleep(1)
    
#create dataframe
rhw_posts = pd.DataFrame(list_posts)

rhw_posts.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,AutoModerator,,,,[],,"I may not be a housewife, but I'm also not real.",...,,,/r/BravoRealHousewives daily OT thread. Today ...,3,https://www.reddit.com/r/BravoRealHousewives/c...,[],,False,all_ads,6
1,,,False,amandatoryy,,,,[],78508438-2e2a-11e4-8237-12313d148d8a,suck it up cream puff,...,,,The Real Housewives of New Jersey S09E07 - Bru...,1,https://www.reddit.com/r/BravoRealHousewives/c...,[],,False,all_ads,6
2,,,False,Artemis273,,,,[],,,...,,,If we could pool our money and hire MKE to do ...,162,https://www.reddit.com/r/BravoRealHousewives/c...,[],,False,all_ads,6
3,,,False,WatermelonRadishh,,,,[],,,...,140.0,140.0,Gotta pay for that wedding somehow but holy Fa...,141,https://i.redd.it/wi6ukzj7w7521.jpg,[],,False,all_ads,6
4,,,False,gaymike219905,,,,[],78508438-2e2a-11e4-8237-12313d148d8a,"She died, Aviva.",...,73.0,140.0,RHONJ Season 9 Midseason Trailer,16,https://people.com/tv/margaret-josephs-pushes-...,[],,False,all_ads,6


In [3]:
rhw_posts.shape

(1988, 100)

 - We have 1988 posts from the Real Housewives subreddit. 

In [4]:
#export csv file
# rhw_posts.to_csv('../data/rhw_posts.csv',index=False)

#### Scraping Real Housewives Comments:

In [5]:
#loop through reddit api to collect post data
after = None
list_posts = []
for _ in range(20):
    posts = requests.get('https://www.reddit.com/r/bravorealhousewives/comments.json',
                         headers = {'User-agent': 'Kyle Richards'},
                         params = {'after': after, 'limit': 100}).json()
    after = posts['data']['after']
    for post in posts['data']['children']:
        list_posts.append(post['data'])
    time.sleep(1)
    
#create dataframe
rhw_comments = pd.DataFrame(list_posts)

rhw_comments.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,score,score_hidden,send_replies,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,ups,user_reports
0,,,False,VirginWhoCantDr1ve,,,[],78508438-2e2a-11e4-8237-12313d148d8a,"Be cool. Don't be all, like, uncool.",dark,...,1,False,True,False,BravoRealHousewives,t5_2v6dk,r/BravoRealHousewives,public,1,[]
1,,,False,jendet010,,,[],,,,...,1,False,True,False,BravoRealHousewives,t5_2v6dk,r/BravoRealHousewives,public,1,[]
2,,,False,dixiecupdispencer,,,[],,,,...,1,False,True,False,BravoRealHousewives,t5_2v6dk,r/BravoRealHousewives,public,1,[]
3,,,False,aintnowallbih,,,[],,,,...,1,False,True,False,BravoRealHousewives,t5_2v6dk,r/BravoRealHousewives,public,1,[]
4,,,False,Jadenandwillow,,,[],,,,...,1,False,True,False,BravoRealHousewives,t5_2v6dk,r/BravoRealHousewives,public,1,[]


In [6]:
rhw_comments.shape

(1940, 63)

 - We have 1928 comments from the Real Housewives subreddit. 

In [7]:
#export csv file
# rhw_comments.to_csv('../data/rhw_comments.csv', index = False)

### Cleaning Real Housewives Posts

In [8]:
#remove posts by user 'damazz09' who posts a long weekly tv ratings list
rhw_posts = rhw_posts[~(rhw_posts['author']=='damazz09')]
#create subset of relevant features
rhw_posts = rhw_posts[['title','selftext','subreddit']]
#fill in null values with a placeholder
rhw_posts.fillna('.', inplace = True)
#create column 'body' that combines 'title' with 'selftext'
rhw_posts['body'] = rhw_posts['title'] + ' ' + rhw_posts['selftext']
#subset dataframe of Real Housewives posts
rhw_posts = rhw_posts[['body','subreddit']]

rhw_posts.head()

Unnamed: 0,body,subreddit
0,/r/BravoRealHousewives daily OT thread. Today ...,BravoRealHousewives
1,The Real Housewives of New Jersey S09E07 - Bru...,BravoRealHousewives
2,If we could pool our money and hire MKE to do ...,BravoRealHousewives
3,Gotta pay for that wedding somehow but holy Fa...,BravoRealHousewives
4,RHONJ Season 9 Midseason Trailer,BravoRealHousewives


### Cleaning Real Housewives Comments

In [9]:
#subset dataframe of Real Housewives comments
rhw_comments = rhw_comments[['body','subreddit']]

rhw_comments.head()

Unnamed: 0,body,subreddit
0,"Well, this is the woman who basically dressed ...",BravoRealHousewives
1,I thought there was a strong undertone in one ...,BravoRealHousewives
2,"If Bethenny liked Jill’s latkas. Better yet, i...",BravoRealHousewives
3,Bitch yes!! And i had that same exact thought!...,BravoRealHousewives
4,Alex mccord said something similar to your sec...,BravoRealHousewives


### Combining Real Housewives Posts and Comments

In [10]:
#Combine Posts with Comments and reset index
rhw = pd.concat([rhw_posts,rhw_comments], axis =0)
rhw = rhw.reset_index(drop=True)
#Drop duplicate copies
rhw.drop_duplicates(keep='first',inplace=True)


rhw.head()

Unnamed: 0,body,subreddit
0,/r/BravoRealHousewives daily OT thread. Today ...,BravoRealHousewives
1,The Real Housewives of New Jersey S09E07 - Bru...,BravoRealHousewives
2,If we could pool our money and hire MKE to do ...,BravoRealHousewives
3,Gotta pay for that wedding somehow but holy Fa...,BravoRealHousewives
4,RHONJ Season 9 Midseason Trailer,BravoRealHousewives


In [11]:
rhw.shape

(1853, 2)

 - We now have a Real Housewives dataframe with 1848 observations and 2 columns, the body of text and subreddit name

### Do the same web scraping for NBA Subreddit (r/nba)

#### Scraping NBA Posts:

In [13]:
#loop through reddit api to collect post data
after = None
list_posts = []
for _ in range(20):
    posts = requests.get('https://www.reddit.com/r/nba.json',
                         headers = {'User-agent': 'Steph Curry'},
                         params = {'after': after, 'limit': 100}).json()
    after = posts['data']['after']
    for post in posts['data']['children']:
        list_posts.append(post['data'])
    time.sleep(1)
    
#create dataframe
nba_posts = pd.DataFrame(list_posts)

nba_posts.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,suggested_sort,thumbnail,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,brexbre,,,Bulls2,[],,[CHI] Wendell Carter Jr.,...,new,,r/NBA Game Threads Index + Daily Discussion (D...,27,https://www.reddit.com/r/nba/comments/a7n5bc/r...,[],,False,all_ads,6
1,,,False,scoutwithbryan,,,,[],,,...,qa,,Hi Reddit! My name is Bryan Oringher and I spe...,283,https://www.reddit.com/r/nba/comments/a7ou6w/h...,[],,False,all_ads,6
2,,,False,mrguister,,,76ers1,[],,[PHI] Joel Embiid,...,,,Lebron with some advice to his son Bryce,15236,https://streamable.com/o9j7l,[],,False,all_ads,6
3,,,False,urasha,,,Knicks1,[],7cbf914a-b670-11e7-9aac-0e7f0a0db672,Knicks,...,,,"[Windhorst] ""LeBron James wants the Lakers to ...",2574,https://twitter.com/WindhorstESPN/status/10753...,[],,False,all_ads,6
4,,,False,SlimShagy,,,Spurs1,[],dfb246bc-3feb-11e8-9fdf-0e057b5e85fa,Spurs,...,,,"[Windhorst] ""LEAGUE EXECUTIVES ARE puzzled by ...",873,https://www.reddit.com/r/nba/comments/a7pto8/w...,[],,False,all_ads,6


In [14]:
nba_posts.shape

(1924, 95)

 - We have 1926 posts from the NBA subreddit.

In [15]:
#export csv file
# nba_posts.to_csv('../data/nba_posts.csv')

#### Scraping NBA comments:

In [16]:
#loop through reddit api to collect post data
after = None
list_posts = []
for _ in range(20):
    posts = requests.get('https://www.reddit.com/r/nba/comments.json',
                         headers = {'User-agent': 'Steph Curry'},
                         params = {'after': after, 'limit': 100}).json()
    after = posts['data']['after']
    for post in posts['data']['children']:
        list_posts.append(post['data'])
    time.sleep(1)
    
#create dataframe
nba_comments = pd.DataFrame(list_posts)


nba_comments.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,score,score_hidden,send_replies,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,ups,user_reports
0,,,False,duskhat,,Warriors1,[],3172a552-362b-11e8-96d7-0e405102b13a,Warriors,dark,...,1,False,True,False,nba,t5_2qo4s,r/nba,public,1,[]
1,,,False,manisier,,Warriors3,[],,Warriors,dark,...,1,False,True,False,nba,t5_2qo4s,r/nba,public,1,[]
2,,,False,sriracha82,,,[],,,,...,1,False,True,False,nba,t5_2qo4s,r/nba,public,1,[]
3,,,False,TreChomes,,Raptors2,[],8aada790-3561-11e8-8a85-0eea1222c1c4,[TOR] OG Anunoby,dark,...,1,False,True,False,nba,t5_2qo4s,r/nba,public,1,[]
4,,,False,moemonay,,,[],,,,...,1,False,True,False,nba,t5_2qo4s,r/nba,public,1,[]


In [17]:
nba_comments.shape

(1964, 63)

 - We have 1963 comments from the NBA subreddit.

In [18]:
#export csv file
# nba_comments.to_csv('../data/nba_comments.csv')

### Cleaning NBA Posts 

In [19]:
#create subset of relevant features
nba_posts = nba_posts[['title','selftext','subreddit']]
#fill in null values with placeholder
nba_posts.fillna('.', inplace = True)
#create column 'body' that combines 'title' with 'selftext'
nba_posts['body'] = nba_posts['title'] + ' ' + nba_posts['selftext']
#subset dataframe of NBA posts
nba_posts = nba_posts[['body','subreddit']]
nba_posts.head()

Unnamed: 0,body,subreddit
0,r/NBA Game Threads Index + Daily Discussion (D...,nba
1,Hi Reddit! My name is Bryan Oringher and I spe...,nba
2,Lebron with some advice to his son Bryce,nba
3,"[Windhorst] ""LeBron James wants the Lakers to ...",nba
4,"[Windhorst] ""LEAGUE EXECUTIVES ARE puzzled by ...",nba


### Cleaning NBA Comments

In [20]:
#create subset
nba_comments = nba_comments[['body','subreddit']]
nba_comments.head()

Unnamed: 0,body,subreddit
0,That doesn't sound right to me. Best record in...,nba
1,Melo doing work from the grave,nba
2,KD. He makes about 10 methodical midrangers a ...,nba
3,What was that Pascal,nba
4,"Man I love basketball, this is what separates ...",nba


### Combining NBA Posts and Comments

In [21]:
#Combine Posts with Comments and reset index
nba = pd.concat([nba_posts,nba_comments], axis =0)
nba = nba.reset_index(drop=True)
#Drop duplicate copies
nba.drop_duplicates(keep='first',inplace=True)

## Finally, combine all four NBA and Real Housewives dataframes and export for modeling

In [22]:
#create dataframe of two columns: Text and Target
df = pd.concat([rhw,nba],axis=0)
df.reset_index(drop=True,inplace=True)

#Binarize Target variable: 1 for RealHousewives, 0 for NBA
df['subreddit'] = df['subreddit'].apply(lambda x: 1 if x == 'BravoRealHousewives' else 0)
#rename columns
df.columns = ['text','target']

df.head()

Unnamed: 0,text,target
0,/r/BravoRealHousewives daily OT thread. Today ...,1
1,The Real Housewives of New Jersey S09E07 - Bru...,1
2,If we could pool our money and hire MKE to do ...,1
3,Gotta pay for that wedding somehow but holy Fa...,1
4,RHONJ Season 9 Midseason Trailer,1


In [23]:
df.shape

(3604, 2)

In [24]:
#export combined dataframe of nba post and comments
df.to_csv('../data/data_final.csv',index=False)