# Project 3: Sub-reddit Posts Classification

## Data Collection

In this part of the project, perform web scrapping to collect data from two subreddits `'r/audiobooks'` and `'r/booksuggestions'`

In [1]:
# Imports 
import requests
import pandas as pd
import time
import random

## Web Scraping

- Loop function to get posts from [audiobooks](https://www.reddit.com/r/audiobooks/)

In [5]:
# Define the url
url = 'https://www.reddit.com/r/audiobooks.json'

In [6]:
# Loop function
posts = []
after = None

for a in range(50):       # set the loop frequency
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after          # to avoid getting repeating post by renewing the url
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)    # make sure the request status
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    if a > 0:
        prev_posts = pd.read_csv('abooks.csv')             # check every posts that scrapped
        current_df = pd.DataFrame(posts)                   # create dataframe with the posts scrapped
        pd.concat([prev_posts,current_df],axis=0).to_csv('abooks.csv',index=False)    # join both dataframe
        
    else:
        pd.DataFrame(posts).to_csv('abooks.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,20)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/audiobooks.json
6
https://www.reddit.com/r/audiobooks.json?after=t3_ilfwgo
3
https://www.reddit.com/r/audiobooks.json?after=t3_ik5fqr
7
https://www.reddit.com/r/audiobooks.json?after=t3_iiaivz
10
https://www.reddit.com/r/audiobooks.json?after=t3_igbn8d
20
https://www.reddit.com/r/audiobooks.json?after=t3_ieljwj
5
https://www.reddit.com/r/audiobooks.json?after=t3_ida52k
11
https://www.reddit.com/r/audiobooks.json?after=t3_ibsc0r
2
https://www.reddit.com/r/audiobooks.json?after=t3_i9s1jo
12
https://www.reddit.com/r/audiobooks.json?after=t3_i7rnct
17
https://www.reddit.com/r/audiobooks.json?after=t3_i61ic0
7
https://www.reddit.com/r/audiobooks.json?after=t3_i3mu50
19
https://www.reddit.com/r/audiobooks.json?after=t3_i2hy0z
10
https://www.reddit.com/r/audiobooks.json?after=t3_i0xm45
15
https://www.reddit.com/r/audiobooks.json?after=t3_hz60yi
8
https://www.reddit.com/r/audiobooks.json?after=t3_hxja5d
9
https://www.reddit.com/r/audiobooks.json?after=t3_hv87o7
16
http

In [7]:
# total amount of posts scraped
len(posts)

1238

In [8]:
# save as a dataframe
a_books = pd.DataFrame(posts)
a_books.to_csv('audiobook.csv',index=False)

In [10]:
# drop the duplicate posts
a_books.drop_duplicates(subset='selftext',keep=False,inplace=True)

In [11]:
a_books.shape

(688, 111)

In [12]:
a_books.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,num_crossposts,media,is_video,post_hint,preview,link_flair_template_id,crosspost_parent_list,url_overridden_by_dest,crosspost_parent,author_cakeday
252,,audiobooks,I am a 24f and I only listen to audiobooks wit...,t2_3933piuh,False,,0,False,Is anyone else weird like me when it comes to ...,[],...,0,,False,,,36fa779c-80d6-11e4-9559-22000b3617ab,,,,
253,,audiobooks,I am using the app Libby to listen to audioboo...,t2_415dp1oy,False,,0,False,Obtaining a digital library card,[],...,0,,False,,,36fa779c-80d6-11e4-9559-22000b3617ab,,,,
254,,audiobooks,Is there a way to bookmark where i am in a aud...,t2_5tei9wlk,False,,0,False,About Groove music,[],...,0,,False,,,36fa779c-80d6-11e4-9559-22000b3617ab,,,,
255,,audiobooks,"Hi all,\n\n""The Zombie Letters"" by Billie Dean...",t2_13igm1,False,,0,False,"[FREE US &amp; UK Promotion for ""The Zombie Le...",[],...,0,,False,,,2bf22082-8f52-11e3-8877-12313d224170,,,,
256,,audiobooks,If anyone watches the movies 'The Grand Budape...,t2_4i02pdrl,False,,0,False,Narrated by Jude Law?,[],...,0,,False,self,{'images': [{'source': {'url': 'https://extern...,bc43e32c-063a-11e5-a69f-0e874b638b83,,,,


In [14]:
# save the dataframe without duplicate
a_books.to_csv('audiobook_1.csv',index=False)

In [15]:
# make a data frame with data needed from audiobooks
audiob = a_books[['subreddit','selftext']].copy()
audiob.head()

Unnamed: 0,subreddit,selftext
252,audiobooks,I am a 24f and I only listen to audiobooks wit...
253,audiobooks,I am using the app Libby to listen to audioboo...
254,audiobooks,Is there a way to bookmark where i am in a aud...
255,audiobooks,"Hi all,\n\n""The Zombie Letters"" by Billie Dean..."
256,audiobooks,If anyone watches the movies 'The Grand Budape...


- Loop function to get posts from [booksuggestions](https://www.reddit.com/r/booksuggestions/)

Repeat steps above

In [16]:
url2 = 'https://www.reddit.com/r/booksuggestions.json'

In [17]:
posts2 = []
after = None

for a in range(50):
    if after == None:
        current_url = url2
    else:
        current_url = url2 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts2.extend(current_posts)
    after = current_dict['data']['after']
    
    if a > 0:
        prev_posts = pd.read_csv('books.csv')
        current_df = pd.DataFrame(posts2)
        pd.concat([prev_posts,current_df],axis=0).to_csv('books.csv',index=False)
        
    else:
        pd.DataFrame(posts2).to_csv('books.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,30)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/booksuggestions.json
7
https://www.reddit.com/r/booksuggestions.json?after=t3_im8p7m
29
https://www.reddit.com/r/booksuggestions.json?after=t3_ilrtt8
26
https://www.reddit.com/r/booksuggestions.json?after=t3_iluzrz
27
https://www.reddit.com/r/booksuggestions.json?after=t3_ikyvk1
22
https://www.reddit.com/r/booksuggestions.json?after=t3_il9za5
2
https://www.reddit.com/r/booksuggestions.json?after=t3_ilfvfk
3
https://www.reddit.com/r/booksuggestions.json?after=t3_il5u4d
13
https://www.reddit.com/r/booksuggestions.json?after=t3_il0rj3
24
https://www.reddit.com/r/booksuggestions.json?after=t3_ikqi13
15
https://www.reddit.com/r/booksuggestions.json?after=t3_ikfnr6
21
https://www.reddit.com/r/booksuggestions.json?after=t3_ikh4mv
6
https://www.reddit.com/r/booksuggestions.json?after=t3_ikcn0j
28
https://www.reddit.com/r/booksuggestions.json?after=t3_ik5njh
14
https://www.reddit.com/r/booksuggestions.json?after=t3_ijtrx3
22
https://www.reddit.com/r/booksuggestions.json

  interactivity=interactivity, compiler=compiler, result=result)


4
https://www.reddit.com/r/booksuggestions.json?after=t3_ifujb8
9
https://www.reddit.com/r/booksuggestions.json?after=t3_ifhf15
23
https://www.reddit.com/r/booksuggestions.json?after=t3_ifjq1j
14
https://www.reddit.com/r/booksuggestions.json
21
https://www.reddit.com/r/booksuggestions.json?after=t3_im8p7m
25
https://www.reddit.com/r/booksuggestions.json?after=t3_ilrtt8
11
https://www.reddit.com/r/booksuggestions.json?after=t3_iluzrz
7
https://www.reddit.com/r/booksuggestions.json?after=t3_ikyvk1
9
https://www.reddit.com/r/booksuggestions.json?after=t3_il9za5
29
https://www.reddit.com/r/booksuggestions.json?after=t3_ilfvfk
30
https://www.reddit.com/r/booksuggestions.json?after=t3_il5u4d
9
https://www.reddit.com/r/booksuggestions.json?after=t3_il0rj3
7
https://www.reddit.com/r/booksuggestions.json?after=t3_ikqi13
11


In [18]:
len(posts2)

1250

In [19]:
books = pd.DataFrame(posts2)
books.to_csv('booksuggestion.csv',index=False)

In [20]:
books.shape

(1250, 107)

In [22]:
books.drop_duplicates(subset='selftext',keep=False,inplace=True)

In [23]:
books.shape

(651, 107)

In [24]:
books.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,author_cakeday
251,,booksuggestions,Hi! Looking for stories concerning pirates. So...,t2_3a29jo3g,False,,0,False,Any Pirate books?,[],...,False,https://www.reddit.com/r/booksuggestions/comme...,340493,1598992000.0,0,,False,,,
252,,booksuggestions,I know this sounds weird. „What kind of wild m...,t2_37beow9v,False,,0,False,"Fantasy, Magic, Mythology and Science Fiction,...",[],...,False,https://www.reddit.com/r/booksuggestions/comme...,340493,1598991000.0,0,,False,,,
253,,booksuggestions,I feel like this is either very common or very...,t2_gveh3g,False,,0,False,Nonfiction that reads like narrative fiction,[],...,False,https://www.reddit.com/r/booksuggestions/comme...,340493,1598991000.0,0,,False,,,
255,,booksuggestions,Recently I started school from 8 to 3 with 15 ...,t2_5qsrmgx6,False,,0,False,What are some easy reads,[],...,False,https://www.reddit.com/r/booksuggestions/comme...,340493,1598976000.0,0,,False,,,
256,,booksuggestions,I enjoyed it personally,t2_85nrd,False,,0,False,Military Sci-Fi books like Starship Troopers,[],...,False,https://www.reddit.com/r/booksuggestions/comme...,340493,1598988000.0,0,,False,,,


In [25]:
books.to_csv('booksuggestion_1.csv',index=False)

In [26]:
# create dataframe with data needed from booksuggestions
book = books[['subreddit','selftext']].copy()
book.head()

Unnamed: 0,subreddit,selftext
251,booksuggestions,Hi! Looking for stories concerning pirates. So...
252,booksuggestions,I know this sounds weird. „What kind of wild m...
253,booksuggestions,I feel like this is either very common or very...
255,booksuggestions,Recently I started school from 8 to 3 with 15 ...
256,booksuggestions,I enjoyed it personally


### Combined  data frame with data needed from both subreddit

In [27]:
df_list = [audiob,book]

df = pd.concat(df_list)
df.head()

Unnamed: 0,subreddit,selftext
252,audiobooks,I am a 24f and I only listen to audiobooks wit...
253,audiobooks,I am using the app Libby to listen to audioboo...
254,audiobooks,Is there a way to bookmark where i am in a aud...
255,audiobooks,"Hi all,\n\n""The Zombie Letters"" by Billie Dean..."
256,audiobooks,If anyone watches the movies 'The Grand Budape...


In [28]:
# shape of the final dataset
df.shape

(1339, 2)

In [29]:
# save the dataset to csv
df.to_csv('combinedbook.csv',index=False)