**Project 3: Web APIs & Classification Part 1 - Data Extracting**

# Problem statement

Using data obtained from webscraping, to train a classifier to predict which subreddit a given post came from.

I will be scraping data from www.reddit.com. 2 subreddit will be selected and 1000 posts(each) will be scraped. The data from the post will then be fed to classifier models for training and to used to predict the subreddit which the testing post is from.

In [1]:
import requests
import random
import time
from bs4 import BeautifulSoup
import pandas as pd
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

# Data Collection / Web Scaping

Selected subreddits are Parenting and Childfree.<br>
<br>
Parenting subreddit discusses mainly on the issues pertaining to raising a child. Childfree mainly discusses events that prompts people not to have child. These 2 subreddit shares many similar words but the context is different. Target to scape 1000 posts from each subreddit. These will be further split into training and testing data.

## Parenting subreddit

Adding new.json ensures we will not have issues to scape 1000 posts. 

In [2]:
url = 'https://www.reddit.com/r/Parenting/new.json'
response = requests.get(url,headers={'User-agent': 'Pony Inc 1.0'}) # user-agent is required as reddit had blocked python webscaping.
response

<Response [200]>

200 code represents the link is working and connection is through.

Feel the data to a dictionary. With the dictionary, we can further dive in to self the title and post content.

In [3]:
reddit_parenting_dict = response.json() 

In [4]:
reddit_parenting_dict.keys()

dict_keys(['kind', 'data'])

In [5]:
reddit_parenting_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [6]:
reddit_parenting_dict['data']['children'][0]['data']['title']

"My 5yo son wants to wear skirts - I don't mind but I had a few questions"

In [7]:
reddit_parenting_dict['data']['children'][0]['data']['selftext']

'My son turned 5 a few months ago. Since then we\'ve been letting him pick his own clothes once a week. At first he would put together his existing clothes (he once picked his turtles onesie and rocked it) then asked for new clothes he saw on TV or his friends wear. Of late he\'s been asking for female clothes, skirts especially. It\'s mostly stuff like "I want that skirt Emily wore on Tuesday" which is super specific and helpful.\n\nI don\'t care what he wears as long as it covers him and makes him comfy. But it\'s there something in terms of gender identity or am I overthinking it? He hasn\'t mentioned anything about it and I haven\'t brought it up with him because he\'s only a child. I think his skirts obsession is a fad and it\'s more driven by his friends than anything else. I\'ve not bought him any girls clothing yet and put it off by saying things like "it\'s cold today" or "you look so handsome in that shirt".\n\nTo be clear, I don\'t care how he dresses, and how he feels about

Title and selftext is what we require.

In [8]:
posts = []
after = None

for a in range(40): # each iterations only scape 25 reddit posts. 40 loops will get 1000 posts.
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after #after each loop, the url will jump to the enxt page
    print(current_url)
    response = requests.get(current_url, headers={'User-agent': 'Pony Inc 2.0'})
    
    if response.status_code != 200: 
        print('Status error', response.status_code)
        break
    
    current_dict = response.json()
    current_posts = [p['data'] for p in current_dict['data']['children']] #all data under children will be scaped. These will include our title and selftext.
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,7)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/Parenting/new.json
3
https://www.reddit.com/r/Parenting/new.json?after=t3_dmawu1
4
https://www.reddit.com/r/Parenting/new.json?after=t3_dm7lz1
7
https://www.reddit.com/r/Parenting/new.json?after=t3_dm1dbd
3
https://www.reddit.com/r/Parenting/new.json?after=t3_dlvg38
4
https://www.reddit.com/r/Parenting/new.json?after=t3_dlomez
2
https://www.reddit.com/r/Parenting/new.json?after=t3_dliz90
7
https://www.reddit.com/r/Parenting/new.json?after=t3_dlcca2
6
https://www.reddit.com/r/Parenting/new.json?after=t3_dl85dc
7
https://www.reddit.com/r/Parenting/new.json?after=t3_dl2rc9
4
https://www.reddit.com/r/Parenting/new.json?after=t3_dktwsd
4
https://www.reddit.com/r/Parenting/new.json?after=t3_dkmh0h
2
https://www.reddit.com/r/Parenting/new.json?after=t3_dkelyx
6
https://www.reddit.com/r/Parenting/new.json?after=t3_dk5rpd
4
https://www.reddit.com/r/Parenting/new.json?after=t3_djv5og
7
https://www.reddit.com/r/Parenting/new.json?after=t3_djox7q
2
https://www.reddit.com/r

Putting the data into dataframe.

In [9]:
df_parenting=pd.DataFrame(posts) 
df_parenting.shape

(994, 99)

To check for duplicated posts.

In [10]:
df_parenting_duplicated=df_parenting[df_parenting.duplicated(subset ="title", 
                     keep = 'first')] 

In [11]:
df_parenting_duplicated.shape #checking for duplicated posts

(3, 99)

There are 2 duplicate posts. 

In [12]:
df_parenting.drop_duplicates(subset ="title", #drop 2 posts as these are duplicates.
                     keep = 'first', inplace = True) 

In [13]:
df_parenting.shape 

(991, 99)

Removed 2 duplicate posts.

In [14]:
df_parenting2=df_parenting[['title','selftext']] # We will only require title and the selftext

In [15]:
df_parenting2['type']='parenting' #Labelling the target subreddit

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [16]:
parenting_csv=df_parenting2.to_csv('../datasets/childfree.csv', index = False) #export to csv file

## Childfree subreddit

In [17]:
url = 'https://www.reddit.com/r/childfree/new.json'
response = requests.get(url,headers={'User-agent': 'Pony Inc 1.0'})
response

<Response [200]>

In [18]:
posts = []
after = None

for a in range(40):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    response = requests.get(current_url, headers={'User-agent': 'Pony Inc 2.0'})
    
    if response.status_code != 200:
        print('Status error', response.status_code)
        break
    
    current_dict = response.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,7)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/childfree/new.json
6
https://www.reddit.com/r/childfree/new.json?after=t3_dmf5xm
4
https://www.reddit.com/r/childfree/new.json?after=t3_dm8rf0
3
https://www.reddit.com/r/childfree/new.json?after=t3_dm4nt1
6
https://www.reddit.com/r/childfree/new.json?after=t3_dm188n
4
https://www.reddit.com/r/childfree/new.json?after=t3_dlws3g
7
https://www.reddit.com/r/childfree/new.json?after=t3_dlrzqk
7
https://www.reddit.com/r/childfree/new.json?after=t3_dlnd9i
5
https://www.reddit.com/r/childfree/new.json?after=t3_dlhowt
4
https://www.reddit.com/r/childfree/new.json?after=t3_dl98lm
2
https://www.reddit.com/r/childfree/new.json?after=t3_dl4dfm
4
https://www.reddit.com/r/childfree/new.json?after=t3_dkw7oa
5
https://www.reddit.com/r/childfree/new.json?after=t3_dkrjy2
7
https://www.reddit.com/r/childfree/new.json?after=t3_dkms5h
4
https://www.reddit.com/r/childfree/new.json?after=t3_dkg3h6
2
https://www.reddit.com/r/childfree/new.json?after=t3_dk9h9a
5
https://www.reddit.com/r

In [19]:
df_childfree=pd.DataFrame(posts)
df_childfree.shape

(995, 99)

In [20]:
df_childfree_duplicated=df_childfree[df_childfree.duplicated(subset ="title",
                     keep = 'first')]

In [21]:
df_childfree_duplicated.shape

(0, 99)

In [22]:
df_childfree2=df_childfree[['title','selftext']]

In [23]:
df_childfree2['type']='childfree'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [24]:
childfree_csv=df_childfree2.to_csv('../datasets/childfree.csv', index = False)

**Data collection ends here and I will continue on part 2 with the data cleaning and modelling.**