# Project 3: SubReddit Classifier

---

# Data Collection and Cleaning
---

## Cooking


Scraping top 1000 posts from subreddit - 'cooking'. Once the data is fully scrapped, the duplicate posts, null values posts and unwanted columns are removed to obtain a clean dataset and saved into a csv file

### Data Collection

In [2]:
#Imports:
import requests
import pandas as pd
import time
import random

In [2]:
# specify the url
url = 'https://www.reddit.com/r/Cooking.json'

In [3]:
#requesting the website
res = requests.get(url)

In [4]:
#check the response status code
res.status_code

429

In [5]:
#Python has its own default user agent. 
#Since there are so many scripts out there that are already 'hitting' reddit's API,
#reddit is basically shutting down all Python scripts from accessing its API.
#We will change our request a little bit to make it not use the default user agent.
#creating useragent
res = requests.get(url, headers={'User-agent': 'Pony Inc 1.0'})

In [7]:
#check the response status code
res.status_code

200

In [8]:
#Sends a JSON response composed of the specified data, stored in reddit_dict
reddit_dict = res.json()

In [9]:
posts = []
after = None

# since we need 1100 posts ,the loop runs 45 times
for a in range(45):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after   # reddit_dict['data']['after'] gives the name of last post 
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'}) #defining the user agent
    
    if res.status_code != 200:  # if reponse is not positive print status error and break -to run the program
        print('Status error', res.status_code)
        break
    
    current_dict = res.json() #storing JSON response
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts) # update in list called post
    after = current_dict['data']['after'] # update the value of last post in the current batch
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/Cooking.json
43
https://www.reddit.com/r/Cooking.json?after=t3_esbcc1
47
https://www.reddit.com/r/Cooking.json?after=t3_esevn7
52
https://www.reddit.com/r/Cooking.json?after=t3_es9q3k
50
https://www.reddit.com/r/Cooking.json?after=t3_es5rgp
49
https://www.reddit.com/r/Cooking.json?after=t3_es19gt
57
https://www.reddit.com/r/Cooking.json?after=t3_ervu4t
30
https://www.reddit.com/r/Cooking.json?after=t3_eqx9jz
48
https://www.reddit.com/r/Cooking.json?after=t3_erko7w
29
https://www.reddit.com/r/Cooking.json?after=t3_ere5ht
11
https://www.reddit.com/r/Cooking.json?after=t3_er99n7
56
https://www.reddit.com/r/Cooking.json?after=t3_er4510
51
https://www.reddit.com/r/Cooking.json?after=t3_eql8t1
57
https://www.reddit.com/r/Cooking.json?after=t3_eqsf3y
5
https://www.reddit.com/r/Cooking.json?after=t3_eqhsk4
16
https://www.reddit.com/r/Cooking.json?after=t3_eqho90
44
https://www.reddit.com/r/Cooking.json?after=t3_eq8wee
35
https://www.reddit.com/r/Cooking.json?after=t3_e

In [12]:
posts = []
after = None

for a in range(45):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # store the information in csv file
    
    if a > 0:
        prev_posts = pd.read_csv('../data/cooking.csv')
        current_df = pd.DataFrame()
        
    else:
        pd.DataFrame(posts).to_csv('../data/cooking.csv', index = False) 

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/Cooking.json
6
https://www.reddit.com/r/Cooking.json?after=t3_eslspu
5
https://www.reddit.com/r/Cooking.json?after=t3_eshvsy
5
https://www.reddit.com/r/Cooking.json?after=t3_es8r6n
2
https://www.reddit.com/r/Cooking.json?after=t3_es8bmy
5
https://www.reddit.com/r/Cooking.json?after=t3_erljl2
6
https://www.reddit.com/r/Cooking.json?after=t3_es175o
6
https://www.reddit.com/r/Cooking.json?after=t3_erumd1
3
https://www.reddit.com/r/Cooking.json?after=t3_erfcmo
2
https://www.reddit.com/r/Cooking.json?after=t3_erdrpa
6
https://www.reddit.com/r/Cooking.json?after=t3_er7mga
3
https://www.reddit.com/r/Cooking.json?after=t3_eqwhe4
4
https://www.reddit.com/r/Cooking.json?after=t3_eqwq7r
5
https://www.reddit.com/r/Cooking.json?after=t3_eqt6da
5
https://www.reddit.com/r/Cooking.json?after=t3_eqnw0x
6
https://www.reddit.com/r/Cooking.json?after=t3_eqjuym
2
https://www.reddit.com/r/Cooking.json?after=t3_eq311o
5
https://www.reddit.com/r/Cooking.json?after=t3_eqbb02
4
https://

In [13]:
# store the csv file
pd.DataFrame(posts).to_csv('../data/cooking.csv', index = False)

### Data Cleaning

In [4]:
# reading the csv file into pandas dataframes
cooking=pd.read_csv('../data/cooking.csv')

In [5]:
#displaying the first 5 rows of dataframe
cooking.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,crosspost_parent_list,crosspost_parent
0,,Cooking,I have been learning how to cook over the past...,t2_2tl9qdub,False,,0,False,I hate my own cooking,[],...,False,https://www.reddit.com/r/Cooking/comments/esh9...,1558181,1579723000.0,0,,False,,,
1,,Cooking,Maybe this is about cooking. Maybe not. Since ...,t2_45mcc44e,False,,0,False,My mom's measuring cup,[],...,False,https://www.reddit.com/r/Cooking/comments/es3e...,1558181,1579651000.0,1,,False,,,
2,,Cooking,I typically save several dollars every trip to...,t2_zv97z,False,,0,False,"Money has been tight this past month, so we’ve...",[],...,False,https://www.reddit.com/r/Cooking/comments/eskl...,1558181,1579737000.0,0,,False,,,
3,,Cooking,"I eat a lot of steamed vegetables, particularl...",t2_89mrw,False,,0,False,How can I make steamed vegetables taste more i...,[],...,False,https://www.reddit.com/r/Cooking/comments/ese5...,1558181,1579710000.0,0,,False,,,
4,,Cooking,I currently own both a Crock Pot and an Instan...,t2_30t97dei,False,,0,False,Can a slow cooker be fully replaced by an Inst...,[],...,False,https://www.reddit.com/r/Cooking/comments/esi2...,1558181,1579727000.0,0,,False,,,


In [6]:
# checking the number of columns and rows
cooking.shape

(1101, 102)

In [7]:
#checking the number of duplicate rows
cooking[cooking.duplicated(keep=False)]

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,crosspost_parent_list,crosspost_parent
177,,Cooking,Hi all!\n\n\nI'm having a pizza night tonight....,t2_13vymy,False,,0,False,Homemade veggie pizza. When to cook toppings?,[],...,False,https://www.reddit.com/r/Cooking/comments/erlv...,1558183,1579563000.0,0,,False,,,
178,,Cooking,https://youtu.be/TgUyiV_gqjg,t2_1k2rgxne,False,,0,False,Breakfast in 1 Minute,[],...,False,https://www.reddit.com/r/Cooking/comments/erxe...,1558183,1579626000.0,0,,False,,,
179,,Cooking,Simply wondering if there's such a thing as so...,t2_dt64r,False,,0,False,"Outside of cooking classes, are there any ways...",[],...,False,https://www.reddit.com/r/Cooking/comments/ermp...,1558183,1579567000.0,0,,False,,,
180,,Cooking,For those of you who have used the ranch seaso...,t2_3oxutu7q,False,,0,False,Hidden Valley Ranch Packet Question,[],...,False,https://www.reddit.com/r/Cooking/comments/ermm...,1558183,1579567000.0,0,,False,,,
181,,Cooking,"\nHi folks,\n\nAs one who loves fitness, cooki...",t2_icsrl,False,,0,False,Cooking resources for one who loves fitness?,[],...,False,https://www.reddit.com/r/Cooking/comments/erml...,1558183,1579567000.0,0,,False,,,
182,,Cooking,I used egg noodles as a sub and need help addi...,t2_1m8qhcos,False,,0,False,Turkey Spaetzle Soup tastes bland,[],...,False,https://www.reddit.com/r/Cooking/comments/erpc...,1558183,1579580000.0,0,,False,,,
184,,Cooking,My last roommate used my deep fryer well over ...,t2_26y456kz,False,,0,False,Is my deep fryer safe to use?,[],...,False,https://www.reddit.com/r/Cooking/comments/ermc...,1558183,1579566000.0,0,,False,,,
185,,Cooking,The only way I can get an over easy egg to NOT...,t2_2eyzwg3b,False,,0,False,"A perfect Over Easy egg. Please Help, they ALW...",[],...,False,https://www.reddit.com/r/Cooking/comments/erm8...,1558183,1579565000.0,0,,False,,,
186,,Cooking,I remember on Sunday afternoons standing in th...,t2_d0f1b,False,,0,False,"Red-eye gravy, how I long for thee ...",[],...,False,https://www.reddit.com/r/Cooking/comments/erlu...,1558183,1579563000.0,0,,False,,,
187,,Cooking,Everything I try from guides and recipes onlin...,t2_2lup7zpp,False,,0,False,How to brown chicken breast?,[],...,False,https://www.reddit.com/r/Cooking/comments/ernn...,1558183,1579572000.0,0,,False,,,


In [8]:
#dropping the duplicates
cooking.drop_duplicates(inplace=True)

In [9]:
# number of rows and columns after removing duplicates
cooking.shape

(1083, 102)

In [10]:
# columns in dataset
cooking.columns

Index(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved',
       'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext',
       ...
       'stickied', 'url', 'subreddit_subscribers', 'created_utc',
       'num_crossposts', 'media', 'is_video', 'author_cakeday',
       'crosspost_parent_list', 'crosspost_parent'],
      dtype='object', length=102)

In [12]:
# storing the columns to be dropped in a variable
columns_drop=[column for column in cooking.columns if column !='subreddit'and column !='selftext']

In [13]:
#dropping unwanted columns
cooking.drop(columns_drop, axis=1, inplace=True)

In [14]:
# final dataframe
cooking.head()

Unnamed: 0,subreddit,selftext
0,Cooking,I have been learning how to cook over the past...
1,Cooking,Maybe this is about cooking. Maybe not. Since ...
2,Cooking,I typically save several dollars every trip to...
3,Cooking,"I eat a lot of steamed vegetables, particularl..."
4,Cooking,I currently own both a Crock Pot and an Instan...


In [15]:
#datatype of each variable in dataframe
cooking.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1083 entries, 0 to 1100
Data columns (total 2 columns):
subreddit    1083 non-null object
selftext     1007 non-null object
dtypes: object(2)
memory usage: 25.4+ KB


In [16]:
#dropping the rows with null values
cooking.dropna(axis = 0,inplace = True)

In [17]:
#datatype of each variable in dataframe
cooking.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1007 entries, 0 to 1100
Data columns (total 2 columns):
subreddit    1007 non-null object
selftext     1007 non-null object
dtypes: object(2)
memory usage: 23.6+ KB


In [18]:
#saving the clean dataframe into a csv file
cooking.loc[ :].to_csv('../data/final_cooking.csv',index=False)