<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Scraping, NLP and Binary Classification Problem

# Objective of the project:
    
* Use Pushshift API to collect 2 subreddits category: makeup and fragrance
* Use NLP to train a classifier on which subreddit a post has been given to


Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members. Posts are organized by subject into user-created boards called "communities" or "subreddits", which cover a variety of topics such as news, politics, religion, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing. Submissions with more upvotes appear towards the top of their subreddit and, if they receive enough upvotes, ultimately on the site's front page. In this project, the 2 subreddits that were chosen are: Fragrance and Makeup.

In this notebook, I will be scraping my data to do a binary classification problem.

In [42]:
#libraries imported

import requests
import pandas as pd
import time

# Scraping Data Using Pushshift

In [58]:
# function to scrape data
# scrape 100 data each time
# total rounds of scraping (11)
# concat the 100 posts into 1 csv file
# there will be 10 secs pause after each round of scraping

def redditscrap(title):
    url = 'https://api.pushshift.io/reddit/search/submission'
    df_load = []
    params = {
        'subreddit': title,
        'size' : 100,
        'before': None
    }
    for i in range(11):
        res = requests.get(url,params)
        data = res.json()
        posts = data['data']
        df_new = pd.DataFrame(posts)
        df_load.append(df_new)
        params['before'] = df_new['created_utc'][99]
        df = pd.concat(df_load, ignore_index = True)
        df.to_csv(f'{title}.csv')
        time.sleep(10)
        print(f'{i+1} Iterations completed')


# Scraping Makeup Subreddit

In [45]:
#instantiate the function to scrap the data from Reddit
redditscrap('Makeup')

1 Iterations completed
2 Iterations completed
3 Iterations completed
4 Iterations completed
5 Iterations completed
6 Iterations completed
7 Iterations completed
8 Iterations completed
9 Iterations completed
10 Iterations completed
11 Iterations completed


In [49]:
#read the csv file
df_makeup = pd.read_csv('Makeup.csv')

In [50]:
#check first 5 rows
df_makeup.head()

Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,...,crosspost_parent_list,url_overridden_by_dest,author_cakeday,thumbnail_height,thumbnail_width,edited,author_flair_background_color,author_flair_text_color,author_flair_template_id,banned_by
0,0,[],False,ChristyR4230,,[],,text,t2_3944zu0n,False,...,,,,,,,,,,
1,1,[],False,ASOMA-59,,[],,text,t2_e4j1ejyk,False,...,,,,,,,,,,
2,2,[],False,Asmodaia,,[],,text,t2_dp629h8g,False,...,,,,,,,,,,
3,3,[],False,21446,,[],,text,t2_6by954v6,False,...,,,,,,,,,,
4,4,[],False,melissajackson07,,[],,text,t2_1pqcatke,False,...,,,,,,,,,,


In [55]:
#check for duplicates
df_makeup[['subreddit', 'selftext','title','created_utc']].duplicated().sum()

0

In [69]:
#change the date and time format
for value in df_makeup['created_utc']:
    print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(value)))

2021-08-28 13:36:12
2021-08-28 13:19:15
2021-08-28 12:58:24
2021-08-28 12:11:50
2021-08-28 11:42:49
2021-08-28 11:35:03
2021-08-28 10:03:30
2021-08-28 08:55:22
2021-08-28 07:21:24
2021-08-28 06:23:28
2021-08-28 06:04:51
2021-08-28 05:57:23
2021-08-28 05:36:36
2021-08-28 02:05:09
2021-08-28 00:43:12
2021-08-27 23:26:27
2021-08-27 22:39:08
2021-08-27 21:54:57
2021-08-27 20:22:53
2021-08-27 19:44:23
2021-08-27 19:41:21
2021-08-27 18:08:22
2021-08-27 16:34:17
2021-08-27 16:18:04
2021-08-27 12:32:45
2021-08-27 09:41:29
2021-08-27 09:39:57
2021-08-27 03:27:23
2021-08-27 03:26:57
2021-08-27 02:46:19
2021-08-27 02:43:36
2021-08-27 02:41:08
2021-08-27 02:14:58
2021-08-27 02:10:16
2021-08-27 02:02:17
2021-08-27 00:37:55
2021-08-26 22:05:47
2021-08-26 21:25:36
2021-08-26 21:13:35
2021-08-26 20:48:06
2021-08-26 19:28:10
2021-08-26 19:10:24
2021-08-26 18:27:24
2021-08-26 17:42:30
2021-08-26 12:53:44
2021-08-26 10:01:55
2021-08-26 09:49:46
2021-08-26 09:41:22
2021-08-26 09:39:15
2021-08-26 09:29:27


# Scraping Fragrance Subreddit

In [56]:
#instantiate the function to scrap the data from Reddit
df_redditscrap('fragrance')

1 Iterations completed
2 Iterations completed
3 Iterations completed
4 Iterations completed
5 Iterations completed
6 Iterations completed
7 Iterations completed
8 Iterations completed
9 Iterations completed
10 Iterations completed
11 Iterations completed


In [59]:
#read the csv file
df_fragrance = pd.read_csv('fragrance.csv')

In [60]:
#check 1st 5 rows
df_fragrance.head()

Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,...,removed_by_category,post_hint,preview,poll_data,author_flair_background_color,author_flair_template_id,author_flair_text_color,banned_by,edited,author_cakeday
0,0,[],False,Nikko_Kuzma,,[],,text,t2_dkyxctf5,False,...,,,,,,,,,,
1,1,[],False,xxyyzzoe,,[],,text,t2_159m0v,False,...,,,,,,,,,,
2,2,[],False,Claphamtulip,,[],,text,t2_u4kdr,False,...,,,,,,,,,,
3,3,[],False,venusstar98,,[],,text,t2_dvcb8zco,False,...,reddit,,,,,,,,,
4,4,[],False,therealmisslacreevy,,[],,text,t2_5rtsazmf,False,...,moderator,self,"{'enabled': False, 'images': [{'id': 'tjbomo4B...",,,,,,,


In [61]:
#check for duplicates
df_fragrance[['subreddit', 'selftext','title','created_utc']].duplicated().sum()

2

In [62]:
#check the shape
df_fragrance.shape

(1100, 74)

In [66]:
#drop any duplicates
df_fragrance.drop_duplicates(inplace=True)

In [67]:
#check for shape
df_fragrance.shape

(1100, 74)

In [68]:
#change the date and time
for value in df_fragrance['created_utc']:
    print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(value)))

2021-08-28 15:38:41
2021-08-28 13:11:46
2021-08-28 13:08:37
2021-08-28 12:46:53
2021-08-28 12:43:52
2021-08-28 12:12:39
2021-08-28 12:03:46
2021-08-28 10:57:28
2021-08-28 10:40:51
2021-08-28 10:32:06
2021-08-28 10:18:49
2021-08-28 10:14:21
2021-08-28 09:48:48
2021-08-28 09:17:19
2021-08-28 09:05:42
2021-08-28 08:35:47
2021-08-28 08:28:56
2021-08-28 08:14:09
2021-08-28 07:55:50
2021-08-28 07:49:41
2021-08-28 07:25:31
2021-08-28 07:25:02
2021-08-28 07:06:44
2021-08-28 07:03:18
2021-08-28 06:38:54
2021-08-28 06:27:31
2021-08-28 06:13:06
2021-08-28 05:57:33
2021-08-28 05:56:16
2021-08-28 05:54:54
2021-08-28 05:43:20
2021-08-28 05:33:03
2021-08-28 04:44:18
2021-08-28 04:43:14
2021-08-28 03:59:16
2021-08-28 03:35:20
2021-08-28 03:25:57
2021-08-28 03:14:09
2021-08-28 03:13:31
2021-08-28 03:00:12
2021-08-28 02:59:22
2021-08-28 02:53:51
2021-08-28 02:37:58
2021-08-28 02:06:14
2021-08-28 01:48:20
2021-08-28 01:40:57
2021-08-28 01:32:10
2021-08-28 01:19:16
2021-08-28 01:01:56
2021-08-28 00:39:08


As seen above, we have managed to scrape data using Pushshift API. 

Total data collected:
* Fragrance: 1100 rows, 74 columns
* Makeup: 1100 rows, 74 columns