<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web API and NLP

--- 
# Contents


---

### Contents:
Notebook 1
 - Part 1
    - Problem statement
    - Background
    - Summary
 - Part 2
   - Importing libraries  
   - API and parameter testing  
      - Whiskey  
      - Rum  
      - Narrowing down features of interest  
   - Scrapping Data  

--- 
# Part 1

Problem Statement and Introduction

---

## Background

We are a data science team working in Sing Song Cellar, specialising in investing and trade in rum and whiskey. Based on concerns and feedbacks from company shareholders, we are tasked to create a classification model to identify either rum or whiskey is worth purchasing.

## Problem Statement

Sing Song Cellar requests the creation of a machine learning-based algorithm to identify the ways they can maximize the efficiency of their marketing spend based on their target audience.  
Management had asked us to survey whether rum or whiskey is worth investing and buying in. 

## Summary

The following subreddits were scraped using PushShift API from https://api.pushshift.io/ to scrape the text from the following subreddits:  
**Whiskey** and **Rum**.

Whiskey or whisky is a distilled alcoholic beverage made from fermented grain mash. Different grains (probably malt) are used for different varieties, including barley, corn, rye, and wheat. Whiskey is usually aged in wooden barrels, usually made from charred white oak. Unknown white oak barrels previously used to age sherry are also sometimes used.
([*source*](https://liquorama.net/blog/what-is-whiskey))

Whiskey is a highly regulated spirit around the world with a wide variety of varieties. A unique unifying feature of the different categories and types is the fermentation, distillation and aging of the grains in wooden barrels.([*source*](https://liquorama.net/blog/what-is-whiskey))

After fermentation, rum is distilled from molasses or orange juice. The distillate is a clear liquid, usually aged in oak barrels. Rum is produced in almost every sugar producing region in the world, such as the Philippines, where Tantuye is the largest rum producer in the world.([*source*](https://www.thespruceeats.com/introduction-to-rum-760702))

There are many different grades of rum. Light rums are often used in cocktails, while "gold" and "dark" rums are often drunk straight or straight, chilled ("ice water"), or used in cooking, but are now commonly used in blenders. Premium rum can be drunk straight or cold.([*source*](https://www.thespruceeats.com/introduction-to-rum-760702))

Rum plays an important role in the culture of most of the islands of the West Indies and the coasts of Canada and Newfoundland. The drink was famous in the Royal Navy (mixed with water or beer to make a full-fledged beer) and piracy (eaten as a bumbo). Rum is also a popular trade, used to finance businesses such as slavery, crime and military operations.([*source*](https://www.thespruceeats.com/introduction-to-rum-760702))

We are constructing Naive Bayes, K-nearest neighbors and logistic regression models. we would be using CountVectorizer and tfidfvectorizer. Models success would be evaluated by accuracy.

--- 
# Part 2

Data extraction

---

## 2.1 Importing libraries

In [1]:
# Import working libraries:
import requests
from bs4 import BeautifulSoup
import datetime
import pandas as pd
import time
import random

In [2]:
pd.set_option('display.max_columns', 4000)
pd.set_option('display.max_rows', 4000)

## 2.2 API and parameter testing

### 2.2.1 Whiskey

In [3]:
#scrape data for Whiskey

#1. url for Whiskey
url1 = 'https://api.pushshift.io/reddit/search/submission?subreddit=whiskey'
#setting params
params1 = { 
    'subreddit': 'Whiskey', 
    'size': 100
}
#requests library
res1 = requests.get(url1, params1)
#creating BeautifulSoup object
soup1 = BeautifulSoup(res1.content, 'lxml')
#status code
res1.status_code
#assign json format respond to variable
data1 = res1.json()
#copy out data from dictionary
posts1 = data1['data']
#putting this dataframe into a variable
df1=pd.DataFrame(posts1)
#putting data to dataframe
pd.DataFrame(posts1).head(6)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gallery_data,gildings,id,is_created_from_ads_ui,is_crosspostable,is_gallery,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_metadata,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,post_hint,preview,author_flair_background_color,author_flair_text_color,media,media_embed,removed_by_category,secure_media,secure_media_embed,author_cakeday
0,[],False,Lhasabeast,,[],,text,t2_isjvz,False,False,False,[],False,False,1658807708,reddit.com,https://www.reddit.com/r/whiskey/comments/w89g...,{'items': [{'caption': 'Charts and Blending ta...,{},w89gfy,False,True,True,False,False,False,True,False,False,,[],dark,text,False,"{'9b7m15v95ud91': {'e': 'Image', 'id': '9b7m15...",False,True,0,0,False,all_ads,/r/whiskey/comments/w89gfy/my_bourbon_and_rye_...,False,6,1658807718,1,,True,False,False,whiskey,t5_2r06y,212452,public,https://b.thumbs.redditmedia.com/gtcvCwk1oGQxp...,78.0,140.0,My Bourbon and Rye infinity blend!,0,[],1.0,https://www.reddit.com/gallery/w89gfy,https://www.reddit.com/gallery/w89gfy,all_ads,6,,,,,,,,,,
1,[],False,Colon8,,[],,text,t2_30mp3wyp,False,False,False,[],False,False,1658804260,self.whiskey,https://www.reddit.com/r/whiskey/comments/w889...,,{},w889dj,False,True,,False,False,False,True,True,False,,[],dark,text,False,,False,False,0,0,False,all_ads,/r/whiskey/comments/w889dj/first_rye/,False,6,1658804271,1,My friends &amp; I are avid single malt Scotch...,True,False,False,whiskey,t5_2r06y,212448,public,self,,,First Rye,0,[],1.0,https://www.reddit.com/r/whiskey/comments/w889...,,all_ads,6,,,,,,,,,,
2,[],False,Po-thepanda,,[],,text,t2_4q6bk05o,False,False,False,[],False,False,1658799605,i.redd.it,https://www.reddit.com/r/whiskey/comments/w86k...,,{},w86kqp,False,True,,False,False,True,True,False,False,,[],dark,text,False,,False,True,0,0,False,all_ads,/r/whiskey/comments/w86kqp/review_coming_in/,False,6,1658799615,1,,True,False,False,whiskey,t5_2r06y,212444,public,https://b.thumbs.redditmedia.com/EfC1yZ_12IJH9...,140.0,140.0,Review coming in,0,[],1.0,https://i.redd.it/080zft9chtd91.jpg,https://i.redd.it/080zft9chtd91.jpg,all_ads,6,image,"{'enabled': True, 'images': [{'id': '6wQB6N8uI...",,,,,,,,
3,[],False,Professional_Arm_661,,[],,text,t2_acvxk0gu,False,False,False,[],False,False,1658798135,reddit.com,https://www.reddit.com/r/whiskey/comments/w861...,"{'items': [{'id': 168061653, 'media_id': 'd45t...",{},w861oo,False,True,True,False,False,False,True,False,False,,[],dark,text,False,"{'1ookrojyctd91': {'e': 'Image', 'id': '1ookro...",False,True,0,0,False,all_ads,/r/whiskey/comments/w861oo/random_grocery_stor...,False,6,1658798146,1,,True,False,False,whiskey,t5_2r06y,212444,public,https://b.thumbs.redditmedia.com/7u10scn0LQFYR...,140.0,140.0,Random grocery store find...Bushmill's 1608 40...,0,[],1.0,https://www.reddit.com/gallery/w861oo,https://www.reddit.com/gallery/w861oo,all_ads,6,,,,,,,,,,
4,[],False,MEX-R-US,,[],,text,t2_x5ysap7,False,False,False,[],False,False,1658795013,reddit.com,https://www.reddit.com/r/whiskey/comments/w84x...,"{'items': [{'id': 168047062, 'media_id': 'o2j6...",{},w84x31,False,True,True,False,False,False,True,False,False,,[],dark,text,False,"{'csjru1zn3td91': {'e': 'Image', 'id': 'csjru1...",False,True,0,0,False,all_ads,/r/whiskey/comments/w84x31/been_drinking_colle...,False,6,1658795025,1,,True,False,False,whiskey,t5_2r06y,212446,public,https://b.thumbs.redditmedia.com/4JbnF1s-EDGa3...,140.0,140.0,Been drinking / collecting for 4 years and am ...,0,[],1.0,https://www.reddit.com/gallery/w84x31,https://www.reddit.com/gallery/w84x31,all_ads,6,,,,,,,,,,
5,[],False,Gotbourbn,,[],,text,t2_nglx4q3j,False,False,False,[],False,False,1658790485,reddit.com,https://www.reddit.com/r/whiskey/comments/w837...,"{'items': [{'id': 168024598, 'media_id': '57ei...",{},w837qj,False,True,True,False,False,False,True,False,False,,[],dark,text,False,"{'57eip5k8qsd91': {'e': 'Image', 'id': '57eip5...",False,True,0,0,False,all_ads,/r/whiskey/comments/w837qj/finished_a_great_bo...,False,6,1658790496,1,,True,False,False,whiskey,t5_2r06y,212445,public,https://b.thumbs.redditmedia.com/HRGPglKPeXpOX...,140.0,140.0,Finished a great bottle tonight and was gifted...,0,[],1.0,https://www.reddit.com/gallery/w837qj,https://www.reddit.com/gallery/w837qj,all_ads,6,,,,,,,,,,


In [4]:
#seeing what we scrapped for whiskey data
posts1_1 = data1['data'][0]
print(posts1_1)
print(type(posts1_1))
print(posts1_1.keys())
print(posts1_1.values())

{'all_awardings': [], 'allow_live_comments': False, 'author': 'Lhasabeast', 'author_flair_css_class': None, 'author_flair_richtext': [], 'author_flair_text': None, 'author_flair_type': 'text', 'author_fullname': 't2_isjvz', 'author_is_blocked': False, 'author_patreon_flair': False, 'author_premium': False, 'awarders': [], 'can_mod_post': False, 'contest_mode': False, 'created_utc': 1658807708, 'domain': 'reddit.com', 'full_link': 'https://www.reddit.com/r/whiskey/comments/w89gfy/my_bourbon_and_rye_infinity_blend/', 'gallery_data': {'items': [{'caption': 'Charts and Blending tables', 'id': 168107143, 'media_id': 'vms0y4v95ud91'}, {'caption': 'Everything that went into the bottle and the bottle itself.', 'id': 168107144, 'media_id': 's64h35v95ud91'}, {'caption': 'Blending and tasting notes.', 'id': 168107145, 'media_id': '9b7m15v95ud91'}]}, 'gildings': {}, 'id': 'w89gfy', 'is_created_from_ads_ui': False, 'is_crosspostable': True, 'is_gallery': True, 'is_meta': False, 'is_original_content

In [5]:
df1.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gallery_data',
       'gildings', 'id', 'is_created_from_ads_ui', 'is_crosspostable',
       'is_gallery', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_metadata',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'su

In [6]:
df1['author_fullname'].head(100)

0        t2_isjvz
1     t2_30mp3wyp
2     t2_4q6bk05o
3     t2_acvxk0gu
4      t2_x5ysap7
5     t2_nglx4q3j
6        t2_npqjv
7     t2_3pe5ynry
8       t2_12yokh
9        t2_hgldm
10    t2_4lx9471k
11       t2_yxra7
12            NaN
13    t2_4lx9471k
14    t2_q3mauovm
15      t2_107t9a
16       t2_li6hi
17      t2_12mvhd
18    t2_q3mauovm
19    t2_nazlsn0o
20     t2_pym25ry
21    t2_a0wywapt
22       t2_x92a0
23    t2_56uns5te
24       t2_6nd7v
25    t2_7x3el8hx
26    t2_2urftha7
27    t2_39oql2zf
28    t2_3maq3yn9
29    t2_7azkozib
30    t2_93jkmcjm
31       t2_isjvz
32    t2_57ddr2p0
33    t2_8xfgi3pu
34     t2_xwgp1so
35    t2_dynyf5bg
36    t2_4j1fp18p
37    t2_17f85jxt
38    t2_l0xblrug
39    t2_hmu8ttlr
40    t2_8didi8qd
41       t2_4eo77
42    t2_37nycaur
43    t2_a0i0a2c9
44    t2_51hndmtq
45     t2_j8h4uny
46       t2_3qgrc
47    t2_79z09wdg
48       t2_doa77
49    t2_aqui9s76
50    t2_1uba3yk2
51      t2_13vvne
52    t2_64ixu9dn
53    t2_8soc1okn
54       t2_61hce
55    t2_8

### 2.2.2 Rum

In [7]:
#scrape data for Rum

#1. url for Rum
url2 = 'https://api.pushshift.io/reddit/search/submission?subreddit=Rum'
#setting params
params2 = { 
    'subreddit': 'Rum', 
    'size': 100
}
#requests library
res2 = requests.get(url2,params2)
#creating BeautifulSoup object
soup2 = BeautifulSoup(res2.content, 'lxml')
#status code
res2.status_code
#assign json format respond to variable
data2 = res2.json()
#copy out data from dictionary
posts2 = data2['data']
#putting this dataframe into a variable
df2=pd.DataFrame(posts2)
#putting data to dataframe
pd.DataFrame(posts2).head(6)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,removed_by_category,gallery_data,is_gallery,media_metadata,author_flair_template_id,author_flair_text_color,crosspost_parent,crosspost_parent_list
0,[],False,HappyToBeMoi,,[],,text,t2_4t9sgf85,False,False,True,[],False,False,1658810004,self.rum,https://www.reddit.com/r/rum/comments/w8a7pe/r...,{},w8a7pe,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,some_ads,/r/rum/comments/w8a7pe/recommendations_for_a_f...,False,7,1658810015,1,I am in love with it for it being able to work...,True,False,False,rum,t5_2rg5e,38021,public,self,Recommendations for a flavorful darker rum?,0,[],1.0,https://www.reddit.com/r/rum/comments/w8a7pe/r...,some_ads,7,,,,,,,,,,,,,
1,[],False,SilverSandWitch2,,[],,text,t2_fhe8tpwb,False,False,False,[],False,False,1658801372,i.redd.it,https://www.reddit.com/r/rum/comments/w877ov/a...,{},w877ov,False,True,False,False,True,True,False,False,,[],dark,text,False,False,True,0,0,False,some_ads,/r/rum/comments/w877ov/any_nova_scotia_rum_fan...,False,7,1658801383,1,,True,False,False,rum,t5_2rg5e,38018,public,https://b.thumbs.redditmedia.com/_LrwMp67HE612...,Any Nova Scotia rum fans? Have you tried it? I...,0,[],1.0,https://i.redd.it/jch1sc6mmtd91.jpg,some_ads,7,image,"{'enabled': True, 'images': [{'id': 'nnD4Db_Qq...",140.0,140.0,https://i.redd.it/jch1sc6mmtd91.jpg,,,,,,,,
2,[],False,cyberhawk94,,[],,text,t2_a5o1z,False,False,False,[],False,False,1658798547,self.rum,https://www.reddit.com/r/rum/comments/w866zz/e...,{},w866zz,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,some_ads,/r/rum/comments/w866zz/el_dorado_questions/,False,7,1658798557,1,So Ive been meaning to try El dorado for a whi...,True,False,False,rum,t5_2rg5e,38016,public,self,El Dorado questions,0,[],1.0,https://www.reddit.com/r/rum/comments/w866zz/e...,some_ads,7,,,,,,,,,,,,,
3,[],False,Jorgenwatchsven,,[],,text,t2_9aw7uvln,False,False,False,[],False,False,1658798225,self.rum,https://www.reddit.com/r/rum/comments/w862sp/f...,{},w862sp,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,some_ads,/r/rum/comments/w862sp/finally_19_need_some_re...,False,7,1658798236,1,[removed],True,False,False,rum,t5_2rg5e,38016,public,self,Finally 19 need some recommendations,0,[],1.0,https://www.reddit.com/r/rum/comments/w862sp/f...,some_ads,7,,,,,,automod_filtered,,,,,,,
4,[],False,IrezumiHurts,,[],,text,t2_qfibv9u,False,False,True,[],False,False,1658797018,self.rum,https://www.reddit.com/r/rum/comments/w85n1e/r...,{},w85n1e,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,some_ads,/r/rum/comments/w85n1e/rum_selection_in_anchor...,False,7,1658797029,1,"Travelling through, i usually like to see if t...",True,False,False,rum,t5_2rg5e,38014,public,self,"Rum selection in anchorage, AK? 🥶",0,[],1.0,https://www.reddit.com/r/rum/comments/w85n1e/r...,some_ads,7,,,,,,,,,,,,,
5,[],False,Ar15tothedome,,[],,text,t2_2ekctmsk,False,False,True,[],False,False,1658795235,self.rum,https://www.reddit.com/r/rum/comments/w84zxh/a...,{},w84zxh,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,some_ads,/r/rum/comments/w84zxh/advice_needed/,False,7,1658795246,1,Ok so I’m a whiskey guy. Due to price and want...,True,False,False,rum,t5_2rg5e,38013,public,self,Advice needed.,0,[],1.0,https://www.reddit.com/r/rum/comments/w84zxh/a...,some_ads,7,,,,,,,,,,,,,


### Data Dictionary



In [8]:
#seeing what we scrapped for rum data
posts2_1 = data2['data'][0]
print(posts2_1)
print(type(posts2_1))
print(posts2_1.keys())
print(posts2_1.values())

{'all_awardings': [], 'allow_live_comments': False, 'author': 'HappyToBeMoi', 'author_flair_css_class': None, 'author_flair_richtext': [], 'author_flair_text': None, 'author_flair_type': 'text', 'author_fullname': 't2_4t9sgf85', 'author_is_blocked': False, 'author_patreon_flair': False, 'author_premium': True, 'awarders': [], 'can_mod_post': False, 'contest_mode': False, 'created_utc': 1658810004, 'domain': 'self.rum', 'full_link': 'https://www.reddit.com/r/rum/comments/w8a7pe/recommendations_for_a_flavorful_darker_rum/', 'gildings': {}, 'id': 'w8a7pe', 'is_created_from_ads_ui': False, 'is_crosspostable': True, 'is_meta': False, 'is_original_content': False, 'is_reddit_media_domain': False, 'is_robot_indexable': True, 'is_self': True, 'is_video': False, 'link_flair_background_color': '', 'link_flair_richtext': [], 'link_flair_text_color': 'dark', 'link_flair_type': 'text', 'locked': False, 'media_only': False, 'no_follow': True, 'num_comments': 0, 'num_crossposts': 0, 'over_18': False,

In [9]:
df2.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_rece

### 2.2.3 Narrowing down the features of interest

In [10]:
# df1/df2 - whiskey/rum original scrapped data in dataframe
# df1a/df1b - whiskey/rum removed duplicated rows and converted Eporch time to normal time

In [11]:
#focusing on features of interest
features_of_interest = ['author', 'created_utc', 'id', 'score','subreddit', 'title', 'selftext','num_comments']

In [12]:
#Removing unwanted features
#whiskey
df1a= df1[features_of_interest]
#rum
df2a = df2[features_of_interest]

In [13]:
#looking into dataframe to see what we have
#whiskey
df1a.head()

Unnamed: 0,author,created_utc,id,score,subreddit,title,selftext,num_comments
0,Lhasabeast,1658807708,w89gfy,1,whiskey,My Bourbon and Rye infinity blend!,,0
1,Colon8,1658804260,w889dj,1,whiskey,First Rye,My friends &amp; I are avid single malt Scotch...,0
2,Po-thepanda,1658799605,w86kqp,1,whiskey,Review coming in,,0
3,Professional_Arm_661,1658798135,w861oo,1,whiskey,Random grocery store find...Bushmill's 1608 40...,,0
4,MEX-R-US,1658795013,w84x31,1,whiskey,Been drinking / collecting for 4 years and am ...,,0


In [14]:
#rum
df2a.head()

Unnamed: 0,author,created_utc,id,score,subreddit,title,selftext,num_comments
0,HappyToBeMoi,1658810004,w8a7pe,1,rum,Recommendations for a flavorful darker rum?,I am in love with it for it being able to work...,0
1,SilverSandWitch2,1658801372,w877ov,1,rum,Any Nova Scotia rum fans? Have you tried it? I...,,0
2,cyberhawk94,1658798547,w866zz,1,rum,El Dorado questions,So Ive been meaning to try El dorado for a whi...,0
3,Jorgenwatchsven,1658798225,w862sp,1,rum,Finally 19 need some recommendations,[removed],0
4,IrezumiHurts,1658797018,w85n1e,1,rum,"Rum selection in anchorage, AK? 🥶","Travelling through, i usually like to see if t...",0


In [15]:
#removing duplicated rows
df1a.drop_duplicates(inplace=True)
df2a.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1a.drop_duplicates(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2a.drop_duplicates(inplace=True)


In [16]:
#converting created_utc to human time
df1a['timestamp'] = df1a['created_utc'].map(datetime.datetime.fromtimestamp)
df2a['timestamp'] = df2a['created_utc'].map(datetime.datetime.fromtimestamp)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1a['timestamp'] = df1a['created_utc'].map(datetime.datetime.fromtimestamp)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2a['timestamp'] = df2a['created_utc'].map(datetime.datetime.fromtimestamp)


In [17]:
#verifying what we get
df1a['timestamp'].head(5)

0   2022-07-26 11:55:08
1   2022-07-26 10:57:40
2   2022-07-26 09:40:05
3   2022-07-26 09:15:35
4   2022-07-26 08:23:33
Name: timestamp, dtype: datetime64[ns]

In [18]:
df2a['timestamp'].head(5)

0   2022-07-26 12:33:24
1   2022-07-26 10:09:32
2   2022-07-26 09:22:27
3   2022-07-26 09:17:05
4   2022-07-26 08:56:58
Name: timestamp, dtype: datetime64[ns]

## 2.3 Scrapping data

In [19]:
#create function

def scrap(subreddit, days = 30, n = 10000):
    
    #url
    base_url = 'https://api.pushshift.io/reddit/search/submission'
    full_url = f'{base_url}?subreddit={subreddit}&size=100'
    print(full_url)
    
    #creating an empty list
    posts = []
    
    # iterations
    for i in range(1, n+1):
        #modifying url for eachh iteration
        urlmod = '{}&after={}d'.format(full_url, days*i)
        #print URL used and days
        print(f'Url: {urlmod}')
        print(f'Days: {days*i}')
        res_1 = requests.get(urlmod)
        
        #catch error if status code is not 100 else continue
        try:
            res = requests.get(urlmod)
            assert res.status_code == 200
        except:
            continue
        
        #convert to json
        scraped_dict = res.json()['data']
        #construct dataframe from dict
        df = pd.DataFrame.from_dict(scraped_dict)
        #add df to post
        posts.append(df)
        
        
        total_scraped = sum(len(x) for x in posts)
        # print total posts scrapped
        print(f'total_scraped: {total_scraped}')
        
        # if there are more than n values/data, stop
        if total_scraped > n:
            break
        
        # generate a random sleep duration to seem like a human user
        sleep_duration = random.randint(1,9)
        print(f'sleep_duration: {sleep_duration}')
        time.sleep(sleep_duration)
            
    
    # features of interest
    features_of_interest = ['author', 'created_utc', 'id', 'score','subreddit', 'title', 'selftext','num_comments']
    
    #combine all iterations into 1 dataframe
    final_df = pd.concat(posts, sort=False)
    #remove dataframe to limit to features of interest
    final_df = final_df[features_of_interest]
    #drop duplicates
    final_df.drop_duplicates(inplace=True)
    #creating time stamp, human time and remove the 'created_utc' column
    final_df['timestamp'] = final_df['created_utc'].map(datetime.datetime.fromtimestamp)
    final_df.drop("created_utc", axis='columns',inplace=True)
    #display final shape
    print(f'final_df.shape: {final_df.shape}')
    return final_df.reset_index(drop=True)

In [20]:
%%time

# Pulling data from whiskey subreddit
whiskeyscrapdata = scrap('whiskey')

https://api.pushshift.io/reddit/search/submission?subreddit=whiskey&size=100
Url: https://api.pushshift.io/reddit/search/submission?subreddit=whiskey&size=100&after=30d
Days: 30
total_scraped: 100
sleep_duration: 2
Url: https://api.pushshift.io/reddit/search/submission?subreddit=whiskey&size=100&after=60d
Days: 60
total_scraped: 200
sleep_duration: 8
Url: https://api.pushshift.io/reddit/search/submission?subreddit=whiskey&size=100&after=90d
Days: 90
total_scraped: 300
sleep_duration: 3
Url: https://api.pushshift.io/reddit/search/submission?subreddit=whiskey&size=100&after=120d
Days: 120
total_scraped: 400
sleep_duration: 7
Url: https://api.pushshift.io/reddit/search/submission?subreddit=whiskey&size=100&after=150d
Days: 150
total_scraped: 500
sleep_duration: 5
Url: https://api.pushshift.io/reddit/search/submission?subreddit=whiskey&size=100&after=180d
Days: 180
total_scraped: 600
sleep_duration: 8
Url: https://api.pushshift.io/reddit/search/submission?subreddit=whiskey&size=100&after=2

In [21]:
whiskeyscrapdata.head()

Unnamed: 0,author,id,score,subreddit,title,selftext,num_comments,timestamp
0,thor3077,vkxibq,1,whiskey,Another pick up at the liquor store next to th...,,0,2022-06-26 13:34:01
1,james21_h,vkyo5c,1,whiskey,Hakushu Distillery,,0,2022-06-26 14:51:58
2,BickeringPlum,vkz7wd,1,whiskey,"Got me some nice, smooth whiskies!",,0,2022-06-26 15:29:54
3,ETeNDaMINalc,vl3mnw,1,whiskey,That's What I do,,0,2022-06-26 20:25:47
4,danman2424,vl5ekv,1,whiskey,Old dusty find,,0,2022-06-26 22:03:01


In [22]:
%%time 

rumscrapdata = scrap('rum')

https://api.pushshift.io/reddit/search/submission?subreddit=rum&size=100
Url: https://api.pushshift.io/reddit/search/submission?subreddit=rum&size=100&after=30d
Days: 30
total_scraped: 100
sleep_duration: 7
Url: https://api.pushshift.io/reddit/search/submission?subreddit=rum&size=100&after=60d
Days: 60
total_scraped: 200
sleep_duration: 5
Url: https://api.pushshift.io/reddit/search/submission?subreddit=rum&size=100&after=90d
Days: 90
total_scraped: 300
sleep_duration: 1
Url: https://api.pushshift.io/reddit/search/submission?subreddit=rum&size=100&after=120d
Days: 120
total_scraped: 400
sleep_duration: 2
Url: https://api.pushshift.io/reddit/search/submission?subreddit=rum&size=100&after=150d
Days: 150
total_scraped: 500
sleep_duration: 2
Url: https://api.pushshift.io/reddit/search/submission?subreddit=rum&size=100&after=180d
Days: 180
total_scraped: 600
sleep_duration: 3
Url: https://api.pushshift.io/reddit/search/submission?subreddit=rum&size=100&after=210d
Days: 210
total_scraped: 700

In [23]:
whiskeyscrapdata.head()

Unnamed: 0,author,id,score,subreddit,title,selftext,num_comments,timestamp
0,thor3077,vkxibq,1,whiskey,Another pick up at the liquor store next to th...,,0,2022-06-26 13:34:01
1,james21_h,vkyo5c,1,whiskey,Hakushu Distillery,,0,2022-06-26 14:51:58
2,BickeringPlum,vkz7wd,1,whiskey,"Got me some nice, smooth whiskies!",,0,2022-06-26 15:29:54
3,ETeNDaMINalc,vl3mnw,1,whiskey,That's What I do,,0,2022-06-26 20:25:47
4,danman2424,vl5ekv,1,whiskey,Old dusty find,,0,2022-06-26 22:03:01


In [24]:
#saving to csv
whiskeyscrapdata.to_csv('../datasets/whiskey.csv', index=False)
rumscrapdata.to_csv('../datasets/rum.csv', index=False)