<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3A: ProblemStatement_DataScrapping

## Contents:
- [Problem Statement](#Problem-Statement)
- [Background](#Background)
- [Import Libraries](#Import-Libraries)
- [Data Scrapping](#Data-Scrapping)
    - [Initial Test Scrapping](#Initial-Test-Scrapping)
    - [Useful Data Analysis Summary](#Useful-Data-Analysis-Summary)
    - [Actual Scrapping](#Actual-Scrapping)
- [Save Raw Scrapped Data into CSV](#Save-Raw-Scrapped-Data-into-CSV)

## Problem Statement

Company Management has tasked the Data Science Division and Software Engineering team to find a way to field after-hours requests for technical help regarding graphic cards (namely, AMD and NVIDIA). This project aims to build a classifier which identifies keywords to accurately predict whether a question belongs to the AMD subreddit or NVIDIA subreddit.

## Background

Good Games for Everyone (GGE) is a PC building company which specialises in constructing high-performance gaming terminals for our customers.

Based on Customer Service Requests analysis, it was found that the majority of the company's customer service requests were regarding graphic cards. These requests mainly consists of 2 categories:
- gaming terminal graphic cards special order requests
- technical help on graphic cards

While there are enough Customer Support Officers to hande the customer requests for technical help on graphic cards during office hours, there are also plenty of such requests after office hours which goes unanswered and piles up to the next working day.

The Company Management has therefore tasked us, the Data Science Division, to work with the Software Engineering team to find a way to field the after office hours graphic cards requests in order to achieve higher level of customer care.

The Software Engineering team wants to create a chatbox which receives a question from customers and returns a relevant subreddit link containing potential solutions to the customers. They intend to resolve this with a Two-Phase approach:
1. correctly classify the questions into the subreddit it belongs to (**First Phase**)
2. then returns a relevant subreddit link containing potential solutions to the customers (**Second Phase**).


**The focus of this project will only be on the First Phase. In other words:**

The role of the Data Science Division in this project currently is to help with classifying the questions received into the correct subreddit, which is either the AMD or the NVIDIA subreddits since GGE only carries these 2 brands. 


## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import random

#for data scrapping
import requests

#for setting sleep (pause) before each 100 pull
import time

#hide warnings
# import warnings
# warnings.filterwarnings('ignore')

In [2]:
pd.set_option('display.max_columns', None)

## Data Scrapping

### Initial Test Scrapping

Scrapping 100 posts from subreddits to explore the sample data and decide on useful data to keep

In [3]:
#general pushshift api url for the subreddits
url =  'https://api.pushshift.io/reddit/search/submission/'

#set parameters (in dict format) to send as query string
params_amd = {'subreddit': 'Amd',
              'size': 100
             }

params_nvidia = {'subreddit': 'nvidia', 
                'size': 100
               }

In [4]:
#submit requests and get response code to see if it worked
req_amd = requests.get(url, params_amd)
print(req_amd.status_code)

req_nvidia = requests.get(url, params_nvidia)
print(req_nvidia.status_code)

200
200


In [5]:
#get content into dictionary format
data_amd = req_amd.json()
print(type(data_amd))

#check the keys in the dictionary
print(data_amd.keys())

<class 'dict'>
dict_keys(['data'])


In [6]:
#get content into dictionary format
data_nvidia = req_nvidia.json()
print(type(data_nvidia))

#check the keys in the dictionary
print(data_nvidia.keys())

<class 'dict'>
dict_keys(['data'])


In [7]:
#print first post in the dictionaries
print('AMD first post in dictionary\n', data_amd['data'][0])
print('\nNVIDIA first post in dictionary\n', data_nvidia['data'][0])

AMD first post in dictionary
 {'all_awardings': [], 'allow_live_comments': False, 'author': 'Snoo_11263', 'author_flair_css_class': None, 'author_flair_richtext': [], 'author_flair_text': None, 'author_flair_type': 'text', 'author_fullname': 't2_6ysgf78k', 'author_is_blocked': False, 'author_patreon_flair': False, 'author_premium': False, 'awarders': [], 'can_mod_post': False, 'contest_mode': False, 'created_utc': 1627583855, 'domain': 'self.Amd', 'full_link': 'https://www.reddit.com/r/Amd/comments/ou3f2f/3700x_vs_5600x/', 'gildings': {}, 'id': 'ou3f2f', 'is_created_from_ads_ui': False, 'is_crosspostable': False, 'is_meta': False, 'is_original_content': False, 'is_reddit_media_domain': False, 'is_robot_indexable': False, 'is_self': True, 'is_video': False, 'link_flair_background_color': '#ff9800', 'link_flair_richtext': [], 'link_flair_template_id': 'a0256696-9254-11e6-936c-0e8d317f3e82', 'link_flair_text': 'Discussion', 'link_flair_text_color': 'light', 'link_flair_type': 'text', 'loc

In [8]:
#save into dataframe for better visibility
df_amd = pd.DataFrame(data_amd['data'])

df_nvidia = pd.DataFrame(data_nvidia['data'])

In [9]:
df_amd.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,crosspost_parent,crosspost_parent_list,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,author_flair_template_id,author_flair_text_color,author_cakeday,gallery_data,is_gallery,media_metadata
0,[],False,Snoo_11263,,[],,text,t2_6ysgf78k,False,False,False,[],False,False,1627583855,self.Amd,https://www.reddit.com/r/Amd/comments/ou3f2f/3...,{},ou3f2f,False,False,False,False,False,False,True,False,#ff9800,[],a0256696-9254-11e6-936c-0e8d317f3e82,Discussion,light,text,False,False,True,0,0,False,all_ads,/r/Amd/comments/ou3f2f/3700x_vs_5600x/,False,6,reddit,1627583866,1,[removed],True,False,False,Amd,t5_2rw0n,1026486,public,confidence,self,3700x vs 5600x,0,[],1.0,https://www.reddit.com/r/Amd/comments/ou3f2f/3...,all_ads,6,,,,,,,,,,,,,,,,,,
1,[],False,Expensive_Worth_4071,,[],,text,t2_cuxz5rab,False,False,False,[],False,False,1627583342,self.Amd,https://www.reddit.com/r/Amd/comments/ou38y5/h...,{},ou38y5,False,False,False,False,False,False,True,False,#ff9800,[],a0256696-9254-11e6-936c-0e8d317f3e82,Discussion,light,text,False,False,True,0,0,False,all_ads,/r/Amd/comments/ou38y5/how_much_is_the_differe...,False,6,reddit,1627583352,1,[removed],True,False,False,Amd,t5_2rw0n,1026474,public,confidence,self,How much is the difference between Amd 7 5800 ...,0,[],1.0,https://www.reddit.com/r/Amd/comments/ou38y5/h...,all_ads,6,,,,,,,,,,,,,,,,,,
2,[],False,craftbot,,[],,text,t2_4bucz,False,False,False,[],False,False,1627582916,self.Amd,https://www.reddit.com/r/Amd/comments/ou341i/r...,{},ou341i,False,False,False,False,False,False,True,False,#ff9800,[],a0256696-9254-11e6-936c-0e8d317f3e82,Discussion,light,text,False,False,True,0,0,False,all_ads,/r/Amd/comments/ou341i/rx_6800_vs_rx_6700_xt/,False,6,reddit,1627582926,1,[removed],True,False,False,Amd,t5_2rw0n,1026461,public,confidence,self,RX 6800 vs RX 6700 XT,0,[],1.0,https://www.reddit.com/r/Amd/comments/ou341i/r...,all_ads,6,,,,,,,,,,,,,,,,,,
3,[],False,iiamshit,,[],,text,t2_3ctqg5jf,False,False,False,[],False,False,1627582871,self.Amd,https://www.reddit.com/r/Amd/comments/ou33gw/w...,{},ou33gw,False,False,False,False,False,False,True,False,#ffc107,[],28a601d8-9255-11e6-9c8f-0eda72ca337c,Tech Support,light,text,True,False,True,1,0,False,all_ads,/r/Amd/comments/ou33gw/will_an_amd_wraith_stea...,False,6,moderator,1627582882,1,[removed],True,False,False,Amd,t5_2rw0n,1026461,public,confidence,self,Will an AMD Wraith Stealth cooler fit on an AM...,0,[],1.0,https://www.reddit.com/r/Amd/comments/ou33gw/w...,all_ads,6,,,,,,,,,,,,,,,,,,
4,[],False,Devil-Child-6763,,[],,text,t2_as4b3u00,False,False,False,[],False,False,1627581759,self.Amd,https://www.reddit.com/r/Amd/comments/ou2q1x/w...,{},ou2q1x,False,False,False,False,False,False,True,False,#ff9800,[],a0256696-9254-11e6-936c-0e8d317f3e82,Discussion,light,text,False,False,True,0,0,False,all_ads,/r/Amd/comments/ou2q1x/what_does_14v_do_to_a_3...,False,6,reddit,1627581770,1,[removed],True,False,False,Amd,t5_2rw0n,1026446,public,confidence,self,What does 1.4v do to a 3600xt? Let's find out.,0,[],1.0,https://www.reddit.com/r/Amd/comments/ou2q1x/w...,all_ads,6,,,,,,,,,,,,,,,,,,


In [10]:
df_nvidia.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,media,media_embed,post_hint,preview,secure_media,secure_media_embed,thumbnail_height,thumbnail_width,url_overridden_by_dest,removed_by_category,author_flair_background_color,author_flair_template_id,author_flair_text_color,gallery_data,is_gallery,media_metadata,crosspost_parent,crosspost_parent_list
0,[],False,Maaxxat,,[],,text,t2_66fxrzus,False,False,False,[],False,False,1627586247,self.nvidia,https://www.reddit.com/r/nvidia/comments/ou47u...,{},ou47uz,False,True,False,False,False,True,True,False,#e91e63,question,"[{'e': 'text', 't': 'Question'}]",fb6f8e52-4086-11e6-b284-0ee96c7aff3d,Question,light,richtext,False,False,True,0,0,False,all_ads,/r/nvidia/comments/ou47uz/really_bad_performan...,False,6,1627586259,1,"I hadn’t played gta 5 in a year, updated it an...",True,False,False,nvidia,t5_2rlgy,958001,public,self,Really bad performance in gta 5 only,0,[],1.0,https://www.reddit.com/r/nvidia/comments/ou47u...,all_ads,6,,,,,,,,,,,,,,,,,,
1,[],False,NurkShark,,[],,text,t2_c5wxoeuz,False,False,False,[],False,False,1627586008,self.nvidia,https://www.reddit.com/r/nvidia/comments/ou451...,{},ou4513,False,True,False,False,False,True,True,False,#e91e63,question,"[{'e': 'text', 't': 'Question'}]",fb6f8e52-4086-11e6-b284-0ee96c7aff3d,Question,light,richtext,False,False,False,0,0,False,all_ads,/r/nvidia/comments/ou4513/unlocking_gtx_1660ti/,False,6,1627586020,1,I need help me with unlocking power and temp...,True,False,False,nvidia,t5_2rlgy,957998,public,self,Unlocking gtx 1660ti,0,[],1.0,https://www.reddit.com/r/nvidia/comments/ou451...,all_ads,6,,,,,,,,,,,,,,,,,,
2,[],False,Jclevs11,,[],,text,t2_8inge,False,False,False,[],False,False,1627585551,youtube.com,https://www.reddit.com/r/nvidia/comments/ou3zj...,{},ou3zj4,False,True,False,False,False,True,False,False,#e91e63,question,"[{'e': 'text', 't': 'Question'}]",fb6f8e52-4086-11e6-b284-0ee96c7aff3d,Question,light,richtext,False,False,True,0,0,False,all_ads,/r/nvidia/comments/ou3zj4/why_is_my_geforce_ex...,False,6,1627585561,1,,True,False,False,nvidia,t5_2rlgy,957992,public,https://b.thumbs.redditmedia.com/K1w_lAf1G2cI6...,Why is my Geforce Experience doing this?,0,[],1.0,https://www.youtube.com/watch?v=nMzJgHdsLaY&am...,all_ads,6,"{'oembed': {'author_name': 'Jack Cleverly', 'a...","{'content': '&lt;iframe width=""356"" height=""20...",rich:video,"{'enabled': False, 'images': [{'id': 'VxofQcfv...","{'oembed': {'author_name': 'Jack Cleverly', 'a...","{'content': '&lt;iframe width=""356"" height=""20...",105.0,140.0,https://www.youtube.com/watch?v=nMzJgHdsLaY&am...,,,,,,,,,
3,[],False,JDSP_,,[],,text,t2_15chpm,False,False,False,[],False,False,1627585436,imgsli.com,https://www.reddit.com/r/nvidia/comments/ou3y5...,{},ou3y58,False,True,False,False,False,True,False,False,#ff9800,discussion,"[{'e': 'text', 't': 'Discussion'}]",cf69204c-61e5-11e4-890f-12313b0e8c78,Discussion,light,richtext,False,False,True,0,0,False,all_ads,/r/nvidia/comments/ou3y58/the_ascend_dlss_comp...,False,6,1627585446,1,,True,False,False,nvidia,t5_2rlgy,957990,public,https://b.thumbs.redditmedia.com/b2sVhPvut_DDL...,The Ascend DLSS comparisons - In Motion screen...,0,[],1.0,https://imgsli.com/NjMxNjc/0/4,all_ads,6,,,link,"{'enabled': False, 'images': [{'id': 'kBzvwK-2...",,,78.0,140.0,https://imgsli.com/NjMxNjc/0/4,,,,,,,,,
4,[],False,aaf66,,[],,text,t2_10eeyk,False,False,False,[],False,False,1627583955,dsogaming.com,https://www.reddit.com/r/nvidia/comments/ou3ga...,{},ou3ga4,False,True,False,False,False,True,False,False,#2196f3,news,"[{'e': 'text', 't': 'News'}]",031e2f90-61e6-11e4-aafd-12313d1652b9,News,light,richtext,False,False,False,0,0,False,all_ads,/r/nvidia/comments/ou3ga4/escape_from_naraka_i...,False,6,1627583966,1,,True,False,False,nvidia,t5_2rlgy,957956,public,https://b.thumbs.redditmedia.com/8BB7mgmsMaIeh...,Escape from Naraka is the first game using NVI...,0,[],1.0,https://www.dsogaming.com/news/escape-from-nar...,all_ads,6,,,link,"{'enabled': False, 'images': [{'id': 'rbIiHYcC...",,,78.0,140.0,https://www.dsogaming.com/news/escape-from-nar...,,,,,,,,,


In [11]:
#check all column names to sieve out the ones we need
print(df_amd.columns)

print(df_nvidia.columns)

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'removed_by_category', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subredd

### Useful Data Analysis Summary

Looking through the data that we collected, there is a lot of metadata that will not be required. Here, we will keep a few of these columns to use.

<b>Data to keep</b>:
- subreddit --> target y
- title
- selftext
- media_only (will use to ensure none of our posts collected have only media in the contents, to drop afterwards)

<b>Data to keep but may drop afterwards (numerical, only use if correlation with target can be proven)</b>:
- num_comments
- num_crossposts
- score
- total_awards_received

<b>Initial cleaning to be done</b>:
- drop all duplicates, ensure only unique posts in dataframe
- using media_only column to remove that consists of only media
- title + selftext --> interaction term the post title and post body
- drop rows with NULL values in text column
- change subreddit column values to numerical format (amd = 1, nvidia = 0)
- keep 10 000 posts per subreddit to ensure balanced datasets for models to train and test on

The filtering and cleaning will be done in a separate Jupyter Notebook, part 3B.

In [12]:
#names of required columns
required = ['subreddit', 'title', 'selftext', 'media_only', 
            'num_comments', 'num_crossposts', 'score', 'total_awards_received']

### Actual Scrapping

Ultimately, we want to keep 10 000 posts per subreddit to fit into our models. Here, we scrape slightly more than required as we anticipate that rows will be dropped during filtering and cleaning.

In [13]:
#define function to run request 'iterations' number of times so we can get the number of posts we want
#'iterations' in integer and represents how many times we want to scrape
def get_posts(subreddit, iterations=100):

    #create empty list to hold newly created dataframes
    df_list = []
    
    #base url with pull size max 100
    base_url = 'https://api.pushshift.io/reddit/search/submission?subreddit=' + subreddit + '&size=100'
    
    #set before parameter to prevent scrapping from same timeframe and getting duplicates, initial value set to None
    before = None
    
    #loop through 'iterations' range
    for iteration in range(iterations):
        
        #set the url to pull --> first pull will be the latest posts, subsequent pulls will be from timeframe before
        if before == None:
            current_url = base_url
        else:
            current_url = base_url + '&before=' + str(before)
        
        #print url while pulling to check if pulling from correct url
        print(current_url)
        
        #check if request went through using try-except --> 200 if successful, otherwise, error so skip iteration
        try:
            #pull request using url set
            pull_req = requests.get(current_url)
            assert pull_req.status_code == 200
        except:
            print('Status code is not 200! Status code is', pull_req.status_code, '\nSkipping current iteration\n')
            continue
        
        #get content into dictionary format
        current_dict = pull_req.json()
        
        #change dictionary to dataframe format
        current_df = pd.DataFrame(current_dict['data'])
        
        #set the before parameter using smallest created_utc value (the earliest post time in current dataframe)
        before = current_df['created_utc'].min()
        
        #append dataframe to df_list
        df_list.append(current_df)
        
        #print number of posts scrapped so far
        total = sum(len(post) for post in df_list)
        print('Total posts scrapped:', total)
        
        #if we exceed 12_000 posts, stop the loop
        if total > 12_000:
            break
        
        #generate random sleep duration to prevent potential blocking/throttling
        sleep_dur = random.randint(3, 5)
        print('Sleep duration:', sleep_dur, '\n')
        time.sleep(sleep_dur)
        
    print('Scrape completed!')
    
    #concat list of dataframes so it can be displayed in dataframe format instead of a list
    return pd.concat(df_list, sort=False)
    

In [14]:
amd_scrape = get_posts('Amd', 120)

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100
Total posts scrapped: 100
Sleep duration: 3 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1627500453
Total posts scrapped: 200
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1627408591
Total posts scrapped: 300
Sleep duration: 3 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1627305398
Total posts scrapped: 400
Sleep duration: 5 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1627205425
Total posts scrapped: 500
Sleep duration: 5 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1627116471
Total posts scrapped: 600
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1627014925
Total posts scrapped: 700
Sleep duration: 5 

https://api.pushshift.io/reddit/search/submission?subreddit

Total posts scrapped: 6000
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1622651994
Total posts scrapped: 6100
Sleep duration: 5 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1622584290
Total posts scrapped: 6200
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1622538880
Total posts scrapped: 6300
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1622471168
Total posts scrapped: 6400
Sleep duration: 5 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1622369253
Total posts scrapped: 6500
Sleep duration: 3 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1622284730
Total posts scrapped: 6600
Sleep duration: 3 

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1622196549
Total posts scrapped: 6700
Sleep d

https://api.pushshift.io/reddit/search/submission?subreddit=Amd&size=100&before=1617683961
Total posts scrapped: 12000
Sleep duration: 5 

Scrape completed!


In [15]:
nvidia_scrape = get_posts('nvidia', 120)

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100
Total posts scrapped: 100
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1627486534
Total posts scrapped: 200
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1627380987
Total posts scrapped: 300
Sleep duration: 5 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1627271090
Total posts scrapped: 400
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1627141891
Total posts scrapped: 500
Sleep duration: 3 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1627023432
Total posts scrapped: 600
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1626912041
Total posts scrapped: 700
Sleep duration: 3 

https://api.pushshift.io/reddit/search

Total posts scrapped: 5900
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1622125921
Total posts scrapped: 6000
Sleep duration: 3 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1622029671
Total posts scrapped: 6100
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1621895837
Total posts scrapped: 6200
Sleep duration: 3 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1621798232
Total posts scrapped: 6300
Sleep duration: 3 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1621692603
Total posts scrapped: 6400
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1621581347
Total posts scrapped: 6500
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1621473639
Total posts s

Total posts scrapped: 11700
Sleep duration: 5 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1615886829
Total posts scrapped: 11800
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1615803714
Total posts scrapped: 11900
Sleep duration: 4 

https://api.pushshift.io/reddit/search/submission?subreddit=nvidia&size=100&before=1615700611
Total posts scrapped: 12000
Sleep duration: 5 

Scrape completed!


Keeping only columns we decided are necessary to prevent taking up more memory than required when saving/loading the csv file.

In [16]:
amd_raw = amd_scrape.filter(required)
nvidia_raw = nvidia_scrape.filter(required)

In [17]:
amd_raw.head()

Unnamed: 0,subreddit,title,selftext,media_only,num_comments,num_crossposts,score,total_awards_received
0,Amd,3700x vs 5600x,[removed],False,0,0,1,0
1,Amd,How much is the difference between Amd 7 5800 ...,[removed],False,0,0,1,0
2,Amd,RX 6800 vs RX 6700 XT,[removed],False,0,0,1,0
3,Amd,Will an AMD Wraith Stealth cooler fit on an AM...,[removed],False,1,0,1,0
4,Amd,What does 1.4v do to a 3600xt? Let's find out.,[removed],False,0,0,1,0


In [18]:
nvidia_raw.head()

Unnamed: 0,subreddit,title,selftext,media_only,num_comments,num_crossposts,score,total_awards_received
0,nvidia,GFN Thursday: 14 New Games Join GeForce NOW,,False,0,0,1,0
1,nvidia,Where to buy GPU in Italy,Hello everyone i'm from Brazil and i'm plannin...,False,0,0,1,0
2,nvidia,Really bad performance in gta 5 only,"I hadn’t played gta 5 in a year, updated it an...",False,0,0,1,0
3,nvidia,Unlocking gtx 1660ti,I need help me with unlocking power and temp...,False,0,0,1,0
4,nvidia,Why is my Geforce Experience doing this?,,False,0,0,1,0


## Save Raw Scrapped Data into CSV

In [19]:
amd_raw.to_csv('../data/amd_raw.csv', index=False)

In [20]:
nvidia_raw.to_csv('../data/nvidia_raw.csv', index=False)