# Push Shift API data Scrapping

## Problem Statement:

Ｍy objective is to determine which channel to direct the player to, by the used of their words. I will be looking at two specific game, League of Legends and Rainbow 6. Because these two games have to most active community and the play style of the two games are very different. I will obtain subreddit API from both game and apply the data to two classifier regression models: Logistic Regression and Random Forest Classifier. Then I will observe top ten less/most likely words that player will mentioned in these two games.
    
#### Contents:
- [Getting Submissions & Comments from Tom Clancy's Rainbow Six Siege Subreddit](#Getting-Submissions-and-Comments-from-Tom-Clancy's-Rainbow-Six-Siege-Subreddit)
- [Getting Submissions and Comments from League of Legends Subreddit](#Getting-Submissions-and-Comments-from-League-of-Legends-Subreddit)
- [Finalizing DataSet](#Finalizing-DataSet)

---

### Importing libaries

In [1]:
import requests
import json
import pandas as pd
import time
import re

## Push-Shift API

Here I am acquiring subreddit through API. I performed a loop that will continues to obtain data based on the prior date of previous data.

In [58]:
def pushshift(reqtype, sub, after):
    reqtype = reqtype
    sub = sub
    after = after
    before = None
    url = 'https://api.pushshift.io/reddit/search/'+reqtype+'?sort=desc&size=1000&after='+after+'&subreddit='+sub
    print ('getting '+reqtype+' from '+sub)
    res = requests.get(url)
    data = json.loads(res.content)
    if len(data['data']) > 0:
        date = data['data'][-1]['created_utc']
    else:
        date = None
    for i in range(9):
        before = str(date)
        url_i = 'https://api.pushshift.io/reddit/search/'+reqtype+'?sort=desc&size=1000&before='+before+'&after='+after+'&subreddit='+sub
        print('getting url:'+ url_i)
        res_i = requests.get(url_i)
        data_i = json.loads(res_i.content)
        data['data'].extend(data_i['data'])
        date = data_i['data'][-1]['created_utc']
        time.sleep(1)
    return data['data']

# Getting Submissions and Comments from Tom Clancy's Rainbow Six Siege Subreddit
---

**Submission**<br/>
In this case, I found out that I only needed the information in **_title_** and **_selftext_**  ; therefore, I combined those two variables into one column and named it **submission_text**.

**Comment**<br/>
I only needed the information lie within  **_body_** from my comments API.

**Why?**<br/>
- After obtained API from submission and comment, I filtered out the information that is required, then I concatenate two columns from two separate DataFrame into one DataFrame.
- I set an new columns as my target (binary result), if the context is from rainbow six it will be 1, else 0.
- I performed a Regular Expression (regex) to filtered out the context that contained any URLs. Because they obtained a big amount of unnecessary information and removing them can dramatically speed up processing.
---

In [3]:
sub_r6 = pushshift('submission', 'rainbow6', '60d')

getting submission from rainbow6
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1545069697&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544948080&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544821602&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544680702&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544564418&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544458972&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544332593&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/submission?so

In [4]:
com_r6 = pushshift('comment', 'rainbow6', '60d')

getting comment from rainbow6
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545169999&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545159106&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545148482&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545138163&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545119818&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545102447&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545089351&after=60d&subreddit=rainbow6
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=15

In [5]:
len(sub_r6)

10000

In [6]:
len(com_r6)

10000

In [7]:
sub_r6 = pd.DataFrame(sub_r6)
com_r6 = pd.DataFrame(com_r6)

In [8]:
sub_r6.columns

Index(['author', 'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_patreon_flair', 'can_mod_post', 'contest_mode', 'created_utc',
       'crosspost_parent', 'crosspost_parent_list', 'domain', 'edited',
       'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media', 'media_embed', 'media_only', 'no_follow', 'num_comments',
       'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink',
       'pinned', 'post_hint', 'preview', 'pwls', 'retrieved_on', 'score',
  

In [9]:
com_r6.columns

Index(['author', 'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_patreon_flair', 'body', 'created_utc', 'distinguished',
       'gildings', 'id', 'link_id', 'no_follow', 'parent_id', 'permalink',
       'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit',
       'subreddit_id'],
      dtype='object')

## Preparing DataSet

In [10]:
sub_r6['submission_text'] = sub_r6['title'] + sub_r6['selftext']

In [13]:
rainbow6 = pd.concat([sub_r6[['submission_text']], com_r6[['body']]], axis=1)

In [15]:
rainbow6['context'] = rainbow6['body'] + rainbow6['submission_text']

In [17]:
rainbow6['context'] = [re.sub('http[s]?:\/\/[^\s]*','',text) for text in rainbow6.context]

In [18]:
rainbow6.drop(['body','submission_text'], axis=1, inplace=True)
rainbow6['rainbow6']= 1

In [19]:
rainbow6.head()

Unnamed: 0,context,rainbow6
0,Imagine having an elite skin when you're banne...,1
1,I legit haven’t been able to open a pack all s...,1
2,It's a community challenge right nowNoice,1
3,The toxicity carries over to reddit lolThe abs...,1
4,JeBaited. Feels Bad ManSiege Servers Recently....,1


# Getting Submissions and Comments from League of Legends Subreddit

I will be doing to same steps for League of Legends. 

**Submission**<br/>
In this case, I found out that I only needed the information in **_title_** and **_selftext_**  ; therefore, I combined those two variables into one column and named it **submission_text**.

**Comment**<br/>
I only needed the information lie within  **_body_** from my comments API.

**Why?**<br/>
- After obtained API from submission and comment, I filtered out the information that is required, then I concatenate two columns from two separate DataFrame into one DataFrame.
- I set an new columns as my target (binary result), if the context is from rainbow six it will be 1, else 0.
- I performed a Regular Expression (regex) to filtered out the context that contained any URLs. Because they obtained a big amount of unnecessary information and removing them can dramatically speed up processing.



In [20]:
data_lol = pushshift('submission', 'leagueoflegends', '60d')

getting submission from leagueoflegends
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1545067473&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544953677&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544827008&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544724847&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544615115&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544512178&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/submission?sort=desc&size=1000&before=1544404230&after=60d&subreddit=leagueoflegends
getting 

In [21]:
data_lolc = pushshift('comment', 'leagueoflegends', '60d')

getting comment from leagueoflegends
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545175287&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545168479&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545162463&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545155985&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545150091&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545143886&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshift.io/reddit/search/comment?sort=desc&size=1000&before=1545137019&after=60d&subreddit=leagueoflegends
getting url:https://api.pushshif

In [73]:
len(data_lol)

10000

In [74]:
len(data_lolc)

10000

In [75]:
# Set my result to a DataFrame
sub_lol = pd.DataFrame(data_lol)
com_lol = pd.DataFrame(data_lolc)

In [76]:
sub_lol.to_csv('./datas/sub_lol.csv')
com_lol.to_csv('./datas/com_lol.csv')

## Preparing DataSet

In [77]:
sub_lol['submission_text'] = sub_lol['title'] + sub_lol['selftext']

In [78]:
lol = pd.concat([sub_lol[['submission_text']], com_lol[['body']]], axis=1)

In [79]:
lol.head()

Unnamed: 0,submission_text,body
0,Who is Obi Zex Kenobi? 78% Winrate in high Cha...,Yup! It was amazing how quickly they released ...
1,My friend tryied to troll me but he changed wh...,"First of all, you are getting better. Because ..."
2,SIMPLE QUESTION: Your dream opponent is...?How...,I'm pretty new to reddit so idrk what you mean...
3,"[Interview] The Former SKT T1 Support, Wolf, T...",Maybe its seems like there is no progress but ...
4,How to do deal with the disappointment/shame o...,"Kalista, just because it explains Hecarims, th..."


In [80]:
lol['context'] = lol['body'] + lol['submission_text']

In [81]:
lol['context'] = [re.sub('http[s]?:\/\/[^\s]*', '',str(text)) for text in lol.context]

In [82]:
lol.drop(['body','submission_text'], axis=1, inplace=True)
lol['rainbow6'] = 0

# Finalizing DataSet
___

**What Happened?**<br/>
I concatenated the two dataframe from each subreddit **rainbow6** and **leagueoflegends** into our finalized dataframe `df` then safe to `.csv` file. I also used regex to remove the text that are related to the subreddit itself to change the purity of our data to better reflected the context for the target community instead of just a block of words. and ready to move forward to the next step.


In [67]:
df = pd.concat([lol,rainbow6], ignore_index=True, sort=False)

In [68]:
df.reset_index(inplace=True, drop=True)

In [69]:
df['context'] = df.context.map(lambda x: re.sub('(rainbow6|leagueoflegends)[s]?',' ',x, flags= re.I))

In [70]:
df['context'] = df.context.map(lambda x: re.sub('[0-9]', ' ',x))

In [71]:
df.head()

Unnamed: 0,context,rainbow6
0,Yup! It was amazing how quickly they released ...,0
1,"First of all, you are getting better. Because ...",0
2,I'm pretty new to reddit so idrk what you mean...,0
3,Maybe its seems like there is no progress but ...,0
4,"Kalista, just because it explains Hecarims, th...",0


In [72]:
df.to_csv('./datas/df.csv')