<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Web APIs & NLP

# 2_Data Extraction

## Web Scraping from r/stocks

In [1]:
import requests
import pandas as pd
import time

In [2]:
url = "https://api.pushshift.io/reddit/search/submission"

In [3]:
# set parameters to pull data from stocks subreddit 
# set 100 posts per pull as that's the max limit
# apply date-time reference so that the same data can be pulled later in future running of this codebook
params = {
    "subreddit": "stocks",
    "size": 100,
    "before": 1626851005
}

In [4]:
# pull data from website and check status code
response = requests.get(url, params)
response.status_code

200

In [5]:
# store data as an object 
# extract data key from dictionary
# print length to check that the number of posts scraped == 100
scraped_data = response.json()
posts = scraped_data["data"]
len(posts)

100

In [6]:
# create dataframe with key variables of interest needed for NLP 
df = pd.DataFrame(posts)
data = df[["subreddit", "created_utc", "title", "selftext"]]
data.head()

Unnamed: 0,subreddit,created_utc,title,selftext
0,stocks,1626851004,Advise on Long Term Stock?,I am earning very little at the moment but I w...
1,stocks,1626847571,Earning plays,[removed]
2,stocks,1626847468,Going long vega? what is the maximum loss?,[removed]
3,stocks,1626847423,Dad told me to sell on Monday when the market ...,"The stocks I chose were aapl, net, asts, icln,..."
4,stocks,1626847066,Newbie trader. Dad told me to sell when the ma...,[removed]


In [7]:
# create loop to scrape data in a similar way for another 50 times 
row = 99 

for i in range(50):
    params = {
        "subreddit": "stocks",
        "size": 100,
        "before": data.iloc[row, 1]
    }
    response = requests.get(url, params)
    scraped_data = response.json()
    posts = scraped_data["data"]
    df = pd.DataFrame(posts)
    data_new = df[["subreddit", "created_utc", "title", "selftext"]]
    data = data.append(data_new, ignore_index=True)
    row += 100
    time.sleep(5)

print(data.shape)

(5100, 4)


## Prelim data cleaning to remove posts that are duplicates, removed and with no texts 

In [8]:
# create new dataframe with posts that are neither removed nor blank (i.e. NaN value)
stocks = data[(data["selftext"].str.contains(r'\[removed\]') != True) 
              & (data["selftext"].isna() == False) 
              & (data["selftext"] != "")]
stocks.shape

(1795, 4)

In [9]:
# drop duplicated posts within the subreddit and keep the last of the duplicates 
stocks.drop_duplicates(subset=["selftext"], keep='last', inplace=True)
stocks.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stocks.drop_duplicates(subset=["selftext"], keep='last', inplace=True)


(1746, 4)

In [22]:
# check the range of time during which the posts were created
print("last post created: ", pd.to_datetime(stocks["created_utc"][0], unit='s'))
print("first post created: ", pd.to_datetime(stocks.iloc[-1,1], unit='s'))

last post created:  2021-07-21 07:03:24
first post created:  2021-06-22 20:17:38


In [23]:
# trim the post to 1700 so that the size of the dataset from both subreddits are equal 
stocks = stocks[0:1700]

In [26]:
# export dataframes to csv 
stocks.to_csv("../data/extracted/stocks_text.csv", index= False)

## Web Scraping from r/CryptoCurrency

In [27]:
# set parameters to pull data from CryptoCurrency subreddit 
# set 100 posts per pull as that's the max limit
# apply date-time reference so that the same data can be pulled later in future running of this codebook
params = {
    "subreddit": "CryptoCurrency",
    "size": 100,
    "before": 1626851005
}

In [28]:
# pull data from website and check status code
response = requests.get(url, params)
response.status_code

200

In [29]:
# store data as an object 
# extract data key from dictionary
# print length to check that the number of posts scraped == 100
scraped_data = response.json()
posts = scraped_data["data"]
len(posts)

100

In [30]:
# create dataframe with key variables of interest needed for NLP 
df = pd.DataFrame(posts)
data = df[["subreddit", "created_utc", "title", "selftext"]]
data.head()

Unnamed: 0,subreddit,created_utc,title,selftext
0,CryptoCurrency,1626850949,Should I create a gymkhana with all my cryptoc...,"Hello everyone, yesterday I was thinking about..."
1,CryptoCurrency,1626850702,A country’s ban on crypto is only valid if you...,So I don’t understand why countries can ban cr...
2,CryptoCurrency,1626850642,I was already convinced. Fibonacci golden rati...,"Listen, nothing you read on the internet is fi..."
3,CryptoCurrency,1626850580,*According to Research From Fidelity * - 71% o...,
4,CryptoCurrency,1626850543,Illegal Crypto Miners in Ukraine Found Manipul...,


In [31]:
# create loop to scrape data another 50 times 
row = 99
for i in range(50):
    params = {
        "subreddit": "CryptoCurrency",
        "size": 100,
        "before": data.iloc[row, 1]
    }
    response = requests.get(url, params)
    scraped_data = response.json()
    posts = scraped_data["data"]
    df = pd.DataFrame(posts)
    data_new = df[["subreddit", "created_utc", "title", "selftext"]]
    data = data.append(data_new, ignore_index=True)
    row += 100
    time.sleep(5)

print(data.shape)

(5100, 4)


## Prelim data cleaning to remove posts that are duplicates, removed and with no texts 

In [32]:
# create new dataframe with posts that are neither removed nor blank (i.e. NaN value)
crypto = data[(data["selftext"].str.contains(r'\[removed\]') != True) 
              & (data["selftext"].isna() == False) 
              & (data["selftext"] != "")]

crypto.shape

(2044, 4)

In [33]:
# drop duplicated posts within the subreddit and keep the last of the duplicates 
crypto.drop_duplicates(subset=["selftext"], keep='last', inplace=True)
crypto.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  crypto.drop_duplicates(subset=["selftext"], keep='last', inplace=True)


(2006, 4)

In [34]:
# trim the post to 1700 so that the size of the dataset from both subreddits are equal 
crypto = crypto[0:1700]

In [35]:
# check the range of time during which the posts were created
print("last post created: ", pd.to_datetime(crypto["created_utc"][0], unit='s'))
print("first post created: ", pd.to_datetime(crypto.iloc[-1,1], unit='s'))

last post created:  2021-07-21 07:02:29
first post created:  2021-07-18 15:17:13


In [36]:
# export dataframes to csv 
crypto.to_csv("../data/extracted/crypto_text.csv", index=False)