# Scraping titles and self text from /r/Bitcoin

> The goal of this notebook is to neatly scrape titles and dates from the /r/Bitcoin subreddit so that I can perform sentiment analysis on them. 

---

## Imports
> Here I do my imports, most notably getting the PushshiftAPI

In [1]:
from psaw import PushshiftAPI
import pandas as pd
import datetime as dt
from datetime import datetime

api = PushshiftAPI()

## Setting epoch and filter

In [2]:
start_epoch=int(dt.datetime(2013, 10, 12).timestamp()) #selecting the date I want to go back to

results = list(api.search_submissions(after=start_epoch,
                            subreddit='bitcoin',
                            filter=['domain','author', 'title', 'selftext']))



---

## Getting everything into a neat dataframe

In [3]:
results[3].d_

{'author': 'cryptopigmedia',
 'created_utc': 1557764125,
 'domain': 'youtube.com',
 'selftext': '',
 'title': 'BITCOIN ADOPTION IS HERE!! New HTC Smartphone Will Run a Full Node',
 'created': 1557778525.0}

In [4]:
results[:3]

[submission(author='Garandhero', created_utc=1557764215, domain='self.Bitcoin', selftext='did coinbase just die?', title='Coinbase?', created=1557778615.0, d_={'author': 'Garandhero', 'created_utc': 1557764215, 'domain': 'self.Bitcoin', 'selftext': 'did coinbase just die?', 'title': 'Coinbase?', 'created': 1557778615.0}),
 submission(author='gnulligan', created_utc=1557764188, domain='self.Bitcoin', selftext="Hey guys,\n\n&amp;#x200B;\n\nI'm looking for the absolute cheapest way to buy/sell bitcoin (using fiat) and trade with other cryptos. \n\n&amp;#x200B;\n\nMost options for buying have huge fees attached. Same for trading. \n\n&amp;#x200B;\n\nSo, what's the cheapest (PREFERABLY FREE) way to buy, sell, and trade btc and other crypto?\n\n&amp;#x200B;\n\nBTW: I already tried Cashapp and it is literally impossible to withdraw your money.", title='Absolute lowest fee options for buy/sell/trade?', created=1557778588.0, d_={'author': 'gnulligan', 'created_utc': 1557764188, 'domain': 'self.

> After getting everything into a list of objects, I decided to make an original dataframe (og_df) so i could maintain the original information that I will be stripping later for my main dataframe (df). This is for a mix of eas of use without missing any information I may decide I want later on.

In [5]:
og_df = pd.DataFrame(results)

In [6]:
og_df.shape

(701252, 7)

In [7]:
og_df.domain.value_counts().head(10)

self.Bitcoin    325655
i.redd.it        33841
youtube.com      26740
twitter.com      15229
imgur.com        13495
i.imgur.com      12058
youtu.be          9870
coindesk.com      7580
medium.com        6675
reddit.com        6113
Name: domain, dtype: int64

> Next I added a new column 'easy_time' with UTC converted to plain dates.

In [8]:
ts = 1557172785
dt_object = datetime.fromtimestamp(ts)

print(dt_object)

2019-05-06 15:59:45


In [9]:
og_df['easy_time'] = og_df.apply(lambda row: datetime.fromtimestamp(row['created_utc']), axis=1)

## Cleaning any selftext and combining it with titles into a single column

In [10]:
og_df.dtypes

author                 object
created_utc             int64
domain                 object
selftext               object
title                  object
created                object
d_                     object
easy_time      datetime64[ns]
dtype: object

In [11]:
og_df['all_text'] = og_df['selftext'].map(str) + og_df['title'].map(str)

In [13]:
og_df.tail(2)

Unnamed: 0,author,created_utc,domain,selftext,title,created,d_,easy_time,all_text
701250,thetoptier,1381550877,tradersnetwork.biz,Why the Demise of Silk Road Means Bitcoins Are...,1381570000.0,"{'author': 'thetoptier', 'created_utc': 138155...",,2013-10-12 00:07:57,Why the Demise of Silk Road Means Bitcoins Are...
701251,j6wj564jhu,1381550474,2-ip.com,free bitcoin in 30minutes....,1381560000.0,"{'author': 'j6wj564jhu', 'created_utc': 138155...",,2013-10-12 00:01:14,free bitcoin in 30minutes....1381564874.0


In [14]:
df = og_df[['all_text', 'easy_time']]

In [15]:
df.shape

(701252, 2)

In [16]:
df.all_text[4]

'😂🤣😂🤣🤑🤑'

# Saving to a CSV
>I've now got a solid scrape of the bitcoin subreddit and am ready to move on to sentiment analysis

In [17]:
# df.to_csv('/Users/zoenawar/DSI/RNN_LSTM_Cryptocurrency_Project/datasets/rbitcoinscrape')