# Scraping titles and self text from /r/Binance using the pushift api

> The goal of this notebook is to neatly scrape titles and dates from the /r/Bitcoin subreddit so that I can perform sentiment analysis on them. 

---

## Imports
> Here I do my imports, most notably getting the PushshiftAPI

In [1]:
from psaw import PushshiftAPI
import pandas as pd
import datetime as dt
from datetime import datetime

api = PushshiftAPI()

## Setting epoch and filter

In [2]:
start_epoch=int(dt.datetime(2017, 7, 24).timestamp())

results = list(api.search_submissions(after=start_epoch,
                            subreddit='binance',
                            filter=['domain','author', 'title', 'selftext']))

---

## Getting everything into a neat dataframe

In [3]:
results[3].d_

{'author': 'sulli3ms',
 'created_utc': 1557937622,
 'domain': 'i.redd.it',
 'selftext': '',
 'title': 'Withdrawals still frozen? How much longer?',
 'created': 1557952022.0}

In [4]:
results[:3]

[submission(author='Dellz2k19', created_utc=1557940373, domain='self.binance', selftext='[removed]', title='Binance Referral Links', created=1557954773.0, d_={'author': 'Dellz2k19', 'created_utc': 1557940373, 'domain': 'self.binance', 'selftext': '[removed]', 'title': 'Binance Referral Links', 'created': 1557954773.0}),
 submission(author='coinsmash1', created_utc=1557939796, domain='coinrivet.com', selftext='', title='Crypto trading volume hits $100 billion on CoinMarketCap.com', created=1557954196.0, d_={'author': 'coinsmash1', 'created_utc': 1557939796, 'domain': 'coinrivet.com', 'selftext': '', 'title': 'Crypto trading volume hits $100 billion on CoinMarketCap.com', 'created': 1557954196.0}),
 submission(author='acauseforconcern', created_utc=1557939719, domain='self.binance', selftext='', title='13 hours since the update started and still no withdrawls, this is Unacceptable.', created=1557954119.0, d_={'author': 'acauseforconcern', 'created_utc': 1557939719, 'domain': 'self.binanc

> After getting everything into a list of objects, I decided to make an original dataframe (og_df) so i could maintain the original information that I will be stripping later for my main dataframe (df). This is for a mix of eas of use without missing any information I may decide I want later on.

In [5]:
og_df = pd.DataFrame(results)

In [6]:
og_df.shape

(21705, 7)

In [7]:
og_df.domain.value_counts().head(10)

self.binance             16367
i.redd.it                  742
binance.com                381
youtube.com                338
support.binance.com        231
youtu.be                   177
medium.com                 159
twitter.com                150
globalcryptopress.com       93
binance.zendesk.com         84
Name: domain, dtype: int64

> Next I added a new column 'easy_time' with UTC converted to plain dates.

In [8]:
ts = 1557172785
dt_object = datetime.fromtimestamp(ts)

print(dt_object)

2019-05-06 15:59:45


In [9]:
og_df['easy_time'] = og_df.apply(lambda row: datetime.fromtimestamp(row['created_utc']), axis=1)

## Cleaning any selftext and combining it with titles into a single column

In [10]:
og_df.dtypes

author                 object
created_utc             int64
domain                 object
selftext               object
title                  object
created                object
d_                     object
easy_time      datetime64[ns]
dtype: object

In [11]:
og_df['all_text'] = og_df['selftext'].map(str) + og_df['title'].map(str)

In [12]:
og_df.tail(2)

Unnamed: 0,author,created_utc,domain,selftext,title,created,d_,easy_time,all_text
21703,Binance,1500894604,twitter.com,All the trading fee is free on Binance.com wit...,1.50091e+09,"{'author': 'Binance', 'created_utc': 150089460...",,2017-07-24 07:10:04,All the trading fee is free on Binance.com wit...
21704,[deleted],1500891655,self.binance,[deleted],The trading fee is free on binance.com within ...,1.50091e+09,"{'author': '[deleted]', 'created_utc': 1500891...",2017-07-24 06:20:55,[deleted]The trading fee is free on binance.co...


In [14]:
df = og_df[['all_text', 'easy_time']]

In [15]:
df.shape

(21705, 2)

In [16]:
df.all_text[4]

'Is there any App for iPad iOS ?Is there any App for iPad iOS ?'

# Saving to a CSV
>For now, I'm just going to save the May titles and dates to a csv. I will be reading this into my next notebook: sentiment_analysis_testing. 

In [18]:
df.to_csv('/Users/zoenawar/DSI/RNN_LSTM_Cryptocurrency_Project/datasets/rbinancescrape.csv')