# Data Access Notebook

In [20]:
import pandas as pd
import datetime as dt
import dask

## Project Gutenberg

Visit the [homepage](https://www.gutenberg.org/) for Project Gutenberg if you are having trouble finding a specific book.

Usage documentation for the Python package can be found [here](https://pypi.org/project/Gutenberg/).

In [40]:
# See https://www.gutenberg.org/MIRRORS.ALL for a list of mirrors if this one becomes unavailable.
gutenberg_mirror = 'https://gutenberg.pglaf.org/'

from gutenberg.acquire import load_etext
from gutenberg.query import get_metadata
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11, mirror=gutenberg_mirror)).strip()
print(text[:1000])

[Illustration]




Alice’s Adventures in Wonderland

by Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0

Contents

 CHAPTER I.     Down the Rabbit-Hole
 CHAPTER II.    The Pool of Tears
 CHAPTER III.   A Caucus-Race and a Long Tale
 CHAPTER IV.    The Rabbit Sends in a Little Bill
 CHAPTER V.     Advice from a Caterpillar
 CHAPTER VI.    Pig and Pepper
 CHAPTER VII.   A Mad Tea-Party
 CHAPTER VIII.  The Queen’s Croquet-Ground
 CHAPTER IX.    The Mock Turtle’s Story
 CHAPTER X.     The Lobster Quadrille
 CHAPTER XI.    Who Stole the Tarts?
 CHAPTER XII.   Alice’s Evidence




CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the
hot day 

### Simple Example

In [38]:
books = [
    [1342,"Pride and Prejudice","Jane Austen"],
    [11,"Alice's Adventures in Wonderland", "Lewis Carroll"],
    [2701,"Moby Dick; Or, The Whale","Herman Melville"],
    [84,"Frankenstein; Or, The Modern Prometheus", "Mary Wollenstonecraft Shelley" ],
    [345,"Dracula", "Bram Stoker"]
]

In [41]:
gutenDF = pd.DataFrame(books, columns=['ID','Title','Author'])
gutenDF['FullText']=gutenDF.apply(lambda row: strip_headers(load_etext(row['ID'], mirror=gutenberg_mirror)).strip() , axis=1)
gutenDF

Unnamed: 0,ID,Title,Author,FullText
0,1342,Pride and Prejudice,Jane Austen,THERE IS AN ILLUSTRATED EDITION OF THIS TITLE ...
1,11,Alice's Adventures in Wonderland,Lewis Carroll,[Illustration]\n\n\n\n\nAlice’s Adventures in ...
2,2701,"Moby Dick; Or, The Whale",Herman Melville,"MOBY-DICK;\n\nor, THE WHALE.\n\nBy Herman Melv..."
3,84,"Frankenstein; Or, The Modern Prometheus",Mary Wollenstonecraft Shelley,and David Meltzer. HTML version by Al Haines.\...
4,345,Dracula,Bram Stoker,DRACULA\n\n\n\n\n\n ...


In [42]:
gutenDF[gutenDF['Title'].str.contains("W")]['FullText']

1    [Illustration]\n\n\n\n\nAlice’s Adventures in ...
2    MOBY-DICK;\n\nor, THE WHALE.\n\nBy Herman Melv...
Name: FullText, dtype: object

### MetaData and Caching

If you plan on doing a lot of work with Project Gutenberg's metadata functionality, you'll need to cache their metadata first. This can take a very long time but makes it possible to query their metadata quickly.

In [43]:
# from gutenberg.acquire import get_metadata_cache
# cache = get_metadata_cache()
# cache.populate()

KeyboardInterrupt: 

In [None]:
# from gutenberg.query import get_etexts
# from gutenberg.query import get_metadata

# print(get_metadata('title', 11))  # prints frozenset([u'Moby Dick; Or, The Whale'])
# print(get_metadata('author', 11)) # prints frozenset([u'Melville, Hermann'])

# print(get_etexts('title', 'Moby Dick; Or, The Whale'))  # prints frozenset([2701, ...])
# print(get_etexts('author', 'Melville, Hermann'))        # prints frozenset([2701, ...])

## EDGAR Database

Here is the [homepage](https://www.sec.gov/edgar.shtml) for the Securities and Exchange Commission's EDGAR database. If you are having trouble finding a specific company, try their [full text search](https://www.sec.gov/edgar/search/#).

Usage documentation for the Python package can be found [here](https://pypi.org/project/edgar/).

Pull the last 5 10-K reports for the Oracle Corporation

In [44]:
from edgar import Company, TXTML
company = Company("Oracle Corp", "0001341439")
tree = company.get_all_filings(filing_type = "10-K")
docs = Company.get_documents(tree, no_of_documents=5)

Parse the most recent 10-K filing for IBM

In [45]:
company = Company("INTERNATIONAL BUSINESS MACHINES CORP", "0000051143")
doc = company.get_10K()
text = TXTML.parse_full_10K(doc)

Search EDGAR for a company Cisco System

In [46]:
from edgar import Edgar
edgar = Edgar()
possible_companies = edgar.find_company_name("Cisco System")
possible_companies

['CISCO SYSTEMS (SWITZERLAND) INVESTMENTS LTD',
 'CISCO SYSTEMS CAPITAL CORP',
 'CISCO SYSTEMS INC',
 'CISCO SYSTEMS INTERNATIONAL B.V.',
 'CISCO SYSTEMS, INC.',
 'L3TV SAN FRANCISCO CABLE SYSTEM, LLC',
 'SPANISH BROADCASTING SYSTEM SAN FRANCISCO INC']

### Simple Example

In [49]:
companies = [
    Company('AMAZON COM INC','0001018724'),
    Company('Alphabet Inc.','0001652044'),
    Company('MICROSOFT CORP','0000789019'),
]

In [53]:
entries = []

for company in companies:
    
    text = company.get_10K()
    if text is not None:
        text = TXTML.parse_full_10K(text)
    else:
        text = ''
    
    entries.append({
        'Company': company.name,
        'CIK': company.cik,
        'MostRecent_10K': text,
    })

edgarDF = pd.DataFrame(entries)
display(edgarDF)

Unnamed: 0,Company,CIK,MostRecent_10K
0,AMAZON COM INC,1018724,\n\n\n\n\n\n\n\n\t\n\t\t\n\t\tDocument\n\t\n\t...
1,Alphabet Inc.,1652044,\n\n\n\n\n\n\n\n\t\n\t\t\n\t\tDocument\n\t\n\t...
2,MICROSOFT CORP,789019,\n\n\n\n\n\n\n\n\n\n\nmsft-10k_20200630.htm\n\...


#### Last 5 10Ks with Filing Dates

In [68]:
def get_edgar(companies, n):
    filinglist = []
    for company in companies:
        tree = company.get_all_filings(filing_type="10-K")
        docs = company.get_documents(tree, no_of_documents=n, as_documents=True)
        texts = company.get_documents(tree, no_of_documents=n, as_documents=False)
        if type(docs) != list:
            docs=[docs]
        if type(texts) != list:
            texts=[texts]
            
        for i in range(n):
            date = docs[i].content['Filing Date']
            text = TXTML.parse_full_10K(texts[i])
            filinglist.append([company.name, company.cik, date, text])
    df = pd.DataFrame(filinglist, columns=['Company','CIK','Filing_Date', '10K_Text'])
    return df

In [69]:
get_edgar(companies, 5)

Unnamed: 0,Company,CIK,Filing_Date,10K_Text
0,AMAZON COM INC,1018724,2020-01-31,\n\n\n\n\n\n\n\n\t\n\t\t\n\t\tDocument\n\t\n\t...
1,AMAZON COM INC,1018724,2019-02-01,\n\n\n\t\n\t\t\n\t\t\n\t\tDocument\n\t\n\t\nTa...
2,AMAZON COM INC,1018724,2018-02-02,\n\n\n\t\n\t\t\n\t\t\n\t\tDocument\n\t\n\t\n U...
3,AMAZON COM INC,1018724,2017-02-10,\n\n\n\t\n\t\t\n\t\t\n\t\tDocument\n\t\n\t\n U...
4,AMAZON COM INC,1018724,2016-01-29,\n\n\n\t\n\t\t\n\t\t\n\t\t10-K\n\t\n\t\n UNITE...
5,Alphabet Inc.,1652044,2020-02-04,\n\n\n\n\n\n\n\n\t\n\t\t\n\t\tDocument\n\t\n\t...
6,Alphabet Inc.,1652044,2019-02-06,\n\nAmendment No. 1 to Form 10-K\n\n \n\n \...
7,Alphabet Inc.,1652044,2019-02-05,\n\n\n\t\n\t\t\n\t\t\n\t\tDocument\n\t\n\t\nUN...
8,Alphabet Inc.,1652044,2018-02-06,\n\n\n\t\n\t\t\n\t\t\n\t\tDocument\n\t\n\t\nUN...
9,Alphabet Inc.,1652044,2017-02-03,\n\n\n\t\n\t\t\n\t\t\n\t\tDocument\n\t\n\t\nUN...


## Hansard

If you want to use the Hansard you will need to significantly increase the memory you request for your JupyterLab session. We'd suggest upping from 6GB to 64GB. This might result in a longer wait for launching your job but will allow you to hold all of the data in dataframe in memory.

In [None]:
hansard = pd.read_parquet("/scratch/group/oit_research_data/hansard/hansard_20191119.parquet")
hansard.head(1)

You also have the option to use the Dask dataframe library instead of pandas. The tradeoff is that Dask will let you use less memory at the cost of speed and lends itself to parallelization better than pandas. When in doubt, default to using pandas.

In [None]:
from dask import dataframe as dd
hansard = dd.read_parquet("/scratch/group/oit_research_data/hansard/hansard_20191119.parquet")
hansard.head()

### Simple Example

Filtering Hansard to study specific topics

In [None]:
# Convert the speechdate column to datetime objects - forcing any errors to be set to NaN
hansard['speechdate']=pd.to_datetime(hansard['speechdate'], errors='coerce')

In [None]:
# Filter for only speechdates before 1900
hansard1800s = hansard[hansard['speechdate']<dt.datetime(1900,1,1)]

In [None]:
# Filtering for speakers named Gladstone and sentences about Dublin
hansard1800s[(hansard1800s['speaker'].str.contains('Gladstone'))&(hansard1800s['text'].str.contains('Dublin'))]

## US Congress

Data pulled via shell script (congress_download.sh) from [Stanford's Congressional Record](https://data.stanford.edu/congress_text) and is available on M2. If you use this data, please use the proper citation for the dataset.

In [40]:
import pandas as pd
import csv

congress_folder = '/scratch/group/oit_research_data/stanford_congress/hein-bound'


# Modify these two variables to change the interval (inclusive) of congressional sessions that are loaded.
SESSION_START = 91
SESSION_END = 93

def read_congress_csv(filename):
    return pd.read_csv(filename, sep='|', encoding="ISO-8859-1", error_bad_lines=False, warn_bad_lines=False, quoting=csv.QUOTE_NONE)


list_dfs = []

for session in range(SESSION_START, SESSION_END + 1):
    speech_file = f'{congress_folder}/speeches_{session:03}.txt'
    session_dfs.append(read_congress_csv(speech_file))

speeches_df = pd.concat(session_dfs)
del list_dfs[:]
display(speeches_df)

Unnamed: 0,speech_id,speech
0,910000001,Mr. President. it is with deep regret that I a...
1,910000002,The Chair lays before the Senate the following...
2,910000003,(Edward E. Mansur) read as follows:
3,910000004,The Chair lays before the Senate the credentia...
4,910000005,The Senators to be sworn in will now present t...
...,...,...
357286,930357292,Mr. President. I am honored to be able to join...
357287,930357293,Mr. President. ALAN BIBLES decision to retire ...
357288,930357294,Mr. President. I am privileged to be able to j...
357289,930357295,Mr. President. when the 93d Congress adjourns ...


In [41]:
# Load all the metadata for each congressional session.
list_dfs = []

for session in range(SESSION_START, SESSION_END + 1):
    metadata_file = f'{congress_folder}/descr_{session:03}.txt'
    list_dfs.append(read_congress_csv(metadata_file))

metadata_df = pd.concat(list_dfs)
del list_dfs[:]

display(metadata_df)

Unnamed: 0,speech_id,chamber,date,number_within_file,speaker,first_name,last_name,state,gender,line_start,line_end,file,char_count,word_count
0,910000001,S,19690103,1,Mr. MANSFIELD,Unknown,MANSFIELD,Unknown,M,62,72,01031969.txt,320,57
1,910000002,S,19690103,2,The VICE PRESIDENT,Unknown,Unknown,Unknown,Special,75,84,01031969.txt,315,50
2,910000003,S,19690103,3,The legislative clerk,Unknown,Unknown,Unknown,Special,85,86,01031969.txt,35,6
3,910000004,S,19690103,4,The VICE PRESIDENT,Unknown,Unknown,Unknown,Special,258,271,01031969.txt,443,78
4,910000005,S,19690103,5,The VICE PRESIDENT,Unknown,Unknown,Unknown,Special,1015,1021,01031969.txt,213,39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
357291,930357292,E,19741220,2670,Mr. FULBRIGHT,Unknown,FULBRIGHT,Unknown,M,163296,163356,12201974.txt,1989,340
357292,930357293,E,19741220,2671,Mr. FULBRIGHT,Unknown,FULBRIGHT,Unknown,M,163358,163373,12201974.txt,546,95
357293,930357294,E,19741220,2672,Mr. FULBRIGHT,Unknown,FULBRIGHT,Unknown,M,163404,163430,12201974.txt,971,170
357294,930357295,E,19741220,2673,Mr. FULBRIGHT,Unknown,FULBRIGHT,Unknown,M,163432,163471,12201974.txt,1273,216


In [10]:
congress_df = pd.merge(speeches_df, metadata_df, on = 'speech_id')
congress_df.fillna(0, inplace=True)
display(congress_df)

Unnamed: 0,speech_id,speech,chamber,date,number_within_file,speaker,first_name,last_name,state,gender,line_start,line_end,file,char_count,word_count
0,910000001,Mr. President. it is with deep regret that I a...,S,19690103,1,Mr. MANSFIELD,Unknown,MANSFIELD,Unknown,M,62,72,01031969.txt,320,57
1,910000002,The Chair lays before the Senate the following...,S,19690103,2,The VICE PRESIDENT,Unknown,Unknown,Unknown,Special,75,84,01031969.txt,315,50
2,910000003,(Edward E. Mansur) read as follows:,S,19690103,3,The legislative clerk,Unknown,Unknown,Unknown,Special,85,86,01031969.txt,35,6
3,910000004,The Chair lays before the Senate the credentia...,S,19690103,4,The VICE PRESIDENT,Unknown,Unknown,Unknown,Special,258,271,01031969.txt,443,78
4,910000005,The Senators to be sworn in will now present t...,S,19690103,5,The VICE PRESIDENT,Unknown,Unknown,Unknown,Special,1015,1021,01031969.txt,213,39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1018456,930357292,Mr. President. I am honored to be able to join...,E,19741220,2670,Mr. FULBRIGHT,Unknown,FULBRIGHT,Unknown,M,163296,163356,12201974.txt,1989,340
1018457,930357293,Mr. President. ALAN BIBLES decision to retire ...,E,19741220,2671,Mr. FULBRIGHT,Unknown,FULBRIGHT,Unknown,M,163358,163373,12201974.txt,546,95
1018458,930357294,Mr. President. I am privileged to be able to j...,E,19741220,2672,Mr. FULBRIGHT,Unknown,FULBRIGHT,Unknown,M,163404,163430,12201974.txt,971,170
1018459,930357295,Mr. President. when the 93d Congress adjourns ...,E,19741220,2673,Mr. FULBRIGHT,Unknown,FULBRIGHT,Unknown,M,163432,163471,12201974.txt,1273,216


If you want to know how to best use this dataset, we suggest you read the codebook found [here](https://stacks.stanford.edu/file/druid:md374tz9962/codebook_v4.pdf)

## Reddit Archive

**DISCLAIMER**: If you want to use Reddit, you are likely going to have issues with scale due to the size of the full data set. Please reach out to us about how to best approach your data. We might be able to help you cultivate a more manageable subset of the data.

To use Pandas with Reddit, you will need more memory than your typical node. To do this you will want to open a JupyterLab Session on a high-mem queue. For more information, look at the documentation [here](http://faculty.smu.edu/csc/documentation/slurm.html#maneframe-ii-s-slurm-partitions-queues).

If you want to keep the data on disk instead of using a higher memory node, you will need to use a different dataframe library than pandas. Dask will let you do this but it will be very slow.

In [None]:
from dask import dataframe as dd
reddit = dd.read_parquet("/scratch/group/oit_research_data/reddit/reddit.parquet")
reddit.head()

The fastest way to load smaller portions of the Reddit database is using TSV's (tab-seperated value files) as they are already seperated by month (Oct 2007 - May 2015). Keep in mind that the size of each file increases the more recent of a time period you pick and you might need to allocate more memory to your HPC instance beforehand. (September 2010 is only 1 GB while May 2015 is over **12GB**)

In [37]:
import pandas as pd
from datetime import datetime
import glob

reddit_files = glob.glob('/scratch/group/oit_research_data/reddit/*.tsv')

# Add or remove entries to this list to modify what months are loaded into the dataframe.
DATES_TO_LOAD = [
    # (year, month),
    (2008, 10),
    (2008, 11),
]

list_dfs = []
for file in reddit_files:
    filename = file.rsplit('/', maxsplit=1)[-1].rsplit('.', maxsplit=1)[0]
    filetime = datetime.strptime(filename, 'RC_%Y-%m')
    if (filetime.year, filetime.month) in DATES_TO_LOAD:
        list_dfs.append(pd.read_csv(file, sep='\t'))

reddit_df = pd.concat(list_dfs)
del list_dfs[:]

# Recommended: drop deleted comments from your dataset    
reddit_df = reddit_df[reddit_df['body'] != '[deleted]']

# Subreddit filtering example:
subreddit_filter = ('politics', 'worldnews', 'news')
reddit_df=reddit_df[reddit_df['subreddit'].isin(subreddit_filter)]

display(reddit_df)

Unnamed: 0.1,Unnamed: 0,body,created_utc,downs,id,link_id,parent_id,subreddit,subreddit_id,ups
1,1,"Dude, your phone is ringing Dude... Dude, your...",1225497628,0,c064gt4,t3_7ajhj,t1_c0648hy,news,t5_2qh3l,0
7,7,"Ask her to do it with her mouth full. Nudge, k...",1225497666,0,c064gta,t3_7ak8x,t1_c064dd3,worldnews,t5_2qh13,2
22,23,"There's a difference between subtle, tasteful ...",1225497747,0,c064gtq,t3_7aklv,t1_c064c5m,politics,t5_2cneq,8
23,24,"Holy shit, not only was my ad. a floater, mine...",1225497749,0,c064gtr,t3_7aklv,t1_c064b44,politics,t5_2cneq,2
29,30,Do you know what they call Proposition 8 in Ho...,1225497763,0,c064gtx,t3_7ajk1,t3_7ajk1,politics,t5_2cneq,1
...,...,...,...,...,...,...,...,...,...,...
782844,789835,"MOR-MONS DE-TEC-TED. LAUNCH MISSILES, *EXTERMI...",1225497473,0,c064gs0,t3_7ai4f,t1_c0646rv,politics,t5_2cneq,1
782853,789844,It is not a matter of not remembering. The do ...,1225497503,0,c064gs9,t3_7amd9,t3_7amd9,politics,t5_2cneq,1
782859,789850,"If she misunderstands or, even worse, purposef...",1225497524,0,c064gsf,t3_7aklv,t3_7aklv,politics,t5_2cneq,40
782878,789869,It's either that or military power according t...,1225497583,0,c064gsy,t3_7amdg,t1_c064gqo,politics,t5_2cneq,1


### Simple Example

Filtering Reddit to for a specific subreddit over a specific time. In this case, filtering for subreddits including "climate" before December 21st 2012.

In [None]:
subreddit=reddit[reddit['subreddit'].str.contains('climate')]

Convert the 'created_utc column' to datetime objects - forcing any errors to be set to NaN. UTC is [Unix Time](https://en.wikipedia.org/wiki/Unix_time).

Then, filter for only posts before 2012.

In [None]:
subreddit['created_utc']=dd.to_datetime(subreddit['created_utc'], unit='s', errors='coerce')

subreddit2012 = subreddit[subreddit['date']<dt.datetime(2012,12,21)]