# Data Access Notebook

In [None]:
import pandas as pd
import datetime as dt
import dask

## Project Gutenberg

Visit the [homepage](https://www.gutenberg.org/) for Project Gutenberg if you are having trouble finding a specific book.

Usage documentation for the Python package can be found [here](https://pypi.org/project/Gutenberg/).

In [None]:
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

text = strip_headers(load_etext(11)).strip()
#print(text)

### Simple Example

In [None]:
books = [[1342,"Pride and Prejudice","Jane Austen"],
         [11,"Alice's Adventures in Wonderland", "Lewis Carroll"],
         [2701,"Moby Dick; Or, The Whale","Herman Melville"],
         [84,"Frankenstein; Or, The Modern Prometheus", "Mary Wollenstonecraft Shelley" ],
         [345,"Dracula", "Bram Stoker"]
        ]

In [None]:
gutenDF = pd.DataFrame(books, columns=['ID','Title','Author'])
gutenDF['FullText']=gutenDF.apply(lambda row: strip_headers(load_etext(row['ID'])).strip() , axis=1)
gutenDF

In [None]:
gutenDF[gutenDF['Title'].str.contains("W")]['FullText']

### MetaData and Caching

If you plan on doing a lot of work with Project Gutenberg's metadata functionality, you'll need to cache their metadata first. This can take a very long time but makes it possible to query their metadata quickly.

## EDGAR Database

Here is the [homepage](https://www.sec.gov/edgar.shtml) for the Securities and Exchange Commission's EDGAR database. If you are having trouble finding a specific company, try their [full text search](https://www.sec.gov/edgar/search/#).

Usage documentation for the Python package can be found [here](https://pypi.org/project/edgar/).

Pull the last 5 10-K reports for the Oracle Corporation

In [None]:
from edgar import Company, TXTML
company = Company("Oracle Corp", "0001341439")
tree = company.get_all_filings(filing_type = "10-K")
docs = Company.get_documents(tree, no_of_documents=5)

Parse the most recent 10-K filing for IBM

In [None]:
company = Company("INTERNATIONAL BUSINESS MACHINES CORP", "0000051143")
doc = company.get_10K()
text = TXTML.parse_full_10K(doc)

Search EDGAR for a company Cisco System

In [None]:
from edgar import Edgar
edgar = Edgar()
possible_companies = edgar.find_company_name("Cisco System")
possible_companies

### Simple Example

In [None]:
companies = [['AMAZON COM INC','0001018724'],
            ['Alphabet Inc.','0001652044'],
            ['MICROSOFT CORP','0000789019']
            ]

In [None]:
edgarDF = pd.DataFrame(companies, columns=['Company','CIK'])
edgarDF['MostRecent_10K']=edgarDF.apply(lambda row: TXTML.parse_full_10K(Company(row['Company'],row['CIK']).get_10K()) , axis=1)
edgarDF

#### Last 5 10Ks with Filing Dates

In [None]:
def get_edgar(ll, n):
    filinglist = []
    for el in ll:
        company = Company(el[0], el[1])
        tree = company.get_all_filings(filing_type = "10-K")
        docs = Company.get_documents(tree, no_of_documents=n, as_documents=True)
        texts = Company.get_documents(tree, no_of_documents=n, as_documents=False)
        if n<2:
            docs=[docs]
            texts=[texts]
        for i in range(n):
            date = docs[i].content['Filing Date']
            text = TXTML.parse_full_10K(texts[i])
            filinglist.append([el[0],el[1],date,text])
    df = pd.DataFrame(filinglist, columns=['Company','CIK','10K_Filing','Filing_Date'])
    return df

In [None]:
get_edgar(companies,5)

## Hansard

If you want to use the Hansard you will need to significantly increase the memory you request for your JupyterLab session. We'd suggest upping from 6GB to 64GB. This might result in a longer wait for launching your job but will allow you to hold all of the data in dataframe in memory.

In [None]:
hansard = pd.read_parquet("/scratch/group/oit_research_data/hansard/hansard_20191119.parquet")
hansard.head(1)

You also have the option to use the Dask dataframe library instead of pandas. The tradeoff is that Dask will let you use less memory at the cost of speed and lends itself to parallelization better than pandas. When in doubt, default to using pandas.

In [None]:
from dask import dataframe as dd
hansard = dd.read_parquet("/scratch/group/oit_research_data/hansard/hansard_20191119.parquet")
hansard.head()

### Simple Example

Filtering Hansard to study specific topics

In [None]:
# Convert the speechdate column to datetime objects - forcing any errors to be set to NaN
hansard['speechdate']=pd.to_datetime(hansard['speechdate'], errors='coerce')

In [None]:
# Filter for only speechdates before 1900
hansard1800s = hansard[hansard['speechdate']<dt.datetime(1900,1,1)]

In [None]:
# Filtering for speakers named Gladstone and sentences about Dublin
hansard1800s[(hansard1800s['speaker'].str.contains('Gladstone'))&(hansard1800s['text'].str.contains('Dublin'))]

## US Congress

Data pulled via shell script (congress_download.sh) from [Stanford's Congressional Record](https://data.stanford.edu/congress_text) and is available on M2. If you use this data, please use the proper citation for the dataset.

In [None]:
import glob

In [None]:
path_to_congress = "/scratch/group/oit_research_data/stanford_congress"

In [None]:
glob.glob('{}/*'.format(path_to_congress))

If you want to know how to best use this dataset, we suggest you read the codebook found [here](https://stacks.stanford.edu/file/druid:md374tz9962/codebook_v4.pdf)

## COVID-19 Text Data

Data pulled via shell script (congress_download.sh) from [MIT's COVID-19 Open Research Dataset](https://innovation.mit.edu/cord19/) and is available on M2. If you use this data, please use the proper citation for the dataset.

In [None]:
import glob
import json

In [None]:
date = '2020-08-23'

In [None]:
path_to_covid = "/scratch/group/oit_research_data/semantic_scholar_cord_19/"+date

In [None]:
glob.glob('{}/*'.format(path_to_covid))

In [None]:
metadata = pd.read_csv('{}/metadata.csv'.format(path_to_covid), dtype=object)
metadata.head(2)

Here we have provided a function that will take the associated files for each row in our metadata table and read those files in.

In [None]:
def read_text_from_json(row,path=path_to_covid):
    
    file_pdf_json = '{}/{}'.format(path, row['pdf_json_files'])
    file_pmc_json = '{}/{}'.format(path, row['pmc_json_files'])
    
    read = False
    
    try:
        with open(file_pdf_json) as f:
            text = json.load(f)
            read = True
    except FileNotFoundError:
        try:
            with open(file_pmc_json) as f:
                text = json.load(f)
                read = True
        except FileNotFoundError:
            text = {"body_text":"No Files Listed Were Found"}
            read = False
    
    return text
    

In [None]:
covidDF = metadata.dropna(subset=['pdf_json_files', 'pmc_json_files'],thresh=1).sample(100) # Selects a random sample of articles from our collection that have at least 1 json file listed

covidDF['Text'] = covidDF.apply(lambda row: read_text_from_json(row,path_to_covid), axis = 1) # read the text using a function to parse the associated file

Pulls the body text of the paper for the first row in our sample subset of our data

In [None]:
covidDF.iloc[0]['Text']['body_text']

## Reddit Archive

**DISCLAIMER**: If you want to use Reddit, you are likely going to have issues with scale due to the size of the full data set. Please reach out to us about how to best approach your data. We might be able to help you cultivate a more manageable subset of the data.

To use Pandas with Reddit, you will need more memory than your typical node. To do this you will want to open a JupyterLab Session on a high-mem queue. For more information, look at the documentation [here](http://faculty.smu.edu/csc/documentation/slurm.html#maneframe-ii-s-slurm-partitions-queues).

If you want to keep the data on disk instead of using a higher memory node, you will need to use a different dataframe library than pandas. Dask will let you do this but it will be very slow.

In [None]:
from dask import dataframe as dd
reddit = dd.read_parquet("/scratch/group/oit_research_data/reddit/reddit.parquet")
reddit.head()

### Simple Example

Filtering Reddit to for a specific subreddit over a specific time. In this case, filtering for subreddits including "climate" before December 21st 2012.

In [None]:
subreddit=reddit[reddit['subreddit'].str.contains('climate')]

Convert the 'created_utc column' to datetime objects - forcing any errors to be set to NaN. UTC is [Unix Time](https://en.wikipedia.org/wiki/Unix_time).

Then, filter for only posts before 2012.

In [None]:
subreddit['created_utc']=dd.to_datetime(subreddit['created_utc'], unit='s', errors='coerce')

subreddit2012 = subreddit[subreddit['date']<dt.datetime(2012,12,21)]