# Final Project Proposal & Data Report

Jack (Quan Cheng) Xie | qcx201 | N14077607 <br>
Professor Michael Waugh <br>
Data Bootcamp <br>
27 November 2017

## Project Proposal

The idea of the project is to explore the English language. Languages abound with interesting and unintuitive attributes.

##### The Zipf!
One example of such attributes is the [**Zipf Mystery**](https://www.youtube.com/watch?v=fCn8zs912OE) *(video-link for the printout version: https://www.youtube.com/watch?v=fCn8zs912OE, or search  **The Zipf Mystery** by Vsauce on youtube)*. The Zipf mystery is a phenomenon which shows that the frequency of word-use follows a Pareto distribution (or also the power law) such that (roughly) 20% of words in the English lexicon take up 80% of usage. Why? No one knows.

##### Questions, Questions...
The starting point of the project will be to reproduce this *Zipf* phenomenon (and obviously, plot the curve). We can test out how the *Zipf* distribution might change (or not change) in different linguistic environments ( Reddit vs Twitter vs the New York Times *(and maybe vs works of literature)* ) and across time. Then, we may look into some of the possible reasons for this phenomenon, such as the random-words hypothesis mentioned in the video. That may lead us to explore what kinds of words fall into a particular part of the distribution.

We can also see if the *Zipf* phenomenon appears in any other facets of language other than word-frequency in the basic form. Perhaps the *Zipf* may show up in not only single words but in phrases or word-groups. What about *Zipf* in semantically-related words? *Zipf* in sentiment? *Zipf* in the frequency of syllables? *Zipf* in rhymes? 

Maybe? Let's find out.


##### The Data

The datasets that we will use include a collection of comments from reddit, snippets from New York Times articles, and tweets. The python code for accessing or scraping the data are below. Other possibilities include samples from literature through the NLTK module, though I may have to get a little more familiar with NLTK first.

## Data Access

#### Imports:

In [1]:
import pandas as pd
import json
import requests, io
import zipfile as zf  
import bz2
import shutil
import os
import time
from IPython.display import clear_output
import nltk as nltk

#### DataFrame Summary Function
A function to summarize shape, columns, index, and return a DataFrame head-preview.

In [2]:
def df_summary(dataframe):
    print('Shape:', dataframe.shape, '\n')
    print('Columns:', list(dataframe.columns),'\n')
    print('Index:', dataframe.index)
    return dataframe.head()

### Data I. Reddit Archives

In [3]:
# Accessing and opening bz2 files
def get_reddit_file(file):
    url = 'https://files.pushshift.io/reddit/comments/'
    print('Fetching:', file)
    resp = requests.get(url+file)
    print(file,resp)
    result = bz2.BZ2File(io.BytesIO(resp.content))
    clear_output()
    return result

In [4]:
# Combining bz2 files into one json
def get_reddit_archives(start):
    years = list(range(2006,start+1))
    for y in range(len(years)):
        years[y] = str(years[y])

    file_list = list()
    initial = 'RC_2006-01.bz2'
    for year in years:
        file_list.append(initial.replace('2006',year))

    result = list()
    for file in file_list:
        bz_file = get_reddit_file(file)
        linelist = bz_file.readlines()
        for line in linelist:
            result.append(json.loads(line))
    return result
    

In [5]:
# Execution and Pandas
    # I've only done up to 2008 just to get an idea, otherwise it'll take a little too long
    # I'll keepy trying to use the API to scrape Reddit comments
    # But for now I will use the archives in case the API doesn't work out
all_archives = get_reddit_archives(2008)
reddit = pd.DataFrame(all_archives)

In [6]:
# Summarize DataFrame
df_summary(reddit)

Shape: (537997, 22) 

Columns: ['archived', 'author', 'author_flair_css_class', 'author_flair_text', 'body', 'controversiality', 'created_utc', 'distinguished', 'downs', 'edited', 'gilded', 'id', 'link_id', 'name', 'parent_id', 'retrieved_on', 'score', 'score_hidden', 'stickied', 'subreddit', 'subreddit_id', 'ups'] 

Index: RangeIndex(start=0, stop=537997, step=1)


Unnamed: 0,archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,edited,...,link_id,name,parent_id,retrieved_on,score,score_hidden,stickied,subreddit,subreddit_id,ups
0,,jh99,,,early 2006 a probable date,0,1136074029,,,False,...,t3_22569,,t3_22569,1473821517,0,,False,reddit.com,t5_6,0
1,,jpb,,,If you are going to post something that has a ...,0,1136076410,,,False,...,t3_22542,,t3_22542,1473821517,0,,False,reddit.com,t5_6,0
2,,Pichu0102,,,Microsoft hates it's own products?\r\nWho knew?,0,1136078623,,,False,...,t3_22515,,t3_22515,1473821517,2,,False,reddit.com,t5_6,2
3,,libertas,,,"this looks interesting, but it's already aired...",0,1136079346,,,False,...,t3_22528,,t3_22528,1473821517,2,,False,reddit.com,t5_6,2
4,,mdmurray,,,I have nothing but good things to say about De...,0,1136081389,,,False,...,t3_22538,,t3_22538,1473821517,0,,False,reddit.com,t5_6,0


#### Columns of Interest:
`body` for the text data. `created_utc` for the time series analysis. The timeframe here is nice if I can get all the data from 2006 to 2017, but that may be a little bit difficult with the archive volume. The more recent years have too many instances to download and work with efficiently. For now I will use the archive data until I can figure out how to scrape comments from the Reddit API, which is a little complicated.

In [7]:
reddit.head().body

0                           early 2006 a probable date
1    If you are going to post something that has a ...
2      Microsoft hates it's own products?\r\nWho knew?
3    this looks interesting, but it's already aired...
4    I have nothing but good things to say about De...
Name: body, dtype: object

In [8]:
reddit.head().created_utc

0    1136074029
1    1136076410
2    1136078623
3    1136079346
4    1136081389
Name: created_utc, dtype: object

### Data II. New York Times Articles API

In [9]:
# Retreiving data 'documents' from API call
def nyt_data(year, month, day):
    
    url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"
    #nyt_key = 'e3fba4ee4f0944619aa6e7be1fc73eab' # Sometimes backups are needed because of the call limit
    nyt_key = '770978643efb463e84871368163388d7'
    #nyt_key = '7b36694d6ed447e4a3da0b2bf411d0f8'
    
    parameters = {'api-key' : nyt_key,
              'begin_date' : year+month+day,
              'sort' : 'oldest'}

    resp = requests.get(url, params = parameters)
    data = resp.json()
    
    return data

In [10]:
# Getting articles
def get_articles(year,month,day):
    articles = list()
    articles = nyt_data(year, month, day)['response']['docs']
    return articles

In [11]:
# Years and Months for iteration
years = list(range(2006,2017+1))
for y in range(len(years)):
    years[y] = str(years[y])

months = list()
for m in range(1,12+1):
    if m < 10:
        months.append('0'+str(m))
    else:
        months.append(str(m))

# We won't use this now, but it may come in handy later if we decide to up the scraping frequency.
def days(month):
    m30 = ['04','06','09','11']
    res = list()
    for d in range(1,10):
        res.append('0'+str(d))   
    if month == '02':
        for d in range(10,28+1):
            res.append(str(d))
    elif month in m30:
        for d in range(10,30+1):
            res.append(str(d))
    else:
        for d in range(10,31+1):
            res.append(str(d))
    return res

In [12]:
# Scraping data through iteration
def get_nyt():
    result = list()
    for year in years:
        print('Start:', year)
        for month in months:
            for day in ['01','15']: # Two datasets of 10 articles a month
                try: # NYT API limits call rate, so a timer is set to slow down the call iteration
                    docs = get_articles(year, month, day)
                except:
                    time.sleep(5)
                    docs = get_articles(year, month, day)
                    continue
                for article in docs:
                    result.append(article)
                print(year, month, day)
                time.sleep(2)
        print(year, 'OK')
        clear_output()
    return result

In [13]:
# Executing... Wish me luck!
nyt_api = get_nyt()
len(nyt_api)

2820

In [14]:
# Pandas DataFrame
nyt = pd.DataFrame(nyt_api)

In [15]:
# Summarize DataFrame
df_summary(nyt)

Shape: (2820, 20) 

Columns: ['_id', 'abstract', 'blog', 'byline', 'document_type', 'headline', 'keywords', 'multimedia', 'new_desk', 'print_page', 'pub_date', 'score', 'section_name', 'slideshow_credits', 'snippet', 'source', 'type_of_material', 'uri', 'web_url', 'word_count'] 

Index: RangeIndex(start=0, stop=2820, step=1)


Unnamed: 0,_id,abstract,blog,byline,document_type,headline,keywords,multimedia,new_desk,print_page,pub_date,score,section_name,slideshow_credits,snippet,source,type_of_material,uri,web_url,word_count
0,4fd24e778eb7c8105d7f036c,,{},{'original': 'By ALISON BERKLEY'},article,"{'main': 'Long, Steep and Lovely in Aspen', 'k...","[{'isMajor': None, 'rank': 0, 'name': 'glocati...",[],Travel Desk,4,2006-01-01T00:00:00Z,1.0,,,While jet setters schuss down the groomed slop...,The New York Times,News,,https://www.nytimes.com/2006/01/01/travel/01su...,568.0
1,4fd24e778eb7c8105d7f0372,George Ernsberger letter on Jesse Green's arti...,{},,article,{'main': 'The Making of an Ice Princess'},"[{'isMajor': None, 'rank': 0, 'name': 'persons...",[],Magazine,10,2006-01-01T00:00:00Z,1.0,,,Thanks for Jesse Green's terrific article (Dec...,The New York Times,Letter,,https://query.nytimes.com/gst/fullpage.html?re...,41.0
2,4fd24e778eb7c8105d7f037b,,{},{'original': 'By KELLY FEENEY'},article,"{'main': 'Quick Bite/Millburn; When in Essex, ...",[],[],New Jersey Weekly Desk,10,2006-01-01T00:00:00Z,1.0,,,The word famiglia in a food store's name can b...,The New York Times,Review,,https://query.nytimes.com/gst/fullpage.html?re...,350.0
3,4fd24e778eb7c8105d7f0381,British and Dutch researchers conduct study on...,{},{'original': 'By ALEX WILLIAMS'},article,{'main': 'Hangover Helpers: Beyond Sheep Eyes'},"[{'isMajor': None, 'rank': 0, 'name': 'glocati...","[{'type': 'image', 'subtype': 'thumbnail', 'ur...",Style Desk,1,2006-01-01T00:00:00Z,1.0,,,"As people wake up from another New Year's Eve,...",The New York Times,News,,https://www.nytimes.com/2006/01/01/fashion/sun...,1346.0
4,4fd24e778eb7c8105d7f038a,,{},{'original': 'By HOWARD MARKEL'},article,"{'main': 'If the Avian Flu Hasn't Hit, Here's ...",[],[],Week in Review Desk,10,2006-01-01T00:00:00Z,1.0,,,WILD birds have completed their seasonal migra...,The New York Times,News,,https://query.nytimes.com/gst/fullpage.html?re...,329.0


#### Columns of Interest:
`pub_date` for time series analysis. `abstract` and `snippet` for text data, though these texts are often quite short compared to the full articles. Later I may try to increase the data size by making the date frequency finer, though it might be a little tricky with the API limitations. 

In [16]:
nyt.head().abstract

0                                                  NaN
1    George Ernsberger letter on Jesse Green's arti...
2                                                  NaN
3    British and Dutch researchers conduct study on...
4                                                  NaN
Name: abstract, dtype: object

In [17]:
nyt.head().snippet

0    While jet setters schuss down the groomed slop...
1    Thanks for Jesse Green's terrific article (Dec...
2    The word famiglia in a food store's name can b...
3    As people wake up from another New Year's Eve,...
4    WILD birds have completed their seasonal migra...
Name: snippet, dtype: object

In [18]:
nyt.head().pub_date

0    2006-01-01T00:00:00Z
1    2006-01-01T00:00:00Z
2    2006-01-01T00:00:00Z
3    2006-01-01T00:00:00Z
4    2006-01-01T00:00:00Z
Name: pub_date, dtype: object

### Data III. Twitter Collection

In [19]:
# Accessing the zip file
url = 'http://followthehashtag.com/content/uploads/USA-Geolocated-tweets-free-dataset-Followthehashtag.zip'
resp = requests.get(url)
tw_zip = zf.ZipFile(io.BytesIO(resp.content))
tw_zip.namelist()

['heatmap_x_usa_x_filter_nativeretweets.xlsx',
 '__MACOSX/',
 '__MACOSX/._heatmap_x_usa_x_filter_nativeretweets.xlsx',
 'coverage_x_usa_x_filter_nativeretweets.xlsx',
 '__MACOSX/._coverage_x_usa_x_filter_nativeretweets.xlsx',
 'geolocation_x_usa_x_filter_nativeretweets.xlsx',
 '__MACOSX/._geolocation_x_usa_x_filter_nativeretweets.xlsx',
 'dashboard_x_usa_x_filter_nativeretweets.xlsx',
 '__MACOSX/._dashboard_x_usa_x_filter_nativeretweets.xlsx']

In [20]:
# Reading DataFrame through Pandas
twitter = pd.read_excel(tw_zip.open(tw_zip.namelist()[7]),sheet_name = 'Stream')

In [21]:
# Changing column names because of spaces
twitter_cols = list(twitter.columns)
new_twitter_cols = [twitter_cols[i].replace(' ','_') for i in range(len(twitter_cols))]
twitter.columns = new_twitter_cols

In [22]:
# Summarize DataFrame
df_summary(twitter)

Shape: (204820, 19) 

Columns: ['Tweet_Id', 'Date', 'Hour', 'User_Name', 'Nickname', 'Bio', 'Tweet_content', 'Favs', 'RTs', 'Latitude', 'Longitude', 'Country', 'Place_(as_appears_on_Bio)', 'Profile_picture', 'Followers', 'Following', 'Listed', 'Tweet_language_(ISO_639-1)', 'Tweet_Url'] 

Index: RangeIndex(start=0, stop=204820, step=1)


Unnamed: 0,Tweet_Id,Date,Hour,User_Name,Nickname,Bio,Tweet_content,Favs,RTs,Latitude,Longitude,Country,Place_(as_appears_on_Bio),Profile_picture,Followers,Following,Listed,Tweet_language_(ISO_639-1),Tweet_Url
0,721318437075685382,2016-04-16,12:44,Bill Schulhoff,BillSchulhoff,"Husband,Dad,GrandDad,Ordained Minister, Umpire...","Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...",,,40.760278,-72.954722,US,"East Patchogue, NY",http://pbs.twimg.com/profile_images/3788000007...,386.0,705.0,24.0,en,http://www.twitter.com/BillSchulhoff/status/72...
1,721318436173979648,2016-04-16,12:44,Daniele Polis,danipolis,"Viagens, geek, moda, batons laranja, cabelos c...",Pausa pro café antes de embarcar no próximo vô...,,,32.898349,-97.039196,US,"Grapevine, TX",http://pbs.twimg.com/profile_images/7041760340...,812.0,647.0,16.0,pt,http://www.twitter.com/danipolis/status/721318...
2,721318434169102336,2016-04-16,12:44,Kasey Jacobs,KJacobs27,Norwich University Class of 2017,Good. Morning. #morning #Saturday #diner #VT #...,,,44.199476,-72.504173,US,"Barre, VT",http://pbs.twimg.com/profile_images/7169585649...,179.0,206.0,2.0,en,http://www.twitter.com/KJacobs27/status/721318...
3,721318429844582400,2016-04-16,12:44,Stan Curtis,stncurtis,"transcendental music, art for art's sake, craf...",@gratefuldead recordstoredayus 🌹🌹🌹 @ TOMS MUSI...,,,39.901474,-76.606817,US,"Red Lion, PA",http://pbs.twimg.com/profile_images/6962528246...,1229.0,2071.0,11.0,en,http://www.twitter.com/stncurtis/status/721318...
4,721318429081407488,2016-04-16,12:44,Dave Borzymowski,wi_borzo,When in doubt....Panic.,Egg in a muffin!!! (@ Rocket Baby Bakery - @ro...,,,43.060849,-87.998309,US,"Wauwatosa, WI",http://pbs.twimg.com/profile_images/6595279129...,129.0,833.0,9.0,en,http://www.twitter.com/wi_borzo/status/7213184...


#### Columns of Interest:
`Tweet_content` for text data.
`Date` for the time series analysis, though for this particular dataset the timeframe is very small because of the issues with large tweet volumes and Twitter's limitations on accessing older tweets through their API.

In [23]:
twitter.head().Tweet_content

0    Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...
1    Pausa pro café antes de embarcar no próximo vô...
2    Good. Morning. #morning #Saturday #diner #VT #...
3    @gratefuldead recordstoredayus 🌹🌹🌹 @ TOMS MUSI...
4    Egg in a muffin!!! (@ Rocket Baby Bakery - @ro...
Name: Tweet_content, dtype: object

In [24]:
twitter.head().Date

0    2016-04-16
1    2016-04-16
2    2016-04-16
3    2016-04-16
4    2016-04-16
Name: Date, dtype: object