In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [None]:
import grader

# Quandl Miniproject

## Introduction

Data provider [Quandl](https://www.quandl.com/) offers a vast array of free and paid databases, all accessible with the same Python API (application program interface). Quandl aggregates data from many sources, ranging from scientific to economic to government related topics. They conveniently provide the data to you in powerful Pandas DataFrames.

**In this project, you will gain experience working with Python and Pandas using the data from Quandl.**

At the completion of this project, you will understand how to access all of the Quandl data and how to then wrangle that data in Pandas.

## Getting Data From Quandl

To use Quandl you will have to create an API key. The purpose of the API key is to make it easy for Quandl to track the usage of their data (creating data for them to study!) and for them to ensure that no one user is abusing their system with too many requests.

Create an API key by first creating an account on [Quandl](https://www.quandl.com/). You can log in with your Google, Github or LinkedIn accounts if you like.

After creating an account, access your *Account Settings* from the *Me* dropdown in the upper right corner. Then click on the *API KEY* link on the left below *PASSWORD*. Save the API Key: you'll need that in a moment.

There is [documentation](https://www.quandl.com/docs/api?python#) available for the API. You'll need to look through that to find a few pieces of information, but we will walk you through the basics right now.

### Test Query

Quandl provides a Python module that allows for easy access to their API.  Let's make sure we have the right version installed.  It should start with a 3.

In [None]:
import pandas as pd
import quandl
print quandl.version.VERSION

Now, tell Quandl about the API key you created above:

In [None]:
quandl.ApiConfig.api_key = '<API KEY>'  # Fill in your value here

Now we will access some Sunspot data. Visit Quandl's page for the [Solar Influences Data Analysis Center](https://www.quandl.com/data/SIDC/SUNSPOTS_D-Total-Sunspot-Numbers-Daily).

This is daily data collected by the Royal Observatory of Belgium starting in 1818. Observe in the upper right hand corner of the page you will find a *Quandl Code*. You will need this code to access this specific dataset. Each dataset has its own code, which you can use to download the data:

In [None]:
sunspots = quandl.get('SIDC/SUNSPOTS_D')

The string 'SIDC/SUNSPOTS_D' is a code for retrieving specific data offered by Quandl. 'SIDC' refers to the Royal Observatory database, and 'SUNSPOTS_D' is a specific dataset in that database.

Let's take a look at the data.

In [None]:
sunspots.head()

In [None]:
sunspots['Daily Sunspot Number'].plot()

That's how easy Quandl is! Find the Quandl code for the data you want and then call the `get` method.

# Questions

At the end of each of the following sections is a function that returns a list of data.  Currently, they return placeholder data.  You should alter these functions to return the correct data.  These functions are passed to the `score()` function that will submit the data to the grader and print out your grade.

## Question 1: wiki_data
We want to find the daily percentage change in the closing price for the first 100 trading days of 2016 for Tableau Software (ticker symbol DATA).  This should be returned as a list of 100 tuples of (date, percentage).
- Format the dates as strings like "7/04/16" for July 4th or "11/01/16" for November 1st.
  - **IMPORTANT**: look closely at the date format. The date has a leading zero but the month does not. The year is represented with two digits. Admittedly this is not a standard way of representing dates. The goal is to get you to think carefully about date formatting [directives](http://strftime.org/).
- The returns will be percentages, not fractions. Therefore, submit a return of one-and-a-half percent as 1.5, not 0.015.

Quandl provides stock prices in the "WIKI EOD Stock Prices" database.  Use the search feature on the Quandl website to find the code for this dataset.  Use this with `start_date='2015-12-31'` keyword argument to get only the data from 2016.

In [None]:
prices = quandl.get(..., start_date='2015-12-31')

The dataframe you get should have several hundred rows and 12 columns, with a datetime index.

In [None]:
assert prices.shape[1] == 12
print type(prices.index)
prices.head()

The dataframe you get should have several hundred rows and 12 columns, with a datetime index.  The only column we need is the "Adj. Close" column, which is adjusted for corporate actions like dividends and splits.  Use the `.pct_change()` method on this column to get the daily fractional change and then adjust it to be a percentage. 

In [None]:
close_change = ...

The dates are the index of the dataframe.  They can be made into a column with the `.reset_index()` method, or accessed directly with via the `.index` property of the dataframe or `close_change` series.  Once you have the dates, use the `.strftime()` method to format them as strings, with [these directives](http://strftime.org/).

In [None]:
date_str = ...

Finally, combine these two results into a list of tuples.  There are several approaches to this, including:
1. Make the two series into columns in the same dataframe.  Use a list comprehension over the [`.itertuples()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples) generator to produce a list of tuples.
2. Use [`zip()`](https://docs.python.org/2/library/functions.html#zip) to join the two series into a list of tuples.

Finally, combine these two results into a list of tuples.  There are several approaches to this; we'll use [`zip()`](https://docs.python.org/2/library/functions.html#zip) to join the two series into a list of tuples.

In [None]:
wiki_data_tuples = zip(date_str, close_change)

print wiki_data_tuples[0]
print len(wiki_data_tuples)

There are two problems with this:
1. The first element is a NaN from the last day of 2015.
2. There are more than 100 tuples.

Select the first 100 days of 2016, and submit those to the grader.

In [None]:
wiki_data_tuples_all = zip(date_str, close_change)
wiki_data_tuples = ...

def wiki_data():
    return wiki_data_tuples

grader.score('quandl__wiki_data', wiki_data)

## Question 2: state_industry_pairs
The rest of the questions will use data provided by the US [Bureau of Labor Statistics](https://www.quandl.com/data/BLSE?keyword=). Among other things, they track monthly employment numbers by industry for each state.

We are specifically interested in their *State and Area Employment, Hours, and Earnings* data, as described in their [documentation](https://www.quandl.com/data/BLSE/documentation/documentation). The documentation describes the *Code Nomenclature* for data files for all of the permutations of states and industries and seasonally/not seasonally adjusted data.

Each of these datasets looks like this one:

https://www.quandl.com/data/BLSE/SMS01000004300000001-All-Employees-In-Thousands-Transportation-and-Utilities-Alabama

For these questions you will need to use all of these datasets in this subgroup of the BLS database. There will be about ~1118 datasets in total. To obtain all of them, you could use the information provided in the documentation and query each quandl code permutation. Another approach is to download a (zipped) database metadata file from this URL using the Linux command wget:

In [None]:
%%bash
wget https://www.quandl.com/api/v3/databases/BLSE/codes.zip -O codes.zip -nc
unzip -u codes.zip

Now, we can load all of the codes into a dataframe:

In [None]:
codes = pd.read_csv('BLSE-datasets-codes.csv', header=None,
                    names=('Code', 'Description'))
codes.head()

All of the datasets with employment information have a description that begins, "All Employees".  Create a new data frame that contains only those rows.  There should be 1118 in total.

In [None]:
valid_rows = ...
valid_codes = codes[valid_rows]

In [None]:
assert valid_codes.shape[0] == 1118

We want to download and store the data from each of those dataset.  Let's start by downloading the data from the beginning of 2006 to the end of 2015, for a single set:

In [None]:
code = 'BLSE/SMS54000003000000001'
description = 'All Employees, In Thousands; Manufacturing - West Virginia'

df = ...

Because you will need to combine this data with the other sets, add columns for the state, category, and a flag for whether the data are adjusted.  If `df` is a dataframe, a constant column can be added with
```python
df['State'] = pd.Series('West Virginia', index=df.index)
```
Instead of hard-coding the values, though, work them out either through the code, as [described](https://www.quandl.com/data/BLSE-BLS-Employment-Unemployment/documentation/documentation), or the description.

In [None]:
state = ...
category = ...
adjusted = ...

df['State'] = pd.Series(state, index=df.index)
df['Category'] = pd.Series(category, index=df.index)
df['Adjusted'] = pd.Series(adjusted, index=df.index)

Now, combine this code into a single function, for future reuse.  It's best to have this function write the dataframe to a file.  This way, if a data retrieval fails, you can rerun just that dataset.  If you need to restart the notebook, you won't need to download all of the data again.

You can use Pandas' `to_pickle()` and `from_pickle()` functions, or another mechanism.  The checkpoint library [ediblepickle](https://pypi.python.org/pypi/ediblepickle/1.1.3) could also be used to streamline the process so that the time-consuming code will only be run when necessary.

In [None]:
def get_data(code, description):
    # Download data
    # Add columns
    # Save locally
    # Return the dataframe
    return df

get_data(code, description).head()

After you've tested that function for several datasets, write a loop to download all of the data sets.

The speed of that loop might be faster than Quandl's limit. To slow it down you can tell Python to `sleep` for a short time to keep it under the threshold.
```python
import time
time.sleep(0.1)  # sleep for 0.1 seconds (100 ms)
```

If you add that to your function above, we can load all of the data into a single dataframe with the `concat()` function.

In [None]:
df_all = pd.concat(get_data(code, description) for code, description
                   in valid_codes.itertuples(index=False))

Each question will pertain to either the unadjusted or the adjusted data.  You may find it easier to have each in its own dataframe.  Remove the *Total Private* and *Total Nonfarm* data, as these statistics are aggegations, not industries.

In [None]:
df_raw = ... # Unadjusted data
df_adj = ... # Adjusted data

For this question, use the *unadjusted data* to find the 100 largest state-industry pairs for December 2015.

In [None]:
# Select out only the results from 12/2015
dec15 = ...
# Sort them by 'Value' and choose the top 100
top100 = ...

Your answer should consist of 100 tuples of states, industry names, and employment numbers, like this: ((State, Industry), Employment #)

The State and Industry names will be strings, the same as you see in the documentation.

The Employment numbers will be the number of people employed on that date. Note the data is provided to you in thousands, so you will have to do some multiplication.

We can do this with a list comprehension over `top100.itertuples()`.

In [None]:
state_industry_tuples = [...
                         for x in top100.itertuples(index=False)]

In this and the following questions, we give you a placeholder in the score function, so you can check that you understand the format of the answer.  Replace the return statement with one that returns `state_industry_tuples`.

In [None]:
def state_industry_pairs():
    return [(('California', 'Service-Providing'), 14352600)] * 100

grader.score('quandl__state_industry_pairs', state_industry_pairs)

## Question 3: state_total_employed
Using the unadjusted data, what are the total number of employed people in December 2015, by state?

Your answer should consist of 53 tuples of states and employment numbers, like this: (State, Employment #)

That's 50 states, plus Washington DC, Puerto Rico, and the Virgin Islands.

In [None]:
def state_total_employed():
    return [('Alabama', 2965000)] * 53

grader.score('quandl__state_total_employed', state_total_employed)

## Question 4: state_industry_growth
Using the unadjusted data, for each state, which industry saw the largest percent growth from December 2006 to December 2015?

Your answer should consist of 53 tuples of states, industries, and percentages, like this: ((State, Industry), Percentage).

The State and Industry names will be strings, the same as you see in the documentation.

The Percentage will be a percentage, not fraction. Submit a return of 1.5% as 1.5, not 0.015.

Start by getting the data from December 2006

In [None]:
dec06 = ...

We want to compare rows in the `dec06` and `dec15` dataframes that have the same state and category.  When operations are conducted on dataframes, rows are matched by index.  Indices can have multiple levels.  Use the `.set_index()` method with a list as an argument to acheive this.

In [None]:
val06 = ...
val15 = ...

Now, we can do math directly on the dataframes.

In [None]:
growth = ...

To choose the largest for each state, we need to group the rows by state.  To do this, first we have to change the indices back to columns with `.reset_index()`, and then use `.groupby()`.

In [None]:
by_state = ...
assert type(by_state) == pd.core.groupby.DataFrameGroupBy

This `DataFrameGroupBy` object records with rows belong together, but hasn't done any calculations on the groups.  We can pull out a group for analysis.

In [None]:
alabama = by_state.get_group('Alabama')
alabama

Write a function that takes this dataframe and returns the row with the maximum value.

In [None]:
def largest_value(df):
    ...

assert largest_value(alabama)['Category'] == 'Transportation and Utilities'

Now we can use the group-by object's `.apply()` method to apply this to each group.

In [None]:
fastest_by_state = by_state.apply(largest_value)

Now, convert this dataframe to a list of tuples on the correct form.

In [None]:
def state_industry_growth():
    return [(('Alabama', 'Transportation and Utilities'), 4.5769764216366138)] * 53

grader.score('quandl__state_industry_growth', state_industry_growth)

## Question 5: max_employment
Using the unadjusted data, find the maximum employment number for each industry across the USA. That is, find the number of people employed in each industry during the month that that industry peaked in our dataset.

Your answer should consist of 16 tuples of industries and employment numbers, like this: (Industry, Employment #)

The Industry names will be strings, just like they are in the documentation.

The Employment numbers will be the total number of people employed in any state in each industry. Note the data is provided to you in thousands, so you will have to do some multiplication.

In [None]:
def max_employment():
    return [('Air Transportation', 367900)] * 16

grader.score('quandl__max_employment', max_employment)

## Question 6: quarterly_nonfarm
Using the seasonally adjusted data, what is the quarterly percent change for total non-farm employment across all states?

Use the last data-point in each quarter to represent the data for the quarter.

The first calculated percentage will be (should be) NaN, which you can exclude from your answer.

Your answer should be 39 tuples of dates and percentages, like this: (Date, Percentage)

Format the dates as strings like "2016-07-04" for July 4th or "2016-11-01" for November 1st.

The Percentage will be a percentage, not fraction. Submit a return of
1.5% as 1.5, not 0.015.

Hint: Try using a DataFrame's `.resample()` method.

In [None]:
def quarterly_nonfarm():
    return [('2006-06-30', 0.30836643956206888)] * 39

grader.score('quandl__quarterly_nonfarm', quarterly_nonfarm)

## Question 7: third_largest
Using the unadjusted data, what is the 3rd largest industry as a percentage of each state's total industry employment in December 2015?

Your answer should consist of 53 tuples of states, industries, and percentages, like this: ((State, Industry), Percentage).

The State and Industry names will be strings, the same as you see in the documentation.

The Percentage will be as percentages, not fractions. Submit a return of 1.5% as 1.5, not 0.015.

In [None]:
def third_largest():
    return [(('Alabama', 'Goods Producing'), 11.784148397976391)] * 53

grader.score('quandl__third_largest', third_largest)

*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*