**Exercise 1.1**

1) Open up a new Jupyter notebook  
2) Download a few MTA turnstile data files  
3) Open up a file, use csv reader to read it, make a python dict where there is a key for each (C/A, UNIT, SCP, STATION). These are the first four columns. The value for this key should be a list of lists. Each list in the list is the rest of the columns in a row.


In [3]:
# Setup -- I will use Pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import Image

%matplotlib inline

pd.set_option('display.max_rows', 100)


# I chose the last four weeks, labeled from 1-4 from furthest back to most recent
week1 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_160827.txt')
week2 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_160903.txt')
week3 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_160910.txt')
week4 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_160917.txt')

weeks = [week1, week2, week3, week4]

month = pd.concat(weeks)


#Create one column from the first 4
month['LOCATION'] = month['C/A'] + "," + month['UNIT'] + "," + month['SCP'] + "," + month['STATION']

month = month[month['C/A'] != 'TRAM1']
month = month[month['C/A'] != 'TRAM2']

# Delete superfluous columns
month = month.drop(['C/A', 'UNIT', 'SCP', 'STATION'], axis=1, errors ="ignore")

# Reorder columns so LOCATION is at the front
cols = month.columns.tolist()
cols = cols[-1:] + cols[:-1]
month = month[cols]

month.info()
month.head(50)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 770831 entries, 0 to 192457
Data columns (total 8 columns):
LOCATION                                                                770831 non-null object
LINENAME                                                                770831 non-null object
DIVISION                                                                770831 non-null object
DATE                                                                    770831 non-null object
TIME                                                                    770831 non-null object
DESC                                                                    770831 non-null object
ENTRIES                                                                 770831 non-null int64
EXITS                                                                   770831 non-null int64
dtypes: int64(2), object(6)
memory usage: 52.9+ MB


Unnamed: 0,LOCATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/20/2016,00:00:00,REGULAR,5790246,1963095
1,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/20/2016,04:00:00,REGULAR,5790275,1963101
2,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/20/2016,08:00:00,REGULAR,5790284,1963123
3,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/20/2016,12:00:00,REGULAR,5790377,1963178
4,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/20/2016,16:00:00,REGULAR,5790605,1963230
5,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/20/2016,20:00:00,REGULAR,5790926,1963287
6,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/21/2016,00:00:00,REGULAR,5791095,1963314
7,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/21/2016,04:00:00,REGULAR,5791132,1963319
8,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/21/2016,08:00:00,REGULAR,5791141,1963334
9,"A002,R051,02-00-00,59 ST",NQR456,BMT,08/21/2016,12:00:00,REGULAR,5791217,1963404


**Exercise 1.2**

Let's turn this into a time series. For each key (basically the control area, unit, device address and station of a specific turnstile), have a list again, but let the list be comprised of just the point in time and the count of entries. 

This basically means keeping only the date, time, and entries fields in each list. You can convert the date and time into datetime objects -- That is a python class that represents a point in time. You can combine the date and time fields into a string and use the dateutil module to convert it into a datetime object. For an example check this StackOverflow question.

In [4]:
# Merge date and time into a single column and then transform values into datetime objects
month['DATE/TIME'] = month['DATE'] + " " + month['TIME']
month['DATE/TIME'] = pd.to_datetime(month['DATE/TIME'], infer_datetime_format=True)

# Eliminate superfluous columns & reorder
month = month[['LOCATION', 'DATE/TIME', 'ENTRIES']]

month.info()
month.head(20)

# NOTE: I first tried to drop superfluous columns with the line below.  What happened was very strange: it dropped all
# but 'EXITS', and I couldn't figure out why it wouldn't drop that one.
#month = month.drop(['LINENAME', 'DIVISION', 'DESC', 'EXITS'], axis=1, errors="ignore")


<class 'pandas.core.frame.DataFrame'>
Int64Index: 770831 entries, 0 to 192457
Data columns (total 3 columns):
LOCATION     770831 non-null object
DATE/TIME    770831 non-null datetime64[ns]
ENTRIES      770831 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 23.5+ MB


Unnamed: 0,LOCATION,DATE/TIME,ENTRIES
0,"A002,R051,02-00-00,59 ST",2016-08-20 00:00:00,5790246
1,"A002,R051,02-00-00,59 ST",2016-08-20 04:00:00,5790275
2,"A002,R051,02-00-00,59 ST",2016-08-20 08:00:00,5790284
3,"A002,R051,02-00-00,59 ST",2016-08-20 12:00:00,5790377
4,"A002,R051,02-00-00,59 ST",2016-08-20 16:00:00,5790605
5,"A002,R051,02-00-00,59 ST",2016-08-20 20:00:00,5790926
6,"A002,R051,02-00-00,59 ST",2016-08-21 00:00:00,5791095
7,"A002,R051,02-00-00,59 ST",2016-08-21 04:00:00,5791132
8,"A002,R051,02-00-00,59 ST",2016-08-21 08:00:00,5791141
9,"A002,R051,02-00-00,59 ST",2016-08-21 12:00:00,5791217


In [6]:
# Sort the data by LOCATION AND DATE/TIME, and reindex
month = month.sort_values(['LOCATION', 'DATE/TIME'])

month = month.reset_index()
month = month.drop('index', axis = 1)

month

Unnamed: 0,LOCATION,DATE/TIME,ENTRIES
0,"A002,R051,02-00-00,59 ST",2016-08-20 00:00:00,5790246
1,"A002,R051,02-00-00,59 ST",2016-08-20 04:00:00,5790275
2,"A002,R051,02-00-00,59 ST",2016-08-20 08:00:00,5790284
3,"A002,R051,02-00-00,59 ST",2016-08-20 12:00:00,5790377
4,"A002,R051,02-00-00,59 ST",2016-08-20 16:00:00,5790605
5,"A002,R051,02-00-00,59 ST",2016-08-20 20:00:00,5790926
6,"A002,R051,02-00-00,59 ST",2016-08-21 00:00:00,5791095
7,"A002,R051,02-00-00,59 ST",2016-08-21 04:00:00,5791132
8,"A002,R051,02-00-00,59 ST",2016-08-21 08:00:00,5791141
9,"A002,R051,02-00-00,59 ST",2016-08-21 12:00:00,5791217


**Exercise 1.3**

These counts are for every n hours. (What is n?) We want total daily entries.  
>> N equals 4.

Now make it that we again have the same keys, but now we have a single value for a single day, which is the total number of passengers that entered through this turnstile on this day.  

In [7]:
# Step 1: difference the data from above
month['PERIOD ENTRIES'] = month['ENTRIES'].diff()

month.head(50)

Unnamed: 0,LOCATION,DATE/TIME,ENTRIES,PERIOD ENTRIES
0,"A002,R051,02-00-00,59 ST",2016-08-20 00:00:00,5790246,
1,"A002,R051,02-00-00,59 ST",2016-08-20 04:00:00,5790275,29.0
2,"A002,R051,02-00-00,59 ST",2016-08-20 08:00:00,5790284,9.0
3,"A002,R051,02-00-00,59 ST",2016-08-20 12:00:00,5790377,93.0
4,"A002,R051,02-00-00,59 ST",2016-08-20 16:00:00,5790605,228.0
5,"A002,R051,02-00-00,59 ST",2016-08-20 20:00:00,5790926,321.0
6,"A002,R051,02-00-00,59 ST",2016-08-21 00:00:00,5791095,169.0
7,"A002,R051,02-00-00,59 ST",2016-08-21 04:00:00,5791132,37.0
8,"A002,R051,02-00-00,59 ST",2016-08-21 08:00:00,5791141,9.0
9,"A002,R051,02-00-00,59 ST",2016-08-21 12:00:00,5791217,76.0


In [8]:
month.tail(50)

Unnamed: 0,LOCATION,DATE/TIME,ENTRIES,PERIOD ENTRIES
770781,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-09 02:00:00,46,0.0
770782,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-09 06:00:00,46,0.0
770783,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-09 10:00:00,46,0.0
770784,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-09 14:00:00,46,0.0
770785,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-09 18:00:00,46,0.0
770786,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-09 22:00:00,46,0.0
770787,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-10 02:00:00,46,0.0
770788,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-10 06:00:00,46,0.0
770789,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-10 10:00:00,46,0.0
770790,"S102,R165,00-05-01,TOMPKINSVILLE",2016-09-10 14:00:00,46,0.0


In [9]:
# Now shift all the rows up by one so the period totals for one day are all grouped ON that day
month['PERIOD ENTRIES'] = month['PERIOD ENTRIES'].shift(periods = -1)
month.head(100)

Unnamed: 0,LOCATION,DATE/TIME,ENTRIES,PERIOD ENTRIES
0,"A002,R051,02-00-00,59 ST",2016-08-20 00:00:00,5790246,29.0
1,"A002,R051,02-00-00,59 ST",2016-08-20 04:00:00,5790275,9.0
2,"A002,R051,02-00-00,59 ST",2016-08-20 08:00:00,5790284,93.0
3,"A002,R051,02-00-00,59 ST",2016-08-20 12:00:00,5790377,228.0
4,"A002,R051,02-00-00,59 ST",2016-08-20 16:00:00,5790605,321.0
5,"A002,R051,02-00-00,59 ST",2016-08-20 20:00:00,5790926,169.0
6,"A002,R051,02-00-00,59 ST",2016-08-21 00:00:00,5791095,37.0
7,"A002,R051,02-00-00,59 ST",2016-08-21 04:00:00,5791132,9.0
8,"A002,R051,02-00-00,59 ST",2016-08-21 08:00:00,5791141,76.0
9,"A002,R051,02-00-00,59 ST",2016-08-21 12:00:00,5791217,168.0


In [11]:
# Check how often the values are negative
negatives = month['PERIOD ENTRIES'] < 0
negatives.sum()
# This returns 7861, which is barely 1% of the data.  AND, that includes the ones that should be eliminated
# due to differencing across turnstiles.

7861

In [12]:
# Drop negative values for 'PERIOD ENTRIES' (these reflect changes between turnstiles + turnstile resets)
month = month[month['PERIOD ENTRIES'] >= 0]

# Reindex the data again
month = month.reset_index()
month = month.drop('index', axis = 1)
month

Unnamed: 0,LOCATION,DATE/TIME,ENTRIES,PERIOD ENTRIES
0,"A002,R051,02-00-00,59 ST",2016-08-20 00:00:00,5790246,29.0
1,"A002,R051,02-00-00,59 ST",2016-08-20 04:00:00,5790275,9.0
2,"A002,R051,02-00-00,59 ST",2016-08-20 08:00:00,5790284,93.0
3,"A002,R051,02-00-00,59 ST",2016-08-20 12:00:00,5790377,228.0
4,"A002,R051,02-00-00,59 ST",2016-08-20 16:00:00,5790605,321.0
5,"A002,R051,02-00-00,59 ST",2016-08-20 20:00:00,5790926,169.0
6,"A002,R051,02-00-00,59 ST",2016-08-21 00:00:00,5791095,37.0
7,"A002,R051,02-00-00,59 ST",2016-08-21 04:00:00,5791132,9.0
8,"A002,R051,02-00-00,59 ST",2016-08-21 08:00:00,5791141,76.0
9,"A002,R051,02-00-00,59 ST",2016-08-21 12:00:00,5791217,168.0


In [None]:
# COMPLETE TO HERE

In [None]:


#The below didn't work for grouping, not sure why
grouped = month.groupby([month['DATE/TIME'].dt.date, month['LOCATION']])
grouped.head(50)

In [None]:


#Create a new dataframe to store the results of summing the groups
entries_by_day  = month.groupby([month['DATE/TIME'].dt.date, month['LOCATION']]).sum()
#[add .reset_index() to turn this new thing into a dataframe again (doing an operation on it turns it not into a dataframe anymore)]

#Here the index = the date.  Do I want that?

# DOESN'T WORK -- ENTRIES CAN'T JUST BE SUMMED -- I Groupby'd before I got entries in correct time amounts
# Instead: get rid of all entries that aren't hours 00:00:00 or 01:00:00
#Instead create a new dataframe to store the results of keeping just the first entry for each day for each turnstile
#Get unique values for month['DATE/TIME'].dt.date -- verify I just need the above 2


entries_by_day.head(20)