# Instructions for using clean2.py

``clean2.py`` essentially contains one function: ``data_wrangling()``. This function takes a list of week_numbers (optional) which determine the timeframe of data pulled from MTA turnstile data.

We get two dataframes using the command below:
- ``df_turnstiles``: contains all turnstile data
- ``df_ampm``: contains the same data, this time broken down by am/pm
    - 2 entries for each station, for each day (one for AM, one for PM)
    
**NOTE 1**: This part requires a key for Google's geocode API. The below key will be deactivated before it's made public. Get one yourself (it's free\*, fast and easy) [here](https://developers.google.com/maps/documentation/geocoding/start). \*okay, it's free if you're just using it here. I think it would charge you if you ran the below cell 100s of times or use it for other applications.

Then, pass the api key (as a string) as the ``geocode_api_key`` parameter of the ``data_wrangling`` function used in the next cell.

**NOTE 2**: The following cell takes 4-7 minutes to run. By default, the program takes 8 weeks of MTA turnstile data. You can change the ``week_nums`` parameter of ``data_wrangling`` to a smaller number of weeks. This will only save you 1-2 minutes, since a majority of the time is spent on the geocode API (see [wtwy_data_merge.ipynb](https://github.com/edubu2/metis-project1/blob/main/code/wtwy_data_merge.ipynb) for details).

In [1]:
from clean2 import data_wrangling

# must replace geocode_api_key's empty string with a valid key
df_turnstiles, df_ampm = data_wrangling(geocode_api_key='AIzaSyCGo0NcvTdM9fgFx7y8BzShf5OJLHH562U')

**Let's take a look at ``df_turnstiles``**:

In [2]:
df_turnstiles.sample(10)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM,DAY_NAME,ZIPCODE,ZIPCODE_AGI
151985,2019-11-25 12:00:00,R208,R014,03-00-01,FULTON ST,2345ACJZ,IRT,11/25/2019,12:00:00,27691,64950,PM,Monday,,
16077,2019-11-12 23:00:00,B021,R228,00-00-00,AVENUE J,BQ,BMT,11/12/2019,23:00:00,9390032,7846874,PM,Tuesday,11230.0,2773769.0
159911,2019-11-13 08:00:00,R240,R047,00-03-07,GRD CNTRL-42 ST,4567S,IRT,11/13/2019,08:00:00,8827535,2478268,AM,Wednesday,,
16011,2019-11-29 07:00:00,B020,R263,00-06-00,AVENUE H,BQ,BMT,11/29/2019,07:00:00,29249,222447,AM,Friday,11230.0,2773769.0
166613,2019-11-30 11:00:00,R253,R181,00-00-01,110 ST,6,IRT,11/30/2019,11:00:00,5599505,8824400,AM,Saturday,10029.0,2022052.0
67192,2019-11-08 04:00:00,N120,R153,00-00-01,UTICA AV,AC,IND,11/08/2019,04:00:00,5859148,6875235,AM,Friday,11213.0,1378248.0
200779,2019-11-10 04:00:00,R639,R109,00-00-03,CHURCH AV,25,IRT,11/10/2019,04:00:00,22774690,2410333,AM,Sunday,11226.0,2466184.0
79028,2019-12-13 03:00:00,N305A,R016,00-05-01,LEXINGTON AV/53,EM6,IND,12/13/2019,03:00:00,6,1864,AM,Friday,,
10537,2019-11-02 00:00:00,A055,R227,00-00-03,RECTOR ST,NRW,BMT,11/02/2019,00:00:00,3274435,1085644,AM,Saturday,10006.0,577145.0
181196,2019-12-13 00:00:00,R422,R428,00-05-00,BUHRE AV,6,IRT,12/13/2019,00:00:00,517,0,AM,Friday,10461.0,1346454.0


**And now ``df_ampm``**:

In [14]:
df_ampm.head(35)

Unnamed: 0,C/A,UNIT,STATION,DATE,AMPM,DAY_NAME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS,TMP_ENTRIES,TMP_EXITS,TRAFFIC,ZIPCODE,ZIPCODE_AGI
1,A002,R051,59 ST,11/02/2019,PM,Saturday,11714240,8552046,11/02/2019,11713245.0,8551395.0,995.0,651.0,1646.0,,
2,A002,R051,59 ST,11/03/2019,AM,Sunday,11714756,8552455,11/02/2019,11714240.0,8552046.0,516.0,409.0,925.0,,
3,A002,R051,59 ST,11/03/2019,PM,Sunday,11715887,8553273,11/03/2019,11714756.0,8552455.0,1131.0,818.0,1949.0,,
4,A002,R051,59 ST,11/04/2019,AM,Monday,11716360,8554098,11/03/2019,11715887.0,8553273.0,473.0,825.0,1298.0,,
5,A002,R051,59 ST,11/04/2019,PM,Monday,11717806,8554809,11/04/2019,11716360.0,8554098.0,1446.0,711.0,2157.0,,
6,A002,R051,59 ST,11/05/2019,AM,Tuesday,11718319,8555630,11/04/2019,11717806.0,8554809.0,513.0,821.0,1334.0,,
7,A002,R051,59 ST,11/05/2019,PM,Tuesday,11719843,8556402,11/05/2019,11718319.0,8555630.0,1524.0,772.0,2296.0,,
8,A002,R051,59 ST,11/06/2019,AM,Wednesday,11720383,8557257,11/05/2019,11719843.0,8556402.0,540.0,855.0,1395.0,,
9,A002,R051,59 ST,11/06/2019,PM,Wednesday,11721866,8558037,11/06/2019,11720383.0,8557257.0,1483.0,780.0,2263.0,,
10,A002,R051,59 ST,11/07/2019,AM,Thursday,11722410,8558944,11/06/2019,11721866.0,8558037.0,544.0,907.0,1451.0,,


Let's filter by one station...

In [4]:
# mask = df_turnstiles.STATION == '34 ST-PENN STA'
# df_turnstiles[mask].head(20)

In [5]:
# df_ampm.groupby('STATION').agg(sum).sort_values("TRAFFIC", ascending=False).head(10)

In [6]:
# mask = df_ampm.STATION == '34 ST-PENN STA'
# df_ampm[mask]

In [7]:
# import json
# station_zips = json.load(open("data/station_zips.json", "r"))

In [8]:
# df_ampm['ZIPCODE'] = df_ampm['STATION'].map(station_zips)
# df_ampm.sample(50)

In [9]:
# station_agis = dict(zip(df_turnstiles['STATION'], df_turnstiles['ZIPCODE_AGI']))

In [10]:
# station_agis

In [11]:
# df_ampm['ZIPCODE_AGI'] = df_ampm['STATION'].map(station_agis)

In [12]:
# df_ampm.sample(20)

In [15]:
df_ampm.groupby(["STATION"]).agg(sum).sort_values("TRAFFIC", ascending=False).head(10)

Unnamed: 0_level_0,ENTRIES,EXITS,PREV_ENTRIES,PREV_EXITS,TMP_ENTRIES,TMP_EXITS,TRAFFIC,ZIPCODE_AGI
STATION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
23 ST,455928240188,392699599033,455935800000.0,392703800000.0,1210802.0,1092580.0,2303382.0,13465760000.0
34 ST-PENN STA,349522634781,295780285302,349626400000.0,295779200000.0,895644.0,1250075.0,2145719.0,0.0
GRD CNTRL-42 ST,246721312455,349337922842,246720700000.0,349337600000.0,992300.0,887092.0,1879392.0,0.0
34 ST-HERALD SQ,402869841392,549231786266,402869900000.0,549231000000.0,796448.0,835936.0,1632384.0,1613071000.0
14 ST-UNION SQ,15452091906,8518147149,15451250000.0,8517386000.0,844847.0,761596.0,1606443.0,4546414000.0
TIMES SQ-42 ST,605308956566,472890952606,605308400000.0,472890300000.0,892350.0,635912.0,1528262.0,2640326000.0
FULTON ST,352830918591,446832202070,352830400000.0,446831400000.0,562871.0,776994.0,1339865.0,0.0
59 ST,134031133412,87093613207,134030700000.0,87093110000.0,479908.0,691139.0,1171047.0,0.0
96 ST,82401913267,7113448616,82401370000.0,7112833000.0,540377.0,615304.0,1155681.0,1122239000.0
CANAL ST,349090843413,342922973913,349090600000.0,342922500000.0,637975.0,505804.0,1143779.0,6175548000.0
