# Instructions for using clean2.py

``clean2.py`` essentially contains one function: ``data_wrangling()``. This function takes a list of week_numbers (optional) which determine the timeframe of data pulled from MTA turnstile data.

We get two dataframes using the command below:
- ``df_turnstiles``: contains all turnstile data
- ``df_ampm``: contains the same data, this time broken down by am/pm
    - 2 entries for each station, for each day (one for AM, one for PM)
    
**NOTE 1**: This part requires a key for Google's geocode API. The below key will be deactivated before it's made public. Get one yourself (it's free\*, fast and easy) [here](https://developers.google.com/maps/documentation/geocoding/start). \*okay, it's free if you're just using it here. I think it would charge you if you ran the below cell 100s of times or use it for other applications.

Then, pass the api key (as a string) as the ``geocode_api_key`` parameter of the ``data_wrangling`` function used in the next cell.

**NOTE 2**: The following cell takes 4-7 minutes to run. By default, the program takes 8 weeks of MTA turnstile data. You can change the ``week_nums`` parameter of ``data_wrangling`` to a smaller number of weeks. This will only save you 1-2 minutes, since a majority of the time is spent on the geocode API (see [wtwy_data_merge.ipynb](https://github.com/edubu2/metis-project1/blob/main/code/wtwy_data_merge.ipynb) for details).

In [1]:
from clean2 import data_wrangling

# must replace geocode_api_key's empty string with a valid key
df_turnstiles, df_ampm = data_wrangling(geocode_api_key='AIzaSyCGo0NcvTdM9fgFx7y8BzShf5OJLHH562U')

**Let's take a look at ``df_turnstiles``**:

In [2]:
df_turnstiles.sample(10)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM,DAY_NAME,ZIPCODE,ZIPCODE_AGI
107036,2019-11-13 03:00:00,N550,R242,01-06-00,18 AV,F,IND,11/13/2019,03:00:00,840490,61167,AM,Wednesday,11230.0,2773769.0
69644,2019-12-22 00:00:00,N135,R385,01-03-01,ROCKAWAY BLVD,A,IND,12/22/2019,00:00:00,1387676,427910,AM,Sunday,11417.0,724600.0
83951,2019-12-26 06:36:28,N324,R018,00-00-03,JKSN HT-ROOSVLT,EFMR7,IND,12/26/2019,06:36:28,1997329,2346349,AM,Thursday,,
165845,2019-11-16 23:00:00,R254,R181,01-00-01,110 ST,6,IRT,11/16/2019,23:00:00,5216441,1518714,PM,Saturday,10029.0,2022052.0
92728,2019-12-04 08:00:00,N401,R360,00-00-00,21 ST,G,IND,12/04/2019,08:00:00,599512,776589,AM,Wednesday,11101.0,2273198.0
6542,2019-12-19 15:00:00,A041,R086,00-00-03,PRINCE ST,NRW,BMT,12/19/2019,15:00:00,13413921,5772114,PM,Thursday,10012.0,3646355.0
126473,2019-11-07 03:31:50,PTH19,R549,02-01-04,NEWARK C,1,PTH,11/07/2019,03:31:50,47878,3306,AM,Thursday,,
125089,2019-11-09 21:28:17,PTH20,R549,03-00-04,NEWARK HM HE,1,PTH,11/09/2019,21:28:17,7457,67780,PM,Saturday,,
28423,2019-12-24 16:00:00,E001,R368,00-00-04,9 AV,D,BMT,12/24/2019,16:00:00,3720620,5588981,PM,Tuesday,11219.0,1635884.0
96246,2019-12-19 11:00:00,N418,R269,01-06-01,BEDFORD-NOSTRAN,G,IND,12/19/2019,11:00:00,9660223,143799,AM,Thursday,,


**And now ``df_ampm``**:

In [5]:
df_ampm.head(35)

Unnamed: 0,C/A,UNIT,SCP,STATION,ZIPCODE,ZIPCODE_AGI,DATE,AMPM,DAY_NAME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS,TMP_ENTRIES,TMP_EXITS
1,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/02/2019,PM,Saturday,4098923,7058586,11/02/2019,4097957.0,7057072.0,966.0,966.0
2,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/03/2019,AM,Sunday,4099092,7058848,11/02/2019,4098923.0,7058586.0,169.0,169.0
3,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/03/2019,PM,Sunday,4099979,7060287,11/03/2019,4099092.0,7058848.0,887.0,887.0
4,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/04/2019,AM,Monday,4100128,7061222,11/03/2019,4099979.0,7060287.0,149.0,149.0
5,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/04/2019,PM,Monday,4101641,7063433,11/04/2019,4100128.0,7061222.0,1513.0,1513.0
6,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/05/2019,AM,Tuesday,4101898,7064433,11/04/2019,4101641.0,7063433.0,257.0,257.0
7,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/05/2019,PM,Tuesday,4103329,7066762,11/05/2019,4101898.0,7064433.0,1431.0,1431.0
8,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/06/2019,AM,Wednesday,4103617,7067736,11/05/2019,4103329.0,7066762.0,288.0,288.0
9,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/06/2019,PM,Wednesday,4105082,7070201,11/06/2019,4103617.0,7067736.0,1465.0,1465.0
10,A006,R079,00-00-00,5 AV/59 ST,10019,8005583.0,11/07/2019,AM,Thursday,4105342,7071124,11/06/2019,4105082.0,7070201.0,260.0,260.0


Let's filter by one station...

In [4]:
mask = df_ampm.STATION == '50 ST'
df_ampm[mask].head(20)

Unnamed: 0,C/A,UNIT,SCP,STATION,ZIPCODE,ZIPCODE_AGI,DATE,AMPM,DAY_NAME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS,TMP_ENTRIES,TMP_EXITS
55769,E004,R234,00-00-00,50 ST,10019,8005583.0,11/02/2019,PM,Saturday,5948731,5525519,11/02/2019,5948370.0,5525155.0,361.0,361.0
55770,E004,R234,00-00-00,50 ST,10019,8005583.0,11/03/2019,AM,Sunday,5948968,5525777,11/02/2019,5948731.0,5525519.0,237.0,237.0
55771,E004,R234,00-00-00,50 ST,10019,8005583.0,11/03/2019,PM,Sunday,5949485,5526198,11/03/2019,5948968.0,5525777.0,517.0,517.0
55772,E004,R234,00-00-00,50 ST,10019,8005583.0,11/04/2019,AM,Monday,5950050,5526601,11/03/2019,5949485.0,5526198.0,565.0,565.0
55773,E004,R234,00-00-00,50 ST,10019,8005583.0,11/04/2019,PM,Monday,5950898,5527285,11/04/2019,5950050.0,5526601.0,848.0,848.0
55774,E004,R234,00-00-00,50 ST,10019,8005583.0,11/05/2019,AM,Tuesday,5951426,5527687,11/04/2019,5950898.0,5527285.0,528.0,528.0
55775,E004,R234,00-00-00,50 ST,10019,8005583.0,11/05/2019,PM,Tuesday,5952283,5528408,11/05/2019,5951426.0,5527687.0,857.0,857.0
55776,E004,R234,00-00-00,50 ST,10019,8005583.0,11/06/2019,AM,Wednesday,5952806,5528836,11/05/2019,5952283.0,5528408.0,523.0,523.0
55777,E004,R234,00-00-00,50 ST,10019,8005583.0,11/06/2019,PM,Wednesday,5953658,5529517,11/06/2019,5952806.0,5528836.0,852.0,852.0
55778,E004,R234,00-00-00,50 ST,10019,8005583.0,11/07/2019,AM,Thursday,5954189,5529999,11/06/2019,5953658.0,5529517.0,531.0,531.0
