# Instructions for using clean2.py

``clean2.py`` essentially contains one function: ``data_wrangling()``. This function takes a list of week_numbers (optional) which determine the timeframe of data pulled from MTA turnstile data.

We get two dataframes using the command below:
- ``df_turnstiles``: contains all turnstile data
- ``df_ampm``: contains the same data, this time broken down by am/pm
    - 2 entries for each station, for each day (one for AM, one for PM)
    
**NOTE 1**: This part requires a key for Google's geocode API. The below key will be deactivated before it's made public. Get one yourself (it's free\*, fast and easy) [here](https://developers.google.com/maps/documentation/geocoding/start). \*okay, it's free if you're just using it here. I think it would charge you if you ran the below cell 100s of times or use it for other applications.

**NOTE 2: The following cell takes 4-7 minutes to run. By default, the program takes 8 weeks of MTA turnstile data. You can change the ``week_nums`` parameter of ``data_wrangling`` to a smaller number of weeks**. This will only save you 1-2 minutes, since a majority of the time is spent on the geocode API (see [wtwy_data_merge.ipynb](https://github.com/edubu2/metis-project1/blob/main/code/wtwy_data_merge.ipynb) for details).

In [5]:
from clean2 import data_wrangling

# must replace geocode_api_key's empty string with a valid key
df_turnstiles, df_ampm = data_wrangling()
df_turnstiles.sample(20)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM,DAY_NAME,ZIPCODE,ZIPCODE_AGI
10815,2019-12-16 08:00:00,A058,R001,01-06-02,WHITEHALL S-FRY,R1W,BMT,12/16/2019,08:00:00,166161,33498,AM,Monday,,
179708,2019-12-10 12:00:00,R417,R222,00-00-03,PARKCHESTER,6,IRT,12/10/2019,12:00:00,696655,424943,PM,Tuesday,10472.0,971500.0
180251,2019-11-10 15:00:00,R420,R107,00-00-02,WESTCHESTER SQ,6,IRT,11/10/2019,15:00:00,390294,300894,PM,Sunday,,
100440,2019-12-19 19:00:00,N506,R022,00-05-04,34 ST-HERALD SQ,BDFMNQRW,IND,12/19/2019,19:00:00,1254495121,2048928040,PM,Thursday,10001.0,2906435.0
94777,2019-11-28 11:00:00,N419,R287,00-00-00,CLASSON AV,G,IND,11/28/2019,11:00:00,3515355,3689134,AM,Thursday,11238.0,3353662.0
68882,2019-12-18 15:00:00,N124,R103,00-00-03,BROADWAY JCT,ACJLZ,IND,12/18/2019,15:00:00,9446355,6037237,PM,Wednesday,,
106504,2019-11-30 15:30:00,N542,R241,00-06-00,15 ST-PROSPECT,FG,IND,11/30/2019,15:30:00,118236029,671128,PM,Saturday,,
44607,2019-11-08 16:00:00,N002A,R173,00-05-00,INWOOD-207 ST,A,IND,11/08/2019,16:00:00,4292,0,PM,Friday,10034.0,1019289.0
59913,2019-12-22 11:00:00,N091,R029,02-00-02,CHAMBERS ST,ACE23,IND,12/22/2019,11:00:00,7802503,6324267,AM,Sunday,10007.0,2910802.0
14211,2019-11-02 04:00:00,B013,R196,01-00-00,PROSPECT PARK,BQS,BMT,11/02/2019,04:00:00,8090452,16856897,AM,Saturday,,


**Let's take a look at ``df_turnstiles``**:

In [7]:
df_turnstiles.sample(10)

Unnamed: 0,DATE_TIME,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,ENTRIES,EXITS,AMPM,DAY_NAME,ZIPCODE,ZIPCODE_AGI
127697,2019-12-02 12:58:22,PTH22,R540,00-02-05,PATH NEW WTC,1,PTH,12/02/2019,12:58:22,295756,147251,PM,Monday,,
155444,2019-12-24 00:00:00,R221,R170,01-00-02,14 ST-UNION SQ,456LNQRW,IRT,12/24/2019,00:00:00,16738170,9990570,AM,Tuesday,10003.0,8191737.0
36906,2019-11-30 08:00:00,H038,R350,00-06-00,LIVONIA AV,L,BMT,11/30/2019,08:00:00,525797,361728,AM,Saturday,11212.0,1344174.0
93152,2019-12-13 11:00:00,N408A,R256,00-06-03,NASSAU AV,G,IND,12/13/2019,11:00:00,71812,456115,AM,Friday,11222.0,2098579.0
143347,2019-11-27 04:00:00,R165,R167,01-00-01,86 ST,1,IRT,11/27/2019,04:00:00,802965,1030127,AM,Wednesday,,
194413,2019-12-03 07:00:00,R601A,R108,02-00-04,BOROUGH HALL,2345R,IRT,12/03/2019,07:00:00,82626,98885,AM,Tuesday,,
69211,2019-11-06 07:00:00,N129,R382,00-00-01,GRANT AV,A,IND,11/06/2019,07:00:00,9521153,7547277,AM,Wednesday,11208.0,1586724.0
151746,2019-11-27 16:00:00,R206,R014,02-03-01,FULTON ST,2345ACJZ,IRT,11/27/2019,16:00:00,446649,2189727,PM,Wednesday,,
65833,2019-11-04 15:00:00,N111,R284,00-06-00,CLINTON-WASH AV,C,IND,11/04/2019,15:00:00,2424215,248029,PM,Monday,,
191042,2019-12-19 11:00:00,R529,R208,00-00-03,103 ST-CORONA,7,IRT,12/19/2019,11:00:00,1654481,599381,AM,Thursday,,


**And now ``df_ampm``**:

In [8]:
df_ampm.sample(35)

Unnamed: 0,C/A,UNIT,SCP,STATION,ZIPCODE,ZIPCODE_AGI,DATE,AMPM,DAY_NAME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS,TMP_ENTRIES,TMP_EXITS
39927,B027,R136,00-00-01,SHEEPSHEAD BAY,11235,2567055.0,12/03/2019,PM,Tuesday,5017235,5794390,12/02/2019,5015586.0,5792584.0,1649.0,1649.0
121235,N304,R015,01-06-00,5 AV/53 ST,10022,14226340.0,11/13/2019,AM,Wednesday,5148310,5047785,11/12/2019,5146636.0,5043875.0,1674.0,1674.0
90679,N078,R175,01-00-01,14 ST,10011,9331779.0,11/19/2019,AM,Tuesday,1438077,2412565,11/18/2019,1436432.0,2410182.0,1645.0,1645.0
223074,R257,R182,01-03-01,116 ST,10035,833400.0,12/05/2019,PM,Thursday,69118,74529,12/04/2019,67449.0,72635.0,1669.0,1669.0
237748,R331,R364,00-05-01,GUN HILL RD,10467,1714750.0,12/20/2019,AM,Friday,683737068,638320589,12/19/2019,683737068.0,638320589.0,0.0,0.0
3000,A015,R081,00-00-03,49 ST,10019,8005583.0,12/16/2019,AM,Monday,9009739,3879144,12/15/2019,9008351.0,3878578.0,1388.0,1388.0
151438,N510,R163,02-00-00,14 ST,10011,9331779.0,11/20/2019,AM,Wednesday,153319,317633,11/19/2019,152076.0,314983.0,1243.0,1243.0
77777,K017,R401,00-00-01,CENTRAL AV,11221,1958838.0,12/12/2019,AM,Thursday,2775329,3948051,12/11/2019,2774496.0,3946903.0,833.0,833.0
228394,R306,R207,00-00-04,135 ST,10030,645026.0,11/20/2019,PM,Wednesday,7044758,742449,11/19/2019,7042009.0,742149.0,2749.0,2749.0
104093,N127,R442,00-00-02,SHEPHERD AV,11208,1586724.0,11/09/2019,AM,Saturday,785344,270891,11/08/2019,784421.0,270488.0,923.0,923.0


Let's filter by one station...

In [9]:
mask = df_ampm.STATION == '50 ST'
df_ampm[mask].head(20)

Unnamed: 0,C/A,UNIT,SCP,STATION,ZIPCODE,ZIPCODE_AGI,DATE,AMPM,DAY_NAME,ENTRIES,EXITS,PREV_DATE,PREV_ENTRIES,PREV_EXITS,TMP_ENTRIES,TMP_EXITS
55770,E004,R234,00-00-00,50 ST,10019,8005583.0,11/03/2019,AM,Sunday,5948968,5525777,11/02/2019,5948370.0,5525155.0,598.0,598.0
55771,E004,R234,00-00-00,50 ST,10019,8005583.0,11/03/2019,PM,Sunday,5949485,5526198,11/02/2019,5948731.0,5525519.0,754.0,754.0
55772,E004,R234,00-00-00,50 ST,10019,8005583.0,11/04/2019,AM,Monday,5950050,5526601,11/03/2019,5948968.0,5525777.0,1082.0,1082.0
55773,E004,R234,00-00-00,50 ST,10019,8005583.0,11/04/2019,PM,Monday,5950898,5527285,11/03/2019,5949485.0,5526198.0,1413.0,1413.0
55774,E004,R234,00-00-00,50 ST,10019,8005583.0,11/05/2019,AM,Tuesday,5951426,5527687,11/04/2019,5950050.0,5526601.0,1376.0,1376.0
55775,E004,R234,00-00-00,50 ST,10019,8005583.0,11/05/2019,PM,Tuesday,5952283,5528408,11/04/2019,5950898.0,5527285.0,1385.0,1385.0
55776,E004,R234,00-00-00,50 ST,10019,8005583.0,11/06/2019,AM,Wednesday,5952806,5528836,11/05/2019,5951426.0,5527687.0,1380.0,1380.0
55777,E004,R234,00-00-00,50 ST,10019,8005583.0,11/06/2019,PM,Wednesday,5953658,5529517,11/05/2019,5952283.0,5528408.0,1375.0,1375.0
55778,E004,R234,00-00-00,50 ST,10019,8005583.0,11/07/2019,AM,Thursday,5954189,5529999,11/06/2019,5952806.0,5528836.0,1383.0,1383.0
55779,E004,R234,00-00-00,50 ST,10019,8005583.0,11/07/2019,PM,Thursday,5954997,5530673,11/06/2019,5953658.0,5529517.0,1339.0,1339.0
