# MTA Analysis: Data Acquisition & Cleaning

This analysis uses [publically available MTA Turnstile Data](http://web.mta.info/developers/turnstile.html) ranging from January 2018 until June 2020.

#### Import required libraries.

In [25]:
import pandas as pd
import numpy as np
import os

#### Define useful functions (maybe document and put this in .py file later?)

In [19]:
def convert_timestamp_to_mta_format(timestamp):
    
    year = str(timestamp.year)[2:4]
    month = str(timestamp.month).zfill(2)
    day = str(timestamp.day).zfill(2)
    
    date_mta_format = year + month + day
    
    return date_mta_format

def create_raw_df(path):
    '''
    Input: subdirectory path containing data files
    Output: One DF aggregating all data
    '''
    files = os.listdir(path)
    raw_master_df = pd.DataFrame()
    for file in files:
        print(file)
        df = pd.read_csv(os.path.join(path, file), index_col = 0)
        raw_master_df = pd.concat([raw_master_df, df])

    return raw_master_df

#### 1. Load [MTA Turnstile Data](http://web.mta.info/developers/turnstile.html) and combine into a single dataframe.
_Note: The files already exist in the repository. This code block can be skipped over._

In [26]:
start_date = '2015-05-02'
months_of_interest = [5, 6]
start_ts = pd.Timestamp(start_date)

In [27]:
current_timestamp = start_ts
is_date_valid = True
raw_master_df = pd.DataFrame()
while is_date_valid:
    if current_timestamp.month not in months_of_interest:
        current_timestamp += pd.DateOffset(days=7)
        continue
    
    print(f'Downloading data for {current_timestamp}...')
    date_formatted = convert_timestamp_to_mta_format(current_timestamp)
    
    # load data and write to csv
    url = f'http://web.mta.info/developers/data/nyct/turnstile/turnstile_{date_formatted}.txt'
    df_turnstile_data = pd.read_csv(url)
    raw_master_df = pd.concat([raw_master_df, df_turnstile_data])
    
    # add 7 days to get next file. if resulting date is later than today, then stop loop
    current_timestamp += pd.DateOffset(days=7)
    if current_timestamp > pd.to_datetime("now"):
        is_date_valid = False

Downloading data for 2015-05-02 00:00:00...
Downloading data for 2015-05-09 00:00:00...
Downloading data for 2015-05-16 00:00:00...
Downloading data for 2015-05-23 00:00:00...
Downloading data for 2015-05-30 00:00:00...
Downloading data for 2015-06-06 00:00:00...
Downloading data for 2015-06-13 00:00:00...
Downloading data for 2015-06-20 00:00:00...
Downloading data for 2015-06-27 00:00:00...
Downloading data for 2016-05-07 00:00:00...
Downloading data for 2016-05-14 00:00:00...
Downloading data for 2016-05-21 00:00:00...
Downloading data for 2016-05-28 00:00:00...
Downloading data for 2016-06-04 00:00:00...
Downloading data for 2016-06-11 00:00:00...
Downloading data for 2016-06-18 00:00:00...
Downloading data for 2016-06-25 00:00:00...
Downloading data for 2017-05-06 00:00:00...
Downloading data for 2017-05-13 00:00:00...
Downloading data for 2017-05-20 00:00:00...
Downloading data for 2017-05-27 00:00:00...
Downloading data for 2017-06-03 00:00:00...
Downloading data for 2017-06-10 

#### 2. Combine CSVs into a single .pkl to be cleaned.

In [29]:
raw_master_df.to_pickle('processed_data/raw_mta_turnstile_data_mayjune_20152020.pkl')

#### 3. Clean the data set.
- change datatypes of columns
- look for N/As
- high level summaries

In [31]:
df_mta_raw = pd.read_pickle('processed_data/raw_mta_turnstile_data_mayjune_20152020.pkl')
df_mta_raw.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,00:00:00,REGULAR,5106770,1729635
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,04:00:00,REGULAR,5106810,1729649
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,08:00:00,REGULAR,5106835,1729680
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,12:00:00,REGULAR,5106961,1729784
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,16:00:00,REGULAR,5107250,1729858


#### 4. Create a cleaned .pkl for analysis.