# Week 4 Assignment

Developed by Yongkang Liu  
December 16, 2019

<a name="top"></a>  
# Data Preparation

### Friendly Reminder: If you want to just check the cleaned data, you can skip the data preparation steps and jump to the end of this notebook. [click here](#cleaned) 

In [66]:
import pandas as pd

#### Reference

* [Transportation Data and Examples](http://transitdatatoolkit.com/lessons/mapping-a-transit-system/)

##  Data Source: Turntile Data
MTA reguarly publishes the Turntile data every week. Each file contains information regarding the counts of entries and exits through each turntile in MTA stations around every 4 hours. Each turntile is distinguished by UNIT, SCP and STATION. Meanwhile, each station is distinguished by the station name, line name, and division.

Major operations on the data include 1) get the per period entries and exits for each turntile (using groupby), 2) combine data within a station by the time (here the time will be checked by a larger time slice), 3) There are some reset record for entries and exits, how to handle these records (delete or incorporate?)

In [67]:
# Save MTA turnstile data into a dataframe
# Source: http://web.mta.info/developers/turnstile.html
# The file is downloaded and saved in the same folder as the notebook
df_tt = pd.read_csv('turnstile_191102.txt', skipinitialspace=True)  # the data in the week of Nov. 02, 2019
df_tt.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,10/26/2019,00:00:00,REGULAR,7247322,2455491
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,10/26/2019,04:00:00,REGULAR,7247336,2455499
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,10/26/2019,08:00:00,REGULAR,7247351,2455532
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,10/26/2019,12:00:00,REGULAR,7247463,2455623
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,10/26/2019,16:00:00,REGULAR,7247755,2455679


In [68]:
# Check the column headers
print(f'df_tt.columns: {df_tt.columns}')

df_tt.columns: Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES',
       'EXITS                                                               '],
      dtype='object')


Notice that the last column name, a string variable, contains many space characters. We need to remove them.

In [69]:
print(f'Check the last column name: ("{df_tt.columns[-1]}")')

# rename the column name
df_tt.rename(columns={df_tt.columns[-1]:df_tt.columns[-1].strip(' ')}, inplace=True)
print(f'After the change, the new columns are {df_tt.columns}')

Check the last column name: ("EXITS                                                               ")
After the change, the new columns are Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS'],
      dtype='object')


To better identify each station, we design a unique station index. Since a station is uniquely identified by the station name, line name, and division. We use a number string as the label.

In [70]:
print(f'The number of unique station names: {len(df_tt.STATION.unique().tolist())}')
print(f'The number of unique line names: {len(df_tt.LINENAME.unique().tolist())}')
print(f'The number of unique division names: {len(df_tt.DIVISION.unique().tolist())}')

The number of unique station names: 377
The number of unique line names: 113
The number of unique division names: 6


We use three digits to label a station, three digits for a line combination, and one digit for a division. For example, the station "RIT-ROOSEVELT" in Line "R" of Division "RIT" have the encoded indices, "376", "019", and "5", for the station name, line combination, and division, respectively. We label it as "t5019376" in the format of "t-Division-Line-Station" where "t" stands for the turntile data source.

In [None]:
# First, obtain the index in each column
dict_tt_stations = {k: str(v).zfill(3) for v, k in enumerate(df_tt.STATION.unique().tolist())}
dict_tt_lines = {k: str(v).zfill(3) for v, k in enumerate(df_tt.LINENAME.unique().tolist())}
dict_tt_divisions = {k: str(v) for v, k in enumerate(df_tt.DIVISION.unique().tolist())}

# Then, create a new column in the dataframe and assign the unique index 
df_tt['STATION_IDX'] = df_tt[['STATION', 'LINENAME', 'DIVISION']].apply(lambda x: 't'+dict_tt_divisions[x.DIVISION]+dict_tt_lines[x.LINENAME]+dict_tt_stations[x.STATION], axis=1)


In [None]:
df_tt.head()

In [None]:
# optional: check elements in each column/feature
# Explanations to Terminology can be found in http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt
# df_tt.DESC.unique()
# df_tt[df_tt.DESC=='RECOVR AUD']

## Identifying Stations' Geolocation Information

The goal is to assign each row of the turntile data with the station's geolocation information, i.e., the latitude and longitude, for the future association with Foursquare venues.

One station may have multiple exits in different geolocations. As we don't have the location information for each turntile, we will only consider per station's data by summarizing all turntiles' data. 

We will use the MTA's station geolocation dataset. Due to the spelling rules in different datasets, we also need manual check and correction on part of data. The mapping information will be saved in a separate csv file for later use.

First, let us look at the station geolocation dataset.

### Station Data

Station data includes information of each station in MTA. It mainly provides the geolocation information, i.e., latitude and longitude of each station. The record is mainly distinguished by the station name, line name, division.

Major operations within this data include 1) select stations in Manhattan, 2)  



https://en.wikipedia.org/wiki/New_York_City_Subway_nomenclature

In [None]:
# There are multiple versions of such data
# Version 1
#df_geo = pd.read_csv('DOITT_SUBWAY_STATION_01_13SEPT2010.csv')
#df_geo.loc[0]['LINE'].split('-')

# Version 2
# df_station_entrances = pd.read_csv('NYC_Transit_Subway_Entrance_And_Exit_Data.csv')

# Version 3
# We are going to use the following data published by MTA
# http://web.mta.info/developers/data/nyct/subway/Stations.csv 
# in GTFS format
df_stations = pd.read_csv('Stations.csv')

df_stations.head()

In [None]:
df_stations.columns

In [None]:
# Remove unrelated columns
df_stations.drop(["Complex ID", "GTFS Stop ID", 'Line', 'Structure', 'North Direction Label', 'South Direction Label'], axis=1, inplace=True)


In [None]:
df_stations.head()

The station data has the unique identifier, i.e., Station ID, for each record. Therefore, we don't need to render some other ID for this dataframe. Our next step is to link these Station IDs with the station index created in the turntile data.

In [None]:
df_stations.Borough.unique().tolist()

We further retain our disussion within the Manhatten island. Therefore, we only keep the stations in the "Borough M".

In [None]:
df_stations = df_stations[df_stations.Borough=='M']

There is another way to retrieve Manhattan-only data, i.e., deleting all rows that contain station data out of Manhattan
```python
indexNames = df_stations[ df_stations['Borough'] != 'M' ].index # first, get indices of these rows
df_stations.drop(indexNames , inplace=True) # remove them from the dataframe
```

In [None]:
df_stations.head()

In [None]:
print(f'We found {df_stations.shape[0]} stations with {df_stations["Stop Name"].nunique()} unique station names')

The reason why we have more station records than the unique station names is because one station can have multiple records if it hosts multiple lines.

In [None]:
# The stations in different Routes may have the same name. Their coordinates may differ from each other but all in a vicinity. 
# Since the turntile data uses the stop name for all routes, we will render a coordinate for each hub station 

# check how many records are there for the station name of "Canal St"
df_stations[df_stations['Stop Name']=='Canal St']

It may result in two consequences: 1) one station in the turntile data may have multiple geolocations after the mapping, 2) there may be some wrong mapping due to stations in different lines/divisions share the same name.

To mitigate such errors, we use the combination including division, line, and stop name to uniquely locate a station with the unique Station ID. 

### Matching Stations

In order to find all turntile data of Manhattan stations and assign them with the correct geolocation information, we need to match the records by the station names in the df_stations.

In [None]:
# Processing the Turntile dataset

# Since there are only three divisions in Manhattan, we first reduce the turntile data size 
df_tt[df_tt.DIVISION.isin(['BMT', 'IND', 'IRT'])].STATION.nunique()

# Create a unique search id for each station
df_tt['SEARCH_ID'] = df_tt[['STATION', 'LINENAME', 'DIVISION', 'STATION_IDX']].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)

In [None]:
# Obtain all stations in the Turntile dataset and save them into a list
tt_list = df_tt['SEARCH_ID'].unique().tolist()
tt_list = list(map(lambda x: x.split(','), tt_list))

# Convert the list to a dataframe
df_tt_list = pd.DataFrame(tt_list, columns=['Station', 'Routes', 'Division', 'STATION_IDX'])
df_tt_list.head()

# Adjust the name format: change "a-b" to "a - b"
def hyphen_adjust(x):
    if '-' in x:
        tmp = x.split('-')
        return ' - '.join(tmp)
    else:
        return x

#df_tt_list['Station'] = df_tt_list['Station'].apply(hyphen_adjust)
#df_tt_list['Routes'] = df_tt_list['Routes'].apply(lambda x: set(x))

from tqdm import tqdm, tqdm_notebook
tqdm_notebook().pandas()
df_tt_list['Station'] = df_tt_list['Station'].progress_apply(hyphen_adjust)
df_tt_list['Routes'] = df_tt_list['Routes'].progress_apply(lambda x: set(x))
# a progress bar will appear when running the code

df_tt_list.head()

df_tt_list.shape

In [None]:
df_tt_list.head()

Next, we need also to obtain the search ID for the geolocation dataset.

In [None]:
# Processing the station profile dataset, i.e., the station geolocation reference
# Obtain unique ID for the qualified stations
df_stations['Search ID'] = df_stations[['Stop Name', 'Daytime Routes', 'Division', 'Station ID']].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)

df_stations.reset_index(inplace=True)

df_stations.head()

In [None]:
sta_list = df_stations['Search ID'].unique().tolist()

sta_list = list(map(lambda x: x.upper(), sta_list))  # Captalize all names

sta_list = list(map(lambda x: x.split(','), sta_list))

df_station_list = pd.DataFrame(sta_list, columns=['Station', 'Routes', 'Division', 'Station ID'])  # save into a dataframe

df_station_list['Routes'] = df_station_list['Routes'].apply(lambda x: set(x.split(' '))) # split routes into a list

df_station_list.head()

df_station_list.shape

In [None]:
df_station_list.head()

Now, the job is to find these 153 subway stations in Manhattan in the station list of the Turntile data 

In [None]:
# Define a new column to save the matched station
df_tt_list['Geo_ID']=df_tt_list['Station'].apply(lambda x: [])

# Define a new column to signal the match result in the reference station dataframe and set initial values to "False" 
df_station_list['Matched']=df_station_list['Station'].apply(lambda x: False)

In [None]:
'''
def station_match(tt, station):
    matched = False
    #print(f'tt: {tt}, station: {station}')
    if tt[2] == station[2]:  # the same division
        if tt[0] == station[0]:  # the same name
            #print(f'Station: {station} matches with TT: {tt}')
            if station[1].issubset(tt[1]):  # route set match
                matched = True
                print(f'Station: {station} matches with TT: {tt} in lines {station[1]}')
    return matched
'''
# a relaxed version which removes the division comparison because some stations in the Turntile data can be a union of multiple stations of different divisions
# The station name plus service routes can adequately define a unique station
def station_match(tt, station):
    matched = False
    #print(f'tt: {tt}, station: {station}')
    if tt[0] == station[0]:
        #print(f'Station: {station} matches with TT: {tt}')
        if station[1].issubset(tt[1]):
            matched = True
            print(f'Station: {station} matches with TT: {tt} in lines {station[1]}')
    return matched

In [None]:
count = 0
dTest = {}
dict_stations = {}
for i in range(df_station_list.shape[0]):
    for j in range(df_tt_list.shape[0]):
        if station_match(list(df_tt_list.loc[j]), list(df_station_list.loc[i])):
            count += 1
            # Add the Station_ID into The turntile record 
            df_tt_list.loc[j]['Geo_ID'].append(df_station_list.at[i, "Station ID"])
            df_station_list.at[i, 'Matched'] = True
            if df_tt_list.loc[j]['STATION_IDX'] in dTest:
                dict_stations[df_tt_list.loc[j]['STATION_IDX']].append(df_station_list.at[i, "Station ID"])
            else:
                dict_stations[df_tt_list.loc[j]['STATION_IDX']] = [df_station_list.at[i, "Station ID"]]

In [None]:
len(dict_stations)

In [None]:
df_station_list['Matched'].value_counts()

We matched 112 stations by using the exact name search. Let's examine the unmatched cases.

In [None]:
df_station_list[~df_station_list['Matched']].head()

We need to manually find those unrecognized stations in the turntile data. The main reasons of failed matches include the format mismatch, different abbreviation, order of words, etc. Since we only have 41 such items, a manual correction is feasible. That is part of the job for data science projects.

We will use one case as an example.

In [None]:
df_station_manual = df_station_list[~df_station_list['Matched']]
station_index = df_station_manual.index
iterId = 0
if iterId < len(station_index):
    print(f'Working on #{iterId} unmatched record with index: {station_index[iterId]}')
    print(f'The station name: {df_station_manual.iloc[iterId].Station}, routes: {df_station_manual.iloc[iterId].Routes}, div: {df_station_manual.iloc[iterId].Division}, Station ID: {df_station_manual.iloc[iterId]["Station ID"]}')   

In [None]:
#df_station_manual["Station ID"].T.tolist()

In [None]:
# Check the unmatched station one after another and save the result into the manually input dictionary
keyword2search = 'LEXINGTON'
df_tt_list[df_tt_list['Station'].str.contains(keyword2search)]

In the turntile name, there is no station with the name of "LEXINGTON AV/59 ST". Using the keyword "LEXINGTON", the target is not in the returned stations. Then, we turn to try another keyword "59"

In [None]:
keyword2search = '59'
df_tt_list[df_tt_list['Station'].str.contains(keyword2search)]

By using another keyword "59", I can find a number of stations in which the third items in the list.

Here, I need to map the "STATION_IDX" value, i.e., "t0004000" in the turntile data to the "Station ID" value, i.e., "7", in the geolocation data.

I saved these manually identified pairs in a file, 'manual_map.txt'. 

In [None]:
# To facilitate the following processing, we further process the manually rendered mapping information and save the pairs into a csv file.
df_manual_match = pd.read_csv('manual_map.txt', skipinitialspace=True)

headers = ["turntile.station", "turntile.routes", "turntile.station_id", "geo.station", "geo.routes", "geo.station_id"]
rows2write = []
rows2write.append(headers)

for i in range(df_manual_match.shape[0]):
    geo_id = df_manual_match.at[i, "station_df_id"]
    index_tt = df_manual_match.at[i, "tt_df_id"]
    print('\n')
    print(f'Manually input station-geo pair #{i}')
    print(f'Turntile Station {df_tt_list.at[index_tt, "Station"]} w/ Routes: {df_tt_list.at[index_tt, "Routes"]}, Station ID: {df_tt_list.at[index_tt, "STATION_IDX"]}')
    print(f'Geo: {df_station_list.at[geo_id-1, "Station"]} w/ Routes: {df_station_list.at[geo_id-1, "Routes"]}')
    row = [df_tt_list.at[index_tt, "Station"], df_tt_list.at[index_tt, "Routes"], str(df_tt_list.at[index_tt, "STATION_IDX"]), \
          df_station_list.at[geo_id-1, "Station"], df_station_list.at[geo_id-1, "Routes"], str(df_station_list.at[geo_id-1, "Station ID"])]
    rows2write.append(row)

import csv
wrCSVfilename = 'turntile_station_map.csv'

with open(wrCSVfilename, mode='a', newline='') as rtd_file:
    csv_writer = csv.writer(rtd_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for i in rows2write:
        #print(i[cols['frame_number']],i[cols['msgCopy']])
        csv_writer.writerow(i)
print('Writing Done!')

In [None]:
df_manual_map = pd.read_csv('turntile_station_map.csv')

In [None]:
df_manual_map.head()

In [None]:
df_manual_map.shape

In [None]:
df_manual_map.head()

Next, I add these manually found station mappings into the dictionary which already contains the earlier obtained mapping pairs in the automatic matching.

In [None]:
for i in range(df_manual_map.shape[0]):
    if df_manual_map["turntile.station_id"][i] in dTest:
        dict_stations[df_manual_map["turntile.station_id"][i]].append(df_manual_map["geo.station_id"][i])
    else:
        dict_stations[df_manual_map["turntile.station_id"][i]] = [df_manual_map["geo.station_id"][i]]

In [None]:
df_manual_map["turntile.station_id"][0]

In [None]:
len(dict_stations)

In [None]:
for i in dict_stations:
    print(f'There is a key: {i} w/ value of {dict_stations[i]}')

I have created a dictionary with the key:value pair of turntile_station_id:station_id. Next, I will use this dictionary and available information to create another dictionary to map the key of turntil_station_id to the station coordinates. As checked earlier, it is found that some station may be associated with multiple coordinates because it hosts multiple lines of services. In such a case, I will use the centroid of the point set instead. 

In [None]:
df_stations.head()

In [None]:
# dict_stations
# df_stations

from statistics import mean

dict_loc = {}

for i in dict_stations:
    lat = df_stations[df_stations["Station ID"].isin(dict_stations[i])]["GTFS Latitude"].tolist()
    lat = mean(lat)
    lat = round(lat, 6)
    long = df_stations[df_stations["Station ID"].isin(dict_stations[i])]["GTFS Longitude"].tolist()
    long = mean(long)
    long = round(long, 6)
    print(f'Station: {i} has the coordinates: ({lat}, {long})')
    dict_loc[i] = (lat, long)

In [None]:
dict_loc

In [None]:
df_tt_m = df_tt[df_tt.STATION_IDX.isin(dict_stations)].copy()  # make a copy of a slice of dataframe df_tt

# If df_tt_m is set using the get method below
# df_tt_m = df_tt[df_tt.STATION_IDX.isin(dict_stations)]
# The following changes on df_tt_m would generate warnings to alert about such change may affect the original dataframe
# It is because even pandas does not know df_tt_m is a copy or a view of df_tt

df_tt_m.head()

In [None]:
print(f'Turntile data in Manhattan has {df_tt_m.shape[0]} records compared to the total {df_tt.shape[0]} data records')

In [None]:
#df_tt_m.loc[:,"LOCATION"] = df_tt_m.apply(lambda x: dict_loc[x["STATION_IDX"]], axis=1)

#df_tt_m.loc[:,"LOCATION"] = df_tt_m["STATION_IDX"].apply(lambda x: dict_loc[x])

df_tt_m["LOCATION"] = df_tt_m["STATION_IDX"].map(lambda x: dict_loc[x])

df_tt_m.head()

In [None]:
df_grouped = df_tt_m.groupby(['UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION'])  # divide rows into groups based on selected columns as an index 

# Use dataframe.diff() to calculate the difference between two consecutive rows regarding a specific column
# The first row has "NaN" values after calculation
df_tt_m['ENTRIES_DIFF']=df_grouped[['ENTRIES']].diff()
df_tt_m['EXITS_DIFF']=df_grouped[['EXITS']].diff()

df_tt_m.head()

In [None]:
df_tt_m = df_tt_m[~df_tt_m['ENTRIES_DIFF'].isnull()]  # remove all rows with 'NaN' in the 'ENTRIES_DIFF' column

[Back to the top](#top)
<a name="cleaned"></a>  
### Cleaned Data

Now, we've obtained a clean dataframe for the MTA turntile data. It contains 1) per turntile entries and exits counts every four hours, 2) per station geolocation coordinates.

In [None]:
df_tt_m.head(5)

We save the cleaned data into a csv for future analysis work.

In [None]:
df_tt_m.to_csv("modified_turntile_data.csv")

The rush hour in Manhattan is 7 to 9 am and 4:30 pm to 7 pm 

Different venues have different target clients and their distribution may vary with location and time. In Manhattan, the rush hour is usually defined as 

In [None]:
# distance calculation between two points using their latitude and longitude information
# Use geopy module
# https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude

import geopy.distance

coords_1 = (40.576209, -73.967875)
coords_2 = (40.576507, -73.969445)

d = geopy.distance.distance(coords_1, coords_2).m
print(d)

In [None]:
df_grouped = df_tt.groupby(['UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION'])
df_tt_grouped = df_tt.groupby(['STATION', 'LINENAME', 'DIVISION'])[['STATION', 'LINENAME', 'DIVISION']]