In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from itertools import chain

pd.set_option('display.max_columns', None)

This is a silly exercise to do some exploration and data cleaning of the NYC-MTA station metadata, which is associated with the turnstile usage data.

In [45]:
df_station = pd.read_excel('http://web.mta.info/developers/resources/nyct/turnstile/Remote-Booth-Station.xls')
df_station['Line Name'] = df_station['Line Name'].map(str) ## Convert integer to string for stations with only number lines
df_station.head()

Unnamed: 0,Remote,Booth,Station,Line Name,Division
0,R001,A060,WHITEHALL ST,R1,BMT
1,R001,A058,WHITEHALL ST,R1,BMT
2,R001,R101S,SOUTH FERRY,R1,IRT
3,R002,A077,FULTON ST,ACJZ2345,BMT
4,R002,A081,FULTON ST,ACJZ2345,BMT


The *Line Name* column contains each of the train lines that serves the station. NYC Metro lines are characterized just by a single character, so taking the first row with Whitehall Street station as an example there are the __R__ and __1__ lines. 

I'll do a simple calculation of how many stations each line serves based on this dataset. However, even in the first 5 rows you can see that there are duplicates in terms of station name and line name. The data will also be subsetted by the major divisions BMT, IRT, and IND.

In [47]:
df_station_only = df_station[df_station['Division'].isin(['BMT', 'IND', 'IRT'])][['Station', 'Line Name']].drop_duplicates()

## Counting the number of stations per Line
df_line = pd.Series([item for sublist in df_station_only['Line Name'].map(lambda x: list(str(x))) for item in sublist])
df_line.groupby(df_line).count().rename_axis('Line Name').reset_index(name='Count').set_index('Line Name').T

Line Name,1,2,3,4,5,6,7,A,B,C,D,E,F,G,J,L,M,N,Q,R,S,Z
Count,50,63,48,41,57,47,26,55,50,49,45,32,52,24,35,30,43,42,43,56,14,25


The __1__ line seems to have too many station counts. It should be 38 stations according to Wikipedia. Let's see what's going on here.

In [6]:
df_station_only[df_station_only['Line Name'].str.contains('1')]

Unnamed: 0,Station,Line Name
0,WHITEHALL ST,R1
2,SOUTH FERRY,R1
14,42 ST-PA BUS TE,ACENQRS1237
67,CHAMBERS ST,123
68,34 ST-PENN STA,123
72,42 ST-TIMES SQ,1237ACENQRS
74,42 ST-TIMES SQ,ACENQRS1237
81,125 ST,1
82,168 ST-BROADWAY,1AC
84,168 ST-BROADWAY,AC1


Lots of issues here that's causing duplications. There are entries where the station names are the same but the lines are not in order which the `drop_duplicates()` did not pick up. Even worse, _34 ST-PENN STA_ have entries _ACE_ and _123ACE_. 

There are also station names that have _METROCARD_ which probably just needs to be filtered out.

There are also duplicate stations like _42 ST-PA BUS TE_ and _42 ST-TIMES SQ_ which are the same building complex. These will be manually filtered out.

In [52]:
df_station_only = df_station[df_station['Division'].isin(['BMT', 'IND', 'IRT'])][['Station', 'Line Name']]

## Concatenate all line name for every unique station name. 
df_station_only = df_station_only.groupby(['Station'])['Line Name'].apply(lambda x: ''.join(x)).reset_index()

## Create a set, convert back to list and then to return a string for unique list of lines by station name
df_station_only['Line Name'] = df_station_only['Line Name'].map(lambda x: ''.join(list(set(x))))

## Unfortunate addtional filtering needed...
df_station_only = df_station_only[~df_station_only['Station'].str.contains('METROCARD')]
duplicate_station_list = ['42 ST-PA BUS TE', '6 AVE', '14 ST-6 AVE']
df_station_only = df_station_only[~df_station_only['Station'].isin(duplicate_station_list)]
df_station_only

Unnamed: 0,Station,Line Name
0,1 AVE,L
1,103 ST,1BC6
2,103 ST-CORONA,7
3,104 ST,JZ
4,110 ST,6
5,110 ST-CATHEDRL,1
6,110 ST-CPN,32
7,111 ST,7J
8,116 ST,B2C63
9,116 ST-COLUMBIA,1


Looks good, except there are some stations that have the same name but completely different stations. _14 ST_ is one such example. But for the sake of counting the number of stations served by Line this is ok.

In [50]:
## Alternate way using chain and Counter
pd.Series(Counter(chain(*df_station_only['Line Name']))).rename_axis('Line Name').reset_index(name='Count').set_index('Line Name').T

Line Name,1,2,3,4,5,6,7,A,B,C,D,E,F,G,J,L,M,N,Q,R,S,Z
Count,40,52,37,33,48,43,24,49,47,43,42,29,50,24,32,25,40,36,38,48,11,22


Does not quite match the 38 count for the __1__ line but getting quite close.