<a id='Capstone 2: National Transit Database'></a>

# Capstone 2: *National Transit Database*

## Introduction

This project will utilize the monthly modal time series data in the National Transit Database.  The data is collected from participating transit authorities across the United States from 2014 to the present day.

## Column Definitions

### Columns in dataframe which benefit from explanation



| Column Name | Definition (as neded - not all columns included here) [NTD Glossary link](https://www.transit.dot.gov/ntd/national-transit-database-ntd-glossary)|
|:---------|:-------------|
|5 DIgit NTD ID|National Transit Database ID number currently in use|
|4 Digit NTD ID|Legacy ID number|
|Agency| Name of the Reporting Agency|
|Organization Type|Distinguishes between government, private, and independent public agencies|
|Mode|AR: Alaska Railroad|
||CB: Commuter Bus|
||CC: Cable Car|
||CR: Commuter Rail
||DR: Demand Response|
||FB: Ferryboat|
||HR: Heavy Rail|
||IP: Inclined Plane|
||LR: Light Rail|
||MB: Bus|
||MG: Monorail and Automated Guideway|
||PB: Publico|
||RB: Bus Rapid Transit|
||SR: Streetcar Rail|
||TB: Trolleybus|
||TR: Aerial Tramway|
||VP: Vanpool|
||YR: Hybrid Rail
|Type of Service|DO: Directly Operated|
||PT: Purchased Transportation|
||TX: Taxi|
||TN: Transit Network Company (new TOS effective Sept 2019) [NTD Guidance](https://www.transit.dot.gov/sites/fta.dot.gov/files/docs/NTD%202108%20FRN%20Webinar%20Presentation.pdf)|
|Primary UZA Code|*(see note below table)* Numerical ranking by urbanized area population size| 
|Primary UZA Name|*(see note below table)*|
|Primary UZA Sq Miles|*(see note below table)*|
|Primary UZA Population|*(see note below table)*|


**Urbanized Area (UZA):** An urbanized area is an incorporated area with a population of 50,000 or more that is designated as such by the U.S. Department of Commerce, Bureau of the Census.
The Census Bureau delineates urban areas after each decennial census by applying specified criteria to decennial census and other data.

**Non-Rail Modes:**
Transit modes whose vehicles typically operate on roadways - streets, highways or expressways, but may also operate on waterways (ferryboat (FB)) or via aerial cable (aerial tramways (TR)). Vehicles are typically powered by motors onboard the vehicle, with one exception, aerial tramway (TR) vehicles which are electrically powered by a motor not onboard the vehicle in order to pull the vehicle via an overhead cable. NTD recognizes eight non-rail modes:
1.   Aerial Tramway (TR)
2.   Bus (MB)
3.   Bus rapid transit (RB)
4.   Commuter bus (CB)
5.   Demand Response (DR)
6.   Demand taxi (DT)
7.   Ferryboat (FB)
8.   Jitney (JT)
9.   Publico (PB)
10.   Trolleybus (TB), and
11.   Vanpool (VP).


<div class="alert alert-info">
  <strong>Importing Dependencies</strong>

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
import numpy as np

<div class="alert alert-info">
  <strong>Importing File: </strong> https://www.transit.dot.gov/ntd/national-transit-database-ntd-glossary</a>.
  
  There are NANs in 'Rail (Y/N) so I cannot change the data type to bool right now.  This will be addressed later in the project.
</div>

In [2]:
file = 'Monthly_Modal_Time_Series.csv'
data = pd.read_csv(file, dtype={'5 DIgit NTD ID': str, '4 Digit NTD ID': str})
data.info()

  data = pd.read_csv(file, dtype={'5 DIgit NTD ID': str, '4 Digit NTD ID': str})


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133196 entries, 0 to 133195
Data columns (total 65 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   5 DIgit NTD ID                              133196 non-null  object 
 1   4 Digit NTD ID                              130151 non-null  object 
 2   Agency                                      133148 non-null  object 
 3   Organization Type                           128819 non-null  object 
 4   Mode                                        133148 non-null  object 
 5   Type of Service                             133148 non-null  object 
 6   Rail (Y/N)                                  133148 non-null  object 
 7   Primary UZA Code                            133196 non-null  int64  
 8   Primary UZA Name                            128807 non-null  object 
 9   Primary UZA Sq Miles                        128855 non-null  float64
 

<div class="alert alert-info">
  <strong>Taking a look at the dataframe</strong>

In [3]:
data.head()

Unnamed: 0,5 DIgit NTD ID,4 Digit NTD ID,Agency,Organization Type,Mode,Type of Service,Rail (Y/N),Primary UZA Code,Primary UZA Name,Primary UZA Sq Miles,...,Pedestrian in Corsswalk Injuries,Pedestrian Not in Crosswalk Injuries,Pedestrian Crossing Tracks Injuries,Pedestrian Walking Along Tracks Injuries,Other Vehicle Occupant Injuries,Other Injuries,Trespasser Injuries,Suicide Injuries,Total Other Injuries,Total Injuries
0,8,8,Tri-County Metropolitan Transportation Distric...,Independent Public Agency or Authority of Tran...,LR,DO,True,24,"Portland, OR-WA",524.0,...,0,0,0,0,1,0,0,0,1,1
1,8,8,Tri-County Metropolitan Transportation Distric...,Independent Public Agency or Authority of Tran...,MB,DO,False,24,"Portland, OR-WA",524.0,...,0,1,0,0,3,0,0,0,4,11
2,20008,2008,MTA New York City Transit,"Subsidiary Unit of a Transit Agency, Reporting...",DR,PT,False,1,"New York-Newark, NY-NJ-CT",3450.0,...,1,2,0,0,6,0,0,0,9,17
3,20008,2008,MTA New York City Transit,"Subsidiary Unit of a Transit Agency, Reporting...",HR,DO,True,1,"New York-Newark, NY-NJ-CT",3450.0,...,0,0,0,0,0,0,0,2,2,9
4,20008,2008,MTA New York City Transit,"Subsidiary Unit of a Transit Agency, Reporting...",MB,DO,False,1,"New York-Newark, NY-NJ-CT",3450.0,...,1,2,0,2,21,0,0,0,27,42


<div class="alert alert-danger">
  <strong>Type of Service (TOS) "TN":  </strong> It is important to note that this one type of service was not in effect for the entire life of the database since this may change how we use this field.  The documention from NTD shows the new TOS designation "TN" came into use Sept 2019 but it first occurs in the database July 2018. </div>

In [4]:
temp_df = data[data['Type of Service']=='TN']
temp_df[['Agency','Mode', 'Month','Year','Type of Service']].sort_values(by=['Year','Month'],ascending=True).head(10)

Unnamed: 0,Agency,Mode,Month,Year,Type of Service
74105,The Eastern Contra Costa Transit Authority,DR,7,2018,TN
74106,The Eastern Contra Costa Transit Authority,DR,8,2018,TN
74107,The Eastern Contra Costa Transit Authority,DR,9,2018,TN
74108,The Eastern Contra Costa Transit Authority,DR,10,2018,TN
74109,The Eastern Contra Costa Transit Authority,DR,11,2018,TN
74110,The Eastern Contra Costa Transit Authority,DR,12,2018,TN
74111,The Eastern Contra Costa Transit Authority,DR,1,2019,TN
74112,The Eastern Contra Costa Transit Authority,DR,2,2019,TN
74113,The Eastern Contra Costa Transit Authority,DR,3,2019,TN
74114,The Eastern Contra Costa Transit Authority,DR,4,2019,TN


<div class="alert alert-info">
  <strong>Missing Values:  </strong> From looking at the data online, I know that there are missing values in the "5 DIgit NTD ID" field, so this will require some digging. </div>

In [5]:
missing = pd.concat([data.isnull().sum(), 100 * data.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.head(20)

Unnamed: 0,count,%
5 DIgit NTD ID,0,0.0
4 Digit NTD ID,3045,2.286105
Agency,48,0.036037
Organization Type,4377,3.286135
Mode,48,0.036037
Type of Service,48,0.036037
Rail (Y/N),48,0.036037
Primary UZA Code,0,0.0
Primary UZA Name,4389,3.295144
Primary UZA Sq Miles,4341,3.259107


<div class="alert alert-info">
  <strong>Checking Unique Values:  </strong> The number of unique  values in "5 DIgit NTD ID" and "Agency" are pretty close.  There are less unique values in the "4 Digit NTD ID" since this is the legacy ID.  It will be better to use the 5 digit instead of the 4 digit ID going forward. </div>

In [6]:
data['5 DIgit NTD ID'].nunique()

609

In [7]:
data['4 Digit NTD ID'].nunique()

571

In [8]:
data['Agency'].nunique()

607

<div class="alert alert-info">
  <strong>Checking the pairing of "5 DIgit NTD ID" and "Agency":  </strong> No obvious occurances of an ID being associated with more than one Agency or vica versa.  Also no obvious explanation for the small difference in number of IDs vs Agencies. </div>

In [9]:
subset = data[['5 DIgit NTD ID', 'Agency']]
tuples = [tuple(x) for x in subset.to_numpy()]
unique = list(set(tuples))
print(len(unique))

609


In [10]:
print(unique)

[('10005', 'Lowell Regional Transit Authority'), ('90092', 'City of Fairfield, California'), ('30014', 'Cumberland Dauphin-Harrisburg Transit Authority'), ('40102', 'Waccamaw Regional Transportation Authority'), ('00047', 'City of Corvallis'), ('20080', 'New Jersey Transit Corporation'), ('90090', 'Yolo County Transportation District'), ('20075', 'Port Authority Transit Corporation'), ('50167', 'South Lake County Community Services, Inc.'), ('60134', 'The Woodlands Township'), ('50209', 'Central Indiana Regional Transportation Authority'), ('10014', 'Worcester Regional Transit Authority'), ('20122', 'Academy Lines, Inc.'), ('20199', 'County of Atlantic'), ('50029', 'Bay Metropolitan Transit Authority'), ('90030', 'North County Transit District'), ('20149', 'Rockland Coaches, Inc.'), ('90157', 'Access Services'), ('30094', 'City of Harrisonburg'), ('30018', 'Red Rose Transit Authority'), ('10183', "Woods Hole, Martha's Vineyard and Nantucket Steamship Authority"), ('10126', 'Worcester R

<div class="alert alert-info">
  <strong>Looking for Irregularities:  </strong> Interesting quotation mark found for "5 DIgit NTD ID" at index 246 </div>

In [11]:
data[['5 DIgit NTD ID', 'Agency']].drop_duplicates().sort_values('5 DIgit NTD ID').head(50)

Unnamed: 0,5 DIgit NTD ID,Agency
246,"""",
9564,00001,King County Department of Metro Transit
3874,00002,Spokane Transit Authority
4828,00003,Pierce County Transportation Benefit Area Auth...
6770,00005,City of Everett
7897,00006,City of Yakima
4855,00007,Lane Transit District
0,00008,Tri-County Metropolitan Transportation Distric...
8599,00011,Valley Regional Transit
6741,00012,Municipality of Anchorage


<div class="alert alert-info">
  Index 246 appears not to contain any useful information. </div>

In [12]:
data.iloc[246].head(50)

5 DIgit NTD ID                                   "
4 Digit NTD ID                                 NaN
Agency                                         NaN
Organization Type                              NaN
Mode                                           NaN
Type of Service                                NaN
Rail (Y/N)                                     NaN
Primary UZA Code                                 0
Primary UZA Name                               NaN
Primary UZA Sq Miles                           0.0
Primary UZA Population                         0.0
Service Area Sq Miles                          0.0
Service Area Population                        0.0
Year                                          2014
Month                                           10
Vehicles                                         0
Vehicle Revenue Miles                            0
Vehicle Revenue Hours                            0
Ridership                                        0
Collisions with Motor Vehicle  

<div class="alert alert-info">
  <strong>Looking for NaNs in "Agency":  </strong> If there is no way to identify the reporting agency, region, or even organization type, then the data is not useful for the purposes of this project and can be removed </div>

In [13]:
data[data['Agency'].isna()]

Unnamed: 0,5 DIgit NTD ID,4 Digit NTD ID,Agency,Organization Type,Mode,Type of Service,Rail (Y/N),Primary UZA Code,Primary UZA Name,Primary UZA Sq Miles,...,Pedestrian in Corsswalk Injuries,Pedestrian Not in Crosswalk Injuries,Pedestrian Crossing Tracks Injuries,Pedestrian Walking Along Tracks Injuries,Other Vehicle Occupant Injuries,Other Injuries,Trespasser Injuries,Suicide Injuries,Total Other Injuries,Total Injuries
246,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
247,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
248,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
122815,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
122816,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
122817,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
122818,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
125320,"""",,,,,,,0,,0.0,...,0,0,0,0,1,0,0,0,1,1
125321,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
125322,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,1,1


<div class="alert alert-info">
  <strong>Looking for NaNs in "5 DIgit NTD ID":  </strong> No NaNs show is the search, but searching for the quotation mark (as found in index 246) reveals the issue.  These are the same 48 rows that we found when searching for NaNs in "Agency".  Mystery solved. </div>

In [14]:
data[data['5 DIgit NTD ID'].isna()]

Unnamed: 0,5 DIgit NTD ID,4 Digit NTD ID,Agency,Organization Type,Mode,Type of Service,Rail (Y/N),Primary UZA Code,Primary UZA Name,Primary UZA Sq Miles,...,Pedestrian in Corsswalk Injuries,Pedestrian Not in Crosswalk Injuries,Pedestrian Crossing Tracks Injuries,Pedestrian Walking Along Tracks Injuries,Other Vehicle Occupant Injuries,Other Injuries,Trespasser Injuries,Suicide Injuries,Total Other Injuries,Total Injuries


In [15]:
data[data['5 DIgit NTD ID']==r'"']

Unnamed: 0,5 DIgit NTD ID,4 Digit NTD ID,Agency,Organization Type,Mode,Type of Service,Rail (Y/N),Primary UZA Code,Primary UZA Name,Primary UZA Sq Miles,...,Pedestrian in Corsswalk Injuries,Pedestrian Not in Crosswalk Injuries,Pedestrian Crossing Tracks Injuries,Pedestrian Walking Along Tracks Injuries,Other Vehicle Occupant Injuries,Other Injuries,Trespasser Injuries,Suicide Injuries,Total Other Injuries,Total Injuries
246,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
247,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
248,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
122815,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
122816,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
122817,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
122818,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
125320,"""",,,,,,,0,,0.0,...,0,0,0,0,1,0,0,0,1,1
125321,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,0,0
125322,"""",,,,,,,0,,0.0,...,0,0,0,0,0,0,0,0,1,1


<div class="alert alert-info">
  <strong>Removing the rows of data and checking to make sure they are gone.  </strong> </div>

In [16]:
data = data[data['Agency'].notna()]

In [17]:
data[data['Agency'].isna()]

Unnamed: 0,5 DIgit NTD ID,4 Digit NTD ID,Agency,Organization Type,Mode,Type of Service,Rail (Y/N),Primary UZA Code,Primary UZA Name,Primary UZA Sq Miles,...,Pedestrian in Corsswalk Injuries,Pedestrian Not in Crosswalk Injuries,Pedestrian Crossing Tracks Injuries,Pedestrian Walking Along Tracks Injuries,Other Vehicle Occupant Injuries,Other Injuries,Trespasser Injuries,Suicide Injuries,Total Other Injuries,Total Injuries


In [18]:
data[data['5 DIgit NTD ID']==r'"']

Unnamed: 0,5 DIgit NTD ID,4 Digit NTD ID,Agency,Organization Type,Mode,Type of Service,Rail (Y/N),Primary UZA Code,Primary UZA Name,Primary UZA Sq Miles,...,Pedestrian in Corsswalk Injuries,Pedestrian Not in Crosswalk Injuries,Pedestrian Crossing Tracks Injuries,Pedestrian Walking Along Tracks Injuries,Other Vehicle Occupant Injuries,Other Injuries,Trespasser Injuries,Suicide Injuries,Total Other Injuries,Total Injuries


<div class="alert alert-info">
  <strong>"Rail (Y/N)" column data type: </strong> Removing the NANs in the Agency column fixed the NANs in the 'Rail (Y/N)' column.  So, I'm able to change the data type to bool now.</a></div>

In [19]:
missing = pd.concat([data.isnull().sum(), 100 * data.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.head(20)

Unnamed: 0,count,%
5 DIgit NTD ID,0,0.0
4 Digit NTD ID,2997,2.250879
Agency,0,0.0
Organization Type,4329,3.251269
Mode,0,0.0
Type of Service,0,0.0
Rail (Y/N),0,0.0
Primary UZA Code,0,0.0
Primary UZA Name,4341,3.260282
Primary UZA Sq Miles,4341,3.260282


In [20]:
data['Rail (Y/N)'] = data['Rail (Y/N)'].astype(bool)

In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 133148 entries, 0 to 133190
Data columns (total 65 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   5 DIgit NTD ID                              133148 non-null  object 
 1   4 Digit NTD ID                              130151 non-null  object 
 2   Agency                                      133148 non-null  object 
 3   Organization Type                           128819 non-null  object 
 4   Mode                                        133148 non-null  object 
 5   Type of Service                             133148 non-null  object 
 6   Rail (Y/N)                                  133148 non-null  bool   
 7   Primary UZA Code                            133148 non-null  int64  
 8   Primary UZA Name                            128807 non-null  object 
 9   Primary UZA Sq Miles                        128807 non-null  float64
 

<div class="alert alert-info">
  <strong>Rechecking "5 DIgit NTD ID" and "Agency" for uniqueness: </strong> They are still one different.  I might choose to explore this more later, but I'll leave it for now.</a></div>

In [22]:
data['5 DIgit NTD ID'].nunique()

608

In [23]:
data['Agency'].nunique()

607

<div class="alert alert-info">
  <strong>Checking for anomalies or missing data in the "Year" column:</strong>  None found.</div>

In [24]:
data[['5 DIgit NTD ID','Agency','Year']].sort_values('Year', ascending=False).head(50)

Unnamed: 0,5 DIgit NTD ID,Agency,Year
133190,90030,North County Transit District,2022
110740,30083,Transportation District Commission of Hampton ...,2022
110738,30083,Transportation District Commission of Hampton ...,2022
110737,30083,Transportation District Commission of Hampton ...,2022
110736,30083,Transportation District Commission of Hampton ...,2022
121984,50519,Minnesota Valley Transit Authority,2022
121983,50519,Minnesota Valley Transit Authority,2022
121982,50519,Minnesota Valley Transit Authority,2022
121981,50519,Minnesota Valley Transit Authority,2022
110729,30083,Transportation District Commission of Hampton ...,2022


<div class="alert alert-danger">
<strong>Discussion Point (for Nov 9, 2022:)</strong>
    
How shall I treat the rows where Ridership is zero?  What can we understand from a zero in that field?
    
Possible options:   
* Exclude the data
    
* Leave as is
    
* Replace with an average - multiple methods possible to do this

I need to have a better understanding of why ridership is zero during these months/years before I can make a decision.</div>

<div class="alert alert-info">
  <strong>Checking "Ridership": </strong> A number of rows report zero for "Ridership."  Is this data correct or is the zero a placeholder for data that is not available?  How shall I treat this data?</a></div>

In [25]:
data[['5 DIgit NTD ID','Agency','Month','Year','Ridership']].sort_values(by = ['Ridership', '5 DIgit NTD ID','Year','Month'], ascending=True).head(50)

Unnamed: 0,5 DIgit NTD ID,Agency,Month,Year,Ridership
7933,6,City of Yakima,1,2014,0
7934,6,City of Yakima,2,2014,0
7935,6,City of Yakima,3,2014,0
7936,6,City of Yakima,4,2014,0
7937,6,City of Yakima,5,2014,0
106373,6,City of Yakima,4,2020,0
106374,6,City of Yakima,5,2020,0
106375,6,City of Yakima,6,2020,0
106376,6,City of Yakima,7,2020,0
106377,6,City of Yakima,8,2020,0


<div class="alert alert-danger">
  <strong>Checking "Ridership": </strong> There are 1562 rows of data where "Ridership" is zero.</a></div>

In [26]:
data[data['Ridership']==0].shape

(1562, 65)

<div class="alert alert-info">
  <strong>Checking "Ridership": </strong> Looking at one index number from above to see it's data.  At least it is consistent, zero safety data for zero riders.  I checked several fields but may need to do a more diligent exploration before being able to decide what to do with the 1562 rows.</a></div>

In [27]:
data.iloc[57611].head(50)

5 DIgit NTD ID                                                                            00016
4 Digit NTD ID                                                                             0016
Agency                                                                      RiverCities Transit
Organization Type                             City, County or Local Government Unit or Depar...
Mode                                                                                         VP
Type of Service                                                                              DO
Rail (Y/N)                                                                                False
Primary UZA Code                                                                            431
Primary UZA Name                                                                Longview, WA-OR
Primary UZA Sq Miles                                                                       33.0
Primary UZA Population                  

In [28]:
# NTD IDs: T = tribe, R = rural, no letter to indicate urban