# Introduction

Efficient public transit systems play a pivotal role in enhancing mobility, reducing carbon footprints, and fostering sustainable city growth. In many American cities, there is an ongoing and concerted effort to improve multi-modal transit and build better transit systems that supplement, or replace, the predominantly car-oriented infrastructure. Often these transit system improvements are subject to scrutiny, as urban rail projects require an extensive up-front investment of public money.  Within many transit agencies, financial constraints exist, and officials are often hesitant to allocate significant public funds for long-term projects. This hesitancy is intensified by the potential for unexpected expenses that could jeopardize an entire project.

There's a pressing need for a transparent tool that offers accurate cost assessments for urban rail projects. By equipping our community members, local officials, and advocates with realistic cost projections tailored to individual communities, we can optimize the allocation of public funds and bridge the gap between political action and community need.

The objective of this study is to analyze global data on Metro system construction costs and devise a predictive model for future project costs. The study will generate two primary outputs:

A specialized model intended for professionals, such as engineers, familiar with particular locales.
A general-use model, designed for anyone keen on gauging transit project expenses.


By doing so, I hope to contribute to the creation of more efficient, timely, and cost-effective transit projects that better serve the needs of urban populations globally. There are several important metrics to track and within this sheet I will outline the data that will be used in the analysis.

## The Data

The basis for this analysis will be the data collected and organized by the [Transit Costs Project](https://transitcosts.com/) (TCP) who is affiliated with NYU Marron Institute of Urban Management. The Transit Costs Project has provided their own analysis of the data, which can be found on their [analysis page](https://transitcosts.com/new-data/). I intend to build upon their analysis to build a tool that can provided a baseline cost estimate to help estimate the overall cost for a project, given some information about the project area.

The final dataset used in this project will be a modified version of the dataset discussed above. Within the TCP's dataset, there were approximately 150 rows with missing values. The original research team, for the purposes of compiling a trustworthy dataset, left several items blank if they could not verify their true values from official sources. I opted to backfill these datapoints by using a variety of techniques that, I feel, provided me with a suitable approximation. It's worth noting that many of these techniques are imperfect and should be viewed as a potential sources of error. I'll discuss the techniques used in the cleaning section of this analysis.

The Transit Project data includes several important features that will be used in my analysis, however those features are primarily related to the physical attributes of the railway themselves. In addition to this data, I intend to use the provided location for each project to produce several relevant features pertaining to the site conditions at the site. Below, I'll outline the existing features and their purpose.

### The Features

| Feature                     | Unit            | Description                                                                                   |
|-----------------------------|-----------------|-----------------------------------------------------------------------------------------------|
| ID                          | -               | A unique identifier for each record in the dataset.                                           |
| Country                     | -               | The country where the transit project is located.                                             |
| City                        | -               | The city where the transit project is located.                                                |
| Line                        | -               | The name or identifier of the transit line within the city.                                   |
| Phase                       | -               | The phase of the transit project (e.g., Phase 1, Phase 2, etc.).                              |
| Start year                  | Year            | The year in which the transit project construction started.                                   |
| End year                    | Year            | The year in which the transit project construction was completed.                             |
| RR?                         | -               | A binary indicator (Yes/No) for whether the transit line is a rapid transit or not.           |
| Length                      | Kilometers/Miles | The total length of the transit line.                                                         |
| TunnelPer                   | Percentage (%)   | The percentage of the transit line that runs underground in tunnels.                          |
| Tunnel                      | Kilometers/Miles | The length of the transit line that runs underground in tunnels.                              |
| Elevated                    | Kilometers/Miles | The length of the transit line that is elevated above ground level.                           |
| Atgrade                     | Kilometers/Miles | The length of the transit line that is at ground level (at-grade).                            |
| Stations                    | Count           | The total number of stations on the transit line.                                             |
| Platform Length    | Meters          | The average length of platforms at stations.                                                  |
| Source1                     | -               | The source or reference from which the data was obtained.                                    |
| Cost                        | Currency        | The cost of the transit project in the original currency.                                     |
| Currency                    | -               | The currency in which the cost is specified.                                                  |
| Year                        | Year            | The year in which the cost value was recorded.                                                |
| PPP rate                    | -               | The Purchasing Power Parity (PPP) rate for converting the cost to a common currency.         |
| Real cost                   | Currency        | The adjusted cost of the transit project, considering the PPP rate and inflation.            |
| Cost/km         | Millions/km        | The cost of the transit project per kilometer.                                               |
| Cheap?                      | -               | A binary indicator (Yes/No) for whether the transit project is considered cheap or not.      |
| Clength                     | Millions        | The cost of the transit project per kilometer for the length of the transit line.            |
| Ctunnel                     | Millions        | The cost of the transit project per kilometer for the tunnel portion.                         |
| Anglo?                      | -               | A binary indicator (Yes/No) for whether the transit project is located in an Anglophone country. |
| Inflation Index             | -               | The inflation index for adjusting the cost to real value.                                     |
| Real cost    | 2021 dollars    | The adjusted cost of the transit project in 2021 dollars.                                    |
| Cost/km      | Millions        | The cost of the transit project per kilometer in 2021 dollars.                               |
| Source2                     | -               | Additional source or reference for the data.                                                  |
| Reference                   | -               | Any additional reference information related to the transit project.                         |

### Visualizing the Data

For any model produced by this analysis to be relevant, the underlying data needs to represent a diverse population that encapsulates different approaches to implementating an urban rail project. Below I'll provide some visualizations to help show what the data look like.

#### Where are the data?

Below is a map of the data showing which countries and regions the projects are from.

In [80]:
import plotly.express as px
import pickle
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)
import plotly.graph_objects as go
with open('pickles/df_engineered.pkl', 'rb') as f:
    df_engineered = pickle.load(f)

It's important that the data be well distrubuted globally such that there is sufficient representation of a diverse group of projects. Below is a plot of how the projects are spread around the globe. This provides insight into the potential bias of the data.

In [9]:
fig = px.scatter_geo(df_engineered, lat='lat', lon='lng', color="country",
                     hover_name="country", size="length",
                     projection="natural earth")
fig.show()

#### When are the data from?

The provided dataset includes information from projects spanning between 1965 and 2026. Let's look at a distribution of the project end dats to see what they look like.

First, the plot of 'start_year' vs. 'end_year' shows that the duration of a project isn't strictly tied to how long the rail line is going to be. There are clearly other factors that dictate how quickly a project can be built.

In [113]:
fig = px.scatter(df_engineered, x="start_year",y='end_year',color = 'length')
fig.show()

and also the duration of a project doesn't seem to be perfectly correlated with the cost of a project.

In [104]:
fig = px.scatter(df_engineered, x="start_year",y='duration',color = 'cost_km_2021')
fig.show()

#### What question are we trying to answer?

The premise of this analysis is that transit costs vary widely from project to project and the costs of a project are difficult to predict based only on the length of the proposed line. The cost of each urban rail project varies from country to country and even significantly from city to city, so this analysis will look to identify a set of features that accurately predicts the cost of a transit project. I've generated several plots that illustrate how the price of completed projects correlates with other aspects of the project.

The plot below shows how significantly the cost of a project can vary within a country. This distribution implies that there are some underlying features unique to each project site that have a significant impact on the total cost of a project in that area and that the total cost isn't perfectly linear with the total track 'length'.

In [108]:
fig = px.scatter(df_engineered.sort_values(by='country'), 
                 x='country',
                 y='cost_km_2021',
                 color='length',
                 color_continuous_scale=px.colors.sequential.Viridis)

fig.update_layout(width=950, height=400)
fig.show()

and then once again, with duration as the 'color'.

In [111]:
fig = px.scatter(df_engineered.sort_values(by='country'), 
                 x='country',
                 y='cost_km_2021',
                 color='duration',
                 color_continuous_scale=px.colors.sequential.Viridis)

fig.update_layout(width=950, height=400)
fig.show()

This variance in cost also continues when we look at the comparison between the most expensive projects and the most expensive projects per km. Below we plot a sorted list of the projects in the dataset, where the y-axis represents the 'average total cost' of a project and the color represents the 'average per km' cost.

In [82]:
df_engineered['average_costkm_country'] = df_engineered.groupby('country')['cost_km_2021'].transform('mean')
df_engineered['average_cost_country'] = df_engineered.groupby('country')['cost_real_2021'].transform('mean')
df_unique_countries = (df_engineered.drop_duplicates(subset='country')).sort_values(by='average_cost_country', ascending=False)
overall_avg = df_unique_countries['average_cost_country'].mean()

fig = px.bar(df_unique_countries,
             x='country',
             y='average_cost_country',
             color = 'average_costkm_country',
             height=500)

fig.add_shape(
    go.layout.Shape(
        type='line',
        x0=df_unique_countries['country'].iloc[0],
        x1=df_unique_countries['country'].iloc[-1],
        y0=overall_avg,
        y1=overall_avg,
        line=dict(color='purple')
    )
)
fig.show()

### Insights

Since we've established that the price of a project isn't tied to one specific variable, there are several elements that I'd like to explore further. Specifically, I'm interested in the the relationship between tunneling and cost in a given area. I believe the costs associated with tunneling will be dependent on the underlying soil conditions, as well as the density of the area in question. Additionally, the weather conditions and proximity to the water table in a given area would play a role in determining how much dewatering would need to be completed for each kilometer of construction.

In the next sheet, I'll evaluate the distributions of each feature and dictate which features can be ignored.