# **Abstract**

In this paper, I conducted a comprehensive analysis of datasets in CSV format, focusing on mobility-based transportation methods in Central London and exploring its relationships with other societal factors over the year of 2022. Through the use of statistical methods and programming, I was able to unravel the hidden patterns and trends within this data, giving keen insight on what goes on in the streets of London during the day. I scraped all these datasets into panda data structures, and then conducted inspections and cleaning operations so that information could be processed for analysis. By generating visual representations of different correlations within the data, I was able to reveal relationships between variables that could not have been identified otherwise. I was able to find influential factors on transportation demand, and overall, just understand how transportation shapes the London community.

# **Keywords**

- Data Cleaning and Preparation
- Visual Data Analysis
- Geospacial Mapping
- Modes of Mobility-Based Transportation
- Central London


# **Introduction**

London is widely known for its multitude of transportation options, defining its distinctive ambience and atmosphere. One extremely popular choice is active transport, which is both sustainable for the environment and enjoyable when cruising down the city streets. This report delves into the patterns of these transportation methods used in Central London, utilizing the 2022 datasets focusing on bicycle hires and other types of small scale mobility options. Through data analysis, my objective is to uncover trends between the frequencies of mobility-based transportation and various other factors including time of day, specific areas within Central London, and the weather.

# **Methods**

## Data Description

In this deep dive, I used four sets of data: "2022-Central.csv", "0-Count locations.csv", "2022 Q1 (January-March).csv", "2022-1-spring.csv", and "2022-2-autumn.csv".

The 2022 Central dataset recorded the frequencies of different modes of transportation throughout the day, along with several other categories of important information. For each record, there were both categorical and numerical variables.
Categorical Variables:
- Weather: 'Wet' or 'Dry'
- Path: 5 paths, each with 4 cardinal directions (total = 20)
- Mode: 6 modes of transportation
Numerical Variables
- UnqID: 208 unique identifiers
- Date: 31 unique dates ranging over 3 months (May, June, and July)
- Time: 64 intervals of 15 minutes from 6:00 to 21:45
- Count: frequency

The 0-Count Locations dataset held information about specific locations within London. 
Categorical Variables:
- Location description: describes physical location
- Borough: Borough name
- Functional area for monitoring: 3 areas (Central, Inner, or Outer London)
Numerical Variables:
- UnqID: unique ID representing locations (matches first dataset)
- Latitude
- Longitude

The Quarter 1 (January to March), Spring, and Autumn datasets gave information for the remaining months of the year that the 2022 Central dataset was missing.
Categorical Variables:
- Weather: 'Wet' or 'Dry'
Numerical Variables:
- Date: Remaining months of 2022
- Count: frequency

## Methodology

I conducted my data analysis following three general steps: inspecting the data, preparing it for visualization, and then visualize it. Visualization mostly entaied the use of the Plotly Express library, but for the data inspection and preparation processes, I have listed the methods that were most useful to me.
1. Inspection<br>
- pandas.DataFame.head()
- pandas.DataFrame.dtypes
2. Preparation
- pandas.DataFrame.groupby()


## Tools Used

The language I coded in is python, and most of my program's functionality came from imported libraries.
1. **Pandas** <br>
Pandas, undoubtedly, played the largest role in my data analysis process, serving as the backbone for handling and managing the contents of the datasets. The library provided me with a powerful set of tools that assisted me in cleaning and transforming the data into a material that could be programmed into visual representations. Its ability to work with large datasets quickly made it very straightforward when deriving insights and trying to understand patterns and trends.
2. **Plotly Express (and subplots)** <br>
With the Pandas library handling the job of preparing the data for visualization, the actual plotting comes in using the Plotly Express library. It played a huge role in the visualization aspect by giving me the tools I needed to create a streamlined and easy-to-read graph. Plotly Express's versatility was clearly demonstrated by its ability to plot complex charts and its wide variety of graph options. Overall, it was able to successfully faciliate my goal to create plots that can communicate findings and patterns in data.

# **Dataset Exploration and Cleaning**

## 2022 Central Data

In [1]:
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots

# SCRAPE DATA
df = pd.read_csv('2022-Central.csv')

# INSPECT DATA
print(df.head())
print(df.tail())
print(df.dtypes)
print(df.describe())

# CLEAN DATA
# remove duplicates
print(df.duplicated().sum())
df = df.drop_duplicates()
# check for null values
null = df.isnull().sum()
print(null)
# df.dropna()
# remove irrelevant data
df = df.drop(columns = ['Year', 'Day', 'Round'])
# convert data types
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S').dt.time
df['Weather'] = df['Weather'].astype(str)
df['Dir'] = df['Dir'].astype(str)
df['Path'] = df['Path'].astype(str)
df['Mode'] = df['Mode'].astype(str)
df['UnqID'] = df['UnqID'].astype(str)
df['Count'] = df['Count'].astype(int)
print(df.dtypes)
# capitalize all text fields
df.applymap(lambda x: x.capitalize() if isinstance(x, str) else x)
# combining direction and path bc only two categories not combined
for i in range(len(df['Path'])):
    if df.loc[i, 'Path'] == 'Carriageway':
        df.loc[i, 'Path'] = f"Carriageway - {df.loc[i, 'Dir']}"
    if df.loc[i, 'Path'] == 'Shared path':
        df.loc[i, 'Path'] = f"Shared path - {df.loc[i, 'Dir']}"
df = df.drop(columns = ['Dir'])
print(df.head())

   Year     UnqID        Date Weather      Time      Day Round         Dir  \
0  2022  CENCY001  13/07/2022     Dry  06:00:00  Weekday     A  Northbound   
1  2022  CENCY001  13/07/2022     Dry  06:15:00  Weekday     A  Northbound   
2  2022  CENCY001  13/07/2022     Dry  06:30:00  Weekday     A  Northbound   
3  2022  CENCY001  13/07/2022     Dry  06:45:00  Weekday     A  Northbound   
4  2022  CENCY001  13/07/2022     Dry  07:00:00  Weekday     A  Northbound   

          Path         Mode  Count  
0  Carriageway  Cargo bikes      0  
1  Carriageway  Cargo bikes      0  
2  Carriageway  Cargo bikes      0  
3  Carriageway  Cargo bikes      0  
4  Carriageway  Cargo bikes      0  
        Year     UnqID        Date Weather      Time      Day Round  \
370683  2022  CENCY702  14/06/2022     Dry  20:45:00  Weekday     A   
370684  2022  CENCY702  14/06/2022     Dry  21:00:00  Weekday     A   
370685  2022  CENCY702  14/06/2022     Dry  21:15:00  Weekday     A   
370686  2022  CENCY702  1

## Location Data

In [2]:
# SCRAPE DATA
location = pd.read_csv('0-Count locations.csv')

# INSPECT DATA
print(location.head())
print(location.tail())
print(location.dtypes)
print(location.describe())

# CLEAN DATA
# remove duplicates
print(location.duplicated().sum())
location = location.drop_duplicates()
# check for null values
null = location.isnull().sum()
print(null)
# remove unecessary columns
location = location.drop(columns = ['Which folder?', 'Is it on the strategic CIO panel?', 'Shared sites',
                                    'Road type','Northing (UK Grid)','Easting (UK Grid)'])
# rename columns
location.rename(columns = {'Site ID':'UnqID'}, inplace = True)
# convert datatypes
location['UnqID'] = location['UnqID'].astype(str)
location['Location description'] = location['Location description'].astype(str)
location['Borough'] = location['Borough'].astype(str)
location['Functional area for monitoring'] = location['Functional area for monitoring'].astype(str)
location['Latitude'] = location['Latitude'].astype(float)
location['Longitude'] = location['Longitude'].astype(float)
# capitalize all text fields
location.applymap(lambda x: x.capitalize() if isinstance(x, str) else x)
print(location.head())

    Site ID     Which folder?         Shared sites  \
0  CENCY001  Strategic counts             CSHCY084   
1  CENCY002  Strategic counts                    0   
2  CENCY003  Strategic counts             CSHCY077   
3  CENCY004  Strategic counts  CSHCY075 & QWPCY047   
4  CENCY005  Strategic counts             CSHCY045   

                 Location description         Borough  \
0  Millbank (south of Thorney Street)     Westminster   
1                         Bishopsgate  City of London   
2                    Southwark Bridge       Southwark   
3               Southwark Bridge Road       Southwark   
4                       Tooley Street       Southwark   

  Functional area for monitoring Road type  Is it on the strategic CIO panel?  \
0                        Central    A Road                                  1   
1                        Central    A Road                                  1   
2                        Central    A Road                                  1   
3       

## Seasons Data

In [3]:
# SCRAPE and INSPECT
spring = pd.read_csv('2022-1-spring.csv')[['Date','Count','Weather']]
print(spring.head())
print(spring.dtypes)

autumn = pd.read_csv('2022-2-autumn.csv')[['Date','Count','Weather']]
print(autumn.head())
print(autumn.dtypes)

q1 = pd.read_csv('2022 Q1 (January-March).csv')[['Date','Count','Weather']]
print(q1.head())
print(autumn.dtypes)


# CLEANING
# remove null values
spring = spring.dropna()
autumn = autumn.dropna()
q1 = q1.dropna()

# capitalization
spring['Weather'] = spring['Weather'].str.capitalize()
autumn['Weather'] = autumn['Weather'].str.capitalize()
q1['Weather'] = q1['Weather'].str.capitalize()

# ensure data types
spring['Date'] = pd.to_datetime(spring['Date'], dayfirst=True)
autumn['Date'] = pd.to_datetime(autumn['Date'], dayfirst=True)
q1['Date'] = pd.to_datetime(q1['Date'], dayfirst=True)


# PROCESSING
# combine into one dataset
seasons = pd.concat([spring, autumn, q1], ignore_index=True)
# make sure only year 2022 data
seasons = seasons[seasons['Date'].dt.year == 2022]

# drop duplicates
seasons = seasons.drop_duplicates()
print(seasons.head())
print(seasons.isnull().sum())


         Date  Count Weather
0  30/06/2022     11     NaN
1  30/06/2022     14     NaN
2  30/06/2022     28     NaN
3  30/06/2022     28     NaN
4  30/06/2022     47     NaN
Date       object
Count       int64
Weather    object
dtype: object
         Date  Count Weather
0  13/10/2022      5     NaN
1  13/10/2022      5     NaN
2  13/10/2022      4     NaN
3  13/10/2022      5     NaN
4  13/10/2022      9     NaN
Date       object
Count       int64
Weather    object
dtype: object
         Date  Count Weather
0  07/03/2022     12     Dry
1  07/03/2022     43     Dry
2  07/03/2022     64     Dry
3  07/03/2022     77     Dry
4  07/03/2022    116     Dry
Date       object
Count       int64
Weather    object
dtype: object
        Date  Count Weather
0 2022-06-21      3     Dry
1 2022-06-21      1     Dry
3 2022-06-21      4     Dry
4 2022-06-21      2     Dry
7 2022-06-21      5     Dry
Date       0
Count      0
Weather    0
dtype: int64


# **Results**

## Daily Commute Patterns 

In [4]:
## Time vs. Count Sunburst Chart
df2 = df[['Time', 'Count']]
df2 = df2['Count'].groupby(df2['Time']).mean().reset_index()

# plot
fig = px.sunburst(df2, path = ['Time'], color = 'Count', title = 'Time vs. Count Sunburst')
fig.update_layout(width=600, height=600)
fig.show()

This sunburst/pie chart depicts the relationship between the time of day and the frequency of mobility-based transportation. The slices in the chart represent unique times during the day, or more specifically 15 minute intervals ranging from 6 am to 9 pm. Each slice takes a shade of color that represents the average frequency of transportation uses during that period of time. 

It is clear that the two peak transportation periods are around 8:45:00 and 18:00:00, because those two slices of the chart hold the brightest color. These times align with the traditional rush hours of a normal weekday, reflecting the morning and evening commutes of a high percentage of the labor force who typically work 9 to 5 jobs. Additionally, the chart shows subtle rises during midday, which is most likely characteristic for lunch breaks and running errands. One specific trend I noticed is that even though most work hours start at 9 am and end at 5 pm, the transportation demand is offset by varying times. First, the peak of morning commute frequencies are at 8:45 am. This signals that a majority of the labor force are leaving early to ensure they get to work on time. On the contrary, the peak of evening commute frequencies are at 6 pm, an hour after the usual ending time. This is because they now have free time and are in no hurry to make any schedule demands. There are also other possible situations that cause later transportation, such as working overtime or getting drinks with co-workers after work.

Overall, this allowed me to envision the average day of the adult labor force. I learned how professionals working in the real world  manage their time and go about their duties.

## Undercovering London's Transportation System 

In [5]:
# geomap of count vs Location within Central London
# find population of active transport based on unique id
pop = df['Count'].groupby(df['UnqID']).sum()

# link location dataset using unique id
merge = pd.merge(location,pop, on='UnqID')
# print(merge.describe())

# plot
fig = px.scatter_mapbox(merge, lat='Latitude', lon='Longitude', 
                        zoom=12, height=600, title='Interactive Map of Central London', color = 'Count', size = 'Count',
                        hover_name = 'Borough', hover_data = ['Location description', 'Borough', 'Count']
                        )
fig.update_layout(mapbox_style="open-street-map", margin=dict(r=0, l=0, b=0))
fig.show()


The graph presented above is a geographic density visualization portraying the frequency of mobility-based transportation in Central London. The bubble locations are representative of the locations that were observed and recorded, while the size and color are representative of the frequency. High-density areas are portrayed as larger and brighter-colored bubbles, while lower-density areas are potrayed as smaller and darker-colored bubbles. <br>
<br> As observed, there appears to be a clear pattern of higher-density locations along the Thames River. This is significant because areas near the Thames River are home to a multitude of business headquarters, financial institutions, and tourist attractions. The vibrant economic activities most likely attract a lot of commuters and visitors every day. This frequent surge of people towards lively areas most likely escalates the demand for active transportation methods, which is proven by the large, bright bubbles in those areas. Also, the center of London's infrastructure is highly supportive of these different modes of transportation, such as cycling paths and pedestrian routes. From this, I can ultimately identify a positive correlation between economic activity and active transport demand. <br>
<br> Another pattern I was able to identify was the degression from the center of the city: the bubbles start off large and bright, but gradually get smaller and adopt darker shades. The closer you are to the center of london, the more compact the area and the closer destinations are to each other, which would make short-distance transportation options like mobility-based transportation more efficient. On the other hand, moving away from the focal point of the city leads to less compact areas. And farther distances can be covered more efficiently with public transport, such as buses or tubes, rather than cycling or walking. This observation is consistent with the design of London's transport system, which extends from the central hub of the city outward.<br>
<br>Overall, analyzing this digression from the center revealed the meticulous design and cohesiveness of London's transportation network. It illuminates the infrastructure's role in shaping the city's rhythm and way of life.

## Seasonal Impact on Transportation Demand

In [6]:
# march june july data
mjj = df[['Date','Count','Weather']]
# add data for rest of the seasons
data3 = pd.concat([mjj,seasons], ignore_index = True)

#subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=('Dry Weather', 'Wet Weather'), shared_yaxes = True)

# dry weather data
dry_data = data3[data3['Weather'] == 'Dry']
dry_final = dry_data['Count'].groupby(data3['Date']).mean().reset_index()
# dry subplot
fig1 = px.scatter(dry_final, x='Date', y='Count')
for trace in fig1['data']:
    fig.add_trace(trace, row=1, col=1)

# wet weather data
wet_data = data3[data3['Weather'] == 'Wet']
wet_final = wet_data['Count'].groupby(data3['Date']).mean().reset_index()
# wet subplot
fig2 = px.scatter(wet_final, x='Date', y='Count')
for trace in fig2['data']:
    fig.add_trace(trace, row=1, col=2)

# plot
fig.update_xaxes(title_text='Date', row=1, col=1, dtick="M1", tickformat="%b\n%Y")
fig.update_xaxes(title_text='Date', row=1, col=2, dtick="M1", tickformat="%b\n%Y")
fig.update_yaxes(title_text='Count', row=1, col=1)
fig.show()


Above, I have plotted two scatter charts, one for each weather condition: Dry or Wet. For each graph, the mobility-based transportation usage frequency is depicted, with each dot corresponding to the average frequency of usages on that specific day of the month. The two graphs show the relationship between weather and active transport usage over the year of 2022. However, even though I used multiple datasets to try to get data expanding over all the months within this year, I was still missing data for the month of August.

After carefully analyzing the patterns in the scatterplots, I discovered patterns in the demand for active transportation based on the different seasons throughout the year. In the Dry Weather plot, I noticed sharp peaks around times nearing March and October, as well as recessive periods like from June to July. On the contrary, the data scattered in the Wet Weather plot appears more uniform with less fluctuations. From this, I can deduce that most significant climatic factor influencing mobility-based transportation demand is the temperature. This is because, over 2022, seasonal change did not have much influence on the frequency of transportation usage on wet days, but appeared to cause sever fluctuations on dry days, meaning weather conditions do not have much effect on the demand. In the Dry Weather plot, the demand spiked in Spring and Autumn and declined over Summer and Winter, indicating that the people of London prefer to take these transportation options in milder weathers, opposed to the extreme heat of Summer and freezing temperatures of Winter.

Summing up, the data on these graphs reveal a clear seasonal trend that preference for mobility-based transportation peaks during milder temperatures. This pattern emphasizes the dominant role of temperature over precipitation in influencing transportation choices throughout the year.

## The Call for Infrastructure Investments

In [7]:
# empty dataframe
df4 = pd.DataFrame(columns = ['Mode','Wet','Dry'])

# list of unique modes
modes = df['Mode'].unique()
# series of wet
wet = df[df['Weather'] == 'Wet']
wet = wet['Count'].groupby(df['Mode']).sum().reset_index(name = 'Wet')
# series of dry
dry = df[df['Weather'] == 'Dry']
dry = dry['Count'].groupby(df['Mode']).sum().reset_index(name = 'Dry')

# combine for final df
df4['Mode'] = modes
df4 = pd.merge(wet,dry, on='Mode', how='outer')
# print(df4)
# check for nulls
nulldf4 = df4.isnull()
# print(nulldf4)
# find out why there are nulls
pc = False
for i in range(len(df['Mode'])):
    if df.loc[i, 'Mode'] == 'Private cycles' and df.loc[i, 'Weather'] == 'Wet':
        pc = True
# print(pc)
# fix nulls
df4['Wet'] = df4['Wet'].fillna(0)
nulldf4 = df4.isnull()
# print(nulldf4)

# print(df4.describe())
# subplot option
fig = make_subplots(rows=1, cols=2, subplot_titles=('Higher Counts', 'Lower Counts'))

anomaly = df4[df4['Mode'].isin(['Pedestrians','Conventional cycles'])]
rest = df4[~df4['Mode'].isin(['Pedestrians','Conventional cycles'])]
fig1 = px.bar(anomaly, y = 'Mode', x = ['Wet','Dry'])
fig1.update_layout(xaxis_title = 'Count', yaxis_title = 'Mode of Transport', title = 'Higher Counts')
fig1.show()
fig2 = px.bar(rest, y = 'Mode', x = ['Wet','Dry'])
fig2.update_layout(xaxis_title = 'Count', yaxis_title = 'Mode of Transport', title = 'Lower Counts')
fig2.show()

# print("Conventional Cycles ratio to Pedestrian (Dry):",df4['Dry'].loc[1]/df4['Dry'].loc[3])
# print("Conventional Cycles ratio to Pedestrian (Wet):",df4['Wet'].loc[1]/df4['Wet'].loc[3])

# plot
# fig = px.bar(df4, x = 'Mode', y = ['Wet','Dry'], title = 'Mode vs. Weather Segmented Bar Chart')
# fig.show()

This horizontal segmented bar chart compares the frequencies of use between each mode of mobility-based transport under the condition of the weather. To make the comparison more visually intuitive, I seperated the data into two graphs. Because a couple of the modes had significantly higher counts of usage, it would be easier for juxtaposed with different scales.

When comparing the different modes of category, it becomes clear that a hierarchy is formed, with Pedestrians leading by a significant margin, followed by Conventional Cycles and then the rest. Breaking this down is extremely important because it takes the first step towards infrastructure investment priorities. This calls for the need to invest more on modes of transport that are in higher demand, such as walking and conventional cycling. The call for infrastructure improvement is also demonstrated by the patterns revealed under different weather conditions. When comparing the usage frequencies between dry days and wet days, we can see that the usage decreases a considerable amount. This promotes the idea that we need more facilities and solutions to enabling mobility transport options under extreme weathers.

Overall, not only would this make active transport easier and reduce traffic congestion, this would also promote healthier lifestyles by encouraging people to choose transportation options that require mobility. As you can see, by building new walkways, expanding cycling lanes, and making travel paths more weather-protected, the community environment is benefited as a whole.


# **Conclusion**

In conclusion, my comprehensive data analysis on mobility-based transportation methods in Central London during 2022 revealed insightful correlations with external variables. The results provided new perspectives on the professional world, highlighting how infrastructure shapes urban lifestyle and how climate conditions significantly influence the city's dynamics. The findings of my analysis highlight the dynamic nature of urban mobility, showing how transportation influences and gets influenced by the enviornment and community. All in all, not only did my investigation reflect the current state of urban mobility in London, it also provides a blueprint for future research and development.