# **Notebook 03 | Data Visualisation and Analysis**

LSE DS105A – Data for Data Science (2024/25 Autumn Term)

**AUTHOR:** SABAA PASHA (Candidate Number: **41408**)

**OBJECTIVE**: Analyse and create visualisations for the data collected to provide an answer for the question:
> *“Is London really as rainy as the movies make it out to be?”*

## **Imports**

In [1]:
import json

import numpy as np
import pandas as pd

from lets_plot import *
LetsPlot.setup_html()

df = pd.read_json("../data/cleaned_dataframe.json", lines=True)

## **Analysing Precipitation Data**

In this section, I will create visualisations using the data from the following CSV files:
> total_precip.csv
>
> precip_hours.csv
>
> avg_precip_hour.csv

### **Total Precipitation**

Loading the CSV file and reading it as a DataFrame:

In [2]:
total_precip = pd.read_csv("../data/total_precip.csv")

In [3]:
total_precip

Unnamed: 0,City,Precipitation Type,Total Amount (mm)
0,London,Rain,769.5
1,London,Other Precipitation,11.2
2,Bangalore,Rain,795.5
3,Bangalore,Other Precipitation,0.0
4,Bogota,Rain,1038.7
5,Bogota,Other Precipitation,0.0
6,Riyadh,Rain,142.9
7,Riyadh,Other Precipitation,0.0
8,Amsterdam,Rain,1199.2
9,Amsterdam,Other Precipitation,42.6


Creating and customising a **stacked bar chart** to visualise the total volume of precipitation for each city in 2023:

> Break down the "Total Precipitation" into 2 categories ("Rain" and "Other Precipitation").

In [4]:
plot = (
    ggplot(total_precip) +
    geom_bar(
        aes(x="City", y="Total Amount (mm)", fill="Precipitation Type"),
        stat='identity', position='stack',
        tooltips=layer_tooltips()
            .line("@{Precipitation Type}: @{Total Amount (mm)} mm")
    ) +
    labs(title="☔Total Precipitation by City (2023)",
        subtitle=("Amsterdam, Bogota and Bangalore experienced greater rainfall (and thus, total precipitation) than London last year.\n"
                    "There were minimal amounts of other precipitation in Amsterdam and London, with no recorded precipitation in the remaining cities."),
        x="City",
        y="Total Precipitation (mm)") +
    theme_bw() +
    theme(
        plot_title=element_text(size=24, face="bold", hjust=0.5),
        plot_subtitle=element_text(size=16),
        axis_text_x=element_text(size=14), 
        axis_text_y=element_text(size=14),
        axis_title_x=element_text(size=18, face="bold"),
        axis_title_y=element_text(size=18, face="bold"),
    ) +
    scale_fill_manual(values={"Rain": "#4a90e2", "Other Precipitation": "#a9a9a9"}) +
    ggsize(1200, 500)
)
plot

### **Hours of Precipitation**

Loading the CSV file and reading it as a DataFrame:

In [5]:
precip_hours = pd.read_csv("../data/precip_hours.csv")

Creating and customising a **pie chart** to visualise the the distribution of precipitation hours across different cities:
> Calculate each city's percentage of hours of precipitation
>
> Percentage of hours = (City's hours of precipitation / Total hours of precipitation)*100

In [6]:
plot = (
    ggplot(precip_hours) +
    aes(y="Percentage", fill="City") +
    geom_bar(
        stat="identity", 
        width=1, 
        tooltips=layer_tooltips()
            .line("City: | @City")
            .line("Hours of Precipitation: | @{Hours of Precipitation}")
            .line("Percentage: | @Percentage%")
    ) +
    coord_polar(theta="y") +
    labs(
        title="⌛Distribution of Precipitation Hours (2023)",
        subtitle="Amsterdam contributed the largest share (30.8%) to the total precipitation hours,\n"
                "followed closely by Bogota at 28.3%. London only accounted for 20.8% of the total hours."
    ) +
    theme_void() +
    theme(
        plot_title=element_text(size=20, face="bold", hjust=0.5),
        plot_subtitle=element_text(size=14, hjust=0.5),
        plot_margin=[0, 150, 0, 150],
        legend_position="bottom",
        legend_title=element_blank(),
    ) +
    scale_fill_gradient(low="#07285b", high="#cfdcf0")
)
plot

Loading the CSV file and reading it as a DataFrame:

In [7]:
avg_precip_hour = pd.read_csv("../data/avg_precip_hour.csv")

Creating and customising a **dot plot** to visualise the average precipitation per hour:
> Calculate each city's average precipitation per hour: 
>
> Average precipitation per hour = Total precipitation (mm) / Total hours of precipitation

In [8]:
plot = (
    ggplot(avg_precip_hour) +
    aes(x="City", y="Average_Precipitation_Per_Hour", color="Average_Precipitation_Per_Hour") +
    geom_point(
    tooltips=layer_tooltips()
            .line("City: | @City")
            .line("Average Precipitation per Hour: | @{Average_Precipitation_Per_Hour} mm"), 
            size=6) + 
    labs(
        title="🌍Average Precipitation per Hour by City (2023)",
        subtitle="Despite its incredibly hot and dry climate, Riyadh had the highest recorded precipitation per hour!\n"
        "Amsterdam, Bogota and London had the top 3 most hours of precipitation, yet recorded much lower values for average precipitation per hour.",
        x="City",
        y="Average Precipitation Per Hour (mm)"
    ) +
    theme_bw() +
    theme(
        plot_title=element_text(size=20, face="bold", hjust=0.5),
        plot_subtitle=element_text(size=16),
        axis_text_x=element_text(size=14, angle=0),
        axis_text_y=element_text(size=14),
        axis_title_x=element_text(size=18, face="bold"),
        axis_title_y=element_text(size=18, face="bold"),
        panel_background=element_rect(fill="#e0e0e0")
    ) +
    scale_color_gradient(low="#329999", high="#004c4c") +
    ggsize(1200, 500) +
    guides(color="none")
)

plot

### **Comments**

> From the stacked bar chart, it is clear that London's overall precipitation last year is relatively moderate compared to other cities such as Amsterdam, Bogota and Bangalore.  However, while London did not record as much total precipitation, it is important to account for seasonal variation. Bangalore experiences heavy monsoon rain between the months of June and September, which contributes significantly to its high overall precipitation. In contrast, London’s precipitation is more evenly distributed throughout the year, with frequent light showers but fewer extreme rainfall events. 
>
> The data from the pie chart showed that both Amsterdam and Bogota have the largest share of precipitation hours out of the 5 cities. Although London does experience a fair amount of rain, it can be argued that doesn’t have as much of an extended rainy period compared to other cities like Amsterdam and Bogota, which have longer stretches of precipitation.
>
> The dot plot showed that Riyadh recorded the highest value for precipitation per hour in 2023. Surprisingly, London recorded a substantially lower value for precipitation per hour. This implies that, although the city experiences frequent rainfall, is much lighter than the rainfall seen in Riyadh.

## **Analysing Rainfall Data**

In this section, I will create visualisations using the data from the following CSV files:
> total_rainy_days.csv
>
> monthly_avg.csv

### **Total Rainy Days**

Loading the CSV file and reading it as a DataFrame:

In [9]:
total_rainy_days = pd.read_csv("../data/total_rainy_days.csv")

Creating and customising a **bar chart** to visualise the total number of rainy days for each city in 2023:
> Exclude days that recorded less than 1mm rain.

In [10]:
top_3_cities = (
    total_rainy_days.nlargest(3, "Number_of_Rainy_Days")
    .assign(
        Medal=lambda x: ["🥇" if i == 0 else "🥈" if i == 1 else "🥉" for i in range(3)]
    )
    .set_index("City")["Medal"]
)

total_rainy_days["Label"] = total_rainy_days.apply(
    lambda row: f"{row['Number_of_Rainy_Days']} {top_3_cities.get(row['City'], '')}",
    axis=1
)

plot = (
    ggplot(total_rainy_days) +
    geom_bar(
        aes(x="City", y="Number_of_Rainy_Days", fill="Number_of_Rainy_Days"),
        stat="identity", 
        tooltips=layer_tooltips()
            .line("City: | @City")
            .line("Total Rainy Days: | @Number_of_Rainy_Days")
    ) +
    geom_text(
        aes(x="City", y="Number_of_Rainy_Days", label="Label"),
        position=position_stack(vjust=1.025),
        color="black"
    ) +
    labs(
        title="🌧️Number of Rainy Days by City (2023)",
        subtitle=("As with the previous bar chart and pie chart, Amsterdam and Bogota both recorded higher values than London.\n"
                "However, unlike its higher total precipitation, Bangalore does not surpass London in terms of the number of rainy days."),
        x="City",
        y="Number of Rainy Days in 2023"
    ) +
    theme_bw() +
    theme(
        plot_title=element_text(size=24, face="bold", hjust=0.5),
        plot_subtitle=element_text(size=16),
        axis_text_x=element_text(size=14), 
        axis_text_y=element_text(size=14),
        axis_title_x=element_text(size=18, face="bold"),
        axis_title_y=element_text(size=18, face="bold"),
    ) +
    scale_fill_gradient(low="#c19dd0", high="#5c2970") +
    ggsize(1200, 500) +
    guides(fill="none")
)
plot

### **Monthly Average Rainfall**

Loading the CSV file and reading it as a DataFrame:

In [11]:
monthly_avg = pd.read_csv("../data/monthly_avg.csv")

Creating and customising a **line graph** to visualise the monthly average rainfall for each city:

In [12]:
plot = (
    ggplot(monthly_avg) +
    geom_line(aes(x="Month", y="Average Rainfall (mm)", color="City", group="City"), size=1.7) +
    geom_point(
        aes(x="Month", y="Average Rainfall (mm)", color="City"),
        tooltips=layer_tooltips()
            .line("City: | @City")
            .line("Month: | @Month")
            .line("Average Rainfall | @{Average Rainfall (mm)} mm"), 
        size=5, 
        shape=16
    ) +
    labs(
        title="💧Monthly Average Rainfall by City (2023)",
        subtitle="Amsterdam and Bogota consistently saw higher average rainfall values per month compared to London.\n"
                "Bangalore's monsoon season is clearly visible, with significantly greater rainfall than London from May to July and September",
        x="Month",
        y="Average Rainfall (mm)"
    ) +
    theme_bw() +
    theme(
        plot_title=element_text(size=24, face="bold"),
        plot_subtitle=element_text(size=16),
        axis_text_x=element_text(size=14),
        axis_text_y=element_text(size=14),
        axis_title_x=element_text(size=18, face="bold"),
        axis_title_y=element_text(size=18, face="bold"),
        panel_background=element_rect(fill="#e0e0e0")
    ) +
    scale_color_brewer(palette="RdGy") +
    ggsize(1200, 500)
)
plot

Creating and customising a **heatmap** to visualise the monthly average rainfall for each city:
> A clearer way of displaying the same data shown in the line graph.

In [13]:
plot = (
    ggplot(monthly_avg) +
    aes(x="Month", y="City", fill="Average Rainfall (mm)") + 
    geom_tile(
        tooltips=layer_tooltips()
        .line("City: | @City")
        .line("Month: | @Month")
        .line("Average Rainfall | @{Average Rainfall (mm)} mm")
    ) +
    labs(
        title="💧Monthly Average Rainfall by City (2023)",
        subtitle=("Riyadh's average rainfall in January was much higher than in London, as shown by the darker tile."),
        x="Month",
        y ="City"
    ) +
    theme_bw() +
    theme(
        plot_title=element_text(size=24, face="bold", hjust=0.5,),
        plot_subtitle=element_text(size=17),
        axis_text_x=element_text(size=14), 
        axis_text_y=element_text(size=14),
        axis_title_x=element_text(size=18, face="bold"),
        axis_title_y=element_text(size=18, face="bold"),
    ) +
    ggsize (1200,500) +
    scale_fill_gradient(low="#fffff0", high="#be1b1b")
)
plot

Creating and customising a **dot plot** to compare each city's monthly average rainfall directly with London:
> Prepare the data for comparing London's monthly average rainfall by assigning the cities to a list.
>
> Combine the data for all the cities and London into a single DataFrame.

In [14]:
comparison_cities = ["Amsterdam", "Bogota", "Riyadh", "Bangalore"]
monthly_avg_london = monthly_avg[monthly_avg['City'] == "London"]

monthly_avg_comp = pd.concat(
    [
        monthly_avg.query(f'City == "{city}"').assign(Pair=f"{city}") 
        for city in comparison_cities
    ] + 
    [monthly_avg_london.assign(Pair=f"{city}") for city in comparison_cities]
)

In [15]:
plot = (
    ggplot(
        monthly_avg_comp, 
        aes(x="Month", y="Average Rainfall (mm)")
    ) +
    geom_point(
        aes(color="City"), 
        size=4, 
        tooltips=layer_tooltips()
            .format("@Month", "{.0f}")
            .line("City | @City")
            .line("Month | @Month")
            .line("Average Rainfall | @{Average Rainfall (mm)} mm"),
        shape=21, 
        fill="#111111"
    ) +
    facet_grid(y="Pair") + 
    scale_y_continuous(
        name="Average Rainfall (mm)", 
        breaks=list(range(0, 200, 50))
    ) +
    facet_grid(y="Pair") +
    scale_y_continuous(
        name="Average Rainfall (mm)", 
        breaks=[i for i in range(0, 200, 50)]
    ) +
    labs(
        title="💧Monthly Average Rainfall Comparison (London vs All)",
        subtitle="The red dots represent London's monthly average rainfall values.",
        x="Month"
    ) +
    theme(
        plot_title=element_text(size=24, face="bold"),
        plot_subtitle=element_text(size=16),
        axis_text_x=element_text(size=14), 
        axis_text_y=element_text(size=14),
        axis_title_x=element_text(size=18, face="bold"),
        axis_title_y=element_text(size=18, face="bold"),
        strip_background=element_rect(fill="#324655"),
        strip_text=element_text(size=14, color="white")
    ) +
    ggsize(1000, 600) +
    scale_color_manual(
    values={
        "London": "#D32f2f",
        "Bangalore": "#ff8C00",
        "Bogota": "#8a2be2",
        "Riyadh": "#2e8b57",
        "Amsterdam": "#1e90ff"
    }
    ) +    
    guides(colour="none") +
    ggsize(1200, 500)
)
plot

### **Comments**
> The bar chart reveals that Amsterdam and Bogota recorded a higher number of "rainy days" in 2023 than London, despite its reputation for rain. Additionally, the monthly average rainfall for both Amsterdam and Bogota regularly exceeds London's, emphasising the fact that London does not receive as much rainfall.
>
> Bangalore's monsoon season stands out from May to July and September, where it sees a substantial amount of rainfall and goes far beyond London's during these months.
>
> In January, Riyadh's average rainfall also surpassed London's, which is quite unusual due to its typically desert climate.

## **Conclusion**

>Overall, based on the 2023 data, it can be argued that London is **not** as rainy as commonly believed. Despite its reputation for frequent rain and poor weather, the data shows that cities like Amsterdam and Bogota record more rainy days, as well as higher annual precipitation. Additionally, both cities consistently experience greater average monthly rainfall than London. Bangalore’s monsoon season also significantly boosts its total precipitation, with its average rainfall from May to July and September surpassing London’s during these months. Surprisingly, Riyadh, a city known for its dryness, not only recorded higher precipitation per hour than London, but also had a higher average rainfall in January, suggesting that its rainfall is more intense.
>
>Therefore, while London does see regular rainfall, other cities experience more extreme and prolonged periods of rain, challenging the perception of London being "rainy".