# Boxplots

This notebook contains the code to create boxplots and violin plots in `lets-plot`, using the ["Airlines Delays from 2003-2016"](https://www.kaggle.com/datasets/giovamata/airlinedelaycauses) dataset by [Priank Ravichandar](https://www.kaggle.com/priankravichandar) licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/). This dataset contains the information on flight delays and cancellations in the US airports for the period of 2003-2016.

In [1]:
import pandas as pd
from lets_plot import *

LetsPlot.setup_html()

## Import and process the data
* Create a date/time variable from the month/year column
* Remove the first and last years, as they only contain partial records for the year
* Create a variable which calculates the average delay per flight per airport per month
* Create a variable which indicates the US region that an airport is located in

In [2]:
airlines = pd.read_csv("data/airlines.csv")
airlines["Time"] = pd.to_datetime(airlines["TimeLabel"], infer_datetime_format=True)
airlines = airlines[~(airlines["TimeYear"].isin([2003, 2016]))]
airlines["AverageMinutesDelayed"] = airlines["MinutesDelayedTotal"] / airlines["FlightsDelayed"]

airlines["region"] = airlines["AirportCode"].map({
    "ATL": "Southeast",
    "BOS": "Northeast",
    "BWI": "Northeast",
    "CLT": "Southeast",
    "DCA": "Southeast",
    "DEN": "West",
    "DFW": "Southwest",
    "DTW": "Midwest",
    "EWR": "Northeast",
    "FLL": "Southeast",
    "IAD": "Southeast",
    "IAH": "Southwest",
    "JFK": "Northeast",
    "LAS": "West",
    "LAX": "West",
    "LGA": "Northeast",
    "MCO": "Southeast",
    "MDW": "Midwest",
    "MIA": "Southeast",
    "MSP": "Midwest",
    "ORD": "Midwest",
    "PDX": "West",
    "PHL": "Northeast",
    "PHX": "Southwest",
    "SAN": "West",
    "SEA": "West",
    "SFO": "West",
    "SLC": "West",
    "TPA": "Southeast",
})

## Boxplot showing distribution of the time a flight is delayed by airport

In [3]:
(
        ggplot(airlines,
               aes(x="AirportCode", y="AverageMinutesDelayed"))
        + geom_boxplot(fill = "#b3cde3")
        + xlab("Airport code")
        + ylab("Flight delay (minutes)")
        + ggtitle("Distribution of flight delay times in US airports, 2004-2015")
)

## Boxplot showing distribution of the time a flight is delayed by airport, by US region

In [4]:
(
        ggplot(airlines,
               aes(x="AirportCode", y="AverageMinutesDelayed", fill = "region"))
        + geom_boxplot()
        + xlab("Airport code")
        + ylab("Flight delay (minutes)")
        + ggtitle("Distribution of flight delay times in US airports, 2004-2015")
        + scale_fill_brewer(type="qual", palette="Pastel1", name="US region")
)

## Violin plot showing distribution of the time a flight is delayed by airport

In [7]:
(
        ggplot(airlines,
               aes(x="AirportCode", y="AverageMinutesDelayed", fill = "region"))
        + geom_violin()
        + xlab("Airport code")
        + ylab("Flight delay (minutes)")
        + ggtitle("Distribution of flight delay times in US airports, 2004-2015")
        + scale_fill_brewer(type="qual", palette="Pastel1", name="US region")
)