# STAT 201 Proposal

## Group 11
Ray Tao    
Florence Wang    
Lesley Mai    

<hr style="opacity: 0.3" />

# Compare the delay rate of two popular airlines in the United States

## Introduction

### Basic Information

As the demand of fast speed travel, taking plane becomes one of the most popular choice during the past several decades. At the same time, flight delay is most depressing but also sometimes not avoidable. For instance, among different airlines about 0% to 34% was delayed today (FlightAware, 2023).Therefore, the average delay rate of an airline has become a essential standard for customers to evaluate and choose a company. In this project, we will use the dataset that created by Ulrik Thyge Pedersen (2023) to find if there's a significant difference of the delay rate between two industry giants, Delta Airlines and American Airlines.

### Research Question

Is there a difference between the delay rate of flights operated by Delta Airlines and American Airlines?



## Preliminary Result

In [1]:
library(tidyverse)
library(infer)
library(dplyr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


### Load Data

In [2]:
airlines_delay <- read.csv("https://raw.githubusercontent.com/rayyyy122/STAT-201-Project/main/airlines_delay.csv", header = TRUE)

“EOF within quoted string”


#### Raw Dataset

In [3]:
head(airlines_delay)

Unnamed: 0_level_0,X..DOCTYPE.html.
Unnamed: 0_level_1,<chr>
1,<html lang=en data-color-mode=auto data-light-theme=light data-dark-theme=dark data-a11y-animated-images=system>
2,<head>
3,<meta charset=utf-8>
4,<link rel=dns-prefetch href=https://github.githubassets.com>
5,<link rel=dns-prefetch href=https://avatars.githubusercontent.com>
6,<link rel=dns-prefetch href=https://github-cloud.s3.amazonaws.com>


<left><em>Table 1: Raw Airlines Delay Dataset</em></left>

### Clean and Wrangle Data

To clean and wrangle our data, we are going to:

1. Select the relevant columns for our question

2. Filter out rows containing `NA`
 

In [4]:
delay_data <- airlines_delay |> 
    filter(Airline %in% c("DL", "AA")) |> 
    rename(Delay = Class) |>
    filter(!is.na(Airline)) |> 
    filter(!is.na(Delay)) |>
    mutate(Delay = as.factor(Delay)) |>
    select(c(Airline, Delay))

head(delay_data)

ERROR: [1m[33mError[39m in [1m[1m`filter()`:[22m
[1m[22m[33m![39m Problem while computing `..1 = Airline %in% c("DL", "AA")`.
[1mCaused by error in [1m[1m`Airline %in% c("DL", "AA")`:[22m
[33m![39m object 'Airline' not found


<left><em>Table 2: DL Airline and AA Airline Delay or Not </em></left>

### Broad Overview of Data

We are estimating the proportion of delayed flights for each airline and we added the estimated proportion to the plot as text labels, to more clearly observe the difference.

In [None]:
delay_summary <- delay_data %>%
  group_by(Airline, Delay) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(prop = count / sum(count))

plot <- delay_data |> 
  ggplot(aes(x = Airline, fill = Delay)) +
  geom_bar(width = 0.5) + 
  ylab("the Number of Flights") +
  ggtitle("the Proportion of Delayed Flights of each Airlines") +
  scale_fill_discrete(name = "Delay or Not",
                      breaks = c("0", "1"),
                      labels = c("No", "Yes")) +
  geom_text(data=delay_summary, aes(x=Airline, y=count+10, label=round(prop,2)), position=position_dodge(width=0.5))

plot


## Methods: Plan

In this project, we plan to use two different methods to conduct a hypothesis test about Discrepancy between the proportion of delayed flights operated by Delta Airlines and Amenrican Airline. First we will use bootstarpping method in the `infer package` and then we plan to use theory based approach to perform a hypothesis test. We will also construct confidence intervels for the difference between two proportions. We plan to conduct the hypothesis test at the most common significance level of $\alpha = 0.05$. In the bootstraping approach, we will use permutation for the null model. For the theorem based one, we will use Central Limit Theorem two sample z test.  

Our null and alternative hypothesis is:

• $H_0 : p_1 = p_2$

• $H_a : p_1 ≠ p_2$

where:

$p_1$ is the proportion of delayed flights in all flights operated by Delta Airlines

$p_2$ is the proportion of delayed flights in all flights operated by American Airlines

Based on the sample proportions and the graph above, we expect to reject the null hypothesis and accept the alternative that the delay rate of these two airlines is different. There are various implications of this finding: for instance, it can help passengers to make a better choice when buying flight tickets and help invester to evaluate a airline company. Future research could delve into more categories of the airlines, even outside of simply the U.S., to provide information on different flight delays to a wider range of international travellers and to investigate what might be the causes of these differences.

Meanwhile, this analysis is limited because it does not tell us where this significant difference lies. We need to consider is there more test method that is more accurate to find the difference between two population other than Central Limit Theorem? 



## Reference

1. FlightAware (2023). https://flightaware.com/live/cancelled/

2. Pedersen, U. T. (2023).Airlines Delay. https://www.kaggle.com/datasets/ulrikthygepedersen/airlines-delay