# Project 1

### <div class="alert alert-block alert-danger"><b> COVID-19 Impact Analysis: Alabama vs Other States </b></div>

In this project, you will analyze and compare the impact of COVID-19 in Alabama with that in another state or with the national average. This analysis will involve a deep dive into the data, where you will calculate descriptive statistics, visualize trends, and apply probability and distribution concepts covered in Chapter 3.

Your task is to use the statistical tools and techniques we've discussed to uncover patterns, trends, and anomalies in the data. By the end of this project, you should be able to draw meaningful conclusions about how COVID-19 has impacted Alabama compared to the selected region. This might include insights into the spread of the virus, the effectiveness of interventions, or other key metrics.

You are encouraged to critically analyze the data and consider factors that may explain the differences you observe. Use the skills you have developed to tell a data-driven story that sheds light on the public health situation in Alabama.

IMPORTANT: I have given template code for some parts of the project for the state of Georgia. You are expected to pick a different state and to fill in the code that is missing throughout. 

### Objectives
1. **Data Acquisition:** Learn how to access and download COVID-19 data from a reliable source.
2. **Descriptive Statistics:** Calculate and interpret key statistical measures (mean, median, standard deviation, etc.) for COVID-19 metrics in Alabama and another state.
3. **Data Visualization:** Create plots to visually compare COVID-19 trends between Alabama and the selected region.
4. **Probability and Distribution Analysis:** Apply probability concepts to understand the distribution of cases or deaths and make predictions.
5. **Draw Conclusions:** Use the data analysis to make well-supported conclusions about the impact of COVID-19 in Alabama relative to the comparison region. Consider how these conclusions can inform public health decisions or policy.


### Part 1: Data Acquisition

**Objective:** Access and download COVID-19 data for multiple states, including Alabama.

#### Steps:

1. **Visit a Reliable Data Source:**
   - The **New York Times COVID-19 Data** repository on GitHub is a reliable source of daily COVID-19 data for the U.S. You can access it at [https://github.com/nytimes/covid-19-data](https://github.com/nytimes/covid-19-data).
   - Alternatively, you can use **Our World in Data** or **Johns Hopkins University COVID-19 Data** from [https://github.com/CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19).

2. **Search for the Dataset:**
   - For the New York Times dataset, you can directly download the CSV files for cases and deaths by state from their GitHub repository.
   - Look for the files named `us-states.csv`.

3. **Download the Dataset:**
   - Download the `us-states.csv` file and save it in a folder named `data` within your project directory.

In [None]:
# Create the 'data' subfolder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the URL of the CSV file
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"

# Define the destination file path
destfile <- "data/us-states.csv"

# Download the file
download.file(url, destfile)

# Verify that the file has been downloaded
list.files("data")

### Part 2: Descriptive Statistics

**Objective:** Calculate and interpret descriptive statistics for COVID-19 cases and deaths in Alabama and another selected state.

#### Steps:

1. **Load the Dataset into R:**
   - Use appropriate R functions to read the CSV file and inspect the data.

2. **Filter the Data:**
   - Extract COVID-19 data for Alabama and your chosen comparison state from the dataset.

3. **Calculate Descriptive Statistics:**
   - Compute the **mean**, **median**, **standard deviation**, **variance**, and **range** for cases and deaths in both states.

4. **Interpret the Results:**
   - Compare the descriptive statistics between Alabama and the selected state.


In [None]:
# Install and load necessary packages
# install.packages("tidyverse") # Run only if not already installed
library(tidyverse)

# Read the CSV file


# View the first few rows of the dataset


# Filter data for Alabama and another state (e.g., Georgia)


# Calculate descriptive statistics for Alabama and Georgia
covid_summary <- covid_data_filtered %>%
  group_by(state) %>%
  reframe(
    mean_cases = mean(cases, na.rm = TRUE),
    median_cases = median(cases, na.rm = TRUE),
    sd_cases = sd(cases, na.rm = TRUE),
    range_cases = range(cases, na.rm = TRUE),
    mean_deaths = mean(deaths, na.rm = TRUE),
    median_deaths = median(deaths, na.rm = TRUE),
    sd_deaths = sd(deaths, na.rm = TRUE),
    range_deaths = range(deaths, na.rm = TRUE)
  )
covid_summary

The `fips` column in the dataset refers to the Federal Information Processing Standards (FIPS) code.

The `reframe()` function is recommended other than `summarise()` in the latest version of dplyr when you want to ensure that the output is always an ungrouped data frame.

### Part 3: Data Visualization

**Objective:** Create visual representations to compare COVID-19 cases and deaths between Alabama and the selected state.

#### Steps:
1. **Line Plot of Cases Over Time:**
    - Visualize how the number of cases has changed over time for both states.
2. **Bar Plot for Deaths:**
    - Compare the number of deaths across the states.

We are using side-by-side plots to facilitate direct comparison.

In [None]:
# Line plot for cases over time for both states
ggplot(covid_data_filtered, aes(x = date, y = cases, color = state)) +
  geom_line(linewidth = 1) +
  labs(title = "COVID-19 Cases in Alabama vs. Georgia Over Time",
       x = "Date",
       y = "Number of Cases") +
  theme_minimal()

# Bar plot for deaths comparison


### Part 4: Probability and Distribution Analysis

Objective: Apply probability concepts and work with distributions based on the COVID-19 data.

#### Steps:
1. **Assume Normal Distribution:**
    - Based on the data, assume that daily new cases for each state are normally distributed.
2. **Calculate Probabilities:**
    - Determine the probability that a randomly selected day in Alabama has more than a certain number of new cases (e.g., 1000) and compare this with the other state.
3. **Determine Percentiles:**
    - Calculate the 90th percentile for daily new cases in both states and compare the results.

In [None]:
# Calculate mean and standard deviation for Alabama and Georgia daily new cases
mean_cases_alabama <- mean(covid_data_filtered$cases[covid_data_filtered$state == "Alabama"], na.rm = TRUE)
sd_cases_alabama <- sd(covid_data_filtered$cases[covid_data_filtered$state == "Alabama"], na.rm = TRUE)

mean_cases_georgia <- mean(covid_data_filtered$cases[covid_data_filtered$state == "Georgia"], na.rm = TRUE)
sd_cases_georgia <- sd(covid_data_filtered$cases[covid_data_filtered$state == "Georgia"], na.rm = TRUE)

# Probability of more than 1000 cases in Alabama on a given day
prob_above_1000_alabama <- 1 - pnorm(1000, mean = mean_cases_alabama, sd = sd_cases_alabama)
prob_above_1000_alabama

# Probability of more than 1000 cases in Georgia on a given day


# 90th percentile for daily new cases in Alabama


# 90th percentile for daily new cases in Georgia


### Your Conclusion

Based on your calculations, what can you conclude about the severity or spread of COVID-19 in Alabama compared to the other state? Are there specific periods of concern or significant differences in the distribution of cases?