# Title

## 1. Introduction

### Below are just instructions to be deleted after finish
*Begin by providing some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.*

*Clearly state the question you will try to answer with your project. Your question should involve one or more random variables of interest, spread across two or more categories that are interesting to compare. For example, you could consider the annual maxima river flow at two different locations along a river, or perhaps gender diversity at different universities. Of the response variable, identify one location parameter (mean, median, quantile, etc.) and one scale parameter (standard deviation, inter-quartile range, etc.) that would be useful in answering your question. Justify your choices.*

*Identify and describe the dataset that will be used to answer the question. Remember, this dataset is allowed to contain more variables than you need – feel free to drop them!*

*Also, be sure to frame your question/objectives in terms of what is already known in the literature. Be sure to include at least two scientific publications that can help frame your study (you will need to include these in the References section). We have no specific citation style requirements, but be consistent.*

### 1.1 Background information on the topic

Text

### 1.2 The Question 

Text

### 1.3 The Dataset

Text

### 1.4 The Literature 

Text

## 2. Preliminary Results

### 2.0 Libraries and Packages

In [1]:
install.packages("lubridate")

Installing package into 'C:/Users/chunq/R/win-library/4.1'
(as 'lib' is unspecified)



package 'lubridate' successfully unpacked and MD5 sums checked


"cannot remove prior installation of package 'lubridate'"
"problem copying C:\Users\chunq\R\win-library\4.1\00LOCK\lubridate\libs\x64\lubridate.dll to C:\Users\chunq\R\win-library\4.1\lubridate\libs\x64\lubridate.dll: Permission denied"
"restored 'lubridate'"



The downloaded binary packages are in
	C:\Users\chunq\AppData\Local\Temp\RtmpCYRaZS\downloaded_packages


In [2]:
library(tidyverse)
library(readr)
library(tidyr)
library(dbplyr)
library(lubridate)

-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.5     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.6     [32mv[39m [34mdplyr  [39m 1.0.7
[32mv[39m [34mtidyr  [39m 1.1.4     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 2.1.1     [32mv[39m [34mforcats[39m 0.5.1

-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: 'dbplyr'


The following objects are masked from 'package:dplyr':

    ident, sql



Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union




### 2.1 Read the data into R

In [5]:
df <- read_csv("../data/crdt-data.csv")


[1mRows: [22m[34m5320[39m [1mColumns: [22m[34m54[39m

[36m--[39m [1mColumn specification[22m [36m------------------------------------------------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): State
[32mdbl[39m (53): Date, Cases_Total, Cases_White, Cases_Black, Cases_Latinx, Cases_A...


[36mi[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



In [None]:
head(df)
colnames(df)

### 2.2 Clean and wrangle data into a tidy format


In [6]:
# Select the columns
# Too many unuseful data, so only interested in Cases Total,
# white ,black, latinx,asian and their deaths.
case_death <- df %>%
    #select(Cases_Total:Cases_Other, Deaths_Total:Deaths_Other,
     #Hosp_Total:Hosp_Other)
    select(Date, State, Cases_Total:Cases_Asian, Deaths_Total:Deaths_Asian)

glimpse(case_death)

Rows: 5,320
Columns: 12
$ Date          [3m[90m<dbl>[39m[23m 20210307, 20210307, 20210307, 20210307, 20210307, 202103~
$ State         [3m[90m<chr>[39m[23m "AK", "AL", "AR", "AS", "AZ", "CA", "CO", "CT", "DC", "D~
$ Cases_Total   [3m[90m<dbl>[39m[23m 59332, 499819, 324818, NA, 826454, 3501394, 435762, 2853~
$ Cases_White   [3m[90m<dbl>[39m[23m 18300, 160347, 207596, NA, 308453, 546630, 181669, 85469~
$ Cases_Black   [3m[90m<dbl>[39m[23m 1499, 82790, 50842, NA, 25775, 111279, 12637, 19651, 201~
$ Cases_Latinx  [3m[90m<dbl>[39m[23m NA, NA, NA, NA, 244539, 1509103, 119224, 41523, NA, 1453~
$ Cases_Asian   [3m[90m<dbl>[39m[23m 2447, 2273, 2913, NA, 11921, 186562, 6406, 3019, 914, 18~
$ Deaths_Total  [3m[90m<dbl>[39m[23m 305, 10148, 5319, NA, 16328, 54124, 5986, 7704, 1030, 14~
$ Deaths_White  [3m[90m<dbl>[39m[23m 127, 4730, 4171, NA, 8066, 16586, 3869, 5413, 105, 1036,~
$ Deaths_Black  [3m[90m<dbl>[39m[23m 9, 2223, 784, NA, 433, 3275, 191, 906, 773, 

#### Reasons to drop other columns 

We only interested in ethnicity of white, black,asian, and latinx:
+ Cases_Unknown               
+ Cases_Ethnicity_Hispanic    
+ Cases_Ethnicity_NonHispanic 
+ Cases_Ethnicity_Unknown     

Similarly, for

+ Deaths_Unknown               
+ Deaths_Ethnicity_Hispanic    
+ Deaths_Ethnicity_NonHispanic 
+ Deaths_Ethnicity_Unknown 
+ Hosp_Unknown               
+ Hosp_Ethnicity_Hispanic    
+ Hosp_Ethnicity_NonHispanic 
+ Hosp_Ethnicity_Unknown



In [None]:
# Check unique states and decide on which twos to compare
states <- as.factor(case_death$State)

# Note: these include american territories outside of USA, so ignore those
# And we are interested in states who`s have most cases
n_states <- length(levels(states)) 


In [53]:
# Drop na values to get cleaner data to work with
cd_clean <- case_death %>%
             drop_na()
             
# Separate dates to get tidier, and focus on 2021
tidy_cd <- cd_clean %>%
           mutate(Date = lubridate::ymd(Date),
                  year = lubridate::year(Date),
                  month = lubridate::month(Date),
                  day = lubridate::day(Date)) %>%
           select(-Date) %>%
           filter(year == 2021) %>%
           select(-year) %>%
           group_by(month, State) %>%
           arrange(desc(Cases_Total))

# Arrange the states by cases total in descendind order
# to find out states with most cases, and exlucdin NAs
          

In [77]:
# choose CA and TX to be our interested states
# Let this be our sample and visualize it first
tidy_cdc <- tidy_cd %>%
            filter(State %in% c("CA", "TX")) %>%
            arrange(month,State) %>%
            ungroup() %>%
            select(-month, -day)

tidy_cdc


State,Cases_Total,Cases_White,Cases_Black,Cases_Latinx,Cases_Asian,Deaths_Total,Deaths_White,Deaths_Black,Deaths_Latinx,Deaths_Asian
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
CA,3243348,489390,98959,1355776,164484,40697,12423,2516,17965,4532
CA,3169914,475186,96053,1317261,158576,38224,11664,2391,16822,4256
CA,3109151,464301,93817,1289848,154423,36790,11330,2333,16348,4126
CA,3019371,448577,90545,1245900,148487,34433,10530,2195,15240,3866
CA,2942475,436083,87689,1206977,143690,33392,10256,2147,14886,3790
CA,2781039,409082,82125,1133397,133241,31102,9569,2016,14078,3535
CA,2670962,391396,78392,1078116,126414,29701,9133,1934,13484,3349
CA,2482226,360022,72361,990854,114864,27462,8441,1833,12583,3109
CA,2391261,319136,191572,887580,100569,26538,7679,1697,11575,2818
TX,2360632,22453,11750,26927,920,36491,14709,3413,16964,690


In [None]:
# Focus on twos states WA and 
# tidy_cdc <- Case_Death_Clean %>%
#             filter(State == "WA") %>%
#             summarize(p_white = sum(Cases_White) / sum(Cases_Total),
#                      p_black = sum(Cases_Black) / sum(Cases_Total),
#                      p_Asian = sum(Cases_Asian) / sum(Cases_Total),
#                      p_latin = sum(Cases_Latinx) / sum(Cases_Total),
#                      d_white = sum(Deaths_White) / sum(Deaths_Total),
#                      d_black = sum(Deaths_Black) / sum(Deaths_Total),
#                      d_Asian = sum(Deaths_Asian) / sum(Deaths_Total),
#                      d_latin = sum(Deaths_Latinx) / sum(Deaths_Total))

# tidy_cdc
# Pivoting by deaths? 

# group by races? 




### 2.3 Plot the relevant raw data, tailoring your plot in a way that addresses your question.

In [None]:
# Plotting raw data

### 2.4 Compute estimates 

*Compute estimates of the parameter you identified across your groups. Present this in a table. If relevant, include these estimates in your plot.*

In [None]:
# Summary statistics 

## 3. Methods: Plan

### 3.1 What do you expect to find?

Text

### 3.2 What impact could such findings have?

Text

### 3.3 What future questions could this lead to?

Text

## 4. Reference


About the Racial Data Tracker
https://covidtracking.com/race/about