# Title

## 1. Introduction

### Below are just instructions to be deleted after finish
*Begin by providing some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.*

*Clearly state the question you will try to answer with your project. Your question should involve one or more random variables of interest, spread across two or more categories that are interesting to compare. For example, you could consider the annual maxima river flow at two different locations along a river, or perhaps gender diversity at different universities. Of the response variable, identify one location parameter (mean, median, quantile, etc.) and one scale parameter (standard deviation, inter-quartile range, etc.) that would be useful in answering your question. Justify your choices.*

*Identify and describe the dataset that will be used to answer the question. Remember, this dataset is allowed to contain more variables than you need – feel free to drop them!*

*Also, be sure to frame your question/objectives in terms of what is already known in the literature. Be sure to include at least two scientific publications that can help frame your study (you will need to include these in the References section). We have no specific citation style requirements, but be consistent.*

### 1.1 Background information on the topic

Text

### 1.2 The Question 

Text

### 1.3 The Dataset

Text

### 1.4 The Literature 

Text

## 2. Preliminary Results

### 2.0 Libraries and Packages

In [3]:
library(tidyverse)
library(readr)
library(tidyr)
library(dbplyr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.4     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘dbplyr’


The following objects are masked from ‘package:dplyr’:

    ident, sql




### 2.1 Read the data into R

In [4]:
df <- read_csv('CRDT Data - CRDT.csv')

# may change to read from url later, just in case.


[1m[1mRows: [1m[22m[34m[34m5320[34m[39m [1m[1mColumns: [1m[22m[34m[34m54[34m[39m

[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): State
[32mdbl[39m (53): Date, Cases_Total, Cases_White, Cases_Black, Cases_Latinx, Cases_A...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



In [7]:
head(df)

Date,State,Cases_Total,Cases_White,Cases_Black,Cases_Latinx,Cases_Asian,Cases_AIAN,Cases_NHPI,Cases_Multiracial,⋯,Tests_Latinx,Tests_Asian,Tests_AIAN,Tests_NHPI,Tests_Multiracial,Tests_Other,Tests_Unknown,Tests_Ethnicity_Hispanic,Tests_Ethnicity_NonHispanic,Tests_Ethnicity_Unknown
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
20210307,AK,59332.0,18300.0,1499.0,,2447.0,12238.0,1508.0,4453.0,⋯,,,,,,,,,,
20210307,AL,499819.0,160347.0,82790.0,,2273.0,,,,⋯,,,,,,,,,,
20210307,AR,324818.0,207596.0,50842.0,,2913.0,1070.0,3358.0,1804.0,⋯,,,,,,,,,,
20210307,AS,,,,,,,,,⋯,,,,,,,,,,
20210307,AZ,826454.0,308453.0,25775.0,244539.0,11921.0,40707.0,,,⋯,,,,,,,,,,
20210307,CA,3501394.0,546630.0,111279.0,1509103.0,186562.0,9025.0,15281.0,42824.0,⋯,9444459.0,3980518.0,98894.0,222513.0,74171.0,6354689.0,18567612.0,9444459.0,21633943.0,18567612.0


In [6]:
glimpse (df)

Rows: 5,320
Columns: 54
$ Date                         [3m[90m<dbl>[39m[23m 20210307, 20210307, 20210307, 20210307, 2…
$ State                        [3m[90m<chr>[39m[23m "AK", "AL", "AR", "AS", "AZ", "CA", "CO",…
$ Cases_Total                  [3m[90m<dbl>[39m[23m 59332, 499819, 324818, NA, 826454, 350139…
$ Cases_White                  [3m[90m<dbl>[39m[23m 18300, 160347, 207596, NA, 308453, 546630…
$ Cases_Black                  [3m[90m<dbl>[39m[23m 1499, 82790, 50842, NA, 25775, 111279, 12…
$ Cases_Latinx                 [3m[90m<dbl>[39m[23m NA, NA, NA, NA, 244539, 1509103, 119224, …
$ Cases_Asian                  [3m[90m<dbl>[39m[23m 2447, 2273, 2913, NA, 11921, 186562, 6406…
$ Cases_AIAN                   [3m[90m<dbl>[39m[23m 12238, NA, 1070, NA, 40707, 9025, 2527, 3…
$ Cases_NHPI                   [3m[90m<dbl>[39m[23m 1508, NA, 3358, NA, NA, 15281, 1264, NA, …
$ Cases_Multiracial            [3m[90m<dbl>[39m[23m 4453, NA, 1804, NA, NA, 42824

### 2.2 Clean and wrangle data into a tidy format


In [22]:
# Select the columns
Case_Death <- df %>% 
#     select(Cases_Total:Cases_Other, Deaths_Total:Deaths_Other, Hosp_Total:Hosp_Other)
    select(Cases_Total:Cases_Other, Deaths_Total:Deaths_Other)

glimpse(Case_Death)

Rows: 5,320
Columns: 18
$ Cases_Total        [3m[90m<dbl>[39m[23m 59332, 499819, 324818, NA, 826454, 3501394, 435762,…
$ Cases_White        [3m[90m<dbl>[39m[23m 18300, 160347, 207596, NA, 308453, 546630, 181669, …
$ Cases_Black        [3m[90m<dbl>[39m[23m 1499, 82790, 50842, NA, 25775, 111279, 12637, 19651…
$ Cases_Latinx       [3m[90m<dbl>[39m[23m NA, NA, NA, NA, 244539, 1509103, 119224, 41523, NA,…
$ Cases_Asian        [3m[90m<dbl>[39m[23m 2447, 2273, 2913, NA, 11921, 186562, 6406, 3019, 91…
$ Cases_AIAN         [3m[90m<dbl>[39m[23m 12238, NA, 1070, NA, 40707, 9025, 2527, 393, 86, NA…
$ Cases_NHPI         [3m[90m<dbl>[39m[23m 1508, NA, 3358, NA, NA, 15281, 1264, NA, 82, NA, NA…
$ Cases_Multiracial  [3m[90m<dbl>[39m[23m 4453, NA, 1804, NA, NA, 42824, 6580, 17642, NA, NA,…
$ Cases_Other        [3m[90m<dbl>[39m[23m 7130, 38000, 16491, NA, 46964, 304477, 3312, 15284,…
$ Deaths_Total       [3m[90m<dbl>[39m[23m 305, 10148, 5319, NA, 16328, 54124, 598

#### Reasons to drop other columns 


Cases_Unknown               
Cases_Ethnicity_Hispanic    
Cases_Ethnicity_NonHispanic 
Cases_Ethnicity_Unknown     

Similarly, for

Deaths_Unknown               
Deaths_Ethnicity_Hispanic    
Deaths_Ethnicity_NonHispanic 
Deaths_Ethnicity_Unknown  

Hosp_Unknown               
Hosp_Ethnicity_Hispanic    
Hosp_Ethnicity_NonHispanic 
Hosp_Ethnicity_Unknown  

Also, 
for all Tests columns, 
- too many NA values
for all Hospital columns????



In [24]:
# Clean the NA values
Case_Death_Clean  <- Case_Death %>% 
#     filter(!is.na('') & !is.na('') & !is.na(''))
    drop_na()

glimpse(Case_Death_Clean)

Rows: 424
Columns: 18
$ Cases_Total        [3m[90m<dbl>[39m[23m 3501394, 435762, 490011, 76861, 128121, 344532, 348…
$ Cases_White        [3m[90m<dbl>[39m[23m 546630, 181669, 327714, 39457, 59389, 96572, 541927…
$ Cases_Black        [3m[90m<dbl>[39m[23m 111279, 12637, 39044, 1025, 7780, 10932, 110115, 12…
$ Cases_Latinx       [3m[90m<dbl>[39m[23m 1509103, 119224, 43792, 4157, 31023, 62537, 1494376…
$ Cases_Asian        [3m[90m<dbl>[39m[23m 186562, 6406, 21470, 1036, 2342, 12525, 184765, 630…
$ Cases_AIAN         [3m[90m<dbl>[39m[23m 9025, 2527, 4634, 87, 309, 3231, 8915, 2504, 4609, …
$ Cases_NHPI         [3m[90m<dbl>[39m[23m 15281, 1264, 556, 0, 0, 3691, 15228, 1252, 545, 0, …
$ Cases_Multiracial  [3m[90m<dbl>[39m[23m 42824, 6580, 8424, 394, 1200, 5823, 41858, 6475, 83…
$ Cases_Other        [3m[90m<dbl>[39m[23m 304477, 3312, 6470, 2066, 2332, 3133, 305557, 3281,…
$ Deaths_Total       [3m[90m<dbl>[39m[23m 54124, 5986, 6550, 1184, 2541, 5041, 5277

In [28]:
# Group by states

byState <- Case_Death_Clean  %>% 
    group_by('State') # really tho? 

# Or pivoting by cases? 

# Pivoting by deaths? 

# group by races? 
    

print(byState)
glimpse(byState)

[90m# A tibble: 424 × 19[39m
[90m# Groups:   "State" [1][39m
   Cases_Total Cases_White Cases_Black Cases_Latinx Cases_Asian Cases_AIAN
         [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m     3[4m5[24m[4m0[24m[4m1[24m394      [4m5[24m[4m4[24m[4m6[24m630      [4m1[24m[4m1[24m[4m1[24m279      1[4m5[24m[4m0[24m[4m9[24m103      [4m1[24m[4m8[24m[4m6[24m562       [4m9[24m025
[90m 2[39m      [4m4[24m[4m3[24m[4m5[24m762      [4m1[24m[4m8[24m[4m1[24m669       [4m1[24m[4m2[24m637       [4m1[24m[4m1[24m[4m9[24m224        [4m6[24m406       [4m2[24m527
[90m 3[39m      [4m4[24m[4m9[24m[4m0[24m011      [4m3[24m[4m2[24m[4m7[24m714       [4m3[24m[4m9[24m044        [4m4[24m[4m3[24m792       [4m2[24m[4m1[24m470       [4m4[24m634
[90m 4[39m       [4m7[24m[4m6[24

### 2.3 Plot the relevant raw data, tailoring your plot in a way that addresses your question.

In [29]:
# Plotting raw data

### 2.4 Compute estimates 

*Compute estimates of the parameter you identified across your groups. Present this in a table. If relevant, include these estimates in your plot.*

In [None]:
# Summary statistics 

## 3. Methods: Plan

### 3.1 What do you expect to find?

Text

### 3.2 What impact could such findings have?

Text

### 3.3 What future questions could this lead to?

Text

## 4. Referene


About the Racial Data Tracker
https://covidtracking.com/race/about