# Analyzing When and Where Influenza outbreak Occur in U.S

by Berthin Bitja

In [13]:
options(warn = -1) #ignore warnings

# IMPORTANT: This assumes that all packages in "Rstart.R" are installed,
# and the fonts "Source Sans Pro" and "Open Sans Condensed Bold" are installed
# via extrafont. If ggplot2 charts fail to render, you may need to change/remove the theme call.


source("Rstart.R")
library(ggmap)


sessionInfo()

Google Maps API Terms of Service: http://developers.google.com/maps/terms.
Please cite ggmap if you use it: see citation("ggmap") for details.


R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] ggmap_2.7          extrafont_0.17     stringr_1.2.0      digest_0.6.12     
[5] RColorBrewer_1.1-2 scales_0.4.1       ggplot2_2.2.1      dplyr_0.7.0       
[9] readr_1.1.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11      compiler_3.4.1    plyr_1.8.4        bitops_1.0-6     
 [5] tools_3.4.1       uuid_0.1-2        lattice_0.20-35   jsonlite_1.5     
 [9] evaluate_0.10     tibble

# Processing the Data
We load the data using readr and read_csv() since it's faster. Since there is a lot of redundant data (e.g. address, coordinates), we only load the columns we need.da

In [14]:
path <- "data/FluViewPhase8_Season55_Data.csv"

df <- read_csv(path)

Parsed with column specification:
cols(
  STATENAME = col_character(),
  URL = col_character(),
  WEBSITE = col_character(),
  ACTIVITYESTIMATE = col_character(),
  WEEKEND = col_character(),
  WEEK = col_integer(),
  SEASON = col_character()
)


In [16]:
#display the data 
df %>% head(10)
sprintf("# of Rows in Dataframe: %s", nrow(df))
sprintf("Dataframe Size: %s", format(object.size(df), units = "MB"))

STATENAME,URL,WEBSITE,ACTIVITYESTIMATE,WEEKEND,WEEK,SEASON
Alabama,http://adph.org/influenza/,Influenza Surveillance,No Activity,Oct-10-2015,40,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,No Activity,Oct-17-2015,41,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,Local Activity,Oct-24-2015,42,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,Sporadic,Oct-31-2015,43,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,Sporadic,Nov-07-2015,44,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,No Activity,Nov-14-2015,45,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,No Activity,Nov-21-2015,46,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,No Activity,Nov-28-2015,47,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,No Activity,Dec-05-2015,48,2015-16
Alabama,http://adph.org/influenza/,Influenza Surveillance,Local Activity,Dec-12-2015,49,2015-16


In [20]:
columns = c("STATENAME","ACTIVITYESTIMATE","WEEKEND","WEEK")

# select() require column indices that we find trough which() 

df <- df %>% select(which(names(df) %in% columns))

df %>% head(10)
sprintf("# of Rows in Dataframe: %s", nrow(df))
sprintf("Dataframe Size: %s", format(object.size(df), units = "MB"))

STATENAME,ACTIVITYESTIMATE,WEEKEND,WEEK
Alabama,No Activity,Oct-10-2015,40
Alabama,No Activity,Oct-17-2015,41
Alabama,Local Activity,Oct-24-2015,42
Alabama,Sporadic,Oct-31-2015,43
Alabama,Sporadic,Nov-07-2015,44
Alabama,No Activity,Nov-14-2015,45
Alabama,No Activity,Nov-21-2015,46
Alabama,No Activity,Nov-28-2015,47
Alabama,No Activity,Dec-05-2015,48
Alabama,Local Activity,Dec-12-2015,49


All-Caps to proper case. (see this [Stack Overflow question](http://stackoverflow.com/questions/15776732/how-to-convert-a-vector-of-strings-to-title-case))

In [21]:
proper_case <- function(x) {
    return (gsub("\\b([A-Z])([A-Z]+)", "\\U\\1\\L\\2" , x, perl = TRUE))
}

df <- df %>% mutate(STATENAME = proper_case(STATENAME),
                 ACTIVITYESTIMATE = proper_case(ACTIVITYESTIMATE),
                 WEEKEND = proper_case(WEEKEND))

df %>% head(10)

STATENAME,ACTIVITYESTIMATE,WEEKEND,WEEK
Alabama,No Activity,Oct-10-2015,40
Alabama,No Activity,Oct-17-2015,41
Alabama,Local Activity,Oct-24-2015,42
Alabama,Sporadic,Oct-31-2015,43
Alabama,Sporadic,Nov-07-2015,44
Alabama,No Activity,Nov-14-2015,45
Alabama,No Activity,Nov-21-2015,46
Alabama,No Activity,Nov-28-2015,47
Alabama,No Activity,Dec-05-2015,48
Alabama,Local Activity,Dec-12-2015,49


# Filtering the data

let's file the data frame by "Local Activity" to aggregate some interesting statistics

In [22]:
# grepl() is the best way to do in-text search
df_arrest <- df %>% filter(grepl("Local Activity", ACTIVITYESTIMATE))

df_arrest %>% head(10)
sprintf("# of Rows in Dataframe: %s", nrow(df_arrest))
sprintf("Dataframe Size: %s", format(object.size(df_arrest), units = "MB"))

STATENAME,ACTIVITYESTIMATE,WEEKEND,WEEK
Alabama,Local Activity,Oct-24-2015,42
Alabama,Local Activity,Dec-12-2015,49
Alabama,Local Activity,Dec-19-2015,50
Alabama,Local Activity,Jan-30-2016,4
Alabama,Local Activity,Feb-06-2016,5
Alabama,Local Activity,Feb-13-2016,6
Alabama,Local Activity,Mar-26-2016,12
Alabama,Local Activity,Apr-09-2016,14
Alabama,Local Activity,Apr-16-2016,15
Alabama,Local Activity,May-14-2016,19


In [25]:
df_arrest_daily <-df_arrest %>%
                    mutate(Date = as.Date(WEEKEND, "%m/%d/%Y")) %>%
                    group_by(WEEKEND) %>%
                    summarize(count = n()) %>%
                    arrange(WEEKEND)

df_arrest_daily %>% head(10)

WEEKEND,count
Apr-02-2016,5
Apr-09-2016,11
Apr-16-2016,14
Apr-23-2016,14
Apr-30-2016,18
Dec-05-2015,10
Dec-12-2015,12
Dec-19-2015,15
Dec-26-2015,12
Feb-06-2016,16
