# Introduciton

This notebook seeks to be a landing page for people looking to start working with the information made available in the UNCOVER dataset. It begins with an analysis of the basic data located in each of the folders examining the columns that exist.

**Please up vote to let me know this is useful or comment on other ways to make this a valuable resource**

## Data Meta-Analysis: Exploring the data, columns, and content
### General Stats

In [None]:
#Libraries Needed
import pandas as pd
import os
from glob import glob
from tqdm.notebook import tqdm

#Settings to display pandas
pd.set_option('display.max_columns', None)

#Some basic set up
base_path = "/kaggle/input/uncover"

In [None]:
#Traverse paths and pull out all .csv files
#Explaination at 
#https://perials.com/getting-csv-files-directory-subdirectories-using-python/
all_csv_files = []
for path, subdir, files in os.walk(base_path):
    for file in glob(os.path.join(path, "*.csv")):
        all_csv_files.append(file)
"There are a total of {} csv files in this dataset.".format(len(all_csv_files))

In [None]:
uncover_df_dictionary

In [None]:
#Try to make a pandas dataframe from each file
#The dataframes will be accessible by their file name using the uncover_df_dictionary
read_files = []
skipped_files = []
uncover_df_dictionary = {}
for file_path in tqdm(all_csv_files):
    df_name = file_path.split("/")[-1].replace('.csv','')
    try:
        uncover_df_dictionary[df_name] = pd.read_csv(file_path, low_memory=False)
        read_files.append(file_path)
    except:
        skipped_files.append(file_path)
        pass
"Read a total of {} files into Pandas dataframes and skipped {}.".format(len(read_files), len(skipped_files))

### Common Columns
These could be used for merging dataframes. Columns are also called features. These can be used to build models or describe the data. This secton epxlores the columns that exist and how they relate to other dataframes. It also does some examination of how the categorical data maybe turned into numerical features.

In [None]:
#Iterate though dict and group similar
column_dict = {}
for name, df in uncover_df_dictionary.items():
    all_cols = list(df.columns)
    for col in all_cols:
        if col in column_dict.keys():
            column_dict[col].append(name)
        else:
            column_dict[col] = list([name])

#Drop any columns not found in other DataFrames
len_before_drop = len(column_dict)
to_pop = []
for col, df_list in column_dict.items():
    if len(df_list) < 2:
        to_pop.append(col)
        
#Run in seperate loop as can not change size in iterator
for col in to_pop:
    column_dict.pop(col)
    
print("A total of {} columns are unique to one dataframe.".format((len_before_drop-len(column_dict))))
print("A total of {} columns are shared by more than one dataframe.".format(len(column_dict)))

#Make DF with index of cols and a column of dfs with that feature
col_df = pd.DataFrame(pd.Series(column_dict)).reset_index()
col_df.columns = ["Feature", "DataFrames"]
col_df.head(1)

Use pivot table to turn into binary matrix with columns as files and rows as available features. A 1 will represent it is present in the data and a 0 that it is not.

In [None]:
#Explode-Make a new row for each of the values found in the DataFrames lists
col_df_explode = col_df.explode("DataFrames")
#Add present columns to keep track of which are where
col_df_explode["present"] = 1
#Pivot to binary matrix
col_binary_matrix = col_df_explode.pivot_table(index='Feature',
                    columns='DataFrames',
                    values='present',
                    aggfunc='sum',
                    fill_value=0)
col_binary_matrix.head()

Columns most commonly found

In [None]:
col_binary_matrix.sum(axis=1).sort_values(ascending=False).head(10)

DataFrames that share the most features

In [None]:
#Get all possible file combinaitons to compare
from itertools import permutations
all_pairs = permutations(col_binary_matrix.columns,2)
pairs_df_list = []
for df1, df2 in all_pairs:
    boolean_check = (col_binary_matrix[df1]==1) & (col_binary_matrix[df2]==1)
    shared_feats = list(col_binary_matrix.index[boolean_check])
    num_shared_feats = len(shared_feats)
    features_dict = {"df1":df1, "df2": df2,"sim_col_count": num_shared_feats,"sim_col_list": shared_feats}
    if num_shared_feats > 1:
        pairs_df_list.append(features_dict)
shared_cols_dfs = pd.DataFrame(pairs_df_list).sort_values("sim_col_count", ascending=False)
shared_cols_dfs.head(10)

# Summary of Datasets Available
This section will provide a description of each of the datasets in the UNCOVER data set as well as their features.

## Coders Against Covid
Crowdsourced map of testing locations across the US.

## County Health Rankings

The annual Rankings provide a revealing snapshot of how health is influenced by where we live, learn, work, and play. They provide a starting point for change in communities.

## COVID 19 Canada Open Data Working Group
A working group collecting publicly available information on confirmed and presumptive positive cases during the ongoing COVID-19 outbreak in Canada. Data are entered in a spreadsheet with each line representing a unique case, including age, sex, health region location, and history of travel where available. Sources are included as a reference for each entry. All data are exclusively collected from publicly available sources including government reports and news media.

# COVID Tracker Canada
An independent project that compiles daily reports of covid-19 cases as reported by Canadian news outlets.

Copyright © COVID19Tracker.ca 2020 // COVID19Tracker.ca reports both presumptive and confirmed cases in near real-time // contact@covid19tracker.ca

# COVID Tracking Project
The COVID Tracking Project collects and publishes the most complete testing data available for US states and territories

# ECDC
The European Centre for Disease Prevention and Control collects the number of COVID-19 cases and deaths, based on reports from health authorities worldwide. This comprehensive and systematic process is carried out on a daily basis.

# Geotab
Geotab is a telematics company specializing in GPS fleet management and vehicle tracking. This data set provides hourly average duration of border crossings between Canada and the US

# Github
These data sets provide case counts for several countries, and are updated and maintained by citizens from government sources.

# Harvard Global Health Institute
The Harvard Global Health institute created a model of hospital capacity and readiness across the US. This model builds on bed capacity data for each of 306 U.S. hospital markets (Hospital Referral Regions, HRR) with localized estimates of available beds, and beds needed to accommodate COVID-19 patients over the coming months. It highlights where hospitals might find additional bed and ICU bed capacity as well as other shortages that need to be addressed—from workforce to ventilators.



# HDE
The Humanitarian Data Exchange (HDX) is an open platform for sharing data across crises and organisations. Provided are data sets with global information on testing, government responses, and school closures.

# HIFLD
Homeland Infrastructure Foundation-Level Data (HIFLD) provides National foundation-level geospatial data within the open public domain that can be useful to support community preparedness, resiliency, research, and more.

# IHME
Data describes the forecasting carried out by the IHME on the COVID-19 impact on hospital bed-days, ICU-days, ventilator days and deaths by US state in the next 4 months

# Johns Hopkins
These are the data powering the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). The data sources include the World Health Organization, the U.S. Centers for Disease Control and Prevention, the European Center for Disease Prevention and Control, the National Health Commission of the People’s Republic of China, local media reports, local health departments, and the DXY, one of the world’s largest online communities for physicians, health care professionals, pharmacies and facilities.

# New York Times
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

# Next Strain
This data set describes the phylogeny of the hCoV-19 / SARS-CoV-2 genomes being sequenced globally. Site numbering and genome structure uses Wuhan-Hu-1/2019 as reference. The phylogeny is rooted relative to early samples from Wuhan. Temporal resolution assumes a nucleotide substitution rate of 8 × 10^-4 subs per site per year.

# Open Table
OpenTable is publishing changes in restaurant reservations across several regions in 2020 as a year-over-year percentage change compared to 2019

# Our World in Data
Our World In Data aims to aggregate existing research, bring together the relevant data and allow their readers to make sense of the published data and early research on the coronavirus outbreak. They have provided data on coronavirus cases and testing.