<a href="https://colab.research.google.com/github/w-oke/covid_reproduction/blob/main/covid_owid_2_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The data loaded in this Notebook is based on the output of: covid_owid_1_preparation.ipynb, which can be accessed from:
https://github.com/w-oke/covid_reproduction.

In [7]:
import pandas as pd
import urllib.request
import pickle

In [17]:
df_link = 'https://github.com/w-oke/covid_reproduction/raw/main/covid_owid_df.parquet'
df1 = pd.read_parquet(df_link)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14301 entries, 0 to 14300
Data columns (total 29 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   iso_code                             14301 non-null  object        
 1   location                             14301 non-null  object        
 2   date                                 14301 non-null  datetime64[ns]
 3   reproduction_rate                    14301 non-null  float64       
 4   new_tests_smoothed_per_thousand      8803 non-null   float64       
 5   people_vaccinated_per_hundred        4093 non-null   float64       
 6   people_fully_vaccinated_per_hundred  3690 non-null   float64       
 7   total_boosters_per_hundred           738 non-null    float64       
 8   stringency_index                     13289 non-null  float64       
 9   population_density                   14064 non-null  float64       
 10  median_age

Note that:
* about half the data has 'handwashing_facilities', 'extreme_poverty', and 'new_tests_smoothed_per_thousand'
* about a quarter of the data has vaccination data

It is thought that the reason that many of the vaccination and booster data values are null is due to the timeframes (2019-2020) and countries where no vaccinations were available or provided.  Most of these values could therefore be filled with 0 (zero).


In [18]:
df1.head()

Unnamed: 0,iso_code,location,date,reproduction_rate,new_tests_smoothed_per_thousand,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,stringency_index,population_density,median_age,gdp_per_capita,extreme_poverty,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,Alpha,Beta,Delta,Epsilon,Eta,Gamma,Iota,Kappa,Lambda,Mu,Omicron,non_who
0,AGO,Angola,2020-12-21,0.96,,,,,65.74,23.89,16.8,5819.495,,26.664,,61.15,0.581,0.0,74.19,0.0,0.0,1.08,0.0,0.0,0.0,0.0,0.0,0.0,24.73
1,AGO,Angola,2021-01-25,0.91,,,,,62.96,23.89,16.8,5819.495,,26.664,,61.15,0.581,5.77,3.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,90.38
2,AGO,Angola,2021-02-01,0.87,,,,,62.96,23.89,16.8,5819.495,,26.664,,61.15,0.581,7.645,16.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,76.145
3,AGO,Angola,2021-02-08,0.88,,,,,61.11,23.89,16.8,5819.495,,26.664,,61.15,0.581,9.52,28.57,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,61.91
4,AGO,Angola,2021-02-15,0.9,,,,,61.11,23.89,16.8,5819.495,,26.664,,61.15,0.581,6.04,34.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,59.16


In [14]:
# column descriptions from OWID:
owid_col_desc_link = 'https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-codebook.csv'
owid_col_desc = pd.read_csv(owid_col_desc_link)
owid_col_desc.head()
# Note: the Variant data isn't described here, but is percentage by country or as calculated globally
# on that date if the country's data wasn't available.

Unnamed: 0,column,source,category,description
0,iso_code,International Organization for Standardization,Others,ISO 3166-1 alpha-3 – three-letter country codes
1,continent,Our World in Data,Others,Continent of the geographical location
2,location,Our World in Data,Others,Geographical location
3,date,Our World in Data,Others,Date of observation
4,total_cases,COVID-19 Data Repository by the Center for Sys...,Confirmed cases,Total confirmed cases of COVID-19


In [22]:
# make sure that the var_dictionary.pkl matches the dataset imported above
var_link = 'https://github.com/w-oke/covid_reproduction/raw/main/covid_owid_var_dictionary.pkl'

a_file = "covid_owid_var_dictionary.pkl"
data = urllib.request.urlretrieve(var_link, a_file)

with open(a_file, 'rb') as f:
        var = pickle.load(f)

print('The variables in df have been categorized into 4 groups: ', print(var.keys()))
print()
var

dict_keys(['y', 'meta', 'number', 'variants'])
The variables in df have been categorized into 4 groups:  None



{'meta': ['date', 'iso_code', 'location'],
 'number': ['new_tests_smoothed_per_thousand',
  'people_vaccinated_per_hundred',
  'people_fully_vaccinated_per_hundred',
  'total_boosters_per_hundred',
  'stringency_index',
  'population_density',
  'median_age',
  'human_development_index',
  'gdp_per_capita',
  'extreme_poverty',
  'handwashing_facilities',
  'hospital_beds_per_thousand',
  'life_expectancy'],
 'variants': ['Alpha',
  'Beta',
  'Delta',
  'Epsilon',
  'Eta',
  'Gamma',
  'Iota',
  'Kappa',
  'Lambda',
  'Mu',
  'Omicron',
  'non_who'],
 'y': ['reproduction_rate']}

In [13]:
# create a single list of all the features
var_all = [item for sublist in list(var.values()) for item in sublist]
print('The first 4 items in "var_all" are: ', var_all[0:4])
print('There are {} variables in var_all'.format(len(var_all)))

# create a single string of all the features
var_all2 = ', '.join(var_all)
print('var_all2: ', var_all2)

The first 4 items in "var_all" are:  ['reproduction_rate', 'date', 'iso_code', 'location']
There are 29 variables in var_all
var_all2:  reproduction_rate, date, iso_code, location, new_tests_smoothed_per_thousand, people_vaccinated_per_hundred, people_fully_vaccinated_per_hundred, total_boosters_per_hundred, stringency_index, population_density, median_age, human_development_index, gdp_per_capita, extreme_poverty, handwashing_facilities, hospital_beds_per_thousand, life_expectancy, Alpha, Beta, Delta, Epsilon, Eta, Gamma, Iota, Kappa, Lambda, Mu, Omicron, non_who
