<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing Covid-19 Data

_Author: B Rhodes (DC)_

---

For Project 2, you'll be using Python to perform fundamental exploratory data analysis (EDA) tasks. In the other notebook in this project, we can use Pandas, but this notebook you should only use Python. The purpose here is to flex your Python muscles while thinking about data.

Below you'll import a data file with information on Covid-19 in a number of patients from the Cleveland Clinic. The original data along with a data dictionary can be found at the [John Hopkins University: CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data). 


For these exercises, you will conduct basic exploratory data analysis using Python (Pandas is not allowed for this notebook). The goal is to understand the some fundamentals of the COVID-19 data: These exercises will allow you to practice business analysis skills while also becoming more comfortable with Python.

---

## Part 1: Load the data & initial exploration

### Problem 1: Load the file and store it in an object called `covid_csv`.

Hint: This is a csv (comma-separated value) file, so we'll use `csv.reader()` 

See: [Python Docs - csv](https://docs.python.org/2/library/csv.html).



In [3]:
import csv

# import namedtuple as an option to store the data rows
from collections import namedtuple, defaultdict

DATA_FILE = './data/covid.csv'


#### Load the data

In [4]:
with open(DATA_FILE, 'r') as f:
    covid_csv = [row for row in csv.reader(f)]

### Problem 2: Separate ```covid_csv``` into a `header` and `data`. 

Complete the following tasks:

1. Split the covid_csv object into a ```header``` and ```data```.
    1. display the ```header```
    2. display the first 3 rows of ```data```.
2. What are dimensions of your data? Print the result (neatly formatted and each dimension identified.)


**Define the header and display the contents.**

In [5]:
header = covid_csv[0]
header

['Combined_Key',
 'Country_Region',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'Incidence_Rate',
 'Case_Fatality_Ratio']

**Define the data and display the first 3 rows.**

In [6]:
# Assign the data only
# display the first 3 rows
data = covid_csv[1:]
data[0:3] 

[['Afghanistan',
  'Afghanistan',
  '37345',
  '1354',
  '26694',
  '9297.0',
  '95.93267794278722',
  '3.6256526978176464'],
 ['Albania',
  'Albania',
  '6817',
  '208',
  '3552',
  '3057.0',
  '236.882340676906',
  '3.0511955405603635'],
 ['Algeria',
  'Algeria',
  '36699',
  '1333',
  '25627',
  '9739.0',
  '83.69014164611774',
  '3.6322515599880094']]

**Bonus: Use ```namedtuple``` to assign the data.**

In [7]:
# Alternative : use namedtuple
# define the namedtuple called CovidData and assign column names
covid_data = namedtuple('CovidData',header)

# create the tuples from the original data
# use a list comprehension for compactness - could also be a for-loop.
data_named = []
#List of tuples
for row in covid_csv[1:]:
     data_named.append(covid_data._make(row))

**How many rows and how many columns?**

In [8]:
# print the rows and columns in data - label the output
print(f"Rows in data: {len(data)}")
 
print(f"Columns in data? {len(data[0])}") 

Rows in data: 3941
Columns in data? 8


In [9]:
# print the rows and columns in the namedtuple data_named - label the output
len_data_named = len(data_named[:])
print(f"Rows in namedtuple: {len_data_named}")
 
print(f"Columns in namedtuple: {len(covid_data._fields )}")

Rows in namedtuple: 3941
Columns in namedtuple: 8


### Problem 3: Check the data type of each column and convert all numeric values to floats. 

Complete the following tasks:

1. Check data types (you only have to do this for one row).
2. Convert all numeric values to floats.


Note: only print the data type once per column (i.e. only do this for 1 row of data. 

**Format your output neatly and annotate properly.** Unannotated lists of data types will not receive credit. This means you should match each column name to a data type and display the combination.

In [10]:
# check data types for each column
for i,col in enumerate(data[0]):
    print( header[i],"",type(col))

Combined_Key  <class 'str'>
Country_Region  <class 'str'>
Confirmed  <class 'str'>
Deaths  <class 'str'>
Recovered  <class 'str'>
Active  <class 'str'>
Incidence_Rate  <class 'str'>
Case_Fatality_Ratio  <class 'str'>


In [11]:
# check data types for each column
for x in data_named[0:1] :
    for i in range(len(covid_data._fields)):
        print( covid_data._fields[i],type(x[i]))

Combined_Key <class 'str'>
Country_Region <class 'str'>
Confirmed <class 'str'>
Deaths <class 'str'>
Recovered <class 'str'>
Active <class 'str'>
Incidence_Rate <class 'str'>
Case_Fatality_Ratio <class 'str'>


#### Convert numeric data to floats.
1. use a loop to convert only the numeric data (i.e. numbers represented as strings) to float values. You'll have to come up with a way to skip the non-numeric data.

2. If you used namedtuples this is a little trickier since namedtuples are immutable (can't be changed).

Hint: you need to use a placeholder data type that you can convert the values. After conversion put everything back into a namedtuple.

##### Convert the appropriate elements of ```data``` to floats.

In [12]:
# convert all numerical data to floats.
clean_data = []

for row in data :
    new_row  =  [col if i < 2 else  float(col)  for i,col in enumerate(row) ]
    clean_data.append(new_row )
 
for i in range(6):  
    print(clean_data[i]) 

['Afghanistan', 'Afghanistan', 37345.0, 1354.0, 26694.0, 9297.0, 95.93267794278722, 3.6256526978176464]
['Albania', 'Albania', 6817.0, 208.0, 3552.0, 3057.0, 236.882340676906, 3.0511955405603635]
['Algeria', 'Algeria', 36699.0, 1333.0, 25627.0, 9739.0, 83.69014164611774, 3.6322515599880094]
['Andorra', 'Andorra', 977.0, 53.0, 855.0, 69.0, 1264.4793891153822, 5.424769703172978]
['Angola', 'Angola', 1762.0, 80.0, 577.0, 1105.0, 5.361119796138704, 4.540295119182747]
['Antigua and Barbuda', 'Antigua and Barbuda', 92.0, 3.0, 76.0, 13.0, 93.94657299240258, 3.260869565217391]


##### an alternative approach to convert the elements of ```data``` to floats.

In [13]:
# convert all numerical data to floats.
clean_data = []

for row in data :
    new_row = []
    
    for i, col in enumerate(row):      #loop every item in that row. Row = row, i = index, col is the actual text
         
        if i < 2 :                   #do not need to check for the first 2 columns: last, first, sex
            new_row.append(col)
        else:
            new_row.append(float(col))     #convert to float
            
    clean_data.append(new_row)           #add new row to Clean data array
 
for i in range(6):  
    print(clean_data[i]) 

['Afghanistan', 'Afghanistan', 37345.0, 1354.0, 26694.0, 9297.0, 95.93267794278722, 3.6256526978176464]
['Albania', 'Albania', 6817.0, 208.0, 3552.0, 3057.0, 236.882340676906, 3.0511955405603635]
['Algeria', 'Algeria', 36699.0, 1333.0, 25627.0, 9739.0, 83.69014164611774, 3.6322515599880094]
['Andorra', 'Andorra', 977.0, 53.0, 855.0, 69.0, 1264.4793891153822, 5.424769703172978]
['Angola', 'Angola', 1762.0, 80.0, 577.0, 1105.0, 5.361119796138704, 4.540295119182747]
['Antigua and Barbuda', 'Antigua and Barbuda', 92.0, 3.0, 76.0, 13.0, 93.94657299240258, 3.260869565217391]


##### Convert the appropriate elements of data_named to floats.
Note that this is a touch more complicated since namedtuples are immutable and elements cannot be changed. 

So the approach is to create a dictionary for each row and add each element of the namedtuples to the dictionary, converting the type when necessary. We end up with a list of dictionaries with all the same information, but all numerical values are now floats.

In [14]:
data_named[0:3]

[CovidData(Combined_Key='Afghanistan', Country_Region='Afghanistan', Confirmed='37345', Deaths='1354', Recovered='26694', Active='9297.0', Incidence_Rate='95.93267794278722', Case_Fatality_Ratio='3.6256526978176464'),
 CovidData(Combined_Key='Albania', Country_Region='Albania', Confirmed='6817', Deaths='208', Recovered='3552', Active='3057.0', Incidence_Rate='236.882340676906', Case_Fatality_Ratio='3.0511955405603635'),
 CovidData(Combined_Key='Algeria', Country_Region='Algeria', Confirmed='36699', Deaths='1333', Recovered='25627', Active='9739.0', Incidence_Rate='83.69014164611774', Case_Fatality_Ratio='3.6322515599880094')]

In [15]:
len_data_named

3941

In [16]:
# namedtuple approach 

i = 0
for x in data_named[:] : 
    tuple_row = []
     
    tuple_row.append(x.Combined_Key)   
    tuple_row.append(x.Country_Region)   
    tuple_row.append(float(x.Confirmed ))  
    tuple_row.append(float(x.Deaths))  
    tuple_row.append(float(x.Recovered))  
    tuple_row.append(float(x.Active ))  
    tuple_row.append(float(x.Incidence_Rate) )  
    tuple_row.append(float(x.Case_Fatality_Ratio))
    data_named[i] = covid_data._make(tuple(tuple_row) )
    i+=1

In [29]:
# check the result - 
data_named[0:3]

[CovidData(Combined_Key='Afghanistan', Country_Region='Afghanistan', Confirmed=37345.0, Deaths=1354.0, Recovered=26694.0, Active='9297.0', Incidence_Rate='95.93267794278722', Case_Fatality_Ratio='3.6256526978176464'),
 CovidData(Combined_Key='Albania', Country_Region='Albania', Confirmed=6817.0, Deaths=208.0, Recovered=3552.0, Active='3057.0', Incidence_Rate='236.882340676906', Case_Fatality_Ratio='3.0511955405603635'),
 CovidData(Combined_Key='Algeria', Country_Region='Algeria', Confirmed=36699.0, Deaths=1333.0, Recovered=25627.0, Active='9739.0', Incidence_Rate='83.69014164611774', Case_Fatality_Ratio='3.6322515599880094')]

In [30]:
# alternative approach

# use a list comprehension for compactness - could also be a for-loop.
data_named2 = []
#List of tuples
for row in covid_csv[1:]:
     data_named2.append(covid_data._make(row))

In [31]:
data_named2[0:3]

[CovidData(Combined_Key='Afghanistan', Country_Region='Afghanistan', Confirmed='37345', Deaths='1354', Recovered='26694', Active='9297.0', Incidence_Rate='95.93267794278722', Case_Fatality_Ratio='3.6256526978176464'),
 CovidData(Combined_Key='Albania', Country_Region='Albania', Confirmed='6817', Deaths='208', Recovered='3552', Active='3057.0', Incidence_Rate='236.882340676906', Case_Fatality_Ratio='3.0511955405603635'),
 CovidData(Combined_Key='Algeria', Country_Region='Algeria', Confirmed='36699', Deaths='1333', Recovered='25627', Active='9739.0', Incidence_Rate='83.69014164611774', Case_Fatality_Ratio='3.6322515599880094')]

In [32]:

#check if col is numeric or col is a string with digits the convert to a float
def convert_namedtuple(tuple_): 
    
    x=([float(i)
        if (isinstance(i, str) and i.isdigit()) or isinstance(i, int) or isinstance(i, float)  
        else i for a,i in enumerate(tuple_) ])
    return tuple(x) 

In [33]:
for i , row in enumerate (data_named2):
    data_named2[i] = covid_data._make(convert_namedtuple(row))  #update tuple array with converted named tuple

In [28]:

# check the result - 

data_named2[0:3]

[CovidData(Combined_Key='Afghanistan', Country_Region='Afghanistan', Confirmed=37345.0, Deaths=1354.0, Recovered=26694.0, Active='9297.0', Incidence_Rate='95.93267794278722', Case_Fatality_Ratio='3.6256526978176464'),
 CovidData(Combined_Key='Albania', Country_Region='Albania', Confirmed=6817.0, Deaths=208.0, Recovered=3552.0, Active='3057.0', Incidence_Rate='236.882340676906', Case_Fatality_Ratio='3.0511955405603635'),
 CovidData(Combined_Key='Algeria', Country_Region='Algeria', Confirmed=36699.0, Deaths=1333.0, Recovered=25627.0, Active='9739.0', Incidence_Rate='83.69014164611774', Case_Fatality_Ratio='3.6322515599880094')]

### Part 4: Calculate the average number of active cases and average number of deaths.

1. Compute the average for active cases
2. Compute the average number of deaths.
3. Compute the average total number of cases.

Hint: Review the data dictionary to determine the correct information to use.

Hint: Don't over think this. Try to find the simplest approach.


#### Find the average using the standard ```data```

In [21]:
# Create a list for active case counts and deaths
active_cases = [row[5] for row in clean_data]
death_cases = [row[3] for row in clean_data]

#1.Compute the average  
avg_active_cases = round(sum(active_cases)/len(active_cases),2)
avg_death_cases = round(sum(death_cases)/len(death_cases),2)

#2.Display results
print(f"average for active cases {avg_active_cases}")
#2.Compute the average number of deaths.
print(f"average number of deaths {avg_death_cases}")

average for active cases 1789.29
average number of deaths 190.2


In [22]:
#col3 = death, col4 =recovered , col5=active
def get_country_avg(death,recovered,active):
    calc_avg = 0
    if death is not None:
           calc_avg += death
    if recovered is not None:
           calc_avg += recovered
    if active is not None:
           calc_avg += active
            #calc_avg = round(sum(death + recovered + active) / len(clean_data),2)
    return round(calc_avg / len(data),2)

In [23]:
# Calculate the average number of cases per country

#Goal: create an array of tuples (country,average)
country_average = []

for row in  clean_data :
    row_avg = []
    row_avg.append(row[0])
    row_avg.append(get_country_avg(row[3],row[4],row[5]))
    country_average.append(tuple(row_avg))

country_average[0:4] 
     

[('Afghanistan', 9.48),
 ('Albania', 1.73),
 ('Algeria', 9.31),
 ('Andorra', 0.25)]

#### Find the average using the namedtuple ```data_named```

In [24]:
# Create a list for active case counts and deaths

# Note: Don't forget to convert to floats.

# namedtuple approach
active   = [float(x.Active)  for x in data_named]
deaths  = [float(x.Deaths)  for x in data_named]


 
print(" average for active cases =", round(sum(active) / len_data_named,2)   ) 
print(" average for death cases =", round(sum(deaths) / len_data_named,2)   ) 

 average for active cases = 1789.29
 average for death cases = 190.2


In [25]:
# Calculate the average number of cases per country
avg_cases_country = []

for x in data_named[0:] :
    tup_row = []
     
    tup_row.append(x.Country_Region)
    tup_row.append(get_country_avg(x.Recovered,x.Deaths,x.Active))
        
    avg_cases_country.append(tuple(tup_row))
                
avg_cases_country [0:3] 

[('Afghanistan', 9.48), ('Albania', 1.73), ('Algeria', 9.31)]

**Compute the Average total number of cases**

In [26]:
# What information do we need to get this result?
recovered  = [float(x.Recovered)  for x in data_named] 

#### To compute the average total number of cases , we need the total active cases, total death cases, total recovered cases, and the total number of countries.  



In [27]:
#
avg_total = get_country_avg(sum(recovered),sum(deaths),sum(active))
print(f"Average total number of cases: {avg_total}")

Average total number of cases: 5234.42


### Part 5: Create an object ```countries``` that contains all the country names in the data set. Each country should only be listed once.

1. Create a list (or other python data type) of unique country names.
2. Print total number of unique countries represented in the data set.
3. Print the first 5 names and the last 5 names - Print your results neatly and annotate. Your results should be in alphabetical order.


In [28]:
# Where are countries in the rows
countries = [row[1] for row in clean_data ]


# print the country count
# filter duplicates
unique_country = [ x for i, x in enumerate(countries) if x not in countries[:i]]

print (f"total number of unique countries: {len(unique_country)}")

total number of unique countries: 188


In [29]:
# list the countries
unique_country 

['Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czechia',
 'Denmark',
 'Diamond Princess',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Grenada',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',

In [30]:
# print the first 5.
print(unique_country[0:5])
# print the last 5
print(unique_country[-5: ])

['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola']
['West Bank and Gaza', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']


### Part 6: Calculate the average number of confirmed cases for the first 5 countries and the last 5 countries.

1. Determine the average number of confirmed cases for the first 5 countries.
2. Determine the average number of confirmed cases for the last 5 countries.


Note: Print your results neatly and properly annotated.

Hint: Think carefully about the easiest way to count the number of confirmed cases!


In [31]:
covid_data._fields

('Combined_Key',
 'Country_Region',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'Incidence_Rate',
 'Case_Fatality_Ratio')

In [32]:
#first create a list of confirmed cases
confirmed   = [float(x.Confirmed)  for x in data_named] 

In [33]:
avg_confirmed_cases = {}
for i,row in enumerate(data_named[:]) :
    avg_confirmed_cases[row[1]] = round(float(row[2]) / sum(confirmed),8)  
        

In [34]:
first5pairs = {k: avg_confirmed_cases[k] for k in sorted(avg_confirmed_cases.keys())[:5]}
first5pairs

{'Afghanistan': 0.00181019,
 'Albania': 0.00033043,
 'Algeria': 0.00177887,
 'Andorra': 4.736e-05,
 'Angola': 8.541e-05}

In [35]:
last5pairs = {k: avg_confirmed_cases[k] for k in sorted(avg_confirmed_cases.keys())[-5:] }
last5pairs

{'West Bank and Gaza': 0.000736,
 'Western Sahara': 4.8e-07,
 'Yemen': 8.924e-05,
 'Zambia': 0.00041206,
 'Zimbabwe': 0.00023717}

In [36]:
# write a function
def get_display_recs(n, end):
    n5pairs = {key:avg_confirmed_cases[key] for key in sorted(avg_confirmed_cases.keys())[n:end]}
    return n5pairs                

In [37]:
#Get Keys
for i,key in enumerate(get_display_recs(88, 90)):
    print (key,"has average number of confirmed cases of",avg_confirmed_cases[key])
  

Jordan has average number of confirmed cases of 6.316e-05
Kazakhstan has average number of confirmed cases of 0.0049137


In [38]:
#Get Keys
for i,key in enumerate(get_display_recs(43, 47)):
    print (key,"has average number of confirmed cases of",avg_confirmed_cases[key])


Croatia has average number of confirmed cases of 0.00028453
Cuba has average number of confirmed cases of 0.00015162
Cyprus has average number of confirmed cases of 6.258e-05
Czechia has average number of confirmed cases of 0.0009246


### Problem 7: Create a dictionary of confirmed cases in the EU.

The keys in the dictionary are the countries in Europe and the values will be the total number of confirmed cases.

**Expected output**: `{'Austria': 22439, 'Belgium': 75647, ...  }` (*required*)

**Bonus**: use `.defaultdict()` to simplify your code. (*optional*)

See: [Python Doc - defaultdict](https://docs.python.org/3/library/collections.html?highlight=defaultdict#collections.defaultdict) or [Stackoverflow - defaultdict](https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work)

In [39]:
# a list of EU countries
eu = ['Austria',
'Belgium',
'Bulgaria',
'Croatia',
'Cyprus',
'Czechia',
'Denmark',
'Estonia',
'Finland',
'France',
'Germany',
'Greece',
'Hungary',
'Ireland',
'Italy',
'Latvia',
'Lithuania',
'Luxembourg',
'Malta',
'Netherlands',
'Poland',
'Portugal',
'Romania',
'Slovakia',
'Slovenia',
'Spain',
'Sweden']

In [40]:
#Create a dictionary of confirmed cases in the EU
is_eu = {}
for row in clean_data: 
    if row[1] in eu:
        is_eu[row[1]] = row[2]

#print first 5 rows           
first5pairs =  {k: is_eu[k] for k in list(is_eu)[:5]}
first5pairs

{'Austria': 22439.0,
 'Belgium': 75647.0,
 'Bulgaria': 13893.0,
 'Croatia': 5870.0,
 'Cyprus': 1291.0}

In [41]:
# if you used a named tuple - answer here

is_eu_tup = {}

# if you used a named tuple - answer here

for x in data_named[:] :
     if x.Country_Region in eu:
         is_eu_tup[x.Country_Region] = float(x.Confirmed) 

In [42]:
#print first 5 rows           
first5pairs =  {k: is_eu_tup[k] for k in sorted(list(is_eu_tup))[0:5]}
first5pairs

{'Austria': 22439.0,
 'Belgium': 75647.0,
 'Bulgaria': 13893.0,
 'Croatia': 5870.0,
 'Cyprus': 1291.0}

In [43]:
#try with a defaultdict
#defaultdict, on the other hand, will insert a key into the dictionary if it isn't there yet
#Key Error 
from collections import defaultdict
 

#  Check value of a key 'Case_Fatality_Ratio' in my dictionary  covid_data_dict
#check value that is not there
is_eu_tup['Case_Fatality '] 
#Got a KeyError 

KeyError: 'Case_Fatality '

In [None]:
#set default value for keys not found
is_eu_tup = defaultdict(lambda: "Key is Not present in the dictionary")

In [None]:
is_eu_tup['Case_Fatality '] 

### Problem 8: Compare the Case Fatality Rate in the EU to that in the US and North America.

1. Determine the CFR in the EU
2. Determine the CFR in the US
3. Determine the CFR in North America

Note: The Case Fatality Rate is a feature in this data set. You are not to use that feature. You should compute the CFR from the other available features. Use the existing CFR column as a check.

In [83]:
covid_data._fields

('Combined_Key',
 'Country_Region',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'Incidence_Rate',
 'Case_Fatality_Ratio')

In [84]:
#Calculate cfr in eu = #cfr = confirmed /   #death * 100

cfr_eu = {}
for row in clean_data:      
    if row[1] in eu:                                          
        
        try:
            cfr_eu[row[1]] = round(row[3] / row[2] * 100,2)
        except:
            cfr_eu[row[1]] = 0

In [85]:
#print first 5 rows           
first5_cfr_eu =  {k: cfr_eu[k] for k in sorted(list(cfr_eu))[0:5]}
first5_cfr_eu

{'Austria': 3.23,
 'Belgium': 13.09,
 'Bulgaria': 3.47,
 'Croatia': 2.73,
 'Cyprus': 1.55}

In [86]:
#Determine the CFR in the US
cfr_us = {}
for row in clean_data:      
    if row[1] == "US":                                          
        
        try:
            cfr_us[row[1]] = round(row[3] / row[2] * 100,2)
        except:
            cfr_us[row[1]] = 0

In [87]:
#print first 5 rows           
first5_cfr_us =  {k: cfr_us[k] for k in sorted(list(cfr_us))[0:5]}
first5_cfr_us

{'US': 0.0}

In [88]:
# countries in North America
na = ['Antigua and Barbuda',
'Bahamas',
'Barbados',
'Belize',
'Canada',
'Costa Rica',
'Cuba',
'Dominica',
'Dominican Republic',
'El Salvador',
'Grenada',
'Guatemala',
'Haiti',
'Honduras',
'Jamaica',
'Mexico',
'Nicaragua',
'Panama',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Vincent and the Grenadines',
'Trinidad and Tobago',
'US'] 

In [89]:
#Determine the CFR in the US
cfr_na = {}
for row in clean_data:      
    if row[1] in na:                                          
        
        try:
            cfr_na[row[1]] = round(row[3] / row[2] * 100,2)
        except:
            cfr_na[row[1]] = 0

In [90]:
#print first 5 rows           
first5_cfr_na =  {k: cfr_na[k] for k in sorted(list(cfr_na))[0:5]}
first5_cfr_na

{'Antigua and Barbuda': 3.26,
 'Bahamas': 1.45,
 'Barbados': 4.86,
 'Belize': 0.95,
 'Canada': 0.0}

In [110]:
# write a function

def calc_first5_CFR_by_region (region):
    cfr_region = {}
    if isinstance(region, str) == True:
    #if region.isin('na','eu'):
        for row in clean_data:      
            if row[1] == region:                                          
            
                try:
                     cfr_region[row[1]] = round(row[3] / row[2] * 100,2)
                except:
                     cfr_region[row[1]] = 0
    else:
        for row in clean_data:      
            if row[1] in  region :                                          
            
                try:
                    cfr_region[row[1]] = round(row[3] / row[2] * 100,2)
                except:
                    cfr_region[row[1]] = 0
    return  {k: cfr_region[k] for k in sorted(list(cfr_region))[0:5]}



In [111]:
print( calc_first5_CFR_by_region (na))
    

{'Antigua and Barbuda': 3.26, 'Bahamas': 1.45, 'Barbados': 4.86, 'Belize': 0.95, 'Canada': 0.0}


In [112]:
print( calc_first5_CFR_by_region (eu))

{'Austria': 3.23, 'Belgium': 13.09, 'Bulgaria': 3.47, 'Croatia': 2.73, 'Cyprus': 1.55}


In [113]:
print( calc_first5_CFR_by_region ("US"))

{'US': 0.0}



### Bonus 1: Craft a problem statement about this data that interests you, and then answer it!


In [74]:
#The highest number of Confirmed cases held by a country: 
largest = max(float(data[i][2]) for i in range(len(data) ))

In [75]:
print(f" The highest number of Confirmed cases held by a country: {largest}")

 The highest number of Confirmed cases held by a country: 655181.0


In [78]:
#The least number of Confirmed cases held by a country:
smallest = min(float(data[i][2]) for i in range(len(data) ))
print(f" The least number of Confirmed cases held by a country: {smallest}")

 The least number of Confirmed cases held by a country: 0.0


### Bonus 2: Repeat the above analysis using Pandas!


In [46]:
# Where are countries in the rows
import pandas as pd
prod_df = pd.read_csv(DATA_FILE, sep=',')
 
counrtry_list = prod_df['Country_Region'].sort_values(ascending=False)
counrtry_list

3940              Zimbabwe
3939                Zambia
3938                 Yemen
3937        Western Sahara
3936    West Bank and Gaza
               ...        
4                   Angola
3                  Andorra
2                  Algeria
1                  Albania
0              Afghanistan
Name: Country_Region, Length: 3941, dtype: object

In [47]:
# Print total number of unique countries represented in the data set.
 
prod_df['Country_Region'].nunique() 

188

In [48]:

# print the first 5.
prod_df.head(5)[['Country_Region']] 

Unnamed: 0,Country_Region
0,Afghanistan
1,Albania
2,Algeria
3,Andorra
4,Angola


In [49]:
# print the last 5
prod_df.tail(5)[['Country_Region']]

Unnamed: 0,Country_Region
3936,West Bank and Gaza
3937,Western Sahara
3938,Yemen
3939,Zambia
3940,Zimbabwe


In [50]:
#Calculate the average number of confirmed cases for the first 5  
prod_df['Confirmed'].head().mean()



16720.0

In [53]:
#sort biggest to smallest
first5_avg = prod_df['Confirmed'].head(5).sort_values(ascending=False).mean() 

print(f" The average number of confirmed cases for the first 5 countries {first5_avg}")

last5_avg = prod_df['Confirmed'].tail(5).sort_values(ascending=False).mean() 

print(f" The average number of confirmed cases for the last 5 countries {last5_avg}")

 The average number of confirmed cases for the first 5 countries 16720.0
 The average number of confirmed cases for the last 5 countries 6085.8


In [54]:
# write a function
def findavg(df_data):
    last5_avg = prod_df['Confirmed'].tail(5).sort_values(ascending=False).mean() 
    first5_avg = prod_df['Confirmed'].head(5).sort_values(ascending=False).mean()    
    return   last5_avg, first5_avg

findavg(prod_df)
print(f" The average number of confirmed cases for the first 5 countries {findavg(prod_df)[0]}")
print(f" The average number of confirmed cases for the first 5 countries {findavg(prod_df)[1]}")

 The average number of confirmed cases for the first 5 countries 6085.8
 The average number of confirmed cases for the first 5 countries 16720.0


In [55]:
#Problem 7: Create a dictionary of confirmed cases in the EU.
#create filter
is_eu = prod_df['Country_Region'].isin(eu) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_eu][['Country_Region','Confirmed']] 

Unnamed: 0,Country_Region,Confirmed
16,Austria,22439
23,Belgium,75647
58,Bulgaria,13893
168,Croatia,5870
170,Cyprus,1291
...,...,...
614,Sweden,1099
615,Sweden,925
616,Sweden,1824
617,Sweden,2695


In [56]:
#You can use df. to_dict() in order to convert the DataFrame to a dictionary
df = pd.DataFrame(prod_df[is_eu][['Country_Region','Confirmed']],columns=['Country_Region','Confirmed'] )
df.set_index('Country_Region', inplace=True)
df = df.rename_axis(None)  

my_dictionary = df.to_dict()
my_dictionary.items()

dict_items([('Confirmed', {'Austria': 22439, 'Belgium': 75647, 'Bulgaria': 13893, 'Croatia': 5870, 'Cyprus': 1291, 'Czechia': 19075, 'Denmark': 15070, 'Estonia': 2174, 'Finland': 7642, 'France': 230874, 'Germany': 1880, 'Greece': 6177, 'Hungary': 4768, 'Ireland': 26838, 'Italy': 20801, 'Latvia': 1303, 'Lithuania': 2309, 'Luxembourg': 7300, 'Malta': 1190, 'Netherlands': 15765, 'Poland': 53676, 'Portugal': 53223, 'Romania': 65177, 'Slovakia': 2690, 'Slovenia': 2303, 'Spain': 0, 'Sweden': 19204})])

In [58]:
# if you used a named tuple - answer here
#iterate each element and convert to dictionary

#show named tuple as a dictionary
covid_data_dict = {}

#build the keys
for p in header[:]:                   #add column headers sex,age,sibsp,pclass,far,survival
    covid_data_dict[p] = [] 
    
for row in data_named[:]:
      for i,cell in enumerate(row): 
            #print(row._asdict())        #add the entire as a dict 
           # print (i)                    #append cells to existing dictionary keyes
           # print (cell)
            covid_data_dict[header[i]].append(cell)

#print(covid_data_dict.items())

In [59]:
#try with a defaultdict
#defaultdict, on the other hand, will insert a key into the dictionary if it isn't there yet
#Key Error 
from collections import defaultdict
 

#  Check value of a key 'Case_Fatality_Ratio' in my dictionary  covid_data_dict
covid_data_dict['Case_Fatality_Ratio'][0:5] 

[3.6256526978176464,
 3.0511955405603635,
 3.6322515599880094,
 5.424769703172978,
 4.540295119182747]

In [60]:
#check value that is not there
covid_data_dict['Case_Fatality '] 
# I EXPECT A KeyError

KeyError: 'Case_Fatality '

In [61]:
try:
    covid_data_dict['Case_Fatality '] 
except KeyError:
    print("Exception - Key is Not present in the dictionary")

Exception - Key is Not present in the dictionary


In [62]:
#set default value for keys not found
covid_data_dict = defaultdict(lambda: "Key is Not present in the dictionary")

In [63]:
covid_data_dict['Case_Fatality '] 

'Key is Not present in the dictionary'

In [64]:
#Problem 8: Compare the Case Fatality Rate in the EU to that in the US and North America
#create filter
is_eu = prod_df['Country_Region'].isin(eu) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_eu]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_eu_cfr = pd.DataFrame(prod_df[is_eu][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_eu_cfr['Calculated Case Fatality Rate'] = df_eu_cfr['Deaths']/df_eu_cfr['Confirmed'] * 100
 
df_eu_cfr

Unnamed: 0,Country_Region,Confirmed,Deaths,Case_Fatality_Ratio,Calculated Case Fatality Rate
16,Austria,22439,724,3.226525,3.226525
23,Belgium,75647,9900,13.087102,13.087102
58,Bulgaria,13893,482,3.469373,3.469373
168,Croatia,5870,160,2.725724,2.725724
170,Cyprus,1291,20,1.549187,1.549187
...,...,...,...,...,...
614,Sweden,1099,73,6.642402,6.642402
615,Sweden,925,31,3.351351,3.351351
616,Sweden,1824,132,7.236842,7.236842
617,Sweden,2695,180,6.679035,6.679035


In [66]:
# countries in North America
na = ['Antigua and Barbuda',
'Bahamas',
'Barbados',
'Belize',
'Canada',
'Costa Rica',
'Cuba',
'Dominica',
'Dominican Republic',
'El Salvador',
'Grenada',
'Guatemala',
'Haiti',
'Honduras',
'Jamaica',
'Mexico',
'Nicaragua',
'Panama',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Vincent and the Grenadines',
'Trinidad and Tobago',
'US']

In [67]:
#create filter
is_na = prod_df['Country_Region'].isin(na) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_na]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_na_cfr = pd.DataFrame(prod_df[is_na][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_na_cfr['Calculated Case Fatality Rate'] = df_na_cfr['Deaths']/df_na_cfr['Confirmed']  * 100
 
df_na_cfr

Unnamed: 0,Country_Region,Confirmed,Deaths,Case_Fatality_Ratio,Calculated Case Fatality Rate
5,Antigua and Barbuda,92,3,3.260870,3.260870
18,Bahamas,1036,15,1.447876,1.447876
21,Barbados,144,7,4.861111,4.861111
24,Belize,210,2,0.952381,0.952381
65,Canada,11893,217,1.824603,1.824603
...,...,...,...,...,...
3883,US,373,0,0.000000,0.000000
3884,US,278,0,0.000000,0.000000
3885,US,0,28,-99.000000,inf
3886,US,82,0,0.000000,0.000000


In [68]:
#create filter
is_us = prod_df['Country_Region'] == 'US'

#Create a dictionary of confirmed cases in the EU.
prod_df[is_us]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_us_cfr = pd.DataFrame(prod_df[is_us][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_us_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed'] * 100
 
df_us_cfr

Unnamed: 0,Country_Region,Confirmed,Deaths,Case_Fatality_Ratio,Calculated Case Fatality Rate
630,US,1188,22,1.851852,1.851852
631,US,3710,29,0.781671,0.781671
632,US,581,6,1.032702,1.032702
633,US,453,5,1.103753,1.103753
634,US,825,5,0.606061,0.606061
...,...,...,...,...,...
3883,US,373,0,0.000000,0.000000
3884,US,278,0,0.000000,0.000000
3885,US,0,28,-99.000000,inf
3886,US,82,0,0.000000,0.000000


In [69]:
#create filter
is_na = prod_df['Country_Region'].isin(na) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_na]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_na_cfr = pd.DataFrame(prod_df[is_na][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_na_cfr['Calculated Case Fatality Rate'] = df_na_cfr['Deaths']/df_na_cfr['Confirmed']  * 100
 
df_na_cfr

Unnamed: 0,Country_Region,Confirmed,Deaths,Case_Fatality_Ratio,Calculated Case Fatality Rate
5,Antigua and Barbuda,92,3,3.260870,3.260870
18,Bahamas,1036,15,1.447876,1.447876
21,Barbados,144,7,4.861111,4.861111
24,Belize,210,2,0.952381,0.952381
65,Canada,11893,217,1.824603,1.824603
...,...,...,...,...,...
3883,US,373,0,0.000000,0.000000
3884,US,278,0,0.000000,0.000000
3885,US,0,28,-99.000000,inf
3886,US,82,0,0.000000,0.000000


In [70]:
#create filter
is_us = prod_df['Country_Region'] == 'US'

#Create a dictionary of confirmed cases in the EU.
prod_df[is_us]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_us_cfr = pd.DataFrame(prod_df[is_us][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_us_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed'] * 100
 
df_us_cfr

Unnamed: 0,Country_Region,Confirmed,Deaths,Case_Fatality_Ratio,Calculated Case Fatality Rate
630,US,1188,22,1.851852,1.851852
631,US,3710,29,0.781671,0.781671
632,US,581,6,1.032702,1.032702
633,US,453,5,1.103753,1.103753
634,US,825,5,0.606061,0.606061
...,...,...,...,...,...
3883,US,373,0,0.000000,0.000000
3884,US,278,0,0.000000,0.000000
3885,US,0,28,-99.000000,inf
3886,US,82,0,0.000000,0.000000


In [71]:
# write a function

def create_df_calc_cfr(arg_region,mylist):
    if (arg_region == 'na') | (arg_region == 'eu'):
        is_in =  prod_df['Country_Region'].isin(mylist)
        prod_df[is_in]['Country_Region']
        df_isin_cfr = pd.DataFrame(prod_df[is_in][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )
        df_isin_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed']  * 100
    
    else:
        is_in =  prod_df['Country_Region'] == arg_region
        prod_df[is_in]['Country_Region']
        df_isin_cfr = pd.DataFrame(prod_df[is_us][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )
        df_isin_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed']  * 100
    return df_isin_cfr

In [72]:
create_df_calc_cfr('US',[])

Unnamed: 0,Country_Region,Confirmed,Deaths,Case_Fatality_Ratio,Calculated Case Fatality Rate
630,US,1188,22,1.851852,1.851852
631,US,3710,29,0.781671,0.781671
632,US,581,6,1.032702,1.032702
633,US,453,5,1.103753,1.103753
634,US,825,5,0.606061,0.606061
...,...,...,...,...,...
3883,US,373,0,0.000000,0.000000
3884,US,278,0,0.000000,0.000000
3885,US,0,28,-99.000000,inf
3886,US,82,0,0.000000,0.000000


In [73]:
create_df_calc_cfr('na',na)

Unnamed: 0,Country_Region,Confirmed,Deaths,Case_Fatality_Ratio,Calculated Case Fatality Rate
5,Antigua and Barbuda,92,3,3.260870,
18,Bahamas,1036,15,1.447876,
21,Barbados,144,7,4.861111,
24,Belize,210,2,0.952381,
65,Canada,11893,217,1.824603,
...,...,...,...,...,...
3883,US,373,0,0.000000,0.0
3884,US,278,0,0.000000,0.0
3885,US,0,28,-99.000000,inf
3886,US,82,0,0.000000,0.0
