<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing Covid-19 Data

_Author: B Rhodes (DC)_

---

For Project 2, you'll be using Python to perform fundamental exploratory data analysis (EDA) tasks. In the other notebook in this project, we can use Pandas, but this notebook you should only use Python. The purpose here is to flex your Python muscles while thinking about data.

Below you'll import a data file with information on Covid-19 in a number of patients from the Cleveland Clinic. The original data along with a data dictionary can be found at the [John Hopkins University: CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data). 


For these exercises, you will conduct basic exploratory data analysis using Python (Pandas is not allowed for this notebook). The goal is to understand the some fundamentals of the COVID-19 data: These exercises will allow you to practice business analysis skills while also becoming more comfortable with Python.

---

## Part 1: Load the data & initial exploration

### Problem 1: Load the file and store it in an object called `covid_csv`.

Hint: This is a csv (comma-separated value) file, so we'll use `csv.reader()` 

See: [Python Docs - csv](https://docs.python.org/2/library/csv.html).



In [198]:
import csv

# import namedtuple as an option to store the data rows
from collections import namedtuple, defaultdict

DATA_FILE = './data/covid.csv'


#### Load the data

In [199]:
with open(DATA_FILE, 'r') as f:
    covid_csv = [row for row in csv.reader(f)]

In [200]:

covid_csv[0:3]

[['Combined_Key',
  'Country_Region',
  'Confirmed',
  'Deaths',
  'Recovered',
  'Active',
  'Incidence_Rate',
  'Case_Fatality_Ratio'],
 ['Afghanistan',
  'Afghanistan',
  '37345',
  '1354',
  '26694',
  '9297.0',
  '95.93267794278722',
  '3.6256526978176464'],
 ['Albania',
  'Albania',
  '6817',
  '208',
  '3552',
  '3057.0',
  '236.882340676906',
  '3.0511955405603635']]

### Problem 2: Separate ```covid_csv``` into a `header` and `data`. 

Complete the following tasks:

1. Split the covid_csv object into a ```header``` and ```data```.
    1. display the ```header```
    2. display the first 3 rows of ```data```.
2. What are dimensions of your data? Print the result (neatly formatted and each dimension identified.)


**Define the header and display the contents.**

In [201]:
header = covid_csv[0]
header


['Combined_Key',
 'Country_Region',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'Incidence_Rate',
 'Case_Fatality_Ratio']

**Define the data and display the first 3 rows.**

In [202]:
# Assign the data only
# display the first 3 rows
data = covid_csv[1:]
data[0:3] 


[['Afghanistan',
  'Afghanistan',
  '37345',
  '1354',
  '26694',
  '9297.0',
  '95.93267794278722',
  '3.6256526978176464'],
 ['Albania',
  'Albania',
  '6817',
  '208',
  '3552',
  '3057.0',
  '236.882340676906',
  '3.0511955405603635'],
 ['Algeria',
  'Algeria',
  '36699',
  '1333',
  '25627',
  '9739.0',
  '83.69014164611774',
  '3.6322515599880094']]

**Bonus: Use ```namedtuple``` to assign the data.**

In [203]:
# Alternative : use namedtuple
# define the namedtuple called CovidData and assign column names

import collections as col
covid_data = col.namedtuple('CovidData',header)


In [204]:
data_named = []

In [205]:
#List of tuples
for row in covid_csv[1:]:
     data_named.append(covid_data._make(row))

In [206]:
type(data_named )
     

list

In [207]:
# Check the header of the namedtuple
covid_data._fields

('Combined_Key',
 'Country_Region',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'Incidence_Rate',
 'Case_Fatality_Ratio')

**How many rows and how many columns?**

In [208]:
# print the rows and columns in data - label the output
#get row count 
print(f"How many rows in data list? {len(data)}")
 
#get column count for 1 row
print(f"How many columns in header? {len(header)}") 

How many rows in data list? 3941
How many columns in header? 8


In [209]:
# print the rows and columns in the namedtuple data_named - label the output
len_rows_data_named = len(data_named[:])
print(f"Rows in namedtuple: {len_rows_data_named}")
len_col_data_named = len(covid_data._fields) 
print(f"Columns in namedtuple: {len_col_data_named}")
 


Rows in namedtuple: 3941
Columns in namedtuple: 8


### Problem 3: Check the data type of each column and convert all numeric values to floats. 

Complete the following tasks:

1. Check data types (you only have to do this for one row).
2. Convert all numeric values to floats.


Note: only print the data type once per column (i.e. only do this for 1 row of data. 

**Format your output neatly and annotate properly.** Unannotated lists of data types will not receive credit. This means you should match each column name to a data type and display the combination.

In [210]:
# 1 Check data types (you only have to do this for one row).
for i,col in enumerate(data[0]):
    print( header[i],"",type(col))
 

Combined_Key  <class 'str'>
Country_Region  <class 'str'>
Confirmed  <class 'str'>
Deaths  <class 'str'>
Recovered  <class 'str'>
Active  <class 'str'>
Incidence_Rate  <class 'str'>
Case_Fatality_Ratio  <class 'str'>


In [211]:
# check data types for each column in tuple
for tup in data_named[0:1]:
    for i,item in enumerate(tup):
        print(covid_data._fields[i]," ", type(item))
        


Combined_Key   <class 'str'>
Country_Region   <class 'str'>
Confirmed   <class 'str'>
Deaths   <class 'str'>
Recovered   <class 'str'>
Active   <class 'str'>
Incidence_Rate   <class 'str'>
Case_Fatality_Ratio   <class 'str'>


In [212]:
for x in data_named[0:1] :
    for i in range(len_col_data_named):
        print( covid_data._fields[i],type(x[i]))

Combined_Key <class 'str'>
Country_Region <class 'str'>
Confirmed <class 'str'>
Deaths <class 'str'>
Recovered <class 'str'>
Active <class 'str'>
Incidence_Rate <class 'str'>
Case_Fatality_Ratio <class 'str'>


In [213]:
for x in data_named[0:1] :
    for i in range(len_col_data_named):
        print(x[i]) 

Afghanistan
Afghanistan
37345
1354
26694
9297.0
95.93267794278722
3.6256526978176464


In [214]:
data_named[0:1]

[CovidData(Combined_Key='Afghanistan', Country_Region='Afghanistan', Confirmed='37345', Deaths='1354', Recovered='26694', Active='9297.0', Incidence_Rate='95.93267794278722', Case_Fatality_Ratio='3.6256526978176464')]

In [215]:
len_col_data_named

8

In [216]:
test = []

for x in data_named[0:1] :
    col_type = []
    col_type.append(type(x.Combined_Key))    
    col_type.append(type (x.Country_Region))   
    col_type.append(type  (x.Confirmed ))
    col_type.append(type  (x.Deaths))  
    col_type.append(type (x.Recovered )) 
    col_type.append(type  (x.Active )) 
    col_type.append(type (x.Incidence_Rate))   
    col_type.append(type (x.Case_Fatality_Ratio))
    test.append(tuple(col_type))
test


[(str, str, str, str, str, str, str, str)]

In [217]:
for i , row in enumerate (data_named):
    col_values = []
    print(row)

CovidData(Combined_Key='Afghanistan', Country_Region='Afghanistan', Confirmed='37345', Deaths='1354', Recovered='26694', Active='9297.0', Incidence_Rate='95.93267794278722', Case_Fatality_Ratio='3.6256526978176464')
CovidData(Combined_Key='Albania', Country_Region='Albania', Confirmed='6817', Deaths='208', Recovered='3552', Active='3057.0', Incidence_Rate='236.882340676906', Case_Fatality_Ratio='3.0511955405603635')
CovidData(Combined_Key='Algeria', Country_Region='Algeria', Confirmed='36699', Deaths='1333', Recovered='25627', Active='9739.0', Incidence_Rate='83.69014164611774', Case_Fatality_Ratio='3.6322515599880094')
CovidData(Combined_Key='Andorra', Country_Region='Andorra', Confirmed='977', Deaths='53', Recovered='855', Active='69.0', Incidence_Rate='1264.4793891153822', Case_Fatality_Ratio='5.424769703172978')
CovidData(Combined_Key='Angola', Country_Region='Angola', Confirmed='1762', Deaths='80', Recovered='577', Active='1105.0', Incidence_Rate='5.361119796138704', Case_Fatality

CovidData(Combined_Key='Hong Kong, China', Country_Region='China', Confirmed='4243', Deaths='63', Recovered='3189', Active='991.0', Incidence_Rate='56.596062311957816', Case_Fatality_Ratio='1.4847984916332784')
CovidData(Combined_Key='Hubei, China', Country_Region='China', Confirmed='68139', Deaths='4512', Recovered='63623', Active='4.0', Incidence_Rate='115.1580192665202', Case_Fatality_Ratio='6.6217584643155885')
CovidData(Combined_Key='Hunan, China', Country_Region='China', Confirmed='1019', Deaths='4', Recovered='1015', Active='0.0', Incidence_Rate='1.4770256558921582', Case_Fatality_Ratio='0.3925417075564279')
CovidData(Combined_Key='Inner Mongolia, China', Country_Region='China', Confirmed='259', Deaths='1', Recovered='254', Active='4.0', Incidence_Rate='1.022099447513812', Case_Fatality_Ratio='0.3861003861003861')
CovidData(Combined_Key='Jiangsu, China', Country_Region='China', Confirmed='659', Deaths='0', Recovered='656', Active='3.0', Incidence_Rate='0.8185318593963483', Case_

CovidData(Combined_Key='Brown, Kansas, US', Country_Region='US', Confirmed='41', Deaths='0', Recovered='0', Active='41.0', Incidence_Rate='428.6909242994564', Case_Fatality_Ratio='0.0')
CovidData(Combined_Key='Butler, Kansas, US', Country_Region='US', Confirmed='290', Deaths='1', Recovered='0', Active='289.0', Incidence_Rate='433.41154668141263', Case_Fatality_Ratio='0.3448275862068966')
CovidData(Combined_Key='Chase, Kansas, US', Country_Region='US', Confirmed='48', Deaths='0', Recovered='0', Active='48.0', Incidence_Rate='1812.6888217522653', Case_Fatality_Ratio='0.0')
CovidData(Combined_Key='Chautauqua, Kansas, US', Country_Region='US', Confirmed='6', Deaths='0', Recovered='0', Active='6.0', Incidence_Rate='184.61538461538456', Case_Fatality_Ratio='0.0')
CovidData(Combined_Key='Cherokee, Kansas, US', Country_Region='US', Confirmed='132', Deaths='1', Recovered='0', Active='131.0', Incidence_Rate='662.0191584332214', Case_Fatality_Ratio='0.7575757575757576')
CovidData(Combined_Key='Ch

CovidData(Combined_Key='Tompkins, New York, US', Country_Region='US', Confirmed='234', Deaths='0', Recovered='0', Active='234.0', Incidence_Rate='229.00763358778642', Case_Fatality_Ratio='0.0')
CovidData(Combined_Key='Ulster, New York, US', Country_Region='US', Confirmed='2077', Deaths='92', Recovered='0', Active='1985.0', Incidence_Rate='1169.6598018842956', Case_Fatality_Ratio='4.429465575349061')
CovidData(Combined_Key='Unassigned, New York, US', Country_Region='US', Confirmed='0', Deaths='0', Recovered='0', Active='0.0', Incidence_Rate='-99.0', Case_Fatality_Ratio='-99.0')
CovidData(Combined_Key='Warren, New York, US', Country_Region='US', Confirmed='312', Deaths='33', Recovered='0', Active='279.0', Incidence_Rate='487.9269360690605', Case_Fatality_Ratio='10.576923076923077')
CovidData(Combined_Key='Washington, New York, US', Country_Region='US', Confirmed='260', Deaths='14', Recovered='0', Active='246.0', Incidence_Rate='424.8088360237893', Case_Fatality_Ratio='5.384615384615384')

CovidData(Combined_Key='Dnipropetrovsk Oblast, Ukraine', Country_Region='Ukraine', Confirmed='1602', Deaths='36', Recovered='1152', Active='414.0', Incidence_Rate='49.96137505430415', Case_Fatality_Ratio='2.247191011235955')
CovidData(Combined_Key='Donetsk Oblast, Ukraine', Country_Region='Ukraine', Confirmed='1011', Deaths='13', Recovered='780', Active='218.0', Incidence_Rate='24.268459572130972', Case_Fatality_Ratio='1.2858555885262115')
CovidData(Combined_Key='Ivano-Frankivsk Oblast, Ukraine', Country_Region='Ukraine', Confirmed='6039', Deaths='160', Recovered='2372', Active='3507.0', Incidence_Rate='439.7590536915293', Case_Fatality_Ratio='2.649445272396092')
CovidData(Combined_Key='Kharkiv Oblast, Ukraine', Country_Region='Ukraine', Confirmed='5169', Deaths='150', Recovered='2552', Active='2467.0', Incidence_Rate='193.19045686235376', Case_Fatality_Ratio='2.9019152640742885')
CovidData(Combined_Key='Kherson Oblast, Ukraine', Country_Region='Ukraine', Confirmed='276', Deaths='5', R

In [218]:
test = []
for x in data_named[0:1] :
    col_type = []
    for i in  range(len_col_data_named) :
         
        col_type.append(type(x.Combined_Key))    
        col_type.append(type (x.Country_Region))   
        col_type.append(type  (x.Confirmed ))
        col_type.append(type  (x.Deaths))  
        col_type.append(type (x.Recovered )) 
        col_type.append(type  (x.Active )) 
        col_type.append(type (x.Incidence_Rate))   
        col_type.append(type (x.Case_Fatality_Ratio)) 
    test.append(col_type)
        

In [219]:
#This is a range
#start at row 1. End at row 2
data_named[1:2]

[CovidData(Combined_Key='Albania', Country_Region='Albania', Confirmed='6817', Deaths='208', Recovered='3552', Active='3057.0', Incidence_Rate='236.882340676906', Case_Fatality_Ratio='3.0511955405603635')]

In [220]:
data_named[0:1]

[CovidData(Combined_Key='Afghanistan', Country_Region='Afghanistan', Confirmed='37345', Deaths='1354', Recovered='26694', Active='9297.0', Incidence_Rate='95.93267794278722', Case_Fatality_Ratio='3.6256526978176464')]

In [221]:
len(data_named)

3941

In [222]:
for i in range(len(data_named)) :
     print(data_named[i:i+1])
       

[CovidData(Combined_Key='Afghanistan', Country_Region='Afghanistan', Confirmed='37345', Deaths='1354', Recovered='26694', Active='9297.0', Incidence_Rate='95.93267794278722', Case_Fatality_Ratio='3.6256526978176464')]
[CovidData(Combined_Key='Albania', Country_Region='Albania', Confirmed='6817', Deaths='208', Recovered='3552', Active='3057.0', Incidence_Rate='236.882340676906', Case_Fatality_Ratio='3.0511955405603635')]
[CovidData(Combined_Key='Algeria', Country_Region='Algeria', Confirmed='36699', Deaths='1333', Recovered='25627', Active='9739.0', Incidence_Rate='83.69014164611774', Case_Fatality_Ratio='3.6322515599880094')]
[CovidData(Combined_Key='Andorra', Country_Region='Andorra', Confirmed='977', Deaths='53', Recovered='855', Active='69.0', Incidence_Rate='1264.4793891153822', Case_Fatality_Ratio='5.424769703172978')]
[CovidData(Combined_Key='Angola', Country_Region='Angola', Confirmed='1762', Deaths='80', Recovered='577', Active='1105.0', Incidence_Rate='5.361119796138704', Case

[CovidData(Combined_Key='Ford, Kansas, US', Country_Region='US', Confirmed='2093', Deaths='10', Recovered='0', Active='2083.0', Incidence_Rate='6225.6462119634725', Case_Fatality_Ratio='0.4777830864787386')]
[CovidData(Combined_Key='Franklin, Kansas, US', Country_Region='US', Confirmed='210', Deaths='1', Recovered='0', Active='209.0', Incidence_Rate='822.1108675227057', Case_Fatality_Ratio='0.4761904761904762')]
[CovidData(Combined_Key='Geary, Kansas, US', Country_Region='US', Confirmed='153', Deaths='2', Recovered='0', Active='151.0', Incidence_Rate='483.10704136406684', Case_Fatality_Ratio='1.30718954248366')]
[CovidData(Combined_Key='Gove, Kansas, US', Country_Region='US', Confirmed='6', Deaths='0', Recovered='0', Active='6.0', Incidence_Rate='227.6176024279211', Case_Fatality_Ratio='0.0')]
[CovidData(Combined_Key='Graham, Kansas, US', Country_Region='US', Confirmed='17', Deaths='0', Recovered='0', Active='17.0', Incidence_Rate='684.931506849315', Case_Fatality_Ratio='0.0')]
[CovidD

[CovidData(Combined_Key='Morgan, Ohio, US', Country_Region='US', Confirmed='30', Deaths='0', Recovered='0', Active='30.0', Incidence_Rate='206.78246484698101', Case_Fatality_Ratio='0.0')]
[CovidData(Combined_Key='Morrow, Ohio, US', Country_Region='US', Confirmed='179', Deaths='2', Recovered='0', Active='177.0', Incidence_Rate='506.6802536231884', Case_Fatality_Ratio='1.11731843575419')]
[CovidData(Combined_Key='Muskingum, Ohio, US', Country_Region='US', Confirmed='248', Deaths='1', Recovered='0', Active='247.0', Incidence_Rate='287.65296062170154', Case_Fatality_Ratio='0.4032258064516129')]
[CovidData(Combined_Key='Noble, Ohio, US', Country_Region='US', Confirmed='17', Deaths='0', Recovered='0', Active='17.0', Incidence_Rate='117.859123682751', Case_Fatality_Ratio='0.0')]
[CovidData(Combined_Key='Ottawa, Ohio, US', Country_Region='US', Confirmed='404', Deaths='26', Recovered='0', Active='378.0', Incidence_Rate='996.9154842689695', Case_Fatality_Ratio='6.435643564356438')]
[CovidData(Co

#### Convert numeric data to floats.
1. use a loop to convert only the numeric data (i.e. numbers represented as strings) to float values. You'll have to come up with a way to skip the non-numeric data.

2. If you used namedtuples this is a little trickier since namedtuples are immutable (can't be changed).

Hint: you need to use a placeholder data type that you can convert the values. After conversion put everything back into a namedtuple.

##### Convert the appropriate elements of ```data``` to floats.

In [223]:
# convert all numerical data to floats.
clean_data = []

for row in data :
    new_row = []
    
    for i, col in enumerate(row):      #loop every item in that row. Row = row, i = index, col is the actual text
         
        if i < 2 :                   #do not need to check for the first 2 columns: last, first, sex
            new_row.append(col)
        else:
            new_row.append(float(col))     #convert to float
            
    clean_data.append(new_row)           #add new row to Clean data array
   
for i in range(6):  
    print(clean_data[i])             
  

 

['Afghanistan', 'Afghanistan', 37345.0, 1354.0, 26694.0, 9297.0, 95.93267794278722, 3.6256526978176464]
['Albania', 'Albania', 6817.0, 208.0, 3552.0, 3057.0, 236.882340676906, 3.0511955405603635]
['Algeria', 'Algeria', 36699.0, 1333.0, 25627.0, 9739.0, 83.69014164611774, 3.6322515599880094]
['Andorra', 'Andorra', 977.0, 53.0, 855.0, 69.0, 1264.4793891153822, 5.424769703172978]
['Angola', 'Angola', 1762.0, 80.0, 577.0, 1105.0, 5.361119796138704, 4.540295119182747]
['Antigua and Barbuda', 'Antigua and Barbuda', 92.0, 3.0, 76.0, 13.0, 93.94657299240258, 3.260869565217391]


##### an alternative approach to convert the elements of ```data``` to floats.

In [224]:
# alternative approach

clean_data = []

for row in data :
    new_row  =  [col if i < 2 else  float(col)  for i,col in enumerate(row) ]
    clean_data.append(new_row )

for i in range(6):  
    print( clean_data[i])

['Afghanistan', 'Afghanistan', 37345.0, 1354.0, 26694.0, 9297.0, 95.93267794278722, 3.6256526978176464]
['Albania', 'Albania', 6817.0, 208.0, 3552.0, 3057.0, 236.882340676906, 3.0511955405603635]
['Algeria', 'Algeria', 36699.0, 1333.0, 25627.0, 9739.0, 83.69014164611774, 3.6322515599880094]
['Andorra', 'Andorra', 977.0, 53.0, 855.0, 69.0, 1264.4793891153822, 5.424769703172978]
['Angola', 'Angola', 1762.0, 80.0, 577.0, 1105.0, 5.361119796138704, 4.540295119182747]
['Antigua and Barbuda', 'Antigua and Barbuda', 92.0, 3.0, 76.0, 13.0, 93.94657299240258, 3.260869565217391]


##### Convert the appropriate elements of data_named to floats.
Note that this is a touch more complicated since namedtuples are immutable and elements cannot be changed. 

So the approach is to create a dictionary for each row and add each element of the namedtuples to the dictionary, converting the type when necessary. We end up with a list of dictionaries with all the same information, but all numerical values are now floats.

In [225]:
#dictionary for each row 
#create an array of dictionaries
data_dict = {}
for j,x in enumerate(data_named[0:]) :                      #get 1 row
    row_dict = {}
    for i,col in enumerate(((covid_data(*x)[0:]))):              #add each element of the namedtuples to the dictionary
          if i > 1 and not col == None:                         #converting all numerice types to float
            row_dict[covid_data._fields[i]] = float(col)      
                                                            
          else:
            row_dict[covid_data._fields[i]] = (col)            #Add non numeric columns without converting

    data_dict[j] = row_dict  
                                                           

for i,item in enumerate(data_dict.items()):
    if i < 3:
        print(item)

(0, {'Combined_Key': 'Afghanistan', 'Country_Region': 'Afghanistan', 'Confirmed': 37345.0, 'Deaths': 1354.0, 'Recovered': 26694.0, 'Active': 9297.0, 'Incidence_Rate': 95.93267794278722, 'Case_Fatality_Ratio': 3.6256526978176464})
(1, {'Combined_Key': 'Albania', 'Country_Region': 'Albania', 'Confirmed': 6817.0, 'Deaths': 208.0, 'Recovered': 3552.0, 'Active': 3057.0, 'Incidence_Rate': 236.882340676906, 'Case_Fatality_Ratio': 3.0511955405603635})
(2, {'Combined_Key': 'Algeria', 'Country_Region': 'Algeria', 'Confirmed': 36699.0, 'Deaths': 1333.0, 'Recovered': 25627.0, 'Active': 9739.0, 'Incidence_Rate': 83.69014164611774, 'Case_Fatality_Ratio': 3.6322515599880094})


##### an alternative approach to convert the appropriate elements of ```data_named``` to floats.

In [238]:
def convert_namedtuple(tuple_): 

    x=[float(i) if isinstance(i, int) or isinstance(i, float) else i for a,i in enumerate(tuple_) ]
   
    return tuple(x) 

In [240]:
 

for i , row in enumerate (data_named):
                       
    #x = covid_data._make(convert_namedtuple(row))   #convert converted tuple to named tuple
    #data_named[i].update(tuple(tuple_row))           #update tuple array with converted named tuple
    data_named[i] = covid_data._make(convert_namedtuple(row)) 
     

In [243]:
#import collections as col

 
for x in data_named[:] : 
    for i in range(len_data_named ): 
        tuple_row = []
        tuple_row.append(x.Combined_Key)   
        tuple_row.append(x.Country_Region)   
        tuple_row.append(float(x.Confirmed ))  
        tuple_row.append(float(x.Deaths))  
        tuple_row.append(float(x.Recovered))  
        tuple_row.append(float(x.Active ))  
        tuple_row.append(float(x.Incidence_Rate) )  
        tuple_row.append(float(x.Case_Fatality_Ratio)  ) 
     
    data_named[i] = tuple(tuple_row) 
     

In [None]:
new_data_named[0:3]

In [None]:
covid_data._fields

In [None]:

# check the result -

#dictionary of dictionaries
print (f"How many Keys / cols are in covid_data dictionaries? { len(data_dict.keys()) }")
print (f"How many items/rows are in 1 key of covid_data dictionaries? {len(data_dict )}")
    


#get row count 
print(f"How many rows are in this tuples list? {len(new_data_named)}")
 
#get column count for 1 row -  {len(header)}") 
print(f"How many columns are in this tuple row? {len(covid_data(*x))}")

### Part 4: Calculate the average number of active cases and average number of deaths.

1. Compute the average for active cases
2. Compute the average number of deaths.
3. Compute the average total number of cases.

Hint: Review the data dictionary to determine the correct information to use.

Hint: Don't over think this. Try to find the simplest approach.


#### Find the average using the standard ```data```

In [None]:
# Create a list for active case counts and deaths
active_cases = []
deaths = []
recovered = []
confirmed = []

for row in clean_data :
    active_cases.append(row[5])
    deaths.append(  row[3])   
    recovered.append( row[4])
    confirmed.append(row[2])

In [None]:
def get_avg(lst):
    return  round(sum(lst) / len(lst),2)

In [None]:
active_avg  = get_avg(active_cases)
print(" average for active cases =", active_avg  ) 

In [None]:
death_avg = get_avg(deaths)
print(" average for active cases =", death_avg)  

In [None]:
#per data dictionary , Active: Active cases = total cases - total recovered - total deaths.
#so, total cases = Active + Recovered + Deaths
recov_avg = get_avg(recovered)
print(" average for recovered cases =", recov_avg)  


In [None]:
total_cases = sum(active_cases) + sum(deaths) + sum(recovered)

In [None]:
# Calculate the average number of cases per country
 
avg_cases_country = {}

for x in data_named[0:] :
     avg_cases_country[x.Country_Region] =  (float(x.Active) + float(x.Deaths)  + float(x.Recovered))/total_cases 
 

In [None]:
# Calculate the average number of cases per country
 
for i,item in enumerate(avg_cases_country.items()):
    if i < 3:
        print(item)

#### Find the average using the namedtuple ```data_named```

In [None]:
def find_avg(tuple_):
    avg_Active = round(sum(float(x.Active)  for x in tuple_)/len(tuple_),2) 
    avg_Deaths = round(sum(float(x.Deaths)  for x in tuple_)/len(tuple_),2)
    return avg_Active,avg_Deaths


In [None]:
find_avg( data_named)
print(" average for active cases =", find_avg( data_named)[0]) 
print(" average for death cases =", find_avg( data_named)[1]) 
    

In [None]:
# Create a list for active case counts and deaths
# Note: Don't forget to convert to floats
 

# namedtuple approach
active   = [float(x.Active)  for x in data_named]
deaths  = [float(x.Deaths)  for x in data_named]


active_avg  = get_avg(active)
print(" average for active cases =", active_avg ) 

In [None]:

death_avg  = get_avg(deaths)
print(" average for active cases =", death_avg) 

**Compute the Average total number of cases**

#### What information do we need to get this result?
 total death + total active cases
 

In [None]:
 #Calculate the average total number of cases per country
avg_total_cases = active_avg  + death_avg  + recov_avg

avg_total_cases

### Part 5: Create an object ```countries``` that contains all the country names in the data set. Each country should only be listed once.

1. Create a list (or other python data type) of unique country names.
2. Print total number of unique countries represented in the data set.
3. Print the first 5 names and the last 5 names - Print your results neatly and annotate. Your results should be in alphabetical order.


In [None]:
header

In [None]:
countries = []
for row in data: 
    for i, cell in enumerate(row):
        if i == 1 and cell not in countries:
            countries.append(cell)
     

In [None]:
countries [0:3]

In [None]:
# Print total number of unique countries represented in the data set.
 
len(countries)

In [None]:
# print the first 5.
countries[:5]

In [None]:
# print the last 5
countries[-5:]

### Part 6: Calculate the average number of confirmed cases for the first 5 countries and the last 5 countries.

1. Determine the average number of confirmed cases for the first 5 countries.
2. Determine the average number of confirmed cases for the last 5 countries.


Note: Print your results neatly and properly annotated.

Hint: Think carefully about the easiest way to count the number of confirmed cases!


In [None]:
#Determine the average number of confirmed cases for the first 5 countries.
x = sum(float(data[i][2]) for i in range(5) )



In [None]:
avg_confirmed_first5 = x /sum(confirmed)            
print(f" The average number of confirmed cases first 5 countries: {avg_confirmed_first5}")            

In [None]:
#Determine the average number of confirmed cases for the first 5 countries.
y = sum(float(data[i][2]) for i in range(len(data) -5))
                                    

In [None]:
avg_confirmed_last5 = y /sum(confirmed)            
print(f" The average number of confirmed cases last 5 countries: {avg_confirmed_last5}") 

### Problem 7: Create a dictionary of confirmed cases in the EU.

The keys in the dictionary are the countries in Europe and the values will be the total number of confirmed cases.

**Expected output**: `{'Austria': 22439, 'Belgium': 75647, ...  }` (*required*)

**Bonus**: use `.defaultdict()` to simplify your code. (*optional*)

See: [Python Doc - defaultdict](https://docs.python.org/3/library/collections.html?highlight=defaultdict#collections.defaultdict) or [Stackoverflow - defaultdict](https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work)

In [None]:
# a list of EU countries
eu = ['Austria',
'Belgium',
'Bulgaria',
'Croatia',
'Cyprus',
'Czechia',
'Denmark',
'Estonia',
'Finland',
'France',
'Germany',
'Greece',
'Hungary',
'Ireland',
'Italy',
'Latvia',
'Lithuania',
'Luxembourg',
'Malta',
'Netherlands',
'Poland',
'Portugal',
'Romania',
'Slovakia',
'Slovenia',
'Spain',
'Sweden']

In [None]:
header

In [None]:
#The keys in the dictionary are the countries in Europe and the values will be the total number of confirmed cases.
is_eu = {}
for row in data: 
    for i, cell in enumerate(row):
        if i == 1 and cell  in eu:
            is_eu[cell] = row[2]

for i,item in enumerate(is_eu.items()):
    if i < 3:
        print ( item )  
        
type(is_eu)

A very efficient way to retrieve anything is to combine list or dictionary comprehensions with slicing. If you don't need to order the items (you just want n random pairs), you can use a dictionary comprehension like this:

# Python 2
first2pairs = {k: mydict[k] for k in mydict.keys()[:2]}
# Python 3
first2pairs = {k: mydict[k] for k in list(mydict)[:2]}
Generally a comprehension like this is always faster to run than the equivalent "for x in y" loop. Also, by using .keys() to make a list of the dictionary keys and slicing that list you avoid 'touching' any unnecessary keys when you build the new dictionary.

If you don't need the keys (only the values) you can use a list comprehension:

first2vals = [v for v in mydict.values()[:2]]
If you need the values sorted based on their keys, it's not much more trouble:

first2vals = [mydict[k] for k in sorted(mydict.keys())[:2]]
or if you need the keys as well:

first2pairs = {k: mydict[k] for k in sorted(mydict.keys())[:2]}

In [None]:
is_eu_tup = {}

# if you used a named tuple - answer here

for x in data_named[0:] :
     if x.Country_Region in eu:
         is_eu_tup[x.Country_Region] = float(x.Confirmed) 
        
        
        

In [None]:
for i,item in enumerate(is_eu_tup.items()):
    if i < 3:
        print ( item )  
        
type(is_eu_tup)





In [None]:
#try with a defaultdict
#defaultdict, on the other hand, will insert a key into the dictionary if it isn't there yet
#Key Error 
from collections import defaultdict
 

#  Check value of a key 'Case_Fatality_Ratio' in my dictionary  covid_data_dict
#check value that is not there
is_eu_tup['Case_Fatality '] 
#Got a KeyError 
        

In [None]:
#set default value for keys not found
is_eu_tup = defaultdict(lambda: "Key is Not present in the dictionary")

In [None]:
is_eu_tup['Case_Fatality '] 

### Problem 8: Compare the Case Fatality Rate in the EU to that in the US and North America.

1. Determine the CFR in the EU
2. Determine the CFR in the US
3. Determine the CFR in North America

Note: The Case Fatality Rate is a feature in this data set. You are not to use that feature. You should compute the CFR from the other available features. Use the existing CFR column as a check.

Per data dictionary
Case_Fatality_Ratio (%): Case-Fatality Ratio (%) = Number recorded deaths / Number cases.

In [None]:
#create filter
is_eu = prod_df['Country_Region'].isin(eu) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_eu]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_eu_cfr = pd.DataFrame(prod_df[is_eu][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_eu_cfr['Calculated Case Fatality Rate'] = df_eu_cfr['Deaths']/df_eu_cfr['Confirmed'] * 100
 
df_eu_cfr

In [None]:
# countries in North America
na = ['Antigua and Barbuda',
'Bahamas',
'Barbados',
'Belize',
'Canada',
'Costa Rica',
'Cuba',
'Dominica',
'Dominican Republic',
'El Salvador',
'Grenada',
'Guatemala',
'Haiti',
'Honduras',
'Jamaica',
'Mexico',
'Nicaragua',
'Panama',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Vincent and the Grenadines',
'Trinidad and Tobago',
'US'] 

In [None]:
#create filter
is_na = prod_df['Country_Region'].isin(na) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_na]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_na_cfr = pd.DataFrame(prod_df[is_na][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_na_cfr['Calculated Case Fatality Rate'] = df_na_cfr['Deaths']/df_na_cfr['Confirmed']  * 100
 
df_na_cfr

In [None]:
#create filter
is_us = prod_df['Country_Region'] == 'US'

#Create a dictionary of confirmed cases in the EU.
prod_df[is_us]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_us_cfr = pd.DataFrame(prod_df[is_us][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_us_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed'] * 100
 
df_us_cfr

In [None]:
# write a function

def create_df_calc_cfr(arg_region,mylist):
    if (arg_region == 'na') | (arg_region == 'eu'):
        is_in =  prod_df['Country_Region'].isin(mylist)
        prod_df[is_in]['Country_Region']
        df_isin_cfr = pd.DataFrame(prod_df[is_in][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )
        df_isin_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed']  * 100
    
    else:
        is_in =  prod_df['Country_Region'] == arg_region
        prod_df[is_in]['Country_Region']
        df_isin_cfr = pd.DataFrame(prod_df[is_us][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )
        df_isin_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed']  * 100
    return df_isin_cfr

 

In [None]:
create_df_calc_cfr('US',[])

In [None]:
create_df_calc_cfr('na',na)
 


### Bonus 1: Craft a problem statement about this data that interests you, and then answer it!


In [None]:
#The highest number of Confirmed cases held by a country: 
largest = max(float(data[i][2]) for i in range(len(data) ))



In [None]:
print(f" The highest number of Confirmed cases held by a country: {largest}")

In [None]:
#The least number of Confirmed cases held by a country:
smallest = min(float(data[i][2]) for i in range(len(data) ))



In [None]:
print(f" The least number of Confirmed cases held by a country: {smallest}")

### Bonus 2: Repeat the above analysis using Pandas!



##### Part 5: Create an object ```countries``` that contains all the country names in the data set. Each country should only be listed once.

1. Create a list (or other python data type) of unique country names.
2. Print total number of unique countries represented in the data set.
3. Print the first 5 names and the last 5 names - Print your results neatly and annotate. Your results should be in alphabetical order.


In [None]:
# Where are countries in the rows
import pandas as pd
prod_df = pd.read_csv(DATA_FILE, sep=',')
 
counrtry_list = prod_df['Country_Region'].sort_values(ascending=False) 

In [None]:
counrtry_list

In [None]:
# Print total number of unique countries represented in the data set.
 
prod_df['Country_Region'].nunique() 

In [None]:
# print the first 5.
prod_df.head(5)[['Country_Region']] 

In [None]:
# print the last 5
prod_df.tail(5)[['Country_Region']]

#####  Part 6: Calculate the average number of confirmed cases for the first 5 countries and the last 5 countries.

1. Determine the average number of confirmed cases for the first 5 countries.
2. Determine the average number of confirmed cases for the last 5 countries.


Note: Print your results neatly and properly annotated.

Hint: Think carefully about the easiest way to count the number of confirmed cases!


In [None]:
prod_df['Confirmed'].head().mean()

In [None]:
#sort biggest to smallest
first5_avg = prod_df['Confirmed'].head(5).sort_values(ascending=False).mean() 

print(f" The average number of confirmed cases for the first 5 countries {first5_avg}")

In [None]:
last5_avg = prod_df['Confirmed'].tail(5).sort_values(ascending=False).mean() 

print(f" The average number of confirmed cases for the first 5 countries {last5_avg}")

In [None]:
# write a function
def findavg(df_data):
    last5_avg = prod_df['Confirmed'].tail(5).sort_values(ascending=False).mean() 
    first5_avg = prod_df['Confirmed'].head(5).sort_values(ascending=False).mean()    
    return   last5_avg, first5_avg

findavg(prod_df)
print(f" The average number of confirmed cases for the first 5 countries {findavg(prod_df)[0]}")
print(f" The average number of confirmed cases for the first 5 countries {findavg(prod_df)[1]}")

#####  Problem 7: Create a dictionary of confirmed cases in the EU.

The keys in the dictionary are the countries in Europe and the values will be the total number of confirmed cases.

**Expected output**: `{'Austria': 22439, 'Belgium': 75647, ...  }` (*required*)

**Bonus**: use `.defaultdict()` to simplify your code. (*optional*)

See: [Python Doc - defaultdict](https://docs.python.org/3/library/collections.html?highlight=defaultdict#collections.defaultdict) or [Stackoverflow - defaultdict](https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work)

In [None]:
# a list of EU countries
eu = ['Austria',
'Belgium',
'Bulgaria',
'Croatia',
'Cyprus',
'Czechia',
'Denmark',
'Estonia',
'Finland',
'France',
'Germany',
'Greece',
'Hungary',
'Ireland',
'Italy',
'Latvia',
'Lithuania',
'Luxembourg',
'Malta',
'Netherlands',
'Poland',
'Portugal',
'Romania',
'Slovakia',
'Slovenia',
'Spain',
'Sweden']

In [None]:
#create filter
is_eu = prod_df['Country_Region'].isin(eu) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_eu][['Country_Region','Confirmed']] 

In [None]:
#You can use df. to_dict() in order to convert the DataFrame to a dictionary
df = pd.DataFrame(prod_df[is_eu][['Country_Region','Confirmed']],columns=['Country_Region','Confirmed'] )
df.set_index('Country_Region', inplace=True)
df = df.rename_axis(None)  

my_dictionary = df.to_dict()
my_dictionary.items()

In [None]:
# if you used a named tuple - answer here
#iterate each element and convert to dictionary

#show named tuple as a dictionary
covid_data_dict = {}

#build the keys
for p in header[:]:                   #add column headers sex,age,sibsp,pclass,far,survival
    covid_data_dict[p] = [] 
    
for row in array_of_tuples[: ]:
      for i,cell in enumerate(row): 
            #print(row._asdict())        #add the entire as a dict 
           # print (i)                    #append cells to existing dictionary keyes
           # print (cell)
            covid_data_dict[header[i]].append(cell)

#print(covid_data_dict.items())





In [None]:
#try with a defaultdict
#defaultdict, on the other hand, will insert a key into the dictionary if it isn't there yet
#Key Error 
from collections import defaultdict
 

#  Check value of a key 'Case_Fatality_Ratio' in my dictionary  covid_data_dict
covid_data_dict['Case_Fatality_Ratio'][0:5]  
        

In [None]:
#check value that is not there
covid_data_dict['Case_Fatality '] 
#Got a KeyError

In [None]:
#set default value for keys not found
covid_data_dict = defaultdict(lambda: "Key is Not present in the dictionary")

In [None]:
covid_data_dict['Case_Fatality '] 

#####  Problem 8: Compare the Case Fatality Rate in the EU to that in the US and North America.

1. Determine the CFR in the EU
2. Determine the CFR in the US
3. Determine the CFR in North America

Note: The Case Fatality Rate is a feature in this data set. You are not to use that feature. You should compute the CFR from the other available features. Use the existing CFR column as a check.

Per data dictionary
Case_Fatality_Ratio (%): Case-Fatality Ratio (%) = Number recorded deaths / Number cases.

In [None]:
#create filter
is_eu = prod_df['Country_Region'].isin(eu) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_eu]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_eu_cfr = pd.DataFrame(prod_df[is_eu][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_eu_cfr['Calculated Case Fatality Rate'] = df_eu_cfr['Deaths']/df_eu_cfr['Confirmed'] * 100
 
df_eu_cfr

In [None]:
# countries in North America
na = ['Antigua and Barbuda',
'Bahamas',
'Barbados',
'Belize',
'Canada',
'Costa Rica',
'Cuba',
'Dominica',
'Dominican Republic',
'El Salvador',
'Grenada',
'Guatemala',
'Haiti',
'Honduras',
'Jamaica',
'Mexico',
'Nicaragua',
'Panama',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Vincent and the Grenadines',
'Trinidad and Tobago',
'US'] 

In [None]:
#create filter
is_na = prod_df['Country_Region'].isin(na) 

#Create a dictionary of confirmed cases in the EU.
prod_df[is_na]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_na_cfr = pd.DataFrame(prod_df[is_na][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_na_cfr['Calculated Case Fatality Rate'] = df_na_cfr['Deaths']/df_na_cfr['Confirmed']  * 100
 
df_na_cfr

In [None]:
#create filter
is_us = prod_df['Country_Region'] == 'US'

#Create a dictionary of confirmed cases in the EU.
prod_df[is_us]['Country_Region']

#cfr : total deaths/total confirmed * 100
df_us_cfr = pd.DataFrame(prod_df[is_us][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )

df_us_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed'] * 100
 
df_us_cfr

In [None]:
# write a function

def create_df_calc_cfr(arg_region,mylist):
    if (arg_region == 'na') | (arg_region == 'eu'):
        is_in =  prod_df['Country_Region'].isin(mylist)
        prod_df[is_in]['Country_Region']
        df_isin_cfr = pd.DataFrame(prod_df[is_in][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )
        df_isin_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed']  * 100
    
    else:
        is_in =  prod_df['Country_Region'] == arg_region
        prod_df[is_in]['Country_Region']
        df_isin_cfr = pd.DataFrame(prod_df[is_us][['Country_Region','Confirmed','Deaths','Case_Fatality_Ratio']]  )
        df_isin_cfr['Calculated Case Fatality Rate'] = df_us_cfr['Deaths']/df_us_cfr['Confirmed']  * 100
    return df_isin_cfr

 

In [None]:
create_df_calc_cfr('US',[])

In [None]:
create_df_calc_cfr('na',na)
 