<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing Covid-19 Data

_Author: B Rhodes (DC)_

---

For Project 2, you'll be using Python to perform fundamental exploratory data analysis (EDA) tasks. In the other notebook in this project, we can use Pandas, but this notebook you should only use Python. The purpose here is to flex your Python muscles while thinking about data.

Below you'll import a data file with information on Covid-19 in a number of patients from the Cleveland Clinic. The original data along with a data dictionary can be found at the [John Hopkins University: CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data).


For these exercises, you will conduct basic exploratory data analysis using Python (Pandas is not allowed for this notebook). The goal is to understand the some fundamentals of the COVID-19 data: These exercises will allow you to practice business analysis skills while also becoming more comfortable with Python.

---

## Part 1: Load the data & initial exploration

### Problem 1: Load the file and store it in an object called `covid_csv`.

Hint: This is a csv (comma-separated value) file, so we'll use `csv.reader()`

See: [Python Docs - csv](https://docs.python.org/2/library/csv.html).



In [None]:
import csv

# import namedtuple as an option to store the data rows
from collections import namedtuple, defaultdict

DATA_FILE = './data/covid.csv'


#### Load the data

In [2]:
import pandas as pd
import numpy as np


DATA_FILE = '/content/covid.csv'



# loading the data into a panda datarame
prod = pd.read_csv(DATA_FILE)
prod.head()

Unnamed: 0,Combined_Key,Country_Region,Confirmed,Deaths,Recovered,Active,Incidence_Rate,Case_Fatality_Ratio
0,Afghanistan,Afghanistan,37345,1354,26694,9297.0,95.932678,3.625653
1,Albania,Albania,6817,208,3552,3057.0,236.882341,3.051196
2,Algeria,Algeria,36699,1333,25627,9739.0,83.690142,3.632252
3,Andorra,Andorra,977,53,855,69.0,1264.479389,5.42477
4,Angola,Angola,1762,80,577,1105.0,5.36112,4.540295


### Problem 2: Separate ```covid_csv``` into a `header` and `data`.

Complete the following tasks:

1. Split the covid_csv object into a ```header``` and ```data```.
    1. display the ```header```
    2. display the first 3 rows of ```data```.
2. What are dimensions of your data? Print the result (neatly formatted and each dimension identified.)


**Define the header and display the contents.**

In [4]:
import pandas as pd



# header = prod.columns.tolist()
# print(header)

print(prod.columns)

Index(['Combined_Key', 'Country_Region', 'Confirmed', 'Deaths', 'Recovered',
       'Active', 'Incidence_Rate', 'Case_Fatality_Ratio'],
      dtype='object')


**Define the data and display the first 3 rows.**

In [5]:
# Assign the data only
# display the first 3 rows

prod.head(3)


Unnamed: 0,Combined_Key,Country_Region,Confirmed,Deaths,Recovered,Active,Incidence_Rate,Case_Fatality_Ratio
0,Afghanistan,Afghanistan,37345,1354,26694,9297.0,95.932678,3.625653
1,Albania,Albania,6817,208,3552,3057.0,236.882341,3.051196
2,Algeria,Algeria,36699,1333,25627,9739.0,83.690142,3.632252


**Bonus: Use ```namedtuple``` to assign the data.**

In [6]:
import pandas as pd
from collections import namedtuple


prod = pd.DataFrame({
    'Country_Region': ['USA', 'Canada'],
    'Confirmed': [100, 200],
    'Deaths': [5, 10],
    'Recovered': [95, 190],
    'Active': [0, 0],
    'Incidence_Rate': [300.5, 450.2],
    'Case_Fatality_Ratio': [5.0, 5.0]
})

# Alternative : use namedtuple
# define the namedtuple called CovidData and assign column names
CovidData = namedtuple('CovidData', [col.replace(' ', '_') for col in prod.columns])



# create the tuples from the original data
# use a list comprehension for compactness - could also be a for-loop.
data_named = [CovidData(*row) for row in prod.itertuples(index=False, name=None)] #itertuples() method will iterate through each row of the Dataframe


first_record = data_named[0]
print(f"Country/Region: {first_record.Country_Region}, Confirmed: {first_record.Confirmed}")


Country/Region: USA, Confirmed: 100


In [7]:
# Check the header of the namedtuple
print(CovidData._fields)

('Country_Region', 'Confirmed', 'Deaths', 'Recovered', 'Active', 'Incidence_Rate', 'Case_Fatality_Ratio')


**How many rows and how many columns?**

In [8]:
# print the rows and columns in data - label the output


num_rows = len(data_named) # number of rows
num_columns = len(CovidData._fields) # number of columns

print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

Number of rows: 2
Number of columns: 7


In [9]:
# print the rows and columns in the namedtuple data_named - label the output


num_rows = len(data_named)
num_columns = len(CovidData._fields)

# Printing the results with labels
print("Number of rows in data_named:", num_rows)
print("Number of columns in data_named:", num_columns)


Number of rows in data_named: 2
Number of columns in data_named: 7


### Problem 3: Check the data type of each column and convert all numeric values to floats.

Complete the following tasks:

1. Check data types (you only have to do this for one row).
2. Convert all numeric values to floats.


Note: only print the data type once per column (i.e. only do this for 1 row of data.

**Format your output neatly and annotate properly.** Unannotated lists of data types will not receive credit. This means you should match each column name to a data type and display the combination.

In [17]:
# check data types for each column


from collections import namedtuple
import decimal


if data_named:  # Checking to see if the list is not empty
    # Step 1: Check Data Types for Each Column
    first_row = data_named[0]
    print("Data types for each column:")
    for field in CovidData._fields:
        value = getattr(first_row, field) # this method will check the value for the first row
        print(f"{field}: {type(value).__name__}")

    # Step 2: Convert Numeric Values to Floats
    CovidDataFloat = namedtuple('CovidDataFloat', CovidData._fields)
    def convert_to_float(val):
        if isinstance(val, (int, float, decimal.Decimal)):  # Check if the value is numeric
            return float(val)
        return val

    data_named_float = [
        CovidDataFloat(*(convert_to_float(val) for val in row))
        for row in data_named
    ]

    #print a sample to check conversion
    print("\nConverted types in the first row after conversion:")
    for field in CovidDataFloat._fields:
        value = getattr(data_named_float[0], field)
        print(f"{field}: {type(value).__name__}")
else:
    print("The data list 'data_named' is empty.")


Data types for each column:
Country_Region: str
Confirmed: int
Deaths: int
Recovered: int
Active: int
Incidence_Rate: float
Case_Fatality_Ratio: float

Converted types in the first row after conversion:
Country_Region: str
Confirmed: float
Deaths: float
Recovered: float
Active: float
Incidence_Rate: float
Case_Fatality_Ratio: float


In [18]:
# check data types for each column

#  This one seems like the first one, I'm assuming you want the floats  here


#### Convert numeric data to floats.
1. use a loop to convert only the numeric data (i.e. numbers represented as strings) to float values. You'll have to come up with a way to skip the non-numeric data.

2. If you used namedtuples this is a little trickier since namedtuples are immutable (can't be changed).

Hint: you need to use a placeholder data type that you can convert the values. After conversion put everything back into a namedtuple.

##### Convert the appropriate elements of ```data``` to floats.

In [20]:
# convert all numerical data to floats.

from collections import namedtuple
import decimal

def is_numeric(s):
    """Check if a string can be interpreted as a number."""
    try:
        float(s)
        return True
    except ValueError:
        return False

##### an alternative approach to convert the elements of ```data``` to floats.

In [22]:
# convert all numerical data to floats.

from collections import namedtuple
import decimal

# Convert data and store in a new list
converted_data_named = []
for row in data_named:
    # Convert numeric strings to floats, leave other data types unchanged
    converted_row = tuple(float(field) if isinstance(field, str) and is_numeric(field) else field for field in row)
    # Create a new namedtuple instance with the converted data
    converted_data_named.append(CovidDataConverted(*converted_row))

print("Converted row sample:")
for field, value in zip(CovidDataConverted._fields, converted_data_named[0]):
    print(f"{field}: {type(value).__name__} ({value})")

Converted row sample:
Country_Region: str (USA)
Confirmed: int (100)
Deaths: int (5)
Recovered: int (95)
Active: int (0)
Incidence_Rate: float (300.5)
Case_Fatality_Ratio: float (5.0)


##### Convert the appropriate elements of data_named to floats.
Note that this is a touch more complicated since namedtuples are immutable and elements cannot be changed.

So the approach is to create a dictionary for each row and add each element of the namedtuples to the dictionary, converting the type when necessary. We end up with a list of dictionaries with all the same information, but all numerical values are now floats.

In [25]:
# alternative approach

from collections import namedtuple
import decimal

converted_data_dicts = []

for record in data_named:
    # Convert the namedtuple to a dictionary
    record_dict = record._asdict()

    # Modify the dictionary by converting numeric values to floats
    for key, value in record_dict.items():
        if isinstance(value, (int, float, decimal.Decimal)):  # Check if the value is a numeric type
            record_dict[key] = float(value)  # Convert to float

    # Append the modified dictionary to the new list
    converted_data_dicts.append(record_dict)

In [None]:
# namedtuple approach



In [26]:
# check the result -

print("Sample converted dictionary:", converted_data_dicts[0])

Sample converted dictionary: {'Country_Region': 'USA', 'Confirmed': 100.0, 'Deaths': 5.0, 'Recovered': 95.0, 'Active': 0.0, 'Incidence_Rate': 300.5, 'Case_Fatality_Ratio': 5.0}


### Part 4: Calculate the average number of active cases and average number of deaths.

1. Compute the average for active cases
2. Compute the average number of deaths.
3. Compute the average total number of cases.

Hint: Review the data dictionary to determine the correct information to use.

Hint: Don't over think this. Try to find the simplest approach.


#### Find the average using the standard ```data```

In [23]:
# Create a list for active case counts and deaths

active_cases = []
deaths = []
confirmed_cases = []


for record in data_named:
    active_cases.append(record.Active)
    deaths.append(record.Deaths)
    confirmed_cases.append(record.Confirmed)

# Function to calculate the average from a list of numbers
def calculate_average(numbers):
    if numbers:  # Checking if the list is not empty to avoid division by zero
        return sum(numbers) / len(numbers)
    return 0  # Return 0 or suitable value if the list is empty


In [24]:
# Calculate the average number of cases per country

average_active_cases = calculate_average(active_cases)
average_deaths = calculate_average(deaths)
average_confirmed_cases = calculate_average(confirmed_cases)

# Print the results
print(f"Average Active Cases: {average_active_cases}")
print(f"Average Deaths: {average_deaths}")
print(f"Average Confirmed Cases: {average_confirmed_cases}")

Average Active Cases: 0.0
Average Deaths: 7.5
Average Confirmed Cases: 150.0


#### Find the average using the namedtuple ```data_named```

In [43]:
# Create a list for active case counts and deaths

from collections import defaultdict, namedtuple

country_data = defaultdict(lambda: {'active_sum': 0.0, 'death_sum': 0.0, 'count': 0})

# Note: Don't forget to convert to floats.


In [None]:
# Calculate the average number of cases per country


**Compute the Average total number of cases**

In [None]:
# What information do we need to get this result?


In [None]:
#

### Part 5: Create an object ```countries``` that contains all the country names in the data set. Each country should only be listed once.

1. Create a list (or other python data type) of unique country names.
2. Print total number of unique countries represented in the data set.
3. Print the first 5 names and the last 5 names - Print your results neatly and annotate. Your results should be in alphabetical order.


In [39]:
# Where are countries in the rows

df = pd.read_csv ('/content/covid.csv')


# Extract all unique country names and sort them
countries = sorted(df['Country_Region'].unique())

# print the country count
print(f"Total number of unique countries: {len(countries)}")

Total number of unique countries: 188


In [40]:
# list the countries
print("List of all countries:")
for country in countries:
    print(country)

List of all countries:
Afghanistan
Albania
Algeria
Andorra
Angola
Antigua and Barbuda
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Brunei
Bulgaria
Burkina Faso
Burma
Burundi
Cabo Verde
Cambodia
Cameroon
Canada
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo (Brazzaville)
Congo (Kinshasa)
Costa Rica
Cote d'Ivoire
Croatia
Cuba
Cyprus
Czechia
Denmark
Diamond Princess
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Eswatini
Ethiopia
Fiji
Finland
France
Gabon
Gambia
Georgia
Germany
Ghana
Greece
Grenada
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Holy See
Honduras
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Korea, South
Kosovo
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
MS Zaandam
Madagascar
Malawi
Mal

In [41]:
# print the first 5.
print("First 5 countries:", countries[:5])

# print the last 5
print("Last 5 countries:", countries[-5:])


First 5 countries: ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola']
Last 5 countries: ['West Bank and Gaza', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']


### Part 6: Calculate the average number of confirmed cases for the first 5 countries and the last 5 countries.

1. Determine the average number of confirmed cases for the first 5 countries.
2. Determine the average number of confirmed cases for the last 5 countries.


Note: Print your results neatly and properly annotated.

Hint: Think carefully about the easiest way to count the number of confirmed cases!


In [38]:
import pandas as pd

# Load the data from the CSV file
df = pd.read_csv ('/content/covid.csv')

# Sort the DataFrame by 'Country_Region' to ensure alphabetical order
df_sorted = df.sort_values(by='Country_Region')

# Get unique countries and ensure they are sorted
countries_sorted = df_sorted['Country_Region'].unique()

# Select the first and last 5 countries
first_five_countries = countries_sorted[:5]
last_five_countries = countries_sorted[-5:]

# Calculate the average confirmed cases for these countries
# Filtering data for the first 5 countries
first_five_avg = df_sorted[df_sorted['Country_Region'].isin(first_five_countries)]['Confirmed'].mean()

# Filtering data for the last 5 countries
last_five_avg = df_sorted[df_sorted['Country_Region'].isin(last_five_countries)]['Confirmed'].mean()

# Print the results neatly
print(f"Average number of confirmed cases for the first 5 countries: {first_five_avg:.2f}")
print(f"Average number of confirmed cases for the last 5 countries: {last_five_avg:.2f}")


Average number of confirmed cases for the first 5 countries: 16720.00
Average number of confirmed cases for the last 5 countries: 6085.80


### Problem 7: Create a dictionary of confirmed cases in the EU.

The keys in the dictionary are the countries in Europe and the values will be the total number of confirmed cases.

**Expected output**: `{'Austria': 22439, 'Belgium': 75647, ...  }` (*required*)

**Bonus**: use `.defaultdict()` to simplify your code. (*optional*)

See: [Python Doc - defaultdict](https://docs.python.org/3/library/collections.html?highlight=defaultdict#collections.defaultdict) or [Stackoverflow - defaultdict](https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work)

In [None]:
# a list of EU countries
eu = ['Austria',
'Belgium',
'Bulgaria',
'Croatia',
'Cyprus',
'Czechia',
'Denmark',
'Estonia',
'Finland',
'France',
'Germany',
'Greece',
'Hungary',
'Ireland',
'Italy',
'Latvia',
'Lithuania',
'Luxembourg',
'Malta',
'Netherlands',
'Poland',
'Portugal',
'Romania',
'Slovakia',
'Slovenia',
'Spain',
'Sweden']

In [None]:
# if you used a named tuple - answer here

In [42]:
#try with a defaultdict

import pandas as pd
from collections import defaultdict

df = pd.read_csv ('/content/covid.csv')

eu_countries = [
    'Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic',
    'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary',
    'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta',
    'Netherlands', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia',
    'Spain', 'Sweden'
]

eu_data = df[df['Country_Region'].isin(eu_countries)] # filtering the data

confirmed_cases = defaultdict(int) # accumulating the confirmed cases here

# Aggregate confirmed cases by country
for index, row in eu_data.iterrows():
    confirmed_cases[row['Country_Region']] += row['Confirmed']

# Convert the defaultdict to a regular dictionary
confirmed_dict = dict(confirmed_cases)

# Print the dictionary
print("Confirmed COVID-19 cases in EU countries:")
for country, cases in sorted(confirmed_dict.items()):  # Sorted for alphabetical order
    print(f"{country}: {cases}")

Confirmed COVID-19 cases in EU countries:
Austria: 22439
Belgium: 75647
Bulgaria: 13893
Croatia: 5870
Cyprus: 1291
Denmark: 15423
Estonia: 2174
Finland: 7642
France: 244088
Germany: 220859
Greece: 6177
Hungary: 4768
Ireland: 26838
Italy: 251713
Latvia: 1303
Lithuania: 2309
Luxembourg: 7300
Malta: 1190
Netherlands: 61718
Poland: 53676
Portugal: 53223
Romania: 65177
Slovakia: 2690
Slovenia: 2303
Spain: 329784
Sweden: 83455


### Problem 8: Compare the Case Fatality Rate in the EU to that in the US and North America.

1. Determine the CFR in the EU
2. Determine the CFR in the US
3. Determine the CFR in North America

Note: The Case Fatality Rate is a feature in this data set. You are not to use that feature. You should compute the CFR from the other available features. Use the existing CFR column as a check.

In [None]:
import pandas as pd

df = pd.read_csv ('/content/covid.csv')

In [None]:
# countries in North America
na = ['Antigua and Barbuda',
'Bahamas',
'Barbados',
'Belize',
'Canada',
'Costa Rica',
'Cuba',
'Dominica',
'Dominican Republic',
'El Salvador',
'Grenada',
'Guatemala',
'Haiti',
'Honduras',
'Jamaica',
'Mexico',
'Nicaragua',
'Panama',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Vincent and the Grenadines',
'Trinidad and Tobago',
'US']

In [None]:
# write a function





### Bonus 1: Craft a problem statement about this data that interests you, and then answer it!


In [None]:
# Problem Statement: Impact of COVID-19 on Small vs. Large Countries in Europe
# Given the global nature of the COVID-19 pandemic, different countries have faced varying challenges
# based on numerous factors including healthcare infrastructure, population density, and government response.
# An interesting aspect to explore is how smaller European countries (by population) have fared in
# comparison to larger ones in terms of managing COVID-19 cases and fatalities.

# The question is do smaller European countries have a higher or
# lower Case Fatality Rate (CFR) compared to larger European countries during the COVID-19 pandemic?



### Bonus 2: Repeat the above analysis using Pandas!
