#### Question 1: Import appropriate package and load the dataset
Using pandas load the *covid_19.csv*

In [3]:
import pandas as pd
import numpy as np

# Load your dataset 
file_path = 'data/covid_19.csv'
df = pd.read_csv(file_path)

# Output the first 5 rows of the dataframe
df.head()

Unnamed: 0,country,continent,population,day,time,Cases,Recovered,Deaths,Tests
0,Slovakia,Europe,5460193.0,2024-06-30,2024-06-30T16:15:11+00:00,1877605,1856381.0,21224.0,7448789.0
1,Saint-Pierre-Miquelon,North-America,5759.0,2024-06-30,2024-06-30T16:15:16+00:00,3452,2449.0,2.0,25400.0
2,Isle-of-Man,EuRope,85732.0,2024-06-30,2024-06-30T16:15:14+00:00,38008,,116.0,150753.0
3,Somalia,Africa,16841795.0,2024-06-30,2024-06-30T16:15:14+00:00,27334,13182.0,1361.0,400466.0
4,Comoros,Africa,907419.0,2024-06-30,2024-06-30T16:15:15+00:00,9109,8939.0,161.0,


In [2]:
!pip install numpy



#### Question 2: Complete the 'handle_missing_values' function to handle missing values in the dataset. 

Iterate each column of the dataframe **df_filled**. Using **fillna()** fill the ***NaN*** values in each column by *replacing it with a value of zero (0) if it's a numeric column or with the string "Unknown" if it's a text column*.

*Hint* : You can use numpy to detect if the column's data type is a number with the following condition:
-  **if np.issubdtype(df_filled[column].dtype, np.number)**

Save it in the **filled_missing_df** variable

In [4]:
# Function to handle missing values for both text and numeric columns
def handle_missing_values(df):
    df_filled = df.copy()
    
    for column in df_filled.columns:
        
        if np.issubdtype(df_filled[column].dtype, np.number):
            df_filled[column] = df_filled[column].fillna(0)
        else:
            df_filled[column] = df_filled[column].fillna('Unknown')

    return df_filled

In [6]:
filled_missing_df = handle_missing_values(df)
filled_missing_df.head(35)

Unnamed: 0,country,continent,population,day,time,Cases,Recovered,Deaths,Tests
0,Slovakia,Europe,5460193.0,2024-06-30,2024-06-30T16:15:11+00:00,1877605,1856381.0,21224.0,7448789.0
1,Saint-Pierre-Miquelon,North-America,5759.0,2024-06-30,2024-06-30T16:15:16+00:00,3452,2449.0,2.0,25400.0
2,Isle-of-Man,EuRope,85732.0,2024-06-30,2024-06-30T16:15:14+00:00,38008,0.0,116.0,150753.0
3,Somalia,Africa,16841800.0,2024-06-30,2024-06-30T16:15:14+00:00,27334,13182.0,1361.0,400466.0
4,Comoros,Africa,907419.0,2024-06-30,2024-06-30T16:15:15+00:00,9109,8939.0,161.0,0.0
5,Aruba,North-America,107609.0,2024-06-30,2024-06-30T16:15:14+00:00,44224,42438.0,292.0,177885.0
6,Fiji,Oceania,909466.0,2024-06-30,2024-06-30T16:15:13+00:00,69117,67226.0,885.0,672883.0
7,Romania,europe,19031340.0,2024-06-30,2024-06-30T16:15:11+00:00,3529735,3460149.0,68929.0,28758670.0
8,Saint-Pierre-Miquelon,North-America,5759.0,2024-06-30,2024-06-30T16:15:16+00:00,3452,2449.0,2.0,25400.0
9,Yemen,Asia,31154870.0,2024-06-30,2024-06-30T16:15:15+00:00,11945,9124.0,2159.0,329592.0


#### Question 3: Complete the 'handle_outliers' function to handle outliers in the dataset
Same as Question 2, iterate each column of the dataframe (filled_missing_df), detect if the column's data type is a number, and if it is then identify outliers using the interquartile range (IQR) method teached in class. Save it in the **df_without_outliers** variable.

In [9]:
# Function to handle outliers for numeric columns
def handle_outliers(df):
    df_outliers_removed = df.copy()
    
    for column in df_outliers_removed.columns:
        if np.issubdtype(df_outliers_removed[column].dtype, np.number):
            q1 = df_outliers_removed[column].quantile(0.25)
            q3 = df_outliers_removed[column].quantile(0.75)
            iqr = q3 - q1

            lower_bound = q1 - 1.5 * iqr
            upper_bound = q3 + 1.5 * iqr

            df_outliers_removed = df_outliers_removed[(df_outliers_removed[column] >= lower_bound) & (df_outliers_removed[column] <= upper_bound)]
    
    return df_outliers_removed

In [10]:
df_without_outliers = handle_outliers(filled_missing_df)
print("Shape of dataframe before:", filled_missing_df.shape)
print("Shape of dataframe after removing outliers",df_without_outliers.shape )

Shape of dataframe before: (309, 9)
Shape of dataframe after removing outliers (149, 9)


#### Question 4: Complete the 'handle_duplicates' function to remove duplicates in the dataset
Drop duplicates from the dataframe (df_without_outliers) and save it in the **df_deduplicated** variable. 

In [12]:
# Function to handle duplicates
def handle_duplicates(df):
    df_deduplicated = df.drop_duplicates()

    print("Original dataframe shape:", df.shape)
    print("Deduplicated datframe shape:", df_deduplicated.shape)
    
    return df_deduplicated

In [13]:
df_deduplicated = handle_duplicates(df_without_outliers)

Original dataframe shape: (149, 9)
Deduplicated datframe shape: (116, 9)


#### Question 5: Complete the 'standardize_data' function to standardizes the 'continent' column in the dataset
Same as we learned in class, standarize the *continent* column in the dataframe (df_deduplicated) and save it in the **standarized_df** variable.

To first identify the inconsisencies, use the unique() method to identify the distinct values within the column:

In [14]:
pd.unique(df_deduplicated['continent'])

array(['North-America', 'EuRope', 'Africa', 'Oceania', 'Asia', 'Europe',
       'africa', 'Unknown', 'South-America', 'OceanIa', 'AFRICA',
       'north-America', 'AfriCa', 'OCEANIA', 'EUROPE', 'ASIA', 'AsiA',
       'North-america', 'NORTH-AMERICA', 'europe'], dtype=object)

In [15]:
# Function to address inconsistency and standardize data
def standardize_data(df):
    df_standardized = df.copy()
    
    df_standardized['continent'] = df_standardized['continent'].str.strip().str.title()
    
    return df_standardized

In [16]:
standarized_df = standardize_data(df_deduplicated)
pd.unique(standarized_df['continent'])

array(['North-America', 'Europe', 'Africa', 'Oceania', 'Asia', 'Unknown',
       'South-America'], dtype=object)

#### Question 6: Save Your Cleaned DataFrame
Save your cleaned DataFrame (standarized_df) to the Results folder.

In [17]:
output_path = 'Results/covid_19_cleaned.csv'
standarized_df.to_csv(output_path, index = False)