# SLU06 - Dealing with Data Problems


In [1]:
import os
import pandas as pd
import numpy as np
import copy
import hashlib
import json
import warnings
import calendar
import datetime
import pycountry_convert as cc
warnings.filterwarnings('ignore')

In this notebook we will be covering the following:

- Data Entry Problems
 - Data entry problems in categorical variables
 - Data entry problems in numerical variables
 - Duplicated entries
- Missing Values
- Tidy Data

Welcome to the wonderful world of Data Cleanup! In the real would, a lot of good people are spending a lot of time  cleaning datasets and getting them down to a form with which they can work. 

There is one thing that you should always keep in mind when working with data:

![garbage](https://memegenerator.net/img/instances/50405824.jpg)

Let's get our hands dirty.

## The CRSet() Hotel

You're sitting on your desk in your first day of work as a data scientist and your manager has just assigned you your first task:

> DataSlave, here's the dataset from *CRSet() Hotels*. Before we can do anything with it we need it nice and *tidy*. Can you take care of this? Thanks.  

Attached there was a data dictionary for the dataset:
- **hotel:** Resort Hotel or City Hotel
- **is_canceled:** Value indicating if the booking was canceled (1) or not (0)
- **lead_time:** Number of days that elapsed between the entering date of the booking and the arrival date
- **arrival_date**: Arrival date formatted as "Month Day Year"
- **stays_in_weekend_nights:** Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
- **stays_in_week_nights:** Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
- **adults**: Number of adults
- **is_repeated_guest:** Value indicating if the booking name was from a repeated guest (1) or not (0)
- **previous_cancellations:** Number of previous bookings that were cancelled by the customer prior to the current booking
- **agent:** ID of the travel agency that made the booking
- **adr:** Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
- **total_of_special_requests:** Number of special requests made by the customer (e.g. twin bed or high floor)
- **reservation_status:** Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
- **reservation_status_date:** Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Let's start by importing the dataset and taking a look at it. 

### Exercise 1.1

Let's start by importing the dataset and taking a look at it. 
The dataset is located in the `data` folder, in a file named `crset_hotel_bookins.csv`. This file came straigh out of MS Excel, so the values are separated with semi-colons.
save the dataset in the `df_crset` variable


In [2]:
# use panda's read_csv to load the data into a dataframe and save it in df_crset 
df_crset = pd.read_csv(os.path.join('data', 'crset_hotel_bookins.csv'), sep=';')

# YOUR CODE HERE

In [3]:
assert isinstance(df_crset, pd.DataFrame), "Should be a dataframe"
assert df_crset.shape == (119390, 14), "The shape of the dataframe is different then expected. Are you setting the right separator?"
df_crset.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date,stays_in_weekend_nights,stays_in_week_nights,adults,is_repeated_guest,previous_cancellations,agent,adr,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,July 1 2015,0,0,2,0,0,,0.0,0,Check-Out,01/07/2015
1,Resort Hotel,0,737,July 1 2015,0,0,2,0,0,,0.0,0,Check-Out,01/07/2015
2,Resort Hotel,0,7,July 1 2015,0,1,1,0,0,,75.0,0,Check-Out,02/07/2015
3,Resort Hotel,0,13,July 1 2015,0,1,1,0,0,304.0,75.0,0,Check-Out,02/07/2015
4,Resort Hotel,0,14,July 1 2015,0,2,2,0,0,240.0,98.0,1,Check-Out,03/07/2015


### Exercise 1.2 - Arrival Date 

Let's start by looking at the date of the arrival. According to the data dictionary, and the first 5 rows of the dataframe, the `arrival_date` stores a string with the month spelled out, the day in numeral and the year in numeral.

Create a function, called `format_arrival_date()` that splits the values in this column and returns a dataframe with this information as `arrival_date_month`, `arrival_date_day` and `arrival_date_year`, all in numeral.

In [4]:
df_crset.arrival_date.str.split(pat=' ', expand=True)

Unnamed: 0,0,1,2
0,July,1,2015
1,July,1,2015
2,July,1,2015
3,July,1,2015
4,July,1,2015
...,...,...,...
119385,August,30,2017
119386,August,31,2017
119387,August,31,2017
119388,August,31,2017


In [5]:
def format_arrival_date(df: pd.DataFrame)->pd.DataFrame:
    """
    This function cleans "arrival_date" column
    """
    
    # start by copying the dataframe
    _df = df.copy()
    
    # split the "arrival_date" into the "arrival_date_month", "arrival_date_day" and "arrival_date_year" columns
    # hint: make sure you set expand to True
    _df[['arrival_date_month', 'arrival_date_day', 'arrival_date_year']] = _df.arrival_date.str.split(pat=' ', expand=True)
    
    # transform "arrival_date_month" to numeric value 
    # hint: use calendar.month_name from the calendar python module and panda's map() method
    _df['arrival_date_month'] = _df['arrival_date_month'].map(lambda month: list(calendar.month_name).index(month))
    
    # convert new 'arrival_date_day' and 'arrival_date_year' to numeric
    _df['arrival_date_day'] = _df['arrival_date_day'].astype(int)
    _df['arrival_date_year'] = _df['arrival_date_year'].astype(int)
    
    # drop the "arrival_date" column
    _df = _df.drop(columns='arrival_date')
    
    return _df
    
    # YOUR CODE HERE


In [6]:
clean_arrival = format_arrival_date(df_crset)
assert isinstance(clean_arrival, pd.DataFrame), "Should be a dataframe"
assert clean_arrival.shape == (119390, 16), "The shape of the dataframe is different then expected. Have you dropped the old arrival time column?"
assert 'arrival_date' not in clean_arrival.columns, "You should remove the old arrival_date column"
assert 'arrival_date_month' in clean_arrival.columns, "You're missing the arrival_date_month column. Have you named the new column correctly?"
assert 'arrival_date_day' in clean_arrival.columns, "You're missing the arrival_date_day column. Have you named the new column correctly?"
assert 'arrival_date_year'  in clean_arrival.columns, "You're missing the arrival_date_year column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_month), "Months should be saved as intigers" 
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_day), "Days should be saved as intigers"
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_year), "Years should be saved as intigers"
assert hashlib.sha256(json.dumps(str(clean_arrival.iloc[50])).encode()).hexdigest() == '676d3dca5425221c77c8400481d442d96b3a5f88185f729d628e3541ea78ae15', "Something is wrong with your data conversion"

### Exercise 1.3 - Week of year 

Create a function, named `get_week_of_year` that takes the newly created `arrival_date_month`, `arrival_date_day` and `arrival_date_year` creates a new variable in the dataframe called `arrival_date_week_number` with the week number of year for arrival date.

Hint: *datetime.date() recieves year, month, and day as int. and you can get the week in the year with .isocalendar()[1]*

In [7]:
def get_week_of_year(df: pd.DataFrame)->pd.DataFrame:
    """
    This function gets the arrival week of the year
    """
    
    # copy the dataframe
    _df = df.copy()
    
    # get the week of the year number and save it in a new column 
    # hint:  use pandas' apply() with axis=1
    _df['arrival_date_week_number'] = _df.apply(
        lambda row: 
            datetime.date(
                row['arrival_date_year'], 
                row['arrival_date_month'], 
                row['arrival_date_day']
            ).isocalendar()[1], 
        axis=1)
    
    return _df
    
    # YOUR CODE HERE

In [8]:
clean_arrival_week_of_year = get_week_of_year(clean_arrival)
assert isinstance(clean_arrival_week_of_year, pd.DataFrame), "Should be a dataframe"
assert clean_arrival_week_of_year.shape == (119390, 17), "The shape of the dataframe is different then expected. Have you saved the new column?"
assert 'arrival_date_week_number' in clean_arrival_week_of_year.columns, "You're missing the clean_arrival_week_of_year column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_arrival_week_of_year.arrival_date_week_number), "The values in the new column should be saved as intigers" 
assert hashlib.sha256(json.dumps(str(clean_arrival_week_of_year.arrival_date_week_number)).encode()).hexdigest() == 'fbd3bdce9f2b3a1aff60369c351a1ceb63a167a80d78b140859af7f687cd76a3', "Something is wrong with your data conversion"

### Exercise 1.4 - The reservation status date

Do the same processing as we've done to `arrival_date` column but this time for the `reservation_status_date`. 
Ass steps should be done in a single function, named `process_reservation_status_date()`. 

In [9]:
def process_reservation_status_date(df: pd.DataFrame)->pd.DataFrame:
    """
    This function cleans "reservation_status_date" column
    """
    # copy the dataframe
    _df = df.copy()
    
    # split the "reservation_status_date" into "reservation_status_date_date_month", "reservation_status_date_day" and "reservation_status_date_year"
    _df[["reservation_status_date_day", "reservation_status_date_month", "reservation_status_date_year"]] = _df.reservation_status_date.str.split(pat='/', expand=True)
    
    # convert new columns to numeric
    _df['reservation_status_date_day'] = _df['reservation_status_date_day'].astype(int)
    _df['reservation_status_date_month'] = _df['reservation_status_date_month'].astype(int)
    _df['reservation_status_date_year'] = _df['reservation_status_date_year'].astype(int)
    
    # get the week of the year number and save it in a new column 
    # hint:  use pandas' apply() with axis=1
    _df['reservation_status_date_week_number'] = _df.apply(
        lambda row: 
            datetime.date(
                row['reservation_status_date_year'], 
                row['reservation_status_date_month'], 
                row['reservation_status_date_day']
            ).isocalendar()[1], 
        axis=1)
    
    # drop the "reservation_status_date" column
    _df = _df.drop(columns='reservation_status_date')
    
    return _df
    
    # YOUR CODE HERE

In [10]:
clean_status_date = process_reservation_status_date(clean_arrival_week_of_year)
assert isinstance(clean_status_date, pd.DataFrame), "Should be a dataframe"
assert clean_status_date.shape == (119390, 20), "The shape of the dataframe is different then expected. Have you dropped the old reservation_status_date column?"
assert 'reservation_status_date' not in clean_status_date.columns, "You should remove the old reservation_status_date column"
assert 'reservation_status_date_day' in clean_status_date.columns, "You're missing the column. Have you named the new column correctly?"
assert 'reservation_status_date_month' in clean_status_date.columns, "You're missing the column. Have you named the new column correctly?"
assert 'reservation_status_date_year'  in clean_status_date.columns, "You're missing the column. Have you named the new column correctly?"
assert 'reservation_status_date_week_number'  in clean_status_date.columns, "You're missing the column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_day), "Days should be saved as intigers" 
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_month), "Months should be saved as intigers"
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_year), "Years should be saved as intigers"
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_week_number), "Week of the year should be saved as intiger"
assert hashlib.sha256(json.dumps(str(clean_status_date.reservation_status_date_week_number)).encode()).hexdigest() == '035d73f3bf98929a3e2d1ea7b1dd02645a86cc8f1d751dd088533b0e11de0352', "Something is wrong with your data conversion in reservation_status_date_week_number"
assert hashlib.sha256(json.dumps(str(clean_status_date.iloc[50])).encode()).hexdigest() == '996ff301f605548a5d9f5b698bbf3873ce1fe3bd35fef30b249d4159035216cc', "Something is wrong with your data conversion"

### Exercise 2 - Missing data

Let's now look at missing data.

In [11]:
np.sum(clean_status_date.isnull())

hotel                                      0
is_canceled                                0
lead_time                                  0
stays_in_weekend_nights                    0
stays_in_week_nights                       0
adults                                     0
is_repeated_guest                          0
previous_cancellations                     0
agent                                  16340
adr                                        0
total_of_special_requests                  0
reservation_status                         0
arrival_date_month                         0
arrival_date_day                           0
arrival_date_year                          0
arrival_date_week_number                   0
reservation_status_date_day                0
reservation_status_date_month              0
reservation_status_date_year               0
reservation_status_date_week_number        0
dtype: int64

There's over 16000 missing values in the `agent` column, representing over 10% of the total observations and we need to do something about it. 

Usually if more than 70% of values in a column are missing and there is no way to fill in the missing values, then the column can be dropped completely from the dataset. Our `agent` column is a categorical variable that represents the ID of the travel agency that made the booking. We can fill out the missing values with a new category named `unknown`.

Create a new function named `impute_agents` that does exactly that.

In [12]:
def impute_agents(df: pd.DataFrame)->pd.DataFrame:
    """
    This function cleans imputs the missing values in the agents column with a new category
    """
    # copy the dataframe
    _df = df.copy()
    
    # fill the missing values with "unknown"
    _df['agent'] = _df.agent.fillna('unknown')
    
    return _df
    
    # YOUR CODE HERE

In [13]:
imputed_df = impute_agents(clean_status_date) 
assert isinstance(imputed_df, pd.DataFrame), "Should be a dataframe"
assert hashlib.sha256(json.dumps(str(imputed_df.agent)).encode()).hexdigest() == 'e88eaebe8668246fc74259c12ea8e38311c86ec96a73e2a8de26eb30bb7aefb9', "Something is wrong with your data imputation" 
assert hashlib.sha256(json.dumps(sorted(imputed_df.agent[imputed_df.agent == 'unknown'].index)).encode()).hexdigest() == 'cccf60b5fae5bc38bc21b3f899b7bc47e18053fc5d90a0ac9cb6b8770b3c5921'

## Exercise 3 - Drop duplicates

Lastly, the last thing you need to ensure is that you're dataset doesn't have any duplicated data!
Create a short function to do just that!

In [14]:
def drop_duplicated_entries(df: pd.DataFrame)->pd.DataFrame:
    """
    This function drops duplicates
    """
    # copy the dataframe
    _df = df.copy()
    
    # drop duplicates
    _df = _df.drop_duplicates()
    
    return _df
    
    # YOUR CODE HERE    

In [15]:
clean_crset_df = drop_duplicated_entries(imputed_df)
assert isinstance(clean_crset_df, pd.DataFrame), "Should be a dataframe"
assert clean_crset_df.shape == (82870, 20), "The shape of the dataframe is different then expected. Have you removed the duplicated rows?" 

Congratulations! The *CRSet() Hotel* dataset is looking very clean and tidy!
![cr7](https://static.toiimg.com/thumb/msid-78693940,width-1200,height-900,resizemode-4/.jpg)

## A mess to be tidy-up

After nailing your first task, your manager things you're ready to face your next challenge! A World Health Organization (WHO) has been recording all the cases of tuberculosis as a way to monitor the incidency of disease in several countries over time. They have good intentions, but not very good methods to store data. Your manager warns you to prepare yourself...

Let's look at the data.

In [16]:
df_tb_who = pd.read_csv(os.path.join('data', 'tb.csv'), sep=',')
df_tb_who.head()

Unnamed: 0,iso2,year,new_sp,new_sp_m04,new_sp_m514,new_sp_m014,new_sp_m1524,new_sp_m2534,new_sp_m3544,new_sp_m4554,...,new_sp_f04,new_sp_f514,new_sp_f014,new_sp_f1524,new_sp_f2534,new_sp_f3544,new_sp_f4554,new_sp_f5564,new_sp_f65,new_sp_fu
0,AD,1989,,,,,,,,,...,,,,,,,,,,
1,AD,1990,,,,,,,,,...,,,,,,,,,,
2,AD,1991,,,,,,,,,...,,,,,,,,,,
3,AD,1992,,,,,,,,,...,,,,,,,,,,
4,AD,1993,15.0,,,,,,,,...,,,,,,,,,,


According to your data provider (the WHO), the dataset contains counts of confirmed tuberculosis cases by **country**, **year** and **demographic group**. The demographic data contains information on sex (*m* for male and *f* for female)  and age (*0-14, 15-24, 25-34, 35-44, 45-54, 55-64* and *65+*). 

![tb](https://i.pinimg.com/originals/59/b4/35/59b4358ac8b3251a52d76c65cad0ee44.jpg)

You have the data, in the `df_tb_who` variable, as provided. Except for the column `year`, the rest of the column names are not very intuitive. The column `iso2` contains the country code in *iso2 format*. The remaining columns are actually joint realizations of two variables: `sex` and `age`.

## Exercise 4 - Country 

Start by addressing the `iso2` column. Save in a new `country` column the corresponding country name from the iso2 code. *Hint* The pycountry-convert package is your friend! Check the documentation [here](https://pypi.org/project/pycountry-convert/). It's already imported as `cc`. Save the resulting dataframe in `df_tb_who_country`.

In [17]:
#start by creating a function that recieves a iso2 code and returns the country name 
#hint: when a name can't be retrieved, the original code should be returned, as str
#hint2: make sure to return the value as a string! 
def get_country(x):
    try:
        return cc.country_alpha2_to_country_name(x)
    except Exception:
        return str(x)

# copy the dataframe
_df = df_tb_who.copy()

# apply the `get_country()` function and store the results in a new column named "country"
_df['country'] = _df.iso2.apply(get_country)

# drop the original "iso2" column and store the resulting dataframe in df_tb_who_country
df_tb_who_country = _df.drop(columns='iso2')

# YOUR CODE HERE

In [18]:
assert isinstance(df_tb_who_country, pd.DataFrame), "Should be a dataframe"
assert 'iso2' not in df_tb_who_country.columns, "you should drop the original iso2 column"
assert 'country' in df_tb_who_country.columns, "Have you stored the results in a new column named 'country'?"
assert hashlib.sha256(json.dumps(sorted(df_tb_who_country['country'].unique())).encode()).hexdigest() == '40515e68a196feaac974999d8d4fa9f3dd814e1bde66243a968fadf41a8e84de', "Have you converted the iso2 codes to the country NAME?"

## Exercise 5 - the melt function

Before we can do anything else, we need to "tidy" the dataframe. Use the function melt() from pandas to tidy the dataframe and store it in `tidy_tb`. Use as id the new `country` column and the `year` column. As variable name use `column` and as value names `cases`. 


In [19]:
tidy_tb = pd.melt(df_tb_who_country, 
                  id_vars=['country', 'year'], 
                  # value_vars=income_values, 
                  var_name='column', 
                  value_name='cases')
# YOUR CODE HERE

In [20]:
assert isinstance(tidy_tb, pd.DataFrame), "Should be a dataframe"
assert tidy_tb.shape == (121149, 4), "Your dataframe doesn't have the expected shape. Have you 'melted' the dataframe correctly?"
assert "column" in tidy_tb.columns, "The variables other than 'country' and 'year' should be stored in a column named 'column'"
assert "cases" in tidy_tb.columns, "Number of cases should be stored in a column named 'cases'"
assert hashlib.sha256(json.dumps(sorted(tidy_tb['column'])).encode()).hexdigest() == '4ba594d958d63b5bab87fe50944b16f30a93824e56fa331d5d9b59dddf285e35'
assert hashlib.sha256(json.dumps(sorted(tidy_tb['cases'])).encode()).hexdigest() == '8a294cd49c8fc1b29b60893e45426a80cb9681be4c699ae950c7103552cc7153', "Your cases column doesn't look as expected"

## Exercise 6 - Data cleanup

Our dataframe is tidy, but it's **not** clean. From the `tidy_tb` dataframe, drop all the rows where `cases` **OR** `country` is null, as we just don't have any information and we **cannot** assume that the number of cases is zero or the country of origin. The `cases` column should be set as *int*. Save the final dataframe in `clean_tidy_tb` sorted by `country`, `year` and then `column`. The indexes should be reset (with `drop=True`).

In [21]:
#Drop rows with missing values (hint: notnull() function is your friend)
clean_tidy_tb = tidy_tb[tidy_tb[['cases', 'country']].notnull().all(axis=1)]

# Set country as "nan" to missing and drop them
clean_tidy_tb[clean_tidy_tb.country == 'nan'] = None
clean_tidy_tb = clean_tidy_tb.dropna()

#Sort and reset index
clean_tidy_tb = clean_tidy_tb.reset_index(drop=True)
clean_tidy_tb
# YOUR CODE HERE

Unnamed: 0,country,year,column,cases
0,Andorra,1993.0,new_sp,15.0
1,Andorra,1994.0,new_sp,24.0
2,Andorra,1996.0,new_sp,8.0
3,Andorra,1997.0,new_sp,17.0
4,Andorra,1998.0,new_sp,1.0
...,...,...,...,...
38614,Vanuatu,2008.0,new_sp_fu,0.0
38615,Yemen,2008.0,new_sp_fu,0.0
38616,South Africa,2008.0,new_sp_fu,0.0
38617,Zambia,2008.0,new_sp_fu,0.0


In [22]:
assert isinstance(clean_tidy_tb, pd.DataFrame), "Should be a dataframe"
assert clean_tidy_tb.shape == (38619, 4), "The shape of your "
assert hashlib.sha256(json.dumps(sorted(tidy_tb['country'])).encode()).hexdigest() == '43543c9e06fe9846897c269db635da02fc76dae4527775062a2f66efc707e87e'
assert hashlib.sha256(json.dumps(sorted(tidy_tb['cases'])).encode()).hexdigest() == '8a294cd49c8fc1b29b60893e45426a80cb9681be4c699ae950c7103552cc7153'

## Exercise 7 - Multiple Variables stored in one Column

Our `clean_tidy_tb` is looking better, but now we need to address the problem of having multiple variables stored in the `column` column. Let's fix that in a few steps.



### Exercise 7.1 

In `clean_tidy_tb`, first extract the information that it's a female or a male and store it in a column named `sex` and the code for the age in the column `age`. Use pandas' `str.extract` to do this. 

In [23]:
clean_tidy_tb[["sex", "age"]] = clean_tidy_tb["column"].str.extract('new_sp_([m|f])(.*)')

#drop all missing values
clean_tidy_tb = clean_tidy_tb.dropna()

# YOUR CODE HERE

In [24]:
assert isinstance(clean_tidy_tb, pd.DataFrame), "Should be a dataframe"
assert clean_tidy_tb.shape == (35552, 6), "The shape of your dataframe is off. Have you dropped the missing values?"
assert hashlib.sha256(json.dumps(sorted(clean_tidy_tb['sex'].unique())).encode()).hexdigest() == '1a336f5ee71cf591bfd047e8facc048011b4b2bb760743e979ebe7c445dacf1b'
assert hashlib.sha256(json.dumps(sorted(clean_tidy_tb['age'].unique())).encode()).hexdigest() == '70a6918917681862857b955b14e70f7bc68d0050382ef81945fe963e48135f10'

### Exercise 7.2

The values in your `age` column are not very easy to understand. Use the `decode_age` dictionary to convert them to a more readable format

In [25]:
decode_age =   {
        "014": "0-14",
        "1524": "15-24",
        "2534": "25-34",
        "3544": "35-44",
        "4554": "45-54",
        "5564": "55-64",
        "65": "65+",
        "u": "unknown",
    }

In [26]:
clean_tidy_tb["age"] = clean_tidy_tb["age"].map(lambda x: decode_age[x] if x in decode_age.keys() else None)

#drop any row where the values could not be converted
clean_tidy_tb = clean_tidy_tb.dropna()

# YOUR CODE HERE

In [27]:
assert hashlib.sha256(json.dumps(sorted(clean_tidy_tb['age'].unique())).encode()).hexdigest() == '8135dd0c090f9073cbb69a7bbacefd8ad0ecdb6e26415ece93e3fb5f8f5d17e6'

### Exercise 7.3

Finally, save in `final_tb_df` the dataframe with just the columns "country", "year", "sex", "age" and "cases".

In [28]:
final_tb_df = clean_tidy_tb[['country', 'year', 'sex', 'age', 'cases']]
# YOUR CODE HERE

In [29]:
assert isinstance(final_tb_df, pd.DataFrame), "Should be a dataframe"
assert final_tb_df.shape == (33962, 5), "The shape of your dataframe is off"
assert sorted(final_tb_df.columns) == ['age', 'cases', 'country', 'sex', 'year']

Congratulations!!! You're a data cleaning master!

![cleaning](https://mlgq5aailvbd.i.optimole.com/elKhEKc-U13ryZPl/w:1068/h:712/q:auto/https://www.urbancleanpro.com/wp-content/uploads/2019/11/good-job.jpg)