# Content
1. [Load Data](#1)
1. [Check Data](#2) 
1. [Variable description](#3)
1. [Handle missing values](#4)
1. [Trim extra spaces in text](#5)
1. [Check duplicates](#6)
1. [Other (checking typo, misspellings, etc)](#7)

<a id = "1"></a>
# Load data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Get the raw data, for practice

In [None]:
df = pd.read_csv('../input/cause-of-death-in-indonesia/Penyebab Kematian di Indonesia yang Dilaporkan - Raw.csv')

<a id = "2"></a>

# Check data

In [None]:
df.head()

In [None]:
df.info()

<a id = "3"></a>

# Variable description
<br>1. Cause          : specific cause of the death
<br>2. Type           : category of the cause
<br>3. Year           : year of the occurance
<br>4. Data Redundancy: number of copies of same data in the database
<br>5. Total Deaths   : number of the death
<br>6. Source         : source name
<br>7. Page at Source : page number at the source
<br>8. Source URL     : source url
<br>

<a id = "4"></a>

# Handle missing values

In [None]:
df.isnull().sum()

In [None]:
df[df["Source URL"].isna() | df["Page at Source"].isna()]

Merge both records information and delete one of them. Create a new category "unknown" for "source url" and "page at source"

In [None]:
df.loc[df.Cause == "COVID-19", "Total Deaths"] = 22138 + 37889
df.loc[df.Cause == "COVID-19", "Year"] = 2021
df.loc[df.Cause == "COVID-19", "Source URL"] = "unknown"
df.loc[df.Cause == "COVID-19", "Page at Source"] = "unknown"

In [None]:
df_to_drop = df[ df['Cause'] == "COVID-19 (per tanggal 3/7/2021)" ] 
df = df.drop(df_to_drop.index, axis=0)
df.reset_index(drop=True, inplace=True)

In [None]:
df[(df["Cause"]=="COVID-19") | (df["Cause"]=="COVID-19 (per tanggal 3/7/2021)")]

In [None]:
df.isnull().sum()

The empty fields are now gone

In [None]:
df.info()

<a id = "5"></a>

# Trim extra spaces in text

Remove leading, trailing, or multiple spaces

In [None]:
df['Cause'] = df['Cause'].replace('\s+', ' ', regex=True)
df['Type'] = df['Type'].replace('\s+', ' ', regex=True)
df['Source'] = df['Source'].replace('\s+', ' ', regex=True)
df['Page at Source'] = df['Page at Source'].replace('\s+', ' ', regex=True)
df['Source URL'] = df['Source URL'].replace('\s+', ' ', regex=True)

<a id = "6"></a>

# Check duplicates

First, check the duplicated data with the easiest way, and drop one of them.

In [None]:
df[df.duplicated(keep=False)]

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df[df.duplicated(keep=False)]

We successfully removed the duplicated data we found earlier.

Now check the duplicated data again, but this time exclude column "Page at Source". As shown on below data snippet, there are more duplicated data, but with different value for "Page at Source". Remove one of them.

In [None]:
df[df.duplicated(['Cause', 'Type', 'Year', 'Data Redundancy', 'Total Deaths', 'Source', 'Source URL'], keep=False)]

In [None]:
df.drop_duplicates(['Cause', 'Type', 'Year', 'Data Redundancy', 'Total Deaths', 'Source'], keep='first',inplace=True)

In [None]:
df[df.duplicated(['Cause', 'Type', 'Year', 'Data Redundancy', 'Total Deaths', 'Source', 'Source URL'], keep=False)]

The duplicated data are gone

In [None]:
df.reset_index(drop=True, inplace=True)
df.info()

<a id = "7"></a>
# Other (checking typo, misspellings, etc)

value_counts() could help to find any typo in categorical column.

In [None]:
df["Type"].value_counts()

In [None]:
df["Source"].value_counts()

In [None]:
df["Source URL"].value_counts()

In [None]:
def set_pandas_display_options() -> None:
    """Set pandas display options."""
    # Ref: https://stackoverflow.com/a/52432757/
    display = pd.options.display

    display.max_columns = 1000
    display.max_rows = 1000
    display.max_colwidth = 199
    display.width = 1000
    # display.precision = 2  # set as needed

set_pandas_display_options()

In [None]:
df["Cause"].value_counts().tail(100)

* "Keracunan" ≈ "Keracunan/KLB" ≈ "KLB Keracunan" 
* "Tuberkulosis paru lainnya" ≈ "Tuberkulosis paru" ≈ "Tuberkulosis"
* "Petir" ≈ "Tersambar Petir"
* etc, these categorization still need to be fixed