In [12]:
import pandas as pd
from collections import namedtuple

# Load EM-DAT Dataset

### What is EM-DAT Database and what is used for?

The EM-DAT dataset catalogs over 26,000 mass disasters worldwide from 1900 to present day.

It is critical to understand that EM-DAT does not catalog all natural disasters worldwide. It instead focuses on mass disasters.

According to EM-DAT and CRED, a mass disaster is a specific type of natural disaster that leads to significant human and economic loss, requiring that at least one of the following criteria hol

- 100 affected people
- 10 fatalities
- ople
A deceleration of state of emergency (at the country 
- evel)
A call for international assistance (again, at the country level)


In [4]:
em_dat_data = pd.read_excel("EM_DAT.xlsx")

## Explanation of EM-DAT Dataset Columns

- **Dis No**: Unique identifier for each recorded disaster event.
- **Year**: The year the disaster occurred.
- **Seq**: Sequence number representing the occurrence order of events in a given year.
- **Disaster Group**: Broad categorization of the disaster (e.g., natural or technological).
- **Disaster Subgroup**: More specific category within the main disaster group (e.g., biological, climatological, hydrological, etc.).
- **Disaster Type**: Specific type of disaster (e.g., wildfire, flood, storm, etc.).
- **Disaster Subtype**: Further subclassification of disaster type (e.g., forest fire, flash flood, tropical cyclone, etc.).
- **Disaster Subsubtype**: Even more detailed classification within the subtype (e.g., mudslide, blizzard, tornado, etc.).
- **Event Name**: Name given to the disaster event, if any (e.g., Tropical Storm Noul, Typhoon Molave, etc.).
- **Entry Criteria**: Criteria that justified the event’s inclusion in the database.
- **Country**: Country where the disaster occurred.
- **ISO**: The ISO code representation of the country (e.g., USA, SRB, YEM, etc.).
- **Region**: Geographical or administrative region within the country (e.g., Northern America, Southern Europe, Caribbean, etc.).
- **Continent**: The continent where the disaster took place.
- **Location**: Specific location or city affected by the disaster (note: this column is very inconsistent — values range from state/province, city, county, etc.).
- **Origin**: Root cause or source of the disaster (e.g., heavy rains, earthquake, landslide, etc.).
- **Associated Dis**: Related disaster events, if any (e.g., famine, industrial accident, heat wave, etc.).
- **Associated Dis2**: Secondary related disaster events.
- **OFDA Response**: U.S. Office of Foreign Disaster Assistance’s response, if any (values are either "yes" or NaN).
- **Appeal**: Any international appeals for assistance (values are either "yes", "no", or NaN).
- **Declaration**: Declarations made regarding the disaster (values are either "yes", "no", or NaN).
- **Aid Contribution**: Amount of aid contributed (reported in thousands of US dollars) in response to the disaster.
- **Dis Mag Value**: Numeric value representing the magnitude of the disaster (must be used in conjunction with Dis Mag Scale to properly interpret).
- **Dis Mag Scale**: Scale used to measure the disaster’s magnitude, such as KPH, Richter, etc. For example, if Dis Mag Value has a value of 110, and Dis Mag Scale reports KPH, then the natural disaster was reported as having a (presumed) wind speed of 110 KPH.
- **Latitude**: Geographic latitude of the disaster’s epicenter or main affected area.
- **Longitude**: Geographic longitude of the disaster.
- **Local Time**: Local time when the disaster occurred or was first reported.
- **River Basin**: The river basin affected (only applicable for flood events).
- **Start Year**: Year the disaster event started.
- **End Year**: Year the disaster event ended.
- **Start Month**: Month the disaster event started.
- **Start Day**: Day the disaster event started.
- **End Month**: Month the disaster event ended.
- **End Day**: Day the disaster event ended.
- **Total Deaths**: Total number of deaths caused by the disaster.
- **No Injured**: Number of individuals injured due to the disaster.
- **No Affected**: Number of individuals affected in any way by the disaster.
- **No Homeless**: Number of individuals rendered homeless by the disaster.
- **Total Affected**: Combined total of injured, affected, and homeless individuals.
- **Reconstruction Costs (‘000 US$)**: Estimated costs in thousands of US dollars for reconstruction after the disaster.
- **Insured Damages (‘000 US$)**: Estimated damages in thousands of US dollars covered by insurance.
- **Total Damages (‘000 US$)**: Total estimated damages in thousands of US dollars due to the disaster.
- **CPI**: Consumer Price Index at the time of the disaster, useful for adjusting costs over time.
ter, useful for adjusting costs over time.

In [5]:
em_dat_data.head(5)

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,Reconstruction Costs ('000 US$),"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI,Admin Units,Entry Date,Last Update
0,1900-0003-USA,Yes,nat-met-sto-tro,Natural,Meteorological,Storm,Tropical cyclone,,,USA,...,,,,,30000.0,1098720.0,2.730451,,2004-10-18,2023-10-17
1,1900-0005-USA,Yes,tec-ind-fir-fir,Technological,Industrial accident,Fire (Industrial),Fire (Industrial),,,USA,...,,,,,,,2.730451,,2003-07-01,2023-09-25
2,1900-0006-JAM,Yes,nat-hyd-flo-flo,Natural,Hydrological,Flood,Flood (General),,,JAM,...,,,,,,,2.730451,,2003-07-01,2023-09-25
3,1900-0007-JAM,Yes,nat-bio-epi-vir,Natural,Biological,Epidemic,Viral disease,,Gastroenteritis,JAM,...,,,,,,,2.730451,,2003-07-01,2023-09-25
4,1900-0008-JPN,Yes,nat-geo-vol-ash,Natural,Geophysical,Volcanic activity,Ash fall,,,JPN,...,,,,,,,2.730451,,2003-07-01,2023-09-25


# Basic EDA

In [8]:
em_dat_data.shape

(26545, 46)

In [9]:
em_dat_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26545 entries, 0 to 26544
Data columns (total 46 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   DisNo.                                     26545 non-null  object 
 1   Historic                                   26545 non-null  object 
 2   Classification Key                         26545 non-null  object 
 3   Disaster Group                             26545 non-null  object 
 4   Disaster Subgroup                          26545 non-null  object 
 5   Disaster Type                              26545 non-null  object 
 6   Disaster Subtype                           26545 non-null  object 
 7   External IDs                               2401 non-null   object 
 8   Event Name                                 8324 non-null   object 
 9   ISO                                        26545 non-null  object 
 10  Country               

In [16]:
def basic_eda(df):
    num_duplicated = df.duplicated().sum()
    is_nan = df.isnull()

    num_null_rows = is_nan.any(axis=1).sum()

    num_total_null = df.isnull().sum().sum()

    EDARow = namedtuple("EDARow", ["Name", "Value", "Notes"])

    rows = [
        EDARow("Samples", df.shape[0], ""),
        EDARow("Features", df.shape[1], ""),
        EDARow("Duplicate Rows", num_duplicated, ""),
        EDARow("Rows With NaN", num_null_rows, "{:.2f}% all rows".format(
            (num_null_rows / df.shape[0]) * 100), 
        ),
        EDARow("Total NaNs", num_total_null, "{:.2f}% features matrix".format(
        (num_total_null / (df.shape[0] * df.shape[1])) * 100)
        )
    ]

    return pd.DataFrame(rows, columns=["Name", "Value", "Notes"])

In [17]:
basic_eda(em_dat_data)

Unnamed: 0,Name,Value,Notes
0,Samples,26545,
1,Features,46,
2,Duplicate Rows,0,
3,Rows With NaN,26545,100.00% all rows
4,Total NaNs,463985,38.00% features matrix


### Further Summarizing the dataset because its too noisy

In [20]:
def summarize_data(df):
    summary = pd.DataFrame(df.dtypes, columns=["dtypes"])

    summary = summary.reset_index()
    summary["Name"] = summary["index"]
    summary = summary[["Name", "dtypes"]]

    summary["Missing"] = df.isnull().sum().values
    summary["Unique"] = df.nunique().values

    return summary

In [21]:
summarize_data(em_dat_data)

Unnamed: 0,Name,dtypes,Missing,Unique
0,DisNo.,object,0,26545
1,Historic,object,0,2
2,Classification Key,object,0,66
3,Disaster Group,object,0,2
4,Disaster Subgroup,object,0,9
5,Disaster Type,object,0,32
6,Disaster Subtype,object,0,66
7,External IDs,object,24144,1822
8,Event Name,object,18221,3596
9,ISO,object,0,231
