# IT362: Principles of Data Science

## Phase 1: Data Collection Research and Assessment  

### 1. Introduction
The **Natural Disaster Prediction Model: PREDINA** aims to analyze and predict the impacts of natural disasters based on historical data, with a focus on enhancing disaster preparedness and response strategies. The main research question guiding this project is: *How can historical data on natural disasters inform future predictions and improve community resilience?*

### 2. Importing Libraries
In this section, we will import the necessary libraries for our analysis and data processing.

In [75]:
# Install required libraries
!pip install openpyxl

# Import libraries
import pandas as pd
import requests
import csv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


### 3. Datasets

- #### **EM-DAT: The International Disaster Database**
The EM-DAT Public Table is a global disaster database maintained by CRED, tracking natural and technological disasters. It includes data on fatalities, affected populations, and economic damages, and is used for research and disaster management.

In [76]:
emdat_df = pd.read_excel("Datasets/emdat.xlsx")
print(emdat_df.head())

          DisNo. Historic Classification Key Disaster Group Disaster Subgroup  \
0  1999-9388-DJI       No    nat-cli-dro-dro        Natural    Climatological   
1  1999-9388-SDN       No    nat-cli-dro-dro        Natural    Climatological   
2  1999-9388-SOM       No    nat-cli-dro-dro        Natural    Climatological   
3  2000-0001-AGO       No    tec-tra-roa-roa  Technological         Transport   
4  2000-0002-AGO       No    nat-hyd-flo-riv        Natural      Hydrological   

  Disaster Type Disaster Subtype External IDs Event Name  ISO  ...  \
0       Drought          Drought          NaN        NaN  DJI  ...   
1       Drought          Drought          NaN        NaN  SDN  ...   
2       Drought          Drought          NaN        NaN  SOM  ...   
3          Road             Road          NaN        NaN  AGO  ...   
4         Flood   Riverine flood          NaN        NaN  AGO  ...   

  Reconstruction Costs ('000 US$) Reconstruction Costs, Adjusted ('000 US$)  \
0            

- #### **Kaggle Dataset : ALL NATURAL DISASTERS 1900-2021 / EOSDIS**
This dataset, hosted on Kaggle, provides a record of natural disasters worldwide from 1900 to 2021, sourced from NASA's Earth Observing System Data and Information System (EOSDIS). It includes details such as disaster type, location, dates, and impacts (e.g., fatalities, affected populations, and economic damages), making it useful for analyzing historical disaster trends and impacts.

In [77]:
kaggle_df = pd.read_csv("Datasets/EOSDIS.csv")
print(kaggle_df.head())

   Year   Seq Glide Disaster Group Disaster Subgroup      Disaster Type  \
0  1900  9002   NaN        Natural    Climatological            Drought   
1  1900  9001   NaN        Natural    Climatological            Drought   
2  1902    12   NaN        Natural       Geophysical         Earthquake   
3  1902     3   NaN        Natural       Geophysical  Volcanic activity   
4  1902    10   NaN        Natural       Geophysical  Volcanic activity   

  Disaster Subtype Disaster Subsubtype   Event Name     Country  ...  \
0          Drought                 NaN          NaN  Cabo Verde  ...   
1          Drought                 NaN          NaN       India  ...   
2  Ground movement                 NaN          NaN   Guatemala  ...   
3         Ash fall                 NaN  Santa Maria   Guatemala  ...   
4         Ash fall                 NaN  Santa Maria   Guatemala  ...   

  No Affected No Homeless Total Affected Insured Damages ('000 US$)  \
0         NaN         NaN            NaN     

- ### **Global Disaster Alert and Coordination System (GDACS)**
The GDACS API provides real-time alerts on natural disasters such as earthquakes, tsunamis, and storms, offering data on disaster type, location, magnitude, and impact. It is useful for monitoring, coordinating disaster response efforts, and supplementing data for building predictive models.

In [78]:
import requests
import json

def save_raw_dataset():
    LATEST_EVENTS_URL = 'https://www.gdacs.org/gdacsapi/api/events/geteventlist/EVENTS4APP'
    
    try:
        # Send a GET request to the API
        response = requests.get(LATEST_EVENTS_URL)
        response.raise_for_status()
        
        # Save the raw JSON data to a file
        with open('gdacs_dataset.json', 'w') as f:
            json.dump(response.json(), f)
        
        print("Raw dataset saved to 'gdacs_dataset.json'.")

    except requests.exceptions.RequestException as e:
        print(f"Error fetching data: {e}")

# Example usage
if __name__ == "__main__":
    save_raw_dataset()

Raw dataset saved to 'gdacs_dataset.json'.


In [79]:
import json
import csv

def load_and_extract_columns():
    # Load the raw dataset from the saved file
    with open('RawData/gdacs_dataset.json', 'r') as f:
        dataset = json.load(f)
    
    # Extract specific columns from the dataset
    extracted_data = []
    for event in dataset.get('features', []):
        properties = event.get('properties', {})
        coordinates = event["geometry"]["coordinates"]
        latitude = coordinates[1]  # Lat from GeoJSON
        longitude = coordinates[0]  # Lon from GeoJSON

        # Extract magnitude and its unit directly from severitydata
        severity_data = properties.get("severitydata", {})
        magnitude = severity_data.get("severity", 0)
        magnitude_unit = severity_data.get("severityunit", "")  # Get the unit

        extracted_data.append({
            "Event Name": properties.get("eventname", "Unknown"),
            "Country": properties.get("country", "Unknown"),
            "ISO": properties.get("iso3", "Unknown"),
            "Disaster Group": properties.get("eventtype", "Unknown"),
            "Latitude": latitude,
            "Longitude": longitude,
            "Start Year": properties["fromdate"][:4],  # Extract year from date
            "Start Month": properties["fromdate"][5:7],  # Extract month
            "Start Day": properties["fromdate"][8:10],  # Extract day
            "End Year": properties["todate"][:4] if properties["todate"] else None,
            "End Month": properties["todate"][5:7] if properties["todate"] else None,
            "End Day": properties["todate"][8:10] if properties["todate"] else None,
            "Magnitude": magnitude,  # Directly include the magnitude value
            "Magnitude Unit": magnitude_unit,  # Include the magnitude unit
            "Losses": properties.get("economicloss", None)
        })
    
    return extracted_data

def save_to_csv(data, filename='gdacs_disasters.csv'):
    # Save the extracted data to a CSV file
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = data[0].keys()  # Get the column names from the first data entry
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
        writer.writeheader()  # Write the header
        writer.writerows(data)  # Write the data rows

    print(f"Data saved to {filename}.")

# Example usage
if __name__ == "__main__":
    data = load_and_extract_columns()
    print(data)  # Print the extracted data
    save_to_csv(data)  # Save the extracted data as a CSV file

[{'Event Name': '', 'Country': 'Jordan', 'ISO': 'JOR', 'Disaster Group': 'FL', 'Latitude': 31.1847222, 'Longitude': 35.7047218, 'Start Year': '2025', 'Start Month': '02', 'Start Day': '05', 'End Year': '2025', 'End Month': '02', 'End Day': '07', 'Magnitude': 0.0, 'Magnitude Unit': '', 'Losses': None}, {'Event Name': '', 'Country': 'Papua New Guinea', 'ISO': 'PNG', 'Disaster Group': 'EQ', 'Latitude': -2.8138, 'Longitude': 141.6201, 'Start Year': '2025', 'Start Month': '02', 'Start Day': '06', 'End Year': '2025', 'End Month': '02', 'End Day': '06', 'Magnitude': 5.1, 'Magnitude Unit': 'M', 'Losses': None}, {'Event Name': '', 'Country': 'Indonesia', 'ISO': 'IDN', 'Disaster Group': 'EQ', 'Latitude': 4.4689, 'Longitude': 126.6945, 'Start Year': '2025', 'Start Month': '02', 'Start Day': '06', 'End Year': '2025', 'End Month': '02', 'End Day': '06', 'Magnitude': 5.4, 'Magnitude Unit': 'M', 'Losses': None}, {'Event Name': '', 'Country': 'Chile', 'ISO': 'CHL', 'Disaster Group': 'EQ', 'Latitude': 

Convert Data Types to Match

In [80]:
numeric_columns = ["Latitude", "Longitude", "Start Year", "Start Month", "Start Day", 
                   "End Year", "End Month", "End Day", "Total Deaths", "No. Injured", 
                   "No. Affected", "No. Homeless", "Total Affected", "Total Damages ('000 US$)", "CPI", "Magnitude"]

for col in numeric_columns:
    if col in emdat_df.columns:
        emdat_df[col] = pd.to_numeric(emdat_df[col], errors='coerce')  # Convert to numeric, set errors to NaN
    if col in kaggle_df.columns:
        kaggle_df[col] = pd.to_numeric(kaggle_df[col], errors='coerce')

In [81]:
string_columns = ["Year", "Disaster Group", "Disaster Subgroup", "Disaster Type", "ISO", "Magnitude Scale"]

for col in string_columns:
    if col in emdat_df.columns:
        emdat_df[col] = emdat_df[col].astype(str).str.strip()  # Ensure string format
    if col in kaggle_df.columns:
        kaggle_df[col] = kaggle_df[col].astype(str).str.strip()

print("Updated df1 column types:\n", emdat_df.dtypes)
print("Updated df2 column types:\n",kaggle_df.dtypes)

Updated df1 column types:
 DisNo.                                        object
Historic                                      object
Classification Key                            object
Disaster Group                                object
Disaster Subgroup                             object
Disaster Type                                 object
Disaster Subtype                              object
External IDs                                  object
Event Name                                    object
ISO                                           object
Country                                       object
Subregion                                     object
Region                                        object
Location                                      object
Origin                                        object
Associated Types                              object
OFDA/BHA Response                             object
Appeal                                        object
Declaration        

### 4. Data Intergration

In [82]:
emdat_df['DisNo.'] = emdat_df['DisNo.'].astype(str).str[:4]

emdat_df.rename(columns={'DisNo.': 'Year'}, inplace=True)
emdat_df.rename(columns={"Total Damage, Adjusted ('000 US$)": "Total Damages ('000 US$)"}, inplace=True)

kaggle_df.rename(columns={'Dis Mag Value': 'Magnitude'}, inplace=True)
kaggle_df.rename(columns={'Dis Mag Scale': 'Magnitude Scale'}, inplace=True)
kaggle_df.rename(columns={'No Injured': 'No. Injured'}, inplace=True)
kaggle_df.rename(columns={'No Affected': 'No. Affected'}, inplace=True)
kaggle_df.rename(columns={'No Homeless': 'No. Homeless'}, inplace=True)


In [83]:
import pandas as pd

# Example: Columns from df1
df1_columns = [
    "Year", "Disaster Group", "Disaster Subgroup", "Disaster Type", "ISO", 
    "Latitude", "Longitude", "Start Year", "Start Month", "Start Day", 
    "End Year", "End Month", "End Day", "Total Deaths", "No. Injured", 
    "No. Affected", "No. Homeless", "Total Affected", "Total Damages ('000 US$)", 
    "CPI", "Magnitude", "Magnitude Scale"
]

# Example: Columns from df2
df2_columns = [
    "Year", "Disaster Group", "Disaster Subgroup", "Disaster Type", "ISO", 
    "Latitude", "Longitude", "Start Year", "Start Month", "Start Day", 
    "End Year", "End Month", "End Day", "Total Deaths", "No. Injured", 
    "No. Affected", "No. Homeless", "Total Affected", "Total Damages ('000 US$)", 
    "CPI", "Magnitude", "Magnitude Scale"
]

# Select only the specified columns from both dataframes
df1_selected = emdat_df[df1_columns]
df2_selected = kaggle_df[df2_columns]

# Concatenate the two DataFrames vertically (adding rows)
integrated_df = pd.concat([df1_selected, df2_selected], axis=0, ignore_index=True)

# Optionally, reset the index if you want a clean index after concatenation
integrated_df.reset_index(drop=True, inplace=True)

# Display the resulting DataFrame
print(integrated_df)

# Print number of rows after combining
print(f"Rows after combining: {len(integrated_df)}")

# If the combined data isn't empty, save it
if not integrated_df.empty:
    integrated_df.to_csv('Datasets/integrated_data.csv', index=False)
    print("Combined data successfully saved!")
else:
    print("Combined dataset is empty!")


       Year Disaster Group Disaster Subgroup Disaster Type  ISO  Latitude  \
0      1999        Natural    Climatological       Drought  DJI       NaN   
1      1999        Natural    Climatological       Drought  SDN       NaN   
2      1999        Natural    Climatological       Drought  SOM       NaN   
3      2000  Technological         Transport          Road  AGO       NaN   
4      2000        Natural      Hydrological         Flood  AGO       NaN   
...     ...            ...               ...           ...  ...       ...   
32225  2021        Natural      Hydrological         Flood  YEM       NaN   
32226  2021        Natural      Hydrological         Flood  ZAF       NaN   
32227  2021        Natural        Biological      Epidemic  COD       NaN   
32228  2021        Natural      Hydrological         Flood  SRB       NaN   
32229  2021        Natural      Hydrological         Flood  SSD       NaN   

       Longitude  Start Year  Start Month  Start Day  ...  End Day  \
0    