# IT362: Principles of Data Science

## Phase 1: Data Collection Research and Assessment  

### 1. Introduction
The **Natural Disaster Prediction Model: PREDINA** aims to analyze and predict the impacts of natural disasters based on historical data, with a focus on enhancing disaster preparedness and response strategies. The main research question guiding this project is: *How can historical data on natural disasters inform future predictions and improve community resilience?*

### 2. Importing Libraries
In this section, we will import the necessary libraries for our analysis and data processing.

In [None]:
# Install required libraries
!pip install openpyxl

# Import libraries
import pandas as pd
import requests
import csv

### 3. Datasets

- #### **EM-DAT: The International Disaster Database**
The EM-DAT Public Table is a global disaster database maintained by CRED, tracking natural and technological disasters. It includes data on fatalities, affected populations, and economic damages, and is used for research and disaster management.

In [2]:
df1 = pd.read_excel("Datasets/emdat.xlsx")
print(df1.head())

          DisNo. Historic Classification Key Disaster Group Disaster Subgroup  \
0  1999-9388-DJI       No    nat-cli-dro-dro        Natural    Climatological   
1  1999-9388-SDN       No    nat-cli-dro-dro        Natural    Climatological   
2  1999-9388-SOM       No    nat-cli-dro-dro        Natural    Climatological   
3  2000-0001-AGO       No    tec-tra-roa-roa  Technological         Transport   
4  2000-0002-AGO       No    nat-hyd-flo-riv        Natural      Hydrological   

  Disaster Type Disaster Subtype External IDs Event Name  ISO  ...  \
0       Drought          Drought          NaN        NaN  DJI  ...   
1       Drought          Drought          NaN        NaN  SDN  ...   
2       Drought          Drought          NaN        NaN  SOM  ...   
3          Road             Road          NaN        NaN  AGO  ...   
4         Flood   Riverine flood          NaN        NaN  AGO  ...   

  Reconstruction Costs ('000 US$) Reconstruction Costs, Adjusted ('000 US$)  \
0            

- #### **Kaggle Dataset : ALL NATURAL DISASTERS 1900-2021 / EOSDIS**
This dataset, hosted on Kaggle, provides a record of natural disasters worldwide from 1900 to 2021, sourced from NASA's Earth Observing System Data and Information System (EOSDIS). It includes details such as disaster type, location, dates, and impacts (e.g., fatalities, affected populations, and economic damages), making it useful for analyzing historical disaster trends and impacts.

In [3]:
df2 = pd.read_csv("Datasets/EOSDIS.csv")
print(df2.head())

   Year   Seq Glide Disaster Group Disaster Subgroup      Disaster Type  \
0  1900  9002   NaN        Natural    Climatological            Drought   
1  1900  9001   NaN        Natural    Climatological            Drought   
2  1902    12   NaN        Natural       Geophysical         Earthquake   
3  1902     3   NaN        Natural       Geophysical  Volcanic activity   
4  1902    10   NaN        Natural       Geophysical  Volcanic activity   

  Disaster Subtype Disaster Subsubtype   Event Name     Country  ...  \
0          Drought                 NaN          NaN  Cabo Verde  ...   
1          Drought                 NaN          NaN       India  ...   
2  Ground movement                 NaN          NaN   Guatemala  ...   
3         Ash fall                 NaN  Santa Maria   Guatemala  ...   
4         Ash fall                 NaN  Santa Maria   Guatemala  ...   

  No Affected No Homeless Total Affected Insured Damages ('000 US$)  \
0         NaN         NaN            NaN     

- ### **Global Disaster Alert and Coordination System (GDACS)**
The GDACS API provides real-time alerts on natural disasters such as earthquakes, tsunamis, and storms, offering data on disaster type, location, magnitude, and impact. It is useful for monitoring, coordinating disaster response efforts, and supplementing data for building predictive models.

In [4]:


# API URL
api_url = "https://www.gdacs.org/gdacsapi/api/events/geteventlist/EVENTS4APP"

# API Parameters (Modify dates as needed)
params = {
    "from": "2010-01-01",  # Modify this
    "to": "2024-02-06",  # Modify this
    "source": "DFO",
    "alertlevel": "RED",
    "datatype": "4DAYS",
    "type": "json"
}

# Send API request
response = requests.get(api_url, params=params)

if response.status_code == 200:
    data = response.json()  # Convert response to JSON
    features = data.get("features", [])  # Extract list of disaster events

    # CSV Output File
    csv_filename = "gdacs_disasters.csv"

    # Define CSV Headers
    headers = [
        "Event Name", "Country", "ISO", "Disaster Group", "Disaster Subgroup",
        "Disaster Type", "Disaster Subtype", "Latitude", "Longitude",
        "Start Year", "Start Month", "Start Day", "End Year", "End Month", "End Day",
        "Magnitude", "Affected People"
    ]

    # Write to CSV
    with open(csv_filename, mode="w", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(headers)  # Write header row

        for feature in features:
            properties = feature.get("properties", {})
            geometry = feature.get("geometry", {})

            # Extracting required fields
            event_name = properties.get("name", "Unknown")
            country = properties.get("country", "Unknown")
            iso = properties.get("iso3", "Unknown")
            disaster_group = properties.get("Class", "Unknown")  # Adjust as needed
            disaster_subgroup = "Unknown"  # No direct field found in API
            disaster_type = properties.get("eventtype", "Unknown")
            disaster_subtype = "Unknown"  # No direct field found in API

            # Extract latitude & longitude
            latitude, longitude = None, None
            if geometry.get("type") == "Point":
                coordinates = geometry.get("coordinates", [])
                if len(coordinates) >= 2:
                    longitude, latitude = coordinates[0], coordinates[1]

            # Extract start and end dates
            start_date = properties.get("fromdate", "1900-01-01").split("T")[0]
            

            start_year, start_month, start_day = start_date.split("-")
            end_year, end_month, end_day = end_date.split("-")

            # Determine magnitude based on alert level
            alert_level = properties.get("alertlevel", "Green")
            if alert_level == "RED":
                magnitude = ">4"
            elif alert_level == "ORANGE":
                magnitude = ">2"
            else:
                magnitude = "<=2"

            # Affected people
            affected_people = properties.get("totalaffected", "N/A")

            # Write row to CSV
            writer.writerow([
                event_name, country, iso, disaster_group, disaster_subgroup,
                disaster_type, disaster_subtype, latitude, longitude,
                start_year, start_month, start_day, end_year, end_month, end_day,
                magnitude, affected_people
            ])

    print(f"Data successfully saved to {csv_filename}")

else:
    print(f"Failed to fetch data. Status Code: {response.status_code}")


NameError: name 'end_date' is not defined

Convert Data Types to Match

In [None]:
numeric_columns = ["Latitude", "Longitude", "Start Year", "Start Month", "Start Day", 
                   "End Year", "End Month", "End Day", "Total Deaths", "No. Injured", 
                   "No. Affected", "No. Homeless", "Total Affected", "Total Damages ('000 US$)", "CPI", "Magnitude"]

for col in numeric_columns:
    if col in df1.columns:
        df1[col] = pd.to_numeric(df1[col], errors='coerce')  # Convert to numeric, set errors to NaN
    if col in df2.columns:
        df2[col] = pd.to_numeric(df2[col], errors='coerce')

In [None]:
string_columns = ["Year", "Disaster Group", "Disaster Subgroup", "Disaster Type", "ISO", "Magnitude Scale"]

for col in string_columns:
    if col in df1.columns:
        df1[col] = df1[col].astype(str).str.strip()  # Ensure string format
    if col in df2.columns:
        df2[col] = df2[col].astype(str).str.strip()

print("Updated df1 column types:\n", df1.dtypes)
print("Updated df2 column types:\n", df2.dtypes)

### 4. Data Intergration

In [1]:
df1['DisNo.'] = df1['DisNo.'].astype(str).str[:4]

df1.rename(columns={'DisNo.': 'Year'}, inplace=True)
df1.rename(columns={"Total Damage, Adjusted ('000 US$)": "Total Damages ('000 US$')"}, inplace=True)

df2.rename(columns={'Dis Mag Value': 'Magnitude'}, inplace=True)
df2.rename(columns={'Dis Mag Scale': 'Magnitude Scale'}, inplace=True)
df2.rename(columns={'No Injured': 'No. Injured'}, inplace=True)
df2.rename(columns={'No Affected': 'No Affected'}, inplace=True)
df2.rename(columns={'No Homeless': 'No. Homeless'}, inplace=True)


NameError: name 'df1' is not defined

In [18]:
import pandas as pd

# Print column names and types to debug
print("df1 columns and types:\n", df1.dtypes)
print("df2 columns and types:\n", df2.dtypes)

# Rename 'DisNo.' to 'Year' in df1 if it exists
if 'DisNo.' in df1.columns:
    df1.rename(columns={'DisNo.': 'Year'}, inplace=True)

# Convert 'Year' to int if possible, or keep it as a string
df1['Year'] = df1['Year'].astype(str).str[:4]  # Ensure it's a 4-digit string
df2['Year'] = df2['Year'].astype(str)  # Convert to string for matching

# Rename other columns
df1.rename(columns={"Total Damage, Adjusted ('000 US$)": "Total Damages ('000 US$')"}, inplace=True)
df2.rename(columns={'Dis Mag Value': 'Magnitude', 'Dis Mag Scale': 'Magnitude Scale', 
                    'No Injured': 'No. Injured', 'No Affected': 'No. Affected', 
                    'No Homeless': 'No. Homeless'}, inplace=True)

# Define merge columns
merge_columns = [
    "Year", "Disaster Group", "Disaster Subgroup", "Disaster Type", "ISO",
    "Latitude", "Longitude", "Start Year", "Start Month", "Start Day", 
    "End Year", "End Month", "End Day", "Total Deaths", "No. Injured", 
    "No. Affected", "No. Homeless", "Total Affected", "Total Damages ('000 US$)", 
    "CPI", "Magnitude", "Magnitude Scale"
]

# Check for missing columns
missing_cols_df1 = [col for col in merge_columns if col not in df1.columns]
missing_cols_df2 = [col for col in merge_columns if col not in df2.columns]

if missing_cols_df1 or missing_cols_df2:
    print("Missing columns in df1:", missing_cols_df1)
    print("Missing columns in df2:", missing_cols_df2)
else:
    # Merge DataFrames
    merged_df = pd.merge(df1, df2, on=merge_columns, how="inner")

    # Drop duplicates and null values
    merged_df = merged_df.dropna().drop_duplicates()

    # Save the merged data
    merged_df.to_csv('Datasets/integrated_data.csv', index=False)
    print("Merged data successfully saved!")


df1 columns and types:
 Year                                          object
Historic                                      object
Classification Key                            object
Disaster Group                                object
Disaster Subgroup                             object
Disaster Type                                 object
Disaster Subtype                              object
External IDs                                  object
Event Name                                    object
ISO                                           object
Country                                       object
Subregion                                     object
Region                                        object
Location                                      object
Origin                                        object
Associated Types                              object
OFDA/BHA Response                             object
Appeal                                        object
Declaration           

ValueError: You are trying to merge on float64 and object columns for key 'Latitude'. If you wish to proceed you should use pd.concat