# **Phase 2:**

## **Phase Introduction**

This notebook documents the Exploratory Data Analysis (EDA) and data cleaning process for the integrated dataset. The primary objectives of this phase are:

1. **Initial Data Inspection** – Understanding the structure, completeness, and potential issues in the dataset.  
2. **Exploratory Data Analysis (EDA)** – Identifying patterns, trends, and anomalies through statistical summaries and visualizations.  
3. **Data Cleaning & Preprocessing** – Handling missing values, duplicates, and irrelevant features, as well as normalizing or transforming data where necessary.  
4. **Comparison of Primary and Secondary Data** – Evaluating similarities, differences, and potential biases between the two sources.  

Each step is carefully documented, including the rationale behind data processing decisions, challenges encountered, and key findings. The insights gained will help guide further analysis and hypothesis development.  

In [None]:
import pandas as pd

df = pd.read_csv("Datasets/integrated.csv")  
print(df.info())  
print(df.head())  
print(df.describe())  # Summary statistics


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32319 entries, 0 to 32318
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Disaster Group            32319 non-null  object 
 1   Disaster Subgroup         32319 non-null  object 
 2   Disaster Type             32319 non-null  object 
 3   ISO                       32309 non-null  object 
 4   Latitude                  4648 non-null   object 
 5   Longitude                 4651 non-null   object 
 6   Start Year                32319 non-null  int64  
 7   Start Month               31864 non-null  float64
 8   Start Day                 27129 non-null  float64
 9   End Year                  32319 non-null  int64  
 10  End Month                 31449 non-null  float64
 11  End Day                   27264 non-null  float64
 12  Total Deaths              24343 non-null  float64
 13  No. Injured               9848 non-null   float64
 14  No. Af

In [2]:
import pandas as pd

# Load dataset
df = pd.read_csv("Datasets/integrated.csv")  # Modify path if needed

# Initial inspection
print("Dataset Info:\n")
print(df.info())

print("\nFirst Five Rows:\n")
print(df.head())

print("\nSummary Statistics:\n")
print(df.describe())

print("\nMissing Values:\n")
print(df.isnull().sum())

print("\nDuplicate Records:", df.duplicated().sum())

print("\nData Types:\n")
print(df.dtypes)

print("\nFeature Relevance Check:\n")
print(df.nunique())  # Shows unique values per column to check for low-variance features


Dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32319 entries, 0 to 32318
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Disaster Group            32319 non-null  object 
 1   Disaster Subgroup         32319 non-null  object 
 2   Disaster Type             32319 non-null  object 
 3   ISO                       32309 non-null  object 
 4   Latitude                  4648 non-null   object 
 5   Longitude                 4651 non-null   object 
 6   Start Year                32319 non-null  int64  
 7   Start Month               31864 non-null  float64
 8   Start Day                 27129 non-null  float64
 9   End Year                  32319 non-null  int64  
 10  End Month                 31449 non-null  float64
 11  End Day                   27264 non-null  float64
 12  Total Deaths              24343 non-null  float64
 13  No. Injured               9848 non-null   floa