# Group 2 Coding Tutorial

Our project was based on investigating the relationship between natural disasters and the economic damage caused by them. First, we would need to load the dataset onto our Python notebook.

## Loading in data

In order to load the data, we would need to import some of the necessary packages (pandas) to read in the csv file that we have.

In [1]:
import pandas as pd

data = pd.read_csv('CLEANED_DATA.csv')
data.head()

Unnamed: 0,Year,Reported # Disasters,Total Economic Damage,Unnamed: 3
0,1900,5,30000000,
1,1903,8,480000000,
2,1906,17,650750000,
3,1907,5,30000000,
4,1908,4,116000000,


This shows some lines that the csv file contains. An issue that we have run into was that for some reason pandas would continue to read in even empty cells, so we had to impute those empty cells to ensure that we would not run into any further issues with the data or its analysis.

The total number of features is also shown above, which contains year, reported number of disasters, and total economic damage that occured during that year (USD). From these few lines alone, it can be assumed that the higher number of reported disasters there were, the higher the total economic damage can be projected to be.

Now, we will look at some of the summary statistics that this csv file provides for us, in order to gather a clearer understanding of what the data entails.

## Understanding our data

In [2]:
data.describe()

Unnamed: 0,Year,Reported # Disasters,Total Economic Damage,Unnamed: 3
count,109.0,109.0,109.0,0.0
mean,1963.192661,120.697248,29515950000.0,
std,32.802291,135.945285,57025250000.0,
min,1900.0,4.0,8000000.0,
25%,1936.0,12.0,129000000.0,
50%,1964.0,60.0,1529000000.0,
75%,1991.0,227.0,34104950000.0,
max,2018.0,432.0,364093000000.0,


This shows the count, mean, standard deviations, first and third quartiles, the median, and the minimum and maximum values per feature in order for us to easily identify outliers and understand the scaling of the data that is being used. This would be useful in ensuring that there are no extreme values that might be skewing our data and in turn affecting our model.

## Creating our model

We gathered that our data best fits a supervised learning model as opposed to an unsupervised one, since we are using labeled data with expected inputs (year, reported # of disasters) and expected outputs (total economic damage). Within the numerous supervised learning models that are available, we decided to use a Random Forest model since it can work with complex datasets (our data is non-linear and more exponential) and Random Forests are known to create trees that are independent from each other in order to eliminate bias and help mitigate overfitting as much as possible.

To start, we would need to import more packages that are relevant in creating this model, as well as our cross validation approach (GridSearch) and expected outputs to help us evaluate how the model is doing.

In [7]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

These scikitlearn packages help with GridSearch and splitting the training vs testing data, as well as the RandomForest regressor as mentioned previously. As mentioned in the beginning of the tutorial, since there were null values that were read in by pandas, we ensured that there was a pipeline and imputer that could help eliminate issues with the data. Finally, there was mean squared error and r2 score also imported in order to provide us with outputs that we will later analyze.

## Splitting data

Using the data that we now have, we will split it into its x and y components as well as split those into its training and testing data to be used to evaluate the model as much as possible.

In [8]:
# Split the dataset into features (X) and target variable (y)
X = data.drop(columns=['Total Economic Damage'])  # Features
y = data['Total Economic Damage']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This shows that all of the features were taken as x variables except for the economic damage, and the y (or target) variable was the economic damage. Next, we decided to do an 80-20 split (training and testing, respectively) of the data so that the model is more focused on training more than testing. 

In [9]:
# Define the pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('rf_regressor', RandomForestRegressor(random_state=42))
])

This creates the pipeline which imputes some of the missing data, as well as create the regression model.

## Hyperparameters and GridSearch

We created a list for each hyperparameter that is being fed into the model in order to identify which are the best hyperparameters for our data. Additionally, we performed the GridSearch cross validation method using the pipeline and imputer to get the best model.

In [10]:
# Define the hyperparameters grid for tuning
param_grid = {
    'rf_regressor__n_estimators': [50, 100, 150],
    'rf_regressor__max_depth': [5, 10, 15],
    'rf_regressor__min_samples_split': [2, 5, 10],
    'rf_regressor__min_samples_leaf': [1, 2, 4]
}

# Perform Grid Search Cross-Validation with the pipeline
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best model from the pipeline
best_rf_model = grid_search.best_estimator_




## Prediction and Model Evaluation
Now that we have everything that we need in order to make this model, we now have to make predictions based on the testing data and evaluate its performance. For our performance evaluation methods, we chose r^2 and mean squared error (MSE) to gain a better understanding of how the model is doing with the data.

In [11]:
# Make predictions on the test set
y_pred = best_rf_model.predict(X_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print("Mean Squared Error (MSE):", mse)
print("R-squared (R2):", r2)

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

Mean Squared Error (MSE): 4.718087187578521e+20
R-squared (R2): 0.5738448875315248
Best Hyperparameters: {'rf_regressor__max_depth': 5, 'rf_regressor__min_samples_leaf': 4, 'rf_regressor__min_samples_split': 10, 'rf_regressor__n_estimators': 150}




These results show that the MSE is still relatively high and R^2 explains that while there is a relationship that was found, it could be stronger with more information and context being provided. We have also drawn the conclusion that there could be numerous other factors that are also affecting the economic damage being observed - such as inflation, or other situational economic variability. Additionally, since natural disasters can be quite volatile in its own ways and its impact can vary from time to time, it can be difficult to gather results from this dataset alone.