# Project Proposal: Predicting Hurricane Categories Using Machine Learning

## Course: UNC MADS DATA780

### Authors:
- **Hubert Hwang**
- **Sooji Rhodes**

---

## 1. Project Title:
**Predicting Hurricane Categories Using Machine Learning on NOAA Atlantic Hurricane Data**

---

## 2. Objective:
The goal of this project is to build a machine learning model that accurately predicts the category of a hurricane based on meteorological features such as wind speed, pressure, and geographical position. Using historical hurricane data from the NOAA Atlantic Hurricane Database (HURDAT2), we aim to explore various machine learning methods to develop a model that can predict the category of a hurricane based on both aggregated features and sequential storm data.

---

## 3. Motivation:
The recent devastation caused by **Hurricane Helene** in western North Carolina highlights the urgency of improving hurricane prediction models. Early and accurate prediction of hurricane categories is crucial for effective disaster preparedness and response. Machine learning offers the potential to analyze large historical datasets to enhance hurricane category predictions. By leveraging the NOAA dataset, we aim to contribute to this field by building a machine learning pipeline capable of predicting hurricane strength and category based on key meteorological data.

---

## 4. Dataset:
The dataset for this project will be sourced from the [NOAA Atlantic Hurricane Database (HURDAT2)](https://www.kaggle.com/datasets/utkarshx27/noaa-atlantic-hurricane-database/data). It contains detailed information on the positions and attributes of hurricanes and tropical storms from 1975 to 2021.

- **Key Features**:
  - **Timestamps**: Measurements taken every six hours during a storm’s lifecycle (from 1979 onwards).
  - **Position**: Latitude and longitude of the storm at each timestamp.
  - **Attributes**: Wind speed, pressure, storm type, and other relevant meteorological features.
  - **Target Variable**: Hurricane category (ranging from tropical depression to Category 5 hurricanes).

The data is incomplete for earlier years, where certain features such as wind speed and pressure may have missing values. These will be handled through imputation or by excluding certain data points if necessary.

---

## 5. Approach and Methodology:

### Data Preprocessing:
1. **Handling Missing Data**: 
   - For storms with missing data, use appropriate imputation techniques (e.g., forward/backward filling, mean imputation) or drop entries with too much missing data.
   
2. **Feature Engineering**:
   - **Meteorological Features**: Use wind speed, pressure, and geographical data to create features relevant for predicting hurricane category.
   - **Derived Features**: Create features such as the rate of change in wind speed, storm speed (calculated from latitude and longitude), and minimum pressure to capture storm dynamics.
   - **Temporal Features**: Incorporate the time dimension into features to capture storm evolution.

3. **Data Aggregation**:
   - **Aggregated Data**: For simpler models, aggregate features like maximum wind speed, minimum pressure, and average storm speed over the storm’s lifecycle.
   - **Sequential Data**: For more advanced models, keep the six-hourly sequential data, allowing the model to capture storm progression over time.

### Machine Learning Models:
We will experiment with several machine learning models to determine which one performs best at predicting hurricane categories. These include:

1. **Random Forests** (via Scikit-learn):
   - A robust ensemble method suited for tabular data with the ability to handle non-linear relationships between features and the target variable.

2. **Gradient Boosting** (via Scikit-learn):
   - A powerful ensemble technique that builds trees sequentially, improving performance by correcting errors from previous trees. This is especially useful for complex non-linear relationships.

3. **Artificial Neural Networks (ANNs)** (via TensorFlow):
   - ANNs can capture complex, non-linear patterns in the data, especially when we have large datasets with complex interactions between features.

4. **Recurrent Neural Networks (RNNs/LSTMs)** (via TensorFlow):
   - If we decide to keep the sequential structure of the storm data, RNNs and LSTMs will be used to model the temporal progression of hurricanes, capturing how features like wind speed evolve over time.

### Evaluation Metrics:
We will evaluate the performance of our models using the following metrics:

- **Accuracy**: The proportion of correct predictions.
- **Confusion Matrix**: A detailed breakdown of how often each category is predicted correctly or misclassified.
- **Precision, Recall, and F1-Score**: Especially important for measuring performance across hurricane categories if the dataset is imbalanced.
- **ROC-AUC (macro and micro)**: To evaluate the overall performance of the classifier across all hurricane categories.

---

## 6. Project Timeline (4 weeks):

| **Task**                                      | **Deadline**        |
|-----------------------------------------------|---------------------|
| Data Collection and Preprocessing             | Week 1              |
| Feature Engineering and Data Aggregation      | Week 1              |
| Model Training (Random Forest, Gradient Boosting) | Week 2              |
| Model Training (ANNs and LSTMs)               | Week 3              |
| Model Evaluation and Hyperparameter Tuning    | Week 4              |
| Final Presentation and Report                 | End of Week 4       |

---

## 7. Expected Outcomes:
By the end of this project, we expect to have:
1. A trained machine learning model capable of accurately predicting hurricane categories based on historical data.
2. Insights into which meteorological features are most predictive of hurricane strength and category.
3. A comparison of various machine learning models to determine which performs best for this task.
4. A report summarizing the findings, including model performance and recommendations for future work.

---

## 8. Tools and Technologies:
- **Programming Language**: Python
- **Libraries**: 
  - **Data Manipulation and Preprocessing**: Pandas, NumPy
  - **Machine Learning Models**: Scikit-learn, TensorFlow
  - **Visualization**: Matplotlib, Seaborn
- **Version Control**: GitHub (repository located [here](https://github.com/soojirhodes/DATA780_Final_Project))

---

## 9. Challenges and Risks:
- **Missing Data**: Handling missing data, particularly for earlier years (pre-1979), may limit the dataset size or require imputation.
- **Imbalanced Dataset**: There may be imbalances in the number of storms for each category (e.g., fewer Category 5 hurricanes), which could impact model performance.
- **Sequential Modeling**: Using LSTMs or RNNs to model the sequential nature of the data may be challenging due to sequence length variation and potential overfitting on smaller datasets.

---

## 10. Conclusion:
This project will apply a variety of machine learning techniques to predict hurricane categories using historical NOAA data. By the end of the project, we aim to deliver a predictive model that can assist in forecasting hurricane strength and provide valuable insights into the factors that contribute to hurricane intensity. The model’s accuracy and robustness could offer additional tools for improving disaster preparedness and response.

---

## 11. Acknowledgment of AI Use:
This proposal was developed with the assistance of **ChatGPT**. ChatGPT was used to help with idea generation, project organization, and drafting of certain sections of the document. The final content was reviewed and edited by the authors to ensure accuracy and adherence to the guidelines.