# CPSC322 Final Project -- UFO Sightings Dataset
### Sebastian Matthews and Ethan France

## 1. Introduction
Having a shared interest in the idea of UFO sightings, the possibility of explaining such a phenomenon through statistical analysis was too enticing to pass up and allowed for the construction of a unique project. The dataset that we have used for our project contains over 80,000 UFO sighting reports across the world, providing a variety of descriptions pertaining to each case(e.g. datetime, city, state, country, shape, duration in seconds and hours/min, comments regarding the report, date posted, and longitude and latitude), all sourced from [Kaggle](https://www.kaggle.com/datasets/NUFORC/ufo-sightings) and in a CSV format. The second dataset that we used was information regarding weather conditions during the day of the sighting, which we assembled from [Wunderground's Historical Weather Reports](https://www.wunderground.com/history) via a Selenium-based web scraping bot, which stored data an Excel file format and was later merged into our primary dataset. Finally, the third dataset used as an input for the Selenium bot was a CSV file that contained information regarding [ICAO codes](https://github.com/ip2location/ip2location-iata-icao/blob/master/iata-icao.csv), allowing the bot to input the location into the website and retrieve the weather data. According to the graphs below, the features that were the most influential in our models were the maximum temperature and humidity for a given day.

## 2. Data Analysis
Before reducing the scale of our dataset from ~80,000 entries to 3,657 samples, the UFO sightings dataset had attributes such as the date and time when a UFO sighting began; the city, state, and country where the sighting occured; the shape of the formation spotted; the duration of the encounter in seconds, hours, and minutes; comments from the witness; the date when the sighting was posted; and the latitude and longitude of the sighting's location. After sampling 1,000 random UFO sightings in the US, the nearest airport from another random selection of a given number(we chose 5 to reduce the time of data gathering)would be appended to each instance via an application of the Haversine formula via Geopandas. Then, the ICAO of the airport would be used to gather the weather data of a ten-year span(2003-2013) and be merged with the UFO data on date and airport code.

The attributes that we utilized as class information were...
- The dew point average of a given day, labeled "dew-avg".
- The atmospheric pressure of a given day, labeled as "pressure-avg".
- The temperature average of a given day, labeled as "temp-avg".
- The average wind speed of a given day, labeled as "wind-avg".
- The humidity average of a given day, labeled as "humidity-avg".
- The total precipitation in inches for a given day, labeled as "precipitation-total".

All of these attributes were continuous numerical values, which assisted in predicting whether or not a UFO sighting will occur for a given day based on weather features and location in our binary classification scenario. The expected outcome for our predictions would then be stored in a "prediction" column(a boolean value) within the table, telling us whether or not a UFO sighting was present. 

Our Naive Bayes model scored an accuracy of 77%, a precision of 4%, a recall of 19%, and had an F1 score of 6%; in comparison, our Random Forest model scored an accuracy of 95%, a precision of 25%, a recall of 3%, and an F1 score of 5%. Lastly, our KNN model scored an accuracy of 93%, a precision of 7%, a recall of 3%, and an F1 score of 4%. A high accuracy for KNN and Random Forest for this dataset is considered dubious due to the classifiers refusing to identify any UFO sightings, simply flagging the entire dataset as all falses.


![Classifier Metrics Comparison](Classifier_comp.png)
![F1 Classifier Comparison](f1_classifier_comp.png)
![Frequency Chart](Feature_Importance_-_Random_Forest.png)
![Feature Importance Chart NB](Feature_Importance_-_Naive_Bayes.png)
![Classification Bar Graph](TPxTNxFPxFN.png)

## 3. Classification Results

The classifiers that we designed for our dataset were the Naive Bayes, KNN, Binary, and Random Forest algorithms, which were previously constructed during the course's individual programming assignments. Before a classifier would predict a UFO sighting, the specific weather attributes would be normalized and scrubbed of null values and then split the data between a training and test set. Once the classifer was trained on the dataset, it proceeded to calculate the class and feature probabilities and make predictions, comparing the predicted class outcomes to the actual results. For the Random Forest algorithm, the fit function generates a stratified test set along with N random decision trees, then selects the M most accurate trees to determine the majority vote for each node in order to produce the most accurate prediction for the provided dataset.

Ultimately, we decided that the Naive Bayes was our best classifier. Given our dataset, we felt that accuracy could be a misleading statistic due to the distribution of UFO sightings and regular weather data being severely skewed in favor of no sightings. So, although it had the lowest accuracy, the Naive Bayes demonstrated the highest F1-Score and would actually produce true positives for our dataset. We felt that this was best because a high F1-Score indicates a good balance between precision and recall. 

## 4. Classification Web App

In [None]:
import openpyxl
import math
import random
from collections import Counter, defaultdict
from datetime import datetime
import matplotlib.pyplot as plt
from MyNaiveBayesClassifier import MyNaiveBayesClassifier

def read_excel(file_path):
    workbook = openpyxl.load_workbook(file_path)
    sheet = workbook.active
    data = []
    for row in sheet.iter_rows(values_only=True):
        data.append(list(row))
    return data[1:]  # Skip the header

def normalize_units(row, indices):
    normalized_row = []
    for i in indices:
        value = row[i]
        if value is None:
            normalized_row.append(0)  # Handle missing values by setting to 0
        elif i in range(14, 15):  # Humidity (%) Avg column
            normalized_row.append(value / 100 if value is not None else None)
        else:
            normalized_row.append(value)
    return normalized_row

def load_filtered_dataset(file_path):
    data = read_excel(file_path)
    filtered_data = []
    relevant_indices = [11, 14, 17, 20, 23, 26] + [-1]  # Avg value columns and label

    for row in data:
        if any(row[i] is None for i in relevant_indices):
            continue

        label = row[-1]
        if isinstance(label, str):
            label = label.strip().lower()
            label = 1 if label == "yes" else 0 if label == "no" else None

        if label is None or not all(isinstance(row[i], (int, float)) for i in relevant_indices[:-1]):
            continue

        normalized_row = normalize_units(row, relevant_indices[:-1])
        filtered_data.append(normalized_row + [label])


    features = [row[:-1] for row in filtered_data]
    labels = [row[-1] for row in filtered_data]
    return features, labels

def calculate_metrics(y_true, y_pred):
    true_positive = sum(1 for true, pred in zip(y_true, y_pred) if true == pred == 1)
    false_positive = sum(1 for true, pred in zip(y_true, y_pred) if true == 0 and pred == 1)
    false_negative = sum(1 for true, pred in zip(y_true, y_pred) if true == 1 and pred == 0)

    precision = true_positive / (true_positive + false_positive) if (true_positive + false_positive) > 0 else 0
    recall = true_positive / (true_positive + false_negative) if (true_positive + false_negative) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    accuracy = sum(1 for true, pred in zip(y_true, y_pred) if true == pred) / len(y_true)
    return accuracy, precision, recall, f1_score

def run_prediction_interface():
    print("Welcome to the Naive Bayes Weather-UFO Prediction App")
    print("Enter the average weather data values below:")

    feature_names = [
        "Temperature (°F) Avg", "Dew Point (°F) Avg",
        "Humidity (%) Avg", "Wind Speed (mph) Avg", "Pressure (in) Avg", "Precipitation (in) Total"
    ]

    user_input = []
    for feature in feature_names:
        while True:
            try:
                value = float(input(f"Enter {feature}: "))
                if "Humidity" in feature:
                    value /= 100  # Normalize humidity input
                user_input.append(value)
                break
            except ValueError:
                print("Invalid input. Please enter a numerical value.")

    print("\nProcessing your input...")

    try:
        # Load dataset and train model
        file_path = 'merged_weather_ufo.xlsx'
        features, labels = load_filtered_dataset(file_path)
        X_train, y_train, _, _ = split_data(features, labels)

        nb_classifier = MyNaiveBayesClassifier()
        nb_classifier.fit(X_train, y_train)

        # Predict
        prediction = nb_classifier.predict([user_input])[0]
        result = "likely" if prediction == 1 else "unlikely"
        print(f"\nPrediction: It is {result} that a UFO sighting will occur based on the provided weather data.")

    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    run_prediction_interface()

## 5. Conclusion

According to our analysis, there is no strong correlation between weather data and UFO sightings from our dataset; the likelihood of a sighting, which is already rare, also gets heavily skewed depending on airport selection and the size of weather report data. A “run” with 100 airports providing data provides a greater fidelity to our classification results than a “run” with only 5 airports. Therefore, we cannot predict with any certainty if a UFO will be spotted based on the weather. The way that we evaluated our classifiers' predictive ability was by paying close attention to F1 score, as 
The inherit challenges with the dataset came with the size of the charts and the manual assembly of weather data, along with the reduced correlation due to weather data and UFO sightings being separate phenomena that are fairly independent from one another. Nevertheless, the Naive Bayes classifier performed fairly well given the circumstances, properly identifying UFO sightings for seven days, and would have easily performed better when given data related to movie/media releases featuring aliens as the spike in sightings skyrocketed during the 90s according to exploratory data analysis. Further coupling such an acknowledgement with an identification of the day of the week of a sighting would also assist in explaining what contributes to an increased likelihood of experiencing alien encounters.

## 6. Acknowledgements
Historic weather data sourced from [Wunderground](https://www.wunderground.com/history).
<br>
Project idea inspired by Bilal Ali Shah's [article on Medium](https://medium.com/@24020041/ufo-dataset-predicting-ufo-sightings-in-the-us-7539c95e75a8).
<br>
[Notes on Statistics with R (SwR)](https://bookdown.org/pbaumgartner/swr-harris/10-logistic-regression.html).
<br>
[ChatGPT](https://chatgpt.com/) was utilized for cleaning up visualizations and troubleshooting error messages.