# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


## Model Choice

Logistic Regression is chosen as the baseline model due to its simplicity, interpretability, and efficiency for binary or multi-class classification tasks. As we are predicting the categorical variable Navigational Status, Logistic Regression provides a solid starting point to establish a baseline performance.


## Feature Selection

The features selected are based on their relevance to predicting Navigational Status. These include:
- Latitude and Longitude: Represent vessel position.
- Speed over Ground (SOG): Vessel speed, which varies significantly across statuses.
- Course over Ground (COG): Direction of the vessel.
- Heading: Indicates the vessel's orientation.

The target variable is Navigational Status.

## Implementation

[Implement your baseline model here.]



In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Define the folder containing 15-minute set files
data_folder = './time_intervals'

# Initialize an empty DataFrame to combine data from all files
all_data = pd.DataFrame()

# Loop through all CSV files in the folder
for file in os.listdir(data_folder):
    if file.endswith('.csv'):
        file_path = os.path.join(data_folder, file)
        df = pd.read_csv(file_path)

        # Ensure the required columns exist
        required_columns = ['Latitude', 'Longitude', 'SOG', 'COG', 'Heading', 'Navigational status']
        if all(col in df.columns for col in required_columns):
            all_data = pd.concat([all_data, df], ignore_index=True)
        else:
            print(f"Skipping {file}, missing required columns.")

# Feature selection
features = ['Latitude', 'Longitude', 'SOG', 'COG', 'Heading']
target = 'Navigational status'

# Drop rows with missing values in the selected columns
all_data = all_data.dropna(subset=features + [target])

# Encode the target variable (categorical to numeric)
all_data[target] = all_data[target].astype('category').cat.codes

# Define feature matrix (X) and target vector (y)
X = all_data[features]
y = all_data[target]

# Split the dataset into training and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
model = LogisticRegression(max_iter=500, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)


## Evaluation

Evaluation

To evaluate the baseline model:

- Accuracy: Proportion of correct predictions out of all predictions.
- Classification Report: Includes precision, recall, and F1-score for each class, providing detailed insights into model performance.
- Confusion Matrix: Displays actual vs. predicted values to analyze performance for each class.

Example Output:

- Accuracy: Indicates the overall performance.
- Confusion Matrix: Highlights areas where the model misclassified.
- Classification Report: Shows per-class performance metrics, aiding in identifying underperforming categories.



In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Baseline Model Accuracy:", accuracy)

# Generate and display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=all_data[target].cat.categories, 
            yticklabels=all_data[target].cat.categories)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

