# Exploratory Data Analysis

This notebook provides an in-depth overview of the UCODTS datasets. It includes descriptive statistics, data visualizations, and correlation analysis to identify key trends, data quality issues, and potential areas for further investigation. 

**Objectives:**
- Understand the distribution of key metrics (e.g., fatigue scores).
- Identify missing values and outliers.
- Explore relationships between variables to inform feature engineering.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load processed crew data
crew_df = pd.read_csv('../datasets/processed/crew_data_processed.csv')
print('Descriptive statistics for crew data:')
print(crew_df.describe())

# Check for missing values
print('\nMissing values in crew data:')
print(crew_df.isnull().sum())

In [2]:
# Plot histogram of fatigue scores
plt.figure(figsize=(8, 5))
sns.histplot(crew_df['fatigue_score'], bins=10, kde=True)
plt.title('Fatigue Score Distribution')
plt.xlabel('Fatigue Score')
plt.ylabel('Frequency')
plt.show()

In [3]:
# Correlation matrix to explore relationships between numeric features
plt.figure(figsize=(10, 8))
corr = crew_df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix for Crew Data')
plt.show()

## Conclusion

The exploratory analysis reveals the overall distribution of fatigue scores and highlights key relationships between variables. This insight will guide further data cleaning, feature engineering, and model development for predicting crew fatigue and other operational metrics.