# Data Cleaning Prototype

This notebook demonstrates robust data cleaning operations on the raw UCODTS datasets. We focus on handling missing values, converting data types, and removing duplicates to prepare the data for further analysis and modeling.

**Objectives:**
- Ensure data integrity by addressing missing or invalid values.
- Convert date and numeric fields appropriately.
- Generate a clean version of the raw crew data for downstream processing.

In [1]:
import pandas as pd
import numpy as np

# Load raw crew data
raw_file = '../datasets/raw/crew_data.csv'
df = pd.read_csv(raw_file)
print('Initial data shape:', df.shape)

# Display first few rows
df.head()

In [2]:
# Convert fatigue_score to numeric and handle errors
df['fatigue_score'] = pd.to_numeric(df['fatigue_score'], errors='coerce')

# Drop rows with missing values after conversion
df_clean = df.dropna()
print('Data shape after cleaning:', df_clean.shape)

# Remove duplicates
df_clean = df_clean.drop_duplicates()
print('Data shape after removing duplicates:', df_clean.shape)

In [3]:
# Save the cleaned data to the processed directory
output_file = '../datasets/processed/crew_data_processed.csv'
df_clean.to_csv(output_file, index=False)
print('Data cleaning complete. Clean file saved to:', output_file)

## Conclusion

The data cleaning process has been successfully completed. The cleaned dataset is now free of missing or duplicate entries and has the correct data types, making it ready for feature engineering and model development.