**Final Project: EMPA Data Analysis**

**Trevor Gardner**

# 1. What’s the Project About?
The EMPA Monitoring Program in California focuses on tracking the health of estuaries. For this project, I wanted to see if I could predict fish sizes (Small, Medium, Large) based on environmental and biological factors. The idea is that if I can accurately classify fish sizes, it could help understand how Marine Protected Areas (MPAs) work and how fish populations grow over time.

I used a dataset from the SOP 9 Fish Seines project, which includes data on things like fish length, species, and the areas where the fish were sampled. This kind of analysis can be useful for conservationists and policymakers trying to figure out how MPAs are performing.


# 2. Data Collection and Preparation
I got the data from EMPA’s monitoring system, which tracks fish caught with seining nets. Before jumping into the analysis, I had to clean it up a bit. There were some missing values (like -88 or "Not recorded") that I needed to handle, and I also categorized the fish into three groups based on size: Small (0-50mm), Medium (50-100mm), and Large (>100mm).

Then, I split the data into two sets—80% for training the model and 20% for testing it. This helps make sure the model's predictions are solid and can generalize well (I set the random state to 42 to keep everything consistent).

In [22]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load sop9 dataset
df = pd.read_csv("combined_sop9_length_data.csv")

# Replace missing values (Not recorded and -88 values)
df.replace([-88], np.nan, inplace=True)
df.replace(["Not recorded"], np.nan, inplace=True)

# Drop rows where 'length_mm' is missing in sop9 dataset
df.dropna(subset=["length_mm"], inplace=True)

# Categorize 'length_mm' into bins for classification
bins = [0, 50, 100, np.inf]  # Define bins for small, medium, large
labels = ['Small', 'Medium', 'Large']  # Define class labels
df['length_category'] = pd.cut(df['length_mm'], bins=bins, labels=labels, right=False)

# Encode the categorical target variable
target = "length_category"
print(df[target].value_counts())

target_encoder = LabelEncoder()
df[target] = target_encoder.fit_transform(df[target])

# Encoding categorical features
categorical_features = [
    "stationno", "estuaryname", "scientificname", 
    "samplecollectiondate", 
]

# Now you have the sop9 data ready for training
# Label encoding for categorical features
label_encoders = {}
for col in categorical_features:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# Prepare feature matrix X and target vector y
X = df[categorical_features]
y = df[target]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now you have the sop9 data ready for training


length_category
Small     372
Medium    217
Large      51
Name: count, dtype: int64


I decided to enhance the model by incorporating additional data, specifically from Sop2, which includes water quality information collected on the same days as the fish samples. However, integrating this data proved to be quite challenging, as there was no direct correlation between the water quality parameters and the fish findings. I managed to merge the two datasets by matching them based on the common "date collected" column, adding the water quality attributes as additional features to the fish data. The water data, which consists of mean values reported throughout the day at different depths, was intended to enrich the model. Unfortunately, this addition didn’t improve the model’s performance.

In [23]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load sop9 and sop2 datasets
df = pd.read_csv("combined_sop9_length_data.csv")
df_sop2 = pd.read_csv("combined_sop2_water_quality_data.csv")

# Select relevant columns from sop2 dataset
df_sop2 = df_sop2[['projectid', 'siteid', 'stationno', 'h2otemp', 'do_mgl', 'do_percent', 'salinity', 'conductivity', 'samplecollectiondate']]

# Replace missing values (Not recorded and -88 values)
df.replace([-88], np.nan, inplace=True)
df.replace(["Not recorded"], np.nan, inplace=True)
df_sop2.replace([-88], np.nan, inplace=True)
df_sop2.replace(["Not recorded"], np.nan, inplace=True)

# Drop rows where 'length_mm' is missing in sop9 dataset
df.dropna(subset=["length_mm"], inplace=True)

# Categorize 'length_mm' into bins for classification
bins = [0, 50, 100, np.inf]  # Define bins for small, medium, large
labels = ['Small', 'Medium', 'Large']  # Define class labels
df['length_category'] = pd.cut(df['length_mm'], bins=bins, labels=labels, right=False)

# Group water quality data by the sample collection date
df_sop2_grouped = df_sop2.groupby('samplecollectiondate').agg({
    'h2otemp': 'mean',
    'do_mgl': 'mean',
    'do_percent': 'mean',
    'salinity': 'mean',
    'conductivity': 'mean'
}).reset_index()

# Merge the grouped water quality data with the fish data based on 'samplecollectiondate'
df_combined = pd.merge(df, df_sop2_grouped, on='samplecollectiondate', how='left')

# Remove rows where there is no corresponding data for water quality parameters (i.e., NaN values in sop2 columns)
df_combined = df_combined.dropna(subset=['h2otemp', 'do_mgl', 'do_percent', 'salinity', 'conductivity'], how='all')

# Save the combined dataset to a new CSV file
df_combined.to_csv("combined_sop9_and_sop2_data.csv", index=False)

# Encode the categorical target variable
target = "length_category"
print(df_combined[target].value_counts())

target_encoder = LabelEncoder()
df_combined[target] = target_encoder.fit_transform(df_combined[target])

# Encoding categorical features
categorical_features = [
    "stationno", "estuaryname", "scientificname",
    "samplecollectiondate", "surveytype", "status", "h2otemp", "do_mgl", 
    "do_percent", "salinity", "conductivity"
]

# Label encoding for categorical features
label_encoders = {}
for col in categorical_features:
    le = LabelEncoder()
    df_combined[col] = le.fit_transform(df_combined[col].astype(str))
    label_encoders[col] = le

# Prepare features and target for the combined model (sop9 + sop2)
X_combined = df_combined[categorical_features]
y_combined = df_combined[target]

# Split the data into training and test sets for the combined model
X_combined_train, X_combined_test, y_combined_train, y_combined_test = train_test_split(X_combined, y_combined, test_size=0.2, random_state=42)

# Now, you have separate training and test sets for the combined model:
# - `X_combined_train`, `y_combined_train` for training the combined model
# - `X_combined_test`, `y_combined_test` for testing the combined model


length_category
Small     339
Medium    217
Large      48
Name: count, dtype: int64


# 3.     Building the Models

I tested a couple of models to see which would perform best:

1. XGBoost Classifier: I chose this model because it works really well with structured data like the one I was using.
2. MLP Neural Network: This one was just a comparison to see if it would outperform XGBoost.

I trained the XGBoost model, adjusting things like the depth of the trees, learning rate, and number of trees using GridSearchCV to find the best settings. After training, I evaluated how well the model was performing using accuracy and confusion matrix analysis to make sure it wasn’t overfitting.



In [24]:
import xgboost as xgb
from sklearn.metrics import accuracy_score
model = xgb.XGBClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8515625


The combined model, which incorporated the Sop2 water quality data, did not improve the model’s performance. In fact, it resulted in a decrease in accuracy, dropping from approximately 85% to 76%. As a result, I decided to move forward with the original, non-combined dataset.

In [25]:
combined_model = xgb.XGBClassifier(random_state=42)
combined_model.fit(X_combined_train, y_combined_train)
y_combined_pred = combined_model.predict(X_combined_test)
accuracy = accuracy_score(y_combined_test, y_combined_pred)
print("Accuracy:", accuracy)

Accuracy: 0.768595041322314


In [26]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [200, 500, 1000],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

grid_search = GridSearchCV(xgb.XGBClassifier(random_state=42), param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_
model = best_model
# Instantiate and train the XGBoost Classifier
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

KeyboardInterrupt: 

In [None]:
from sklearn.neural_network import MLPClassifier
# Define MLPClassifier (Neural Network)
mlp = MLPClassifier(hidden_layer_sizes = {200, 100}, activation='relu', solver='lbfgs', max_iter=1000, random_state=42)
# Train the model
mlp.fit(X_train, y_train)
# Make predictions
y_pred = mlp.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Neural Network Accuracy:", accuracy)

KeyboardInterrupt: 


# 4. Evaluation and Testing
When I tested the XGBoost model, it achieved an accuracy of 85.16% on the test data. I also looked at the confusion matrix, which showed that most fish were correctly classified, though there were some misclassifications between adjacent size categories. Interestingly, the MLP Neural Network also achieved the same accuracy of 85.16% when I set the hidden layer size to 200, 100. So, both models performed similarly in terms of accuracy, though XGBoost is still considered a bit more efficient for this kind of task with structured data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import xgboost as xgb

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_encoder.classes_)

fig1, axs1 = plt.subplots(1 ,2, figsize=(14, 6))

# Plot confusion matrix
disp.plot(ax=axs1[0], cmap=plt.cm.Blues, values_format="d")
axs1[0].set_title("Confusion Matrix")

# Plot distribution of length categories
df['length_category'].value_counts().plot(kind='bar', ax=axs1[1], color=['blue', 'orange', 'green', 'red'])
axs1[1].set_xlabel("Length Category")
axs1[1].set_ylabel("Count")
axs1[1].set_title("Distribution of Length Categories")
axs1[1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

# Feature Importance Plot
fig2, axs2 = plt.subplots(1, 2, figsize=(14, 6))

importances = model.feature_importances_
features = X.columns

# Sort feature importance
sorted_idx = np.argsort(importances)[::-1]

axs2[0].barh(range(len(features)), importances[sorted_idx], align="center")
axs2[0].set_yticks(range(len(features)))
axs2[0].set_yticklabels([features[i] for i in sorted_idx])
axs2[0].set_xlabel("Feature Importance Score")
axs2[0].set_title("XGBoost Feature Importance")
axs2[0].invert_yaxis()

# XGBoost built-in importance plot
xgb.plot_importance(model, ax=axs2[1])

plt.tight_layout()
plt.show()



# 5. Wrapping It Up

The project was structured logically, with a clear problem definition, methodology, and results. I used visualizations like the confusion matrix and feature importance chart to make the findings easier to understand.

Everything was well-documented, from the code to the steps taken during the project. I worked through different stages, from preparing the data to building and testing the models.

--- 
**Final Model Selection:** XGBoost (85.16% Accuracy)  
**Key Insight:** Environmental and site-specific factors have a big impact on fish size distribution.
**Conclusion:** I successfully used machine learning to classify fish sizes in California’s estuarine environments. This analysis can support conservation efforts and help assess the effectiveness of MPAs.