# Adding Nutrint Load to Initial particle positions and Pylag outputs

### Notebook Overview

This notebook processes nutrient flux data from stream watersheds and integrates it with particle tracking outputs.

---

#### 1. Read Zonal Statistics for Stream Watersheds

We begin by reading the zonal statistics output, which represents the nutrient flux (Total Nitrogen - **TN** and Total Phosphorus - **TP**) from each stream watershed to Lake Huron. These nutrient values are then associated with their corresponding `group_id` to facilitate linkage with particle tracking results.

---

#### 2. Integrate Nutrient Loads with Particle Tracking Outputs

Next, we add TN and TP values to the particle tracking dataset by:

- Counting the number of unique `group_id` entries (i.e., the number of particles released from each stream watershed).
- Dividing the total TN and TP values by the number of particles released **and** the number of unique days in each release month.

For example, for `release_month = 202301`:

$TN_{\text{per particle}} = \dfrac{Total N for the group_id}{\text{number of particles in Jan 2023} \times \text{number of unique days}}$

This allows for a per-particle representation of nutrient loads, supporting fine-scale spatial analysis of nutrient delivery to the lake.


# Imports

In [1]:
# some histogram ananlysis tools
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
# reas the first file from the file
import xarray as xr
import pandas as pd
import re
from collections import defaultdict
import glob

# Inputs

In [2]:
# read the initial points from a file
path = '/home/abolmaal/modelling/FVCOM/Huron/input/initial_position'
initial_positions_file = os.path.join(path, 'initial_positions_releasezone_intersection_multigroup_midplume_lastrevised.dat')

# Output

In [3]:
out_path = '/home/abolmaal/modelling/FVCOM/Huron/InDirectNP_load'

# Functions 

In [4]:
# read the initial_positions file
Initial_position = pd.read_csv(initial_positions_file, sep=' ',header=None, names=['group_id', 'lon', 'lat', 'depth'])
# drop the first row
Initial_position = Initial_position.drop(0, axis=0)
Initial_position.head()

Unnamed: 0,group_id,lon,lat,depth
1,0,275.3339,45.7428,0.0
2,0,275.3351,45.7428,0.0
3,0,275.3363,45.7428,0.0
4,0,275.3375,45.7428,0.0
5,0,275.3387,45.7428,0.0


In [5]:
# Step 1: Count occurrences of each unique group_id
# Step 2: Combine 'group_id' and 'group_number' with leading zeros
# Step 3 : Add a column call particle id  with Combine 'group_id' and 'group_number'
# Step 1: Add a particle index within each group_id
Initial_position['particle_index'] = Initial_position.groupby('group_id').cumcount()

# Step 2: Create particle_id as group_id (no padding) + particle_index (zero-padded to 2 digits)
Initial_position['particle_id'] = Initial_position.apply(
    lambda row: f"{int(row['group_id'])}{int(row['particle_index']):02d}", axis=1
)

In [6]:
# Reorder columns as requested
Initial_position = Initial_position[['group_id', 'particle_id', 'lon', 'lat', 'depth']]

Initial_position.head()

Unnamed: 0,group_id,particle_id,lon,lat,depth
1,0,0,275.3339,45.7428,0.0
2,0,1,275.3351,45.7428,0.0
3,0,2,275.3363,45.7428,0.0
4,0,3,275.3375,45.7428,0.0
5,0,4,275.3387,45.7428,0.0


# Adding TN TP from Zonal Stats to the initial file 

In [7]:
DirectNPcoast = pd.read_csv('/mnt/d/Users/abolmaal/Arcgis/NASAOceanProject/ZonalStats/DirectTNTP_StreamWatresheds.csv')

In [8]:
DirectNPcoast

Unnamed: 0,Group_id,Shape_Area,StreamDirectTP_kgday,StreamDirectTN_kgday,StreamDirectTN_grm2yr,StreamDirectTP_grm2yr
0,0,2.608193e+07,0.974675,13.480809,0.188655,0.013640
1,1,6.915027e+07,2.000936,51.878556,0.273834,0.010562
2,2,1.703711e+08,4.179389,87.793043,0.188086,0.008954
3,3,9.425426e+07,3.598198,54.996792,0.212975,0.013934
4,4,3.772995e+07,1.326998,27.250568,0.263622,0.012837
...,...,...,...,...,...,...
140,170,1.961733e+07,4.299652,63.984811,1.190501,0.079999
141,171,1.228468e+03,0.000563,0.013980,4.153712,0.167337
142,173,1.396735e+07,4.034916,23.636322,0.617673,0.105442
143,174,7.877722e+06,2.496025,14.078233,0.652289,0.115649


In [9]:
# count the nember of unique StWater_id in the data 
unique_stwater_ids = DirectNPcoast['Group_id'].nunique()
print(f"Number of unique Stream watersheds: {unique_stwater_ids}")

Number of unique Stream watersheds: 145


## Merging Direct NP to coast with Initial position

In [11]:
df = Initial_position
df1 = Initial_position
df2 = DirectNPcoast


In [12]:
# Step 1: Count how many times each group_id occurs in df
group_counts = df1['group_id'].value_counts().rename_axis('group_id').reset_index(name='count')
print(group_counts)

     group_id  count
0           0     75
1         116     75
2         113     75
3         112     75
4         111     75
..        ...    ...
109        50     75
110        49     75
111        48     75
112        45     75
113       162     75

[114 rows x 2 columns]


In [13]:
# Rename df2's column for merging
df2 = df2.rename(columns={'Group_id': 'group_id'})

# Step 1: Count how many times each group_id occurs in df
group_counts = df1['group_id'].value_counts().rename_axis('group_id').reset_index(name='count')

# Step 2: Merge counts into df2 (to know how many times to repeat each row)
df2_expanded = df2.merge(group_counts, on='group_id', how='inner')  # Only keep matches

# Step 3: Repeat each row in df2 based on the count
df2_repeated = df2_expanded.loc[df2_expanded.index.repeat(df2_expanded['count'])].reset_index(drop=True)

# Step 4: Reset index on df and df2_repeated to ensure alignment
df1 = df1[df1['group_id'].isin(df2_repeated['group_id'])]  # Filter only matching group_ids
df1 = df1.reset_index(drop=True)
df2_repeated = df2_repeated.drop(columns='count').reset_index(drop=True)

# Add particle_index within each group_id (if not already present)
df1['particle_index'] = df1.groupby('group_id').cumcount()

# Create particle_id as group_id (3 digits) + particle_index (2 digits)
df1['particle_id'] = df1.apply(
    lambda row: f"{int(row['group_id']):03d}{int(row['particle_index']):02d}", axis=1
)
df1['particle_id'] = df1['particle_id'].astype(str)

df2_repeated = df2_repeated.drop(columns='group_id')

# Step 5: Concatenate the two DataFrames
merged_df = pd.concat([df1, df2_repeated], axis=1)


In [14]:
merged_df.columns

Index(['group_id', 'particle_id', 'lon', 'lat', 'depth', 'particle_index',
       'Shape_Area', 'StreamDirectTP_kgday', 'StreamDirectTN_kgday',
       'StreamDirectTN_grm2yr', 'StreamDirectTP_grm2yr'],
      dtype='object')

In [14]:
merged_df

Unnamed: 0,group_id,particle_id,lon,lat,depth,particle_index,Shape_Area,StreamDirectTP_kgday,StreamDirectTN_kgday,StreamDirectTN_grm2yr,StreamDirectTP_grm2yr
0,0,00000,275.3339,45.7428,0.0,0,2.608193e+07,0.974675,13.480809,0.188655,0.013640
1,0,00001,275.3351,45.7428,0.0,1,2.608193e+07,0.974675,13.480809,0.188655,0.013640
2,0,00002,275.3363,45.7428,0.0,2,2.608193e+07,0.974675,13.480809,0.188655,0.013640
3,0,00003,275.3375,45.7428,0.0,3,2.608193e+07,0.974675,13.480809,0.188655,0.013640
4,0,00004,275.3387,45.7428,0.0,4,2.608193e+07,0.974675,13.480809,0.188655,0.013640
...,...,...,...,...,...,...,...,...,...,...,...
8545,162,16270,276.1465,43.6591,0.0,70,1.496975e+10,996.783451,28752.263320,0.701052,0.024304
8546,162,16271,276.1477,43.6591,0.0,71,1.496975e+10,996.783451,28752.263320,0.701052,0.024304
8547,162,16272,276.1489,43.6591,0.0,72,1.496975e+10,996.783451,28752.263320,0.701052,0.024304
8548,162,16273,276.1501,43.6591,0.0,73,1.496975e+10,996.783451,28752.263320,0.701052,0.024304


In [15]:
# save the merged DataFrame to a CSV file
output_file = os.path.join(out_path, 'initial_positions_NP.csv')
merged_df.to_csv(output_file,index=False)

# Addin this to Pylag Out puts

In [16]:
# Define the function to sort the files based on the time
def sort_key(file):
    filename = os.path.basename(file)
    try:
        # Extract the number after the double underscores and before the `.nc` extension
        number = int(filename.split('_')[-1].split('.')[0])
        return number
    except (IndexError, ValueError):
        # Handle filenames that do not match the pattern by returning a high number to place them last
        return float('inf')

In [17]:
# Set the directory of the FVCOM model outputs
FVCOM_dir = '/home/abolmaal/modelling/FVCOM/Huron/output'
# read the updated NetCDF file
updated_files = glob.glob(FVCOM_dir + "/updated_FVCOM_Huron_2323_*.nc")
updated_files.sort(key=sort_key)

In [18]:
ds = xr.open_dataset(updated_files[0])
# Convert the NetCDF 'time' variable to pandas datetime format
time_vals = pd.to_datetime(ds['time'].values)
df_time = pd.DataFrame({'time': time_vals})
# Extract string-format date and compute unique day counts per month
df_time['date'] = df_time['time'].dt.strftime('%Y-%m-%d')
df_time['month'] = df_time['time'].dt.to_period('M')
unique_days_per_month = df_time.groupby('month')['date'].nunique().reset_index()
# Total number of unique days in the file (used in mass calculation)
total_days = unique_days_per_month['date'].sum()

In [19]:
ds

In [8]:
print(f"Total number of unique days in the file: {total_days}")

Total number of unique days in the file: 60


# Adding TN and TP tp particle tracking outputs 

✅ Remarks:

This function allows you to assign nutrient loads from stream watersheds to individual particles, ensuring that:

The load is divided fairly across the number of particles for each group.

The final output includes TN/TP flux and mass per particle.

The file is updated and saved with new nutrient tracking fields.

In [20]:
def update_nc_with_mass_flux(nc_file, merge_df, output_path):
    """
    Processes a single NetCDF particle tracking file:
    - Reads the file and extracts 'group_id' and 'particle_id'
    - Merges TN and TP load data per 'group_id' from `merge_df`
    - Calculates per-particle TN and TP mass based on number of occurrences and number of days in that file
    - Saves the updated NetCDF file with new variables
    """

    # Step 1: Load the NetCDF File
    ds = xr.open_dataset(nc_file)

    # Convert 'time' to datetime and count unique days
    time_vals = pd.to_datetime(ds['time'].values)
    unique_days = pd.to_datetime(time_vals).normalize().nunique()

    # Step 2: Convert dataset to DataFrame
    df = ds[['group_id', 'particle_id']].to_dataframe().reset_index()

    # Step 3: Count how many times each group_id appears (in this file only)
    group_id_counts = df['group_id'].value_counts().reset_index()
    group_id_counts.columns = ['group_id', 'group_id_occurrence']
    df = df.merge(group_id_counts, on='group_id', how='left')

    # Step 4: Merge external nutrient flux values for this file
    flux_subset = merge_df[['group_id', 'StreamDirectTN_kgday', 'StreamDirectTP_kgday']].drop_duplicates()
    df = df.merge(flux_subset, on='group_id', how='left')

    # Step 5: Calculate mass per particle for this specific file
    df['StreamTN_kgdayparticle'] = (df['StreamDirectTN_kgday'] / df['group_id_occurrence'])
    df['StreamTP_kgdayparticle'] = (df['StreamDirectTP_kgday'] / df['group_id_occurrence'])

    # Step 6: Sanity check
    assert len(df) == ds.sizes['particles'], "Mismatch between particle count in NetCDF and DataFrame."

    # Step 7: Write results back into the NetCDF
    ds['StreamDirectTN_kgday'] = (('particles'), df['StreamDirectTN_kgday'].values)
    ds['StreamDirectTP_kgday'] = (('particles'), df['StreamDirectTP_kgday'].values)
    ds['StreamTN_kgdayparticle'] = (('particles'), df['StreamTN_kgdayparticle'].values)
    ds['StreamTP_kgdayparticle'] = (('particles'), df['StreamTP_kgdayparticle'].values)

    # Step 8: Save output
    filename = os.path.basename(nc_file)
    updated_file = os.path.join(output_path, f"particleload_{filename}")
    ds.to_netcdf(updated_file)
    print(f"✅ File saved: {updated_file}")


In [21]:
for nc in updated_files:  # list of your NetCDF files
    update_nc_with_mass_flux(nc, merged_df, FVCOM_dir)

✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_JanFeb_1.nc
✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_FebMar_2.nc
✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_MarApr_3.nc
✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_AprMay_4.nc
✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_MayJun_5.nc
✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_JunJul_6.nc
✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_JulAug_7.nc
✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_AugSep_8.nc
✅ File saved: /home/abolmaal/modelling/FVCOM/Huron/output/particleload_updated_FVCOM_Huron_2323_SepOct_9.nc
✅ File saved: /home/abolmaal

In [24]:
mass_updated = glob.glob(FVCOM_dir + "/particleload_updated_FVCOM_Huron_2323_*.nc")
mass_updated.sort(key=sort_key)


In [25]:
# Open the dataset
ds = xr.open_dataset(mass_updated[0])

In [26]:
ds

In [37]:
# convert ds to pandas DataFrame 
ds_df = ds.to_dataframe().reset_index()
# count the number of occurrences where group_id == 0
group_id_counts = df['group_id'].value_counts().reset_index()
print(f"Number of unique group_id where group_id== 0: {group_id_counts}")
# print the particle_id adn SrteamTN_kgdayparticle
print(ds_df[['group_id','particle_id','StreamDirectTN_kgday','StreamDirectTP_kgday' ,'StreamTN_kgdayparticle', 'StreamTP_kgdayparticle']].head())


Number of unique group_id where group_id== 0:      group_id  count
0           0     75
1         116     75
2         113     75
3         112     75
4         111     75
..        ...    ...
109        50     75
110        49     75
111        48     75
112        45     75
113       162     75

[114 rows x 2 columns]
   group_id particle_id  StreamDirectTN_kgday  StreamDirectTP_kgday  \
0         0   000002301             13.480809              0.974675   
1         0   000012301             13.480809              0.974675   
2         0   000022301             13.480809              0.974675   
3         0   000032301             13.480809              0.974675   
4         0   000042301             13.480809              0.974675   

   StreamTN_kgdayparticle  StreamTP_kgdayparticle  
0                0.179744                0.012996  
1                0.179744                0.012996  
2                0.179744                0.012996  
3                0.179744                0.

In [46]:
# Convert to DataFrame
df = ds[['group_id', 'particle_id', 'StreamDirectTN_kgday', 'StreamDirectTP_kgday',
         'massParticleTN', 'massParticleTP']].to_dataframe().reset_index()

# Preview a few rows
print(df.head(10))

   particles  group_id particle_id  StreamDirectTN_kgday  \
0          0         0   000002301             13.480809   
1          1         0   000012301             13.480809   
2          2         0   000022301             13.480809   
3          3         0   000032301             13.480809   
4          4         0   000042301             13.480809   
5          5         0   000052301             13.480809   
6          6         0   000062301             13.480809   
7          7         0   000072301             13.480809   
8          8         0   000082301             13.480809   
9          9         0   000092301             13.480809   

   StreamDirectTP_kgday  massParticleTN  massParticleTP  
0              0.974675       10.784647         0.77974  
1              0.974675       10.784647         0.77974  
2              0.974675       10.784647         0.77974  
3              0.974675       10.784647         0.77974  
4              0.974675       10.784647         0

In [47]:
# save the DataFrame to a CSV file
output_file = os.path.join(out_path, 'mass_flux_data.csv')
df.to_csv(output_file, index=False)

In [33]:
# read the nc file
ds = xr.open_dataset(mass_updated[0])
# reat the variable massParticleTN
massParticleTN = ds['massParticleTN'].values
# read the variable StreamDirectTN_kgday
StreamDirectTN_kgday = ds['StreamDirectTN_kgday'].values
StreamDirectTN_kgday


array([1.34808094e+01, 1.34808094e+01, 1.34808094e+01, ...,
       2.87522633e+04, 2.87522633e+04, 2.87522633e+04])

In [34]:
massParticleTN

array([1.07846475e+01, 1.07846475e+01, 1.07846475e+01, ...,
       2.30018107e+04, 2.30018107e+04, 2.30018107e+04])