Script Description:
This script reads and extracts GPP_NT & RECO_NT from all the individual NOBV CSV Files, computes the daily averages of the features, reads the pre-processed complete dataset, merges the two columns to the complete dataset and exports the merged dataset.

File Name: 01_02_Filter_Merge_EC_Tower_Data_GPP.ipynb

Date: 2025

Created by: Rob Alamgir

Version: 1.0

References:

#### Import the relevant packages

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#### Extract and Merge data from all the indivdual CSV files 

In [2]:
directory_path = 'C:/Data_MSc_Thesis/EC_Tower_Data/GPP_RECO_Rob'              # Specify the directory path
files = os.listdir(directory_path)                                            # Get a list of all files and directories in the specified directory

files = [f for f in files if os.path.isfile(os.path.join(directory_path, f))] # Filter out directories and only list files
files_with_data = []                                                          # Initialize lists to store files with data and without data
files_without_data = []
data_list = []                                                                # Initialize an empty list to store the data

# Loop through each file
for file in files:
    file_path = os.path.join(directory_path, file)
    df = pd.read_csv(file_path, low_memory=False)                            # Read the CSV file with all columns
    # Check if there is any data in the DataFrame
    if not df.empty:
        df['location'] = file                                                # Add the file name as a new column
        data_list.append(df)                                                 # Append the full DataFrame to the list
        files_with_data.append(file)
    else:
        files_without_data.append(file)                                      # If the file is empty, add it to files_without_data
# Combine all data into a single DataFrame if there's any data
if data_list:
    GPP_RECO_df = pd.concat(data_list, ignore_index=True)                       # Combine all the data into a single DataFrame
    print("Data successfully extracted!")
else:
    print("No data available to merge.")

print("\nFiles with data:")
print(files_with_data)
print("\nFiles without data:")
print(files_without_data)
print(GPP_RECO_df.info()) 

Data successfully extracted!

Files with data:
['ALB_MS_GPP_RECO.csv', 'ALB_RF_GPP_RECO.csv', 'AMM_GPP_RECO.csv', 'AMR_GPP_RECO.csv', 'BUO_GPP_RECO.csv', 'BUW_GPP_RECO.csv', 'HOC_GPP_RECO.csv', 'HOH_GPP_RECO.csv', 'LDC_GPP_RECO.csv', 'LDH_GPP_RECO.csv']

Files without data:
[]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425510 entries, 0 to 425509
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  425510 non-null  int64  
 1   datetime    425510 non-null  object 
 2   day         425510 non-null  object 
 3   hour        425510 non-null  float64
 4   location    425510 non-null  object 
 5   GPP_NT      372236 non-null  float64
 6   RECO_NT     424916 non-null  float64
dtypes: float64(3), int64(1), object(3)
memory usage: 22.7+ MB
None


In [3]:
GPP_RECO_df.head(15)
#GPP_RECO_df.tail(15)

Unnamed: 0.1,Unnamed: 0,datetime,day,hour,location,GPP_NT,RECO_NT
0,0,2022-01-05 00:00:00+01:00,2022-01-05 00:00:00+01:00,0.0,ALB_MS_GPP_RECO.csv,,
1,1,2022-01-05 00:30:00+01:00,2022-01-05 00:00:00+01:00,0.5,ALB_MS_GPP_RECO.csv,,
2,2,2022-01-05 01:00:00+01:00,2022-01-05 00:00:00+01:00,1.0,ALB_MS_GPP_RECO.csv,,
3,3,2022-01-05 01:30:00+01:00,2022-01-05 00:00:00+01:00,1.5,ALB_MS_GPP_RECO.csv,,
4,4,2022-01-05 02:00:00+01:00,2022-01-05 00:00:00+01:00,2.0,ALB_MS_GPP_RECO.csv,,
5,5,2022-01-05 02:30:00+01:00,2022-01-05 00:00:00+01:00,2.5,ALB_MS_GPP_RECO.csv,,
6,6,2022-01-05 03:00:00+01:00,2022-01-05 00:00:00+01:00,3.0,ALB_MS_GPP_RECO.csv,,
7,7,2022-01-05 03:30:00+01:00,2022-01-05 00:00:00+01:00,3.5,ALB_MS_GPP_RECO.csv,,
8,8,2022-01-05 04:00:00+01:00,2022-01-05 00:00:00+01:00,4.0,ALB_MS_GPP_RECO.csv,,
9,9,2022-01-05 04:30:00+01:00,2022-01-05 00:00:00+01:00,4.5,ALB_MS_GPP_RECO.csv,,


#### Dataframe pre-processing

In [4]:
# Ensure datetime column is in the correct format first
if pd.api.types.is_datetime64_any_dtype(GPP_RECO_df['datetime']):
    GPP_RECO_df['datetime'] = GPP_RECO_df['datetime'].dt.tz_localize(None)
else:
    GPP_RECO_df['datetime'] = pd.to_datetime(GPP_RECO_df['datetime'], errors='coerce').dt.tz_localize(None)

# Remove unecessary columns
GPP_RECO_df = GPP_RECO_df.drop(columns=['Unnamed: 0','day'])
GPP_RECO_df['date'] = GPP_RECO_df['datetime'].dt.date                             # Extract only the date part (removes time)
GPP_RECO_df['location'] = GPP_RECO_df['location'].str.replace('_GPP_RECO.csv', '', regex=False)      
GPP_RECO_df.rename(columns={"location": "Source"}, inplace=True)

In [14]:
print(GPP_RECO_df.info()) 
#GPP_RECO_df.head(15)
#GPP_RECO_df.tail(15)
#GPP_RECO_df.describe()

#Check for Missing or Null Values
#missing_values = GPP_RECO_df['datetime'].isnull().sum()
#print(f"Missing values: {missing_values}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425510 entries, 0 to 425509
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   datetime  425510 non-null  datetime64[ns]
 1   hour      425510 non-null  float64       
 2   Source    425510 non-null  object        
 3   GPP_NT    372236 non-null  float64       
 4   RECO_NT   424916 non-null  float64       
 5   date      425510 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 19.5+ MB
None


#### Import the complete dataset to merge the GPP_NT & RECO_NT   

In [12]:
#Load and preprocess data
data_path = "C:/Data_MSc_Thesis/Pre_Processed_Data_Final/Pre_Processed_Data_All_Locations_V1.csv"
complete_dataset = pd.read_csv(data_path, low_memory=False)
complete_dataset['datetime'] = pd.to_datetime(complete_dataset['datetime'], errors='coerce', format='%Y-%m-%d %H:%M:%S')

In [15]:
print(complete_dataset.info()) 
#complete_dataset.head(15)
#complete_dataset.tail(15)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425308 entries, 0 to 425307
Data columns (total 41 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   datetime    425299 non-null  datetime64[ns]
 1   DOY         425200 non-null  object        
 2   daytime     141704 non-null  object        
 3   Source      425308 non-null  object        
 4   SWCT_1_005  390173 non-null  float64       
 5   SWCT_1_015  405102 non-null  float64       
 6   SWCT_1_025  405250 non-null  float64       
 7   SWCT_1_035  405140 non-null  float64       
 8   SWCT_1_045  405447 non-null  float64       
 9   SWCT_1_055  405289 non-null  float64       
 10  SWCT_1_065  405442 non-null  float64       
 11  SWCT_1_075  405565 non-null  float64       
 12  SWCT_1_085  405564 non-null  float64       
 13  SWCT_1_095  391620 non-null  float64       
 14  SWCT_1_105  405528 non-null  float64       
 15  SWCT_1_115  405570 non-null  float64       
 16  ST

#### Merge the two dataframes based on 'datetime' and 'Source'

In [17]:
# Merge only the GPP_NT and RECO_NT columns based on datetime and Source
merged_df = complete_dataset.merge(
    GPP_RECO_df[['datetime', 'Source', 'GPP_NT', 'RECO_NT']], 
    on=['datetime', 'Source'], 
    how='left'
)

In [18]:
print(merged_df.info()) 
#merged_df.head(15)
#merged_df.tail(15)
#merged_df.dtypes
#merged_df.columns.tolist()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425308 entries, 0 to 425307
Data columns (total 43 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   datetime    425299 non-null  datetime64[ns]
 1   DOY         425200 non-null  object        
 2   daytime     141704 non-null  object        
 3   Source      425308 non-null  object        
 4   SWCT_1_005  390173 non-null  float64       
 5   SWCT_1_015  405102 non-null  float64       
 6   SWCT_1_025  405250 non-null  float64       
 7   SWCT_1_035  405140 non-null  float64       
 8   SWCT_1_045  405447 non-null  float64       
 9   SWCT_1_055  405289 non-null  float64       
 10  SWCT_1_065  405442 non-null  float64       
 11  SWCT_1_075  405565 non-null  float64       
 12  SWCT_1_085  405564 non-null  float64       
 13  SWCT_1_095  391620 non-null  float64       
 14  SWCT_1_105  405528 non-null  float64       
 15  SWCT_1_115  405570 non-null  float64       
 16  ST

In [19]:
columns_of_interest = ['GPP_NT', 'RECO_NT']  
print(merged_df[columns_of_interest].describe())

              GPP_NT        RECO_NT
count  372314.000000  425010.000000
mean        6.263830       6.154915
std         9.801603       3.904230
min       -48.126830       0.424153
25%        -0.070678       2.822918
50%         1.689503       5.247238
75%        10.602459       8.944092
max        60.494108      27.719204


In [20]:
# Export the final dataframe to a CSV file
output_path = "C:/Data_MSc_Thesis/Pre_Processed_Data_Final/Pre_Processed_Data_All_Locations_V2.csv"  # Update the path as needed
merged_df.to_csv(output_path, index=False)

print(f"DataFrame successfully saved to {output_path}")

DataFrame successfully saved to C:/Data_MSc_Thesis/Pre_Processed_Data_Final/Pre_Processed_Data_All_Locations_V2.csv
