# Water Quality Prediction: Benchmark Notebook 

## Challenge Overview

<p align="justify">
Welcome to the EY AI & Data Challenge 2026!  
The objective of this challenge is to build a robust <strong> machine learning model </strong>capable of predicting water quality across various river locations in South Africa. In addition to accurate predictions, the model should also identify and emphasize the key factors that significantly influence water quality.
</p>

<p align="justify">
Participants will be provided with a dataset containing three water quality parameters <strong>Total Alkalinity</strong>, <strong>Electrical Conductance</strong>, and <strong>Dissolved Reactive Phosphorus</strong> collected between 2011 and 2015 from approximately 200 river locations across South Africa. Each data point includes the geographic coordinates (latitude and longitude) of the sampling site, the date of collection, and the corresponding water quality measurements.
</p>

<p align="justify">
Using this dataset, participants are expected to build a machine learning model to predict water quality parameters for a separate validation dataset, which includes locations from different regions not present in the training data. The challenge also encourages participants to explore feature importance and provide insights into the factors most strongly associated with variations in water quality.
</p>

<p align="justify">
This challenge is designed for participants with varying levels of experience in data science, remote sensing, and environmental analytics. It offers a valuable opportunity to apply machine learning techniques to real-world environmental data and contribute to advancing water quality monitoring using artificial intelligence.
</p>


<b>About the Notebook: </b><p align="justify"> <p>

<p align="justify"> In this notebook, we demonstrate a basic workflow that serves as a foundation for the challenge. The model has been developed to predict <b>water quality parameters</b> using features derived from the <b>Landsat</b> and <b>TerraClimate</b> datasets. Specifically, four spectral bands—<b>SWIR22</b> (Shortwave Infrared 2), <b>NIR</b> (Near Infrared), <b>Green</b>, and <b>SWIR16</b> (Shortwave Infrared 1)—were utilized from Landsat, along with derived spectral indices such as <b>NDMI</b> (Normalized Difference Moisture Index) and <b>MNDWI</b> (Modified Normalized Difference Water Index). In addition, the <b>PET</b> (Potential Evapotranspiration) variable was incorporated from the <b>TerraClimate</b> dataset to account for climatic influences on water quality. </p> 

<p align="justify"> The dataset spans a five-year period from <b>2011 to 2015</b>. Using <b>API-based data extraction</b> methods, both Landsat and TerraClimate features were retrieved directly from the <a href="https://planetarycomputer.microsoft.com/">Microsoft Planetary Computer</a> portal. These combined spectral, index-based, and climatic features were used as predictors in a regression model to estimate three key water quality parameters: <b>Total Alkalinity (TA)</b>, <b>Electrical Conductance (EC)</b>, and <b>Dissolved Reactive Phosphorus (DRP)</b>. 

</p> <p align="justify"> Please note that this notebook serves only as a starting point. Several assumptions were made during the data extraction and model development process, which you may find opportunities to improve upon. Participants are encouraged to explore additional features, enhance preprocessing techniques, or experiment with different regression algorithms to optimize predictive performance. </p>

## Load In Dependencies

To run this demonstration notebook, you will need to have the following packages imported below installed. This may take some time.  

In [1]:
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Multi-dimensional arrays and datasets (e.g., NetCDF, Zarr)
import xarray as xr

# Geospatial raster data handling with CRS support
import rioxarray as rxr

# Raster operations and spatial windowing
import rasterio
from rasterio.windows import Window

# Feature preprocessing and data splitting
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from scipy.spatial import cKDTree

# Machine Learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# Planetary Computer tools for STAC API access and authentication
import pystac_client
import planetary_computer as pc
from odc.stac import stac_load
from pystac.extensions.eo import EOExtension as eo

from datetime import date
from tqdm import tqdm
import os 

## Response Variable

<p align="justify">
Before building the model, we first load the <b>water quality training dataset</b>. The curated dataset contains samples collected from various monitoring stations across the study region. Each record includes the geographical coordinates (Latitude and Longitude), the sample collection date, and the corresponding <b>measured values</b> for the three key water quality parameters — <b>Total Alkalinity (TA)</b>, <b>Electrical Conductance (EC)</b>, and <b>Dissolved Reactive Phosphorus (DRP)</b>.
</p>


In [2]:
Water_Quality_df=pd.read_csv('water_quality_training_dataset.csv')
Water_Quality_df.head()

Unnamed: 0,Latitude,Longitude,Sample Date,Total Alkalinity,Electrical Conductance,Dissolved Reactive Phosphorus
0,-28.760833,17.730278,02-01-2011,128.912,555.0,10.0
1,-26.861111,28.884722,03-01-2011,74.72,162.9,163.0
2,-26.45,28.085833,03-01-2011,89.254,573.0,80.0
3,-27.671111,27.236944,03-01-2011,82.0,203.6,101.0
4,-27.356667,27.286389,03-01-2011,56.1,145.1,151.0


## Predictor Variables

<p align="justify">
Now that we have our water quality dataset, the next step is to gather the predictor variables from the <b>Landsat</b> and <b>TerraClimate</b> datasets. In this notebook, we demonstrate how to <b>load previously extracted satellite and climate data</b> from separate files, rather than performing the extraction directly, which allows for a smoother and faster experience. Participants can refer to the dedicated extraction notebooks—one for Landsat and another for TerraClimate—to understand how the data was retrieved and processed, and they can also generate their own output CSV files if needed. Using these pre-extracted CSV files, this notebook focuses on loading the predictor features and running the subsequent analysis and model training efficiently.
</p>
<p align="justify">
For more detailed guidance on the original data extraction process, you can review the <a href="https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2#Example-Notebook">Landsat example notebook</a> and the <a href="https://planetarycomputer.microsoft.com/dataset/terraclimate#Example-Notebook">TerraClimate example notebook</a> available on the Planetary Computer portal.
</p>

<p align="justify">We have used selected spectral bands — SWIR22 (Shortwave Infrared 2), NIR (Near Infrared), Green, and SWIR16 (Shortwave Infrared 1) — and computed key spectral indices such as NDMI (Normalized Difference Moisture Index) and MNDWI (Modified Normalized Difference Water Index). These features capture surface moisture, vegetation, and water content characteristics that influence water quality variability. </p> <p align="justify"> In addition to Landsat features, we also incorporated the <b>Potential Evapotranspiration (PET)</b> variable from the <b>TerraClimate</b> dataset, which provides high-resolution global climate data. The PET feature captures the atmospheric demand for moisture, representing climatic conditions such as temperature, humidity, and radiation that influence surface water evaporation and thus affect water quality parameters. </p> <ul> <li>SWIR22 – Sensitive to surface moisture and turbidity variations in water bodies.</li> <li>NIR – Helps in identifying vegetation and suspended matter in water.</li> <li>Green – Useful for detecting water color and surface reflectance changes.</li> <li>SWIR16 – Provides information on surface dryness and sediment concentration.</li> <li>NDMI – Derived from NIR and SWIR16, indicates moisture and vegetation-water interaction.</li> <li>MNDWI – Derived from Green and SWIR22, effective for distinguishing open water areas and reducing built-up noise.</li> <li>PET – Extracted from the TerraClimate dataset, represents the potential evapotranspiration that influences hydrological and water quality dynamics.</li> </ul>

<h4 style="color:rgb(255, 0, 0)"><strong>Tip 1</strong></h4>  
<p align="justify">  
Participants are encouraged to experiment with different combinations of <b>Landsat</b> bands or even include data from other public satellite data sources. By creating mathematical combinations of bands, you can derive various spectral indices that capture surface and environmental characteristics. 
</p>


<h3>Loading Pre-Extracted Landsat Data</h3>
<p align="justify">
In this notebook, we <b>load previously extracted Landsat data</b> from CSV files generated in a separate extraction notebook. This approach ensures a smoother and faster workflow, allowing participants to focus on data analysis and model development without waiting for time-consuming data retrieval.
</p>
<p align="justify">
Participants are expected to generate their own data extraction CSV files by running the dedicated Landsat extraction notebook. These CSV files can then be used here to smoothly run this benchmark notebook. Participants can refer to the extraction notebook to understand the API-based process, including how individual bands and indices like <b>NDMI</b> were computed. Using these pre-extracted CSV files simplifies preprocessing and is ideal for large-scale environmental and water quality analysis.
</p>


<h4 style="color:rgb(255, 0, 0)"><strong>Tip 2</strong></h4>
In the data extraction process (performed in the dedicated extraction notebooks), a 100 m focal buffer was applied around each sampling location rather than using a single point. Participants may explore creating different focal buffers around the locations (e.g., 50 m, 150 m, etc.) during extraction. For example, if a 50 m buffer was used for “Band 2”, the extracted CSV values would reflect the average of "Band 2" within 50 meters of each location. This approach can help reduce errors associated with spatial autocorrelation.


In [3]:
landsat_train_features = pd.read_csv('landsat_features_training.csv')
landsat_train_features.head()

Unnamed: 0,Latitude,Longitude,Sample Date,nir,green,swir16,swir22,NDMI,MNDWI
0,-28.760833,17.730278,02-01-2011,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595
1,-26.861111,28.884722,03-01-2011,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134
2,-26.45,28.085833,03-01-2011,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805
3,-27.671111,27.236944,03-01-2011,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416
4,-27.356667,27.286389,03-01-2011,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683


<h3>Loading Pre-Extracted TerraClimate Data</h3>
<p align="justify">
In this notebook, we <b>load previously extracted TerraClimate data</b> from CSV files generated in a dedicated extraction notebook. This approach ensures a smoother and faster workflow, allowing participants to focus on data analysis and model development without waiting for time-consuming data retrieval.
</p>
<p align="justify">
Participants are expected to generate their own data extraction CSV files by running the dedicated TerraClimate extraction notebook. These CSV files can then be used here to smoothly run this benchmark notebook. Participants can refer to the extraction notebook to understand the API-based process, including how climate variables such as <b>Potential Evapotranspiration (PET)</b> were extracted. Using these pre-extracted CSV files ensures consistent, automated retrieval of high-resolution climate data that can be easily integrated with satellite-derived features for comprehensive environmental and hydrological analysis.
</p>


In [4]:
Terraclimate_df = pd.read_csv('terraclimate_features_training.csv')
Terraclimate_df.head()

Unnamed: 0,Latitude,Longitude,Sample Date,pet
0,-28.760833,17.730278,02-01-2011,174.2
1,-26.861111,28.884722,03-01-2011,124.1
2,-26.45,28.085833,03-01-2011,127.5
3,-27.671111,27.236944,03-01-2011,129.7
4,-27.356667,27.286389,03-01-2011,129.2


## Joining the predictor variables and response variables
Now that we have extracted our predictor variables, we need to join them onto the response variable . We use the function <i><b>combine_two_datasets</b></i> to combine the predictor variables and response variables.The <i><b>concat</b></i> function from pandas comes in handy here.

In [5]:
# Combine two datasets vertically (along columns) using pandas concat function.
def combine_two_datasets(dataset1,dataset2,dataset3):
    '''
    Returns a  vertically concatenated dataset.
    Attributes:
    dataset1 - Dataset 1 to be combined 
    dataset2 - Dataset 2 to be combined
    '''
    
    data = pd.concat([dataset1,dataset2,dataset3], axis=1)
    data = data.loc[:, ~data.columns.duplicated()]
    return data

In [6]:
# Combining ground data and final data into a single dataset.
wq_data = combine_two_datasets(Water_Quality_df, landsat_train_features, Terraclimate_df)
wq_data.head()

Unnamed: 0,Latitude,Longitude,Sample Date,Total Alkalinity,Electrical Conductance,Dissolved Reactive Phosphorus,nir,green,swir16,swir22,NDMI,MNDWI,pet
0,-28.760833,17.730278,02-01-2011,128.912,555.0,10.0,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2
1,-26.861111,28.884722,03-01-2011,74.72,162.9,163.0,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1
2,-26.45,28.085833,03-01-2011,89.254,573.0,80.0,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5
3,-27.671111,27.236944,03-01-2011,82.0,203.6,101.0,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7
4,-27.356667,27.286389,03-01-2011,56.1,145.1,151.0,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2


<h3>Handling Missing Values</h3>  
<p align="justify">  
Before model training, missing values in the dataset were carefully handled to ensure data consistency and prevent model bias. Numerical columns were imputed using their median values, maintaining the overall data distribution while minimizing the impact of outliers.  
</p>

In [7]:
wq_data = wq_data.fillna(wq_data.median(numeric_only=True))
wq_data.isna().sum()

Latitude                         0
Longitude                        0
Sample Date                      0
Total Alkalinity                 0
Electrical Conductance           0
Dissolved Reactive Phosphorus    0
nir                              0
green                            0
swir16                           0
swir22                           0
NDMI                             0
MNDWI                            0
pet                              0
dtype: int64

## Model Building

<p align="justify"> Now let us select the columns required for our model building exercise. We will consider only Band swir22, NDMI and MNDWI from the Landsat data and pet from Terraclimate dataset as our predictor variables. It does not make sense to use latitude and longitude as predictor variables, as they do not have any direct impact on predicting the water quality parameters.</p>


In [None]:
# Retaining only the columns for swir22, NDMI, MNDWI, pet, Total Alkalinity, Electrical Conductance and Dissolved Reactive Phosphorus Index in the dataset.
wq_data = wq_data[['swir22','NDMI','MNDWI','pet', 'Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']]

<h4 style="color:rgb(255, 0, 0)"><strong>Tip 3</strong></h4>
<p align="justify">We are developing individual models for each water quality parameter using a common set of features: SWIR22, NDMI, MNDWI, and PET. However, participants are encouraged to experiment with different feature combinations to build more robust machine learning models.</p>

## Helper Functions
### Train and Test Split 
<p align="justify">We will now split the data into 70% training data and 30% test data. Scikit-learn alias “sklearn” is a robust library for machine learning in Python. The scikit-learn library has a <i><b>model_selection</b></i> module in which there is a splitting function <i><b>train_test_split</b></i>. You can use the same.</p>

### Feature Scaling 

<p align="justify"> Before initiating the model training we may have to execute different data pre-processing steps. Here we are demonstrating the scaling of 'swir22','NDMI','MNDWI','pet' variable by using Standard Scaler.</p>

<p align = "justify">Feature Scaling is a data preprocessing step for numerical features. Many machine learning algorithms like Gradient descent methods, KNN algorithm, linear and logistic regression, etc. require data scaling to produce good results. Scikit learn provides functions that can be used to apply data scaling. Here we are using Standard Scaler. The idea behind Standard Scaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.</p>

### Model Training
<p align="justify">
Now that we have the data in a format suitable for machine learning, we can begin training our models. In this demonstration notebook, we will build three separate regression models — one for each target water quality parameter: Total Alkalinity, Electrical Conductance, and Dissolved Reactive Phosphorus. 
Each model will be trained independently to capture the unique relationships between the satellite-derived features and each parameter.
</p>

<p align="justify">
We will use the Random Forest Regressor from the scikit-learn library to build our models. Scikit-learn provides a wide range of regression algorithms with extensive parameter tuning and customization capabilities.
</p>

<p align="justify">
For model training, the predictor variables (e.g., SWIR22, NDMI, MNDWI, and pet) will be stored in an array X, and the response variable (one of the three water quality parameters) will be stored in an array Y. 
It is important not to include the response variable in X. Additionally, since latitude, longitude, and sample date are only used for spatial and temporal reference, they will be excluded from the predictor variables during model training.
</p>

### Model Evaluation
<p align="justify">
Now that we have trained our models for the three water quality parameters, the next step is to evaluate their performance. Each regression model for Total Alkalinity, Electrical Conductance, and Dissolved Reactive Phosphorus is assessed using the R² score and the Root Mean Square Error (RMSE). The R² score measures how well the model explains the variance in the observed values, while RMSE quantifies the average magnitude of prediction errors. Together, these metrics help determine how effectively each model captures variations in water quality across different locations and sampling dates. Scikit-learn provides built-in functions to compute these metrics, and participants may explore additional evaluation methods or custom metrics as needed.</p>


<h4 style="color:rgb(255, 0, 0)"><strong>Tip 4</strong></h4>
<p align="justify">There are many data preprocessing methods available, which might help to improve the model performance. Participants should explore various suitable preprocessing methods as well as different machine learning algorithms to build a robust model.</p>

In [9]:
def split_data(X, y, test_size=0.3, random_state=42):
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

def scale_data(X_train, X_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    return X_train_scaled, X_test_scaled, scaler

def train_model(X_train_scaled, y_train):
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train_scaled, y_train)
    return model

def evaluate_model(model, X_scaled, y_true, dataset_name="Test"):
    y_pred = model.predict(X_scaled)
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    print(f"\n{dataset_name} Evaluation:")
    print(f"R²: {r2:.3f}")
    print(f"RMSE: {rmse:.3f}")
    return y_pred, r2, rmse

<div class="section">
  <h2>Model Workflow (Pipeline)</h2>
  <p align="justify">
    The complete model development process follows a structured pipeline to ensure consistency, reproducibility, and clarity. 
    Each stage in the workflow is modularized into independent functions, which can be reused for different water quality parameters. 
    This modular approach helps streamline the process and makes the workflow easily adaptable to new datasets or parameters in the future.
  </p>

  <p align="justify">
    The pipeline automates the sequence of steps — from data preparation to evaluation — for each target parameter. 
    The same set of predictor variables is used, while the response variable changes for each of the three targets: 
    <i>Total Alkalinity (TA)</i>, <i>Electrical Conductance (EC)</i>, and <i>Dissolved Reactive Phosphorus (DRP)</i>. 
    By maintaining a consistent framework, comparisons across models remain fair and interpretable.
  </p>


In [10]:
def run_pipeline(X, y, param_name="Parameter"):
    print(f"\n{'='*60}")
    print(f"Training Model for {param_name}")
    print(f"{'='*60}")
    
    # Split data
    X_train, X_test, y_train, y_test = split_data(X, y)
    
    # Scale
    X_train_scaled, X_test_scaled, scaler = scale_data(X_train, X_test)
    
    # Train
    model = train_model(X_train_scaled, y_train)
    
    # Evaluate (in-sample)
    y_train_pred, r2_train, rmse_train = evaluate_model(model, X_train_scaled, y_train, "Train")
    
    # Evaluate (out-sample)
    y_test_pred, r2_test, rmse_test = evaluate_model(model, X_test_scaled, y_test, "Test")
    
    # Return summary
    results = {
        "Parameter": param_name,
        "R2_Train": r2_train,
        "RMSE_Train": rmse_train,
        "R2_Test": r2_test,
        "RMSE_Test": rmse_test
    }
    return model, scaler, pd.DataFrame([results])

### Model Training and Evaluation for Each Parameter

<p align="justify">In this step, we apply the complete modeling pipeline to each of the three selected water quality parameters — Total Alkalinity, Electrical Conductance, and Dissolved Reactive Phosphorus. The input feature set (<code>X</code>) remains the same across all three models, while the target variable (<code>y</code>) changes for each parameter. For every parameter, the <code>run_pipeline()</code> function is executed, which handles data preprocessing, model training, and both in-sample and out-of-sample evaluation. This ensures a consistent workflow and allows for a fair comparison of model performance across different water quality indicators.</p>

In [11]:
X = wq_data.drop(columns=['Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus'])

y_TA = wq_data['Total Alkalinity']
y_EC = wq_data['Electrical Conductance']
y_DRP = wq_data['Dissolved Reactive Phosphorus']

model_TA, scaler_TA, results_TA = run_pipeline(X, y_TA, "Total Alkalinity")
model_EC, scaler_EC, results_EC = run_pipeline(X, y_EC, "Electrical Conductance")
model_DRP, scaler_DRP, results_DRP = run_pipeline(X, y_DRP, "Dissolved Reactive Phosphorus")



Training Model for Total Alkalinity

Train Evaluation:
R²: 0.903
RMSE: 23.132

Test Evaluation:
R²: 0.546
RMSE: 50.870

Training Model for Electrical Conductance

Train Evaluation:
R²: 0.918
RMSE: 98.007

Test Evaluation:
R²: 0.585
RMSE: 219.999

Training Model for Dissolved Reactive Phosphorus

Train Evaluation:
R²: 0.882
RMSE: 17.455

Test Evaluation:
R²: 0.529
RMSE: 35.182


### Model Performance Summary

<p align="justify">After training and evaluating the models for each water quality parameter, the individual performance metrics are combined into a single summary table. This table consolidates the R² and RMSE values for both in-sample and out-of-sample evaluations, enabling an easy comparison of model performance across Total Alkalinity, Electrical Conductance, and Dissolved Reactive Phosphorus. Such a summary provides a quick overview of how well each model captures the variability in the respective parameter and highlights any differences in predictive accuracy.</p>

In [12]:
results_summary = pd.concat([results_TA, results_EC, results_DRP], ignore_index=True)
results_summary


Unnamed: 0,Parameter,R2_Train,RMSE_Train,R2_Test,RMSE_Test
0,Total Alkalinity,0.903199,23.132468,0.545661,50.870362
1,Electrical Conductance,0.917884,98.007101,0.585458,219.998675
2,Dissolved Reactive Phosphorus,0.882169,17.454757,0.529145,35.181776


## Submission

<p align="justify">Once you are satisfied with your model’s performance, you can proceed to make predictions for unseen data. To do this, use your trained model to estimate the concentrations of the target water quality parameters — Total Alkalinity, Electrical Conductance, and Dissolved Reactive Phosphorus — for a set of test locations provided in the "Submission_template.csv" file. The predicted results can then be uploaded to the challenge platform for evaluation.</p>

In [13]:
#Reading the coordinates for the submission
test_file = pd.read_csv('submission_template.csv')
test_file.head()

Unnamed: 0,Latitude,Longitude,Sample Date,Total Alkalinity,Electrical Conductance,Dissolved Reactive Phosphorus
0,-32.043333,27.822778,01-09-2014,,,
1,-33.329167,26.0775,16-09-2015,,,
2,-32.991639,27.640028,07-05-2015,,,
3,-34.096389,24.439167,07-02-2012,,,
4,-32.000556,28.581667,01-10-2014,,,


<p align="justify">
Similarly, participants can use the <b>Landsat</b> and <b>TerraClimate</b> data extraction demonstration notebooks to produce feature CSVs for their <b>validation</b> data. For convenience, we have already computed and saved example validation outputs as <code>landsat_features_val_V3.csv</code> and <code>Terraclimate_val_df_v3.csv</code>. Participants should save their own extracted files in the same format and column schema; doing so will allow this benchmark notebook to load the validation features directly and run smoothly.
</p>


In [14]:
landsat_val_features = pd.read_csv('landsat_features_validation.csv')
landsat_val_features.head()

Unnamed: 0,Latitude,Longitude,Sample Date,nir,green,swir16,swir22,NDMI,MNDWI
0,-32.043333,27.822778,01-09-2014,15229.0,12868.0,14797.0,12421.0,0.014388,-0.069727
1,-33.329167,26.0775,16-09-2015,,,,,,
2,-32.991639,27.640028,07-05-2015,16221.0,9304.5,12536.5,9958.0,0.128123,-0.147979
3,-34.096389,24.439167,07-02-2012,,,,,,
4,-32.000556,28.581667,01-10-2014,9125.0,11100.5,9455.0,8711.0,-0.017761,0.080052


In [15]:
Terraclimate_val_df = pd.read_csv('terraclimate_features_validation.csv')
Terraclimate_val_df.head()

Unnamed: 0,Latitude,Longitude,Sample Date,pet
0,-32.043333,27.822778,01-09-2014,161.90001
1,-33.329167,26.0775,16-09-2015,177.6
2,-32.991639,27.640028,07-05-2015,158.40001
3,-34.096389,24.439167,07-02-2012,130.0
4,-32.000556,28.581667,01-10-2014,152.5


In [16]:
#Consolidate all the extracted bands and features in a single dataframe
val_data = pd.DataFrame({
    'Longitude': landsat_val_features['Longitude'].values,
    'Latitude': landsat_val_features['Latitude'].values,
    'Sample Date': landsat_val_features['Sample Date'].values,
    'nir': landsat_val_features['nir'].values,
    'green': landsat_val_features['green'].values,
    'swir16': landsat_val_features['swir16'].values,
    'swir22': landsat_val_features['swir22'].values,
    'NDMI': landsat_val_features['NDMI'].values,
    'MNDWI': landsat_val_features['MNDWI'].values,
    'pet': Terraclimate_val_df['pet'].values,
})

In [17]:
# Impute the missing values
val_data = val_data.fillna(val_data.median(numeric_only=True))

In [None]:
# Extracting specific columns (swir22, NDMI, MNDWI, pet) from the validation dataset
submission_val_data=val_data.loc[:,['swir22','NDMI','MNDWI','pet']]
submission_val_data.head()

Unnamed: 0,swir22,NDMI,MNDWI,pet
0,12421.0,0.014388,-0.069727,161.90001
1,9973.0,0.081427,-0.130571,177.6
2,9958.0,0.128123,-0.147979,158.40001
3,9973.0,0.081427,-0.130571,130.0
4,8711.0,-0.017761,0.080052,152.5


In [19]:
submission_val_data.shape

(200, 4)

In [20]:
# --- Predicting for Total Alkalinity ---
X_sub_scaled_TA = scaler_TA.transform(submission_val_data)
pred_TA_submission = model_TA.predict(X_sub_scaled_TA)

# --- Predicting for Electrical Conductance ---
X_sub_scaled_EC = scaler_EC.transform(submission_val_data)
pred_EC_submission = model_EC.predict(X_sub_scaled_EC)

# --- Predicting for Dissolved Reactive Phosphorus ---
X_sub_scaled_DRP = scaler_DRP.transform(submission_val_data)
pred_DRP_submission = model_DRP.predict(X_sub_scaled_DRP)

In [21]:
submission_df = pd.DataFrame({
    'Longitude': test_file['Longitude'].values,
    'Latitude': test_file['Latitude'].values,
    'Sample Date': test_file['Sample Date'].values,
    'Total Alkalinity': pred_TA_submission,
    'Electrical Conductance': pred_EC_submission,
    'Dissolved Reactive Phosphorus': pred_DRP_submission
})

In [22]:
#Displaying the sample submission dataframe
submission_df.head()

Unnamed: 0,Longitude,Latitude,Sample Date,Total Alkalinity,Electrical Conductance,Dissolved Reactive Phosphorus
0,27.822778,-32.043333,01-09-2014,114.833126,314.267271,25.235333
1,26.0775,-33.329167,16-09-2015,156.957291,648.500033,60.455833
2,27.640028,-32.991639,07-05-2015,62.87798,724.577283,29.445333
3,24.439167,-34.096389,07-02-2012,72.334447,234.950986,13.726667
4,28.581667,-32.000556,01-10-2014,109.078753,304.0102,30.21


In [23]:
#Dumping the predictions into a csv file.
submission_df.to_csv("submission.csv",index = False)

### Upload submission file on platform

Upload the submission.csv on the <a href ="https://challenge.ey.com">platform</a> to get score generated on scoreboard.

## Conclusion

<div align ="justify">Now that you have learned a basic approach to model training, it’s time to try your own approach! Feel free to modify any of the functions presented in this notebook. We look forward to seeing your version of the model and the results. Best of luck with the challenge!</div>