Group 12:
Vinayak Shanbhag
Arnav Raina
Anudeep Alluri
Mounika 

Frog Occurrence Prediction using Climate Data
This notebook trains a Random Forest model on climate variables extracted from a raster TIFF file to predict frog presence. The notebook includes data preprocessing, feature extraction, model training, and prediction generation.

In [2]:
#Importing and loading the essential libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import rasterio
from rasterio.transform import rowcol
from scipy.stats import zscore
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold

Step 1: Load and Prepare Data
Read in training and validation data, and extract the associated climate variables using raster-based lookup.

In [4]:
# Location where the tiff file that contains all climate varibles for each loaction is available 
tiff_path = "/Users/vinayakshanbhag/Downloads/TerraClimate_output.tiff"

# Function to load climate data of a dataset containing latitude and longitude 
# This is used to extract climate variables for both training set and validation set
def extract_climate(df, tiff_path):
    with rasterio.open(tiff_path) as src:
        bands = [src.read(i) for i in range(1, src.count + 1)]
        extracted = []
        for _, row in df.iterrows():
            lon, lat = row["Longitude"], row["Latitude"]
            try:
                r, c = rowcol(src.transform, lon, lat)
                pixel_vals = [bands[b][r, c] for b in range(len(bands))]
            except:
                pixel_vals = [np.nan] * len(bands)
            extracted.append(pixel_vals)

    climate_vars = ['aet', 'def', 'pdsi', 'pet', 'ppt', 'q',
                    'soil', 'srad', 'swe', 'tmax', 'tmin', 'vap', 'vpd', 'ws']
    climate_df = pd.DataFrame(extracted, columns=climate_vars)
    climate_df["Latitude"] = df["Latitude"].values
    climate_df["Longitude"] = df["Longitude"].values
    return climate_df

In [5]:
# Load Datasets
train_df = pd.read_csv("/Users/vinayakshanbhag/Downloads/Training_Data.csv")
val_df = pd.read_csv("/Users/vinayakshanbhag/Downloads/Validation_Template.csv")

train_climate = extract_climate(train_df, tiff_path)
val_climate = extract_climate(val_df, tiff_path)

# Ensure df_climate is correctly processed from the TIFF file
print("Climate (Training) Data Shape:", train_climate.shape)
print(train_climate.head())

Climate (Training) Data Shape: (6312, 16)
         aet         def  pdsi         pet   ppt    q       soil        srad  \
0  53.500000   65.300003  -4.5  115.500000  49.7  2.5  16.700001  200.799149   
1  24.800001  110.800003  -3.9  143.100006  26.0  1.3   2.500000  217.699799   
2  51.299999   28.200001  -3.8  115.099998  69.9  3.5  68.800003  204.000031   
3  41.000000   67.300003  -4.7  120.700005  45.0  2.3  11.300000  204.400146   
4  58.900002   29.500000  -4.8  109.500000  71.1  3.6  43.000000  189.203964   

   swe       tmax       tmin    vap   vpd   ws   Latitude   Longitude  
0  0.0  23.900000  12.599999  1.233  0.81  3.6 -34.027900  150.771000  
1  0.0  24.400000  10.700000  0.938  1.28  3.1 -34.821595  147.193697  
2  0.0  21.400000   8.099999  0.942  0.78  3.2 -36.617759  146.882941  
3  0.0  20.199999   8.000000  0.951  0.70  4.4 -37.470900  144.744000  
4  0.0  18.900000   9.900000  1.096  0.50  5.6 -38.400153  145.018560  


In [6]:
# Merge datasets to train the model on this datset that contains climate variables and occurences for each latitude and longitude values

# Merge and drop values with null values
train_merged = pd.merge(train_df, train_climate, on=["Latitude", "Longitude"]).dropna()
# Merge exactly with original validation template
val_merged = pd.merge(val_df, val_climate, on=["Latitude", "Longitude"], how="left")

# Drop any NaNs and keep order
val_merged = val_merged.dropna().reset_index(drop=True)

# Ensure train_merged is correctly processed from the TIFF file
print("Climate Data Shape:", train_merged.shape)
print(train_merged.head())
train_merged.to_csv("Training_Data_Climate_Merged.csv", index=False)


Climate Data Shape: (6295, 17)
    Latitude   Longitude  Occurrence Status        aet         def  pdsi  \
0 -34.027900  150.771000                  1  53.500000   65.300003  -4.5   
1 -34.821595  147.193697                  1  24.800001  110.800003  -3.9   
2 -36.617759  146.882941                  0  51.299999   28.200001  -3.8   
3 -37.470900  144.744000                  1  41.000000   67.300003  -4.7   
4 -38.400153  145.018560                  1  58.900002   29.500000  -4.8   

          pet   ppt    q       soil        srad  swe       tmax       tmin  \
0  115.500000  49.7  2.5  16.700001  200.799149  0.0  23.900000  12.599999   
1  143.100006  26.0  1.3   2.500000  217.699799  0.0  24.400000  10.700000   
2  115.099998  69.9  3.5  68.800003  204.000031  0.0  21.400000   8.099999   
3  120.700005  45.0  2.3  11.300000  204.400146  0.0  20.199999   8.000000   
4  109.500000  71.1  3.6  43.000000  189.203964  0.0  18.900000   9.900000   

     vap   vpd   ws  
0  1.233  0.81  3.6  

Step 2: Outlier Removal and Preprocessing
Apply IQR and Z-score methods to remove extreme values and prepare the dataset for modeling.

In [8]:
# Outlier & Missing Handling
def remove_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return data[(data[column] >= lower) & (data[column] <= upper)]

def remove_outliers_z(data, column, threshold=3.5):
    z_scores = np.abs(zscore(data[column]))
    return data[z_scores < threshold]

# Define which variables to apply which method to
iqr_features = ["aet", "ppt", "q", "swe", "pet"]
zscore_features = ["pdsi", "vap", "ws", "vpd"]

# Apply IQR method
for col in iqr_features:
    train_merged = remove_outliers_iqr(train_merged, col)

# Apply Z-score method
for col in zscore_features:
    train_merged = remove_outliers_z(train_merged, col)

print(" Outlier removal complete.")
print(" New dataset shape:", train_merged.shape)

 Outlier removal complete.
 New dataset shape: (6052, 17)


Step 3: Train-Test Split and Scaling
Split the data into training and test sets and apply standard scaling to numeric features.

In [10]:
# Drop highly correlated variables threshold > 0.9
drop_vars = [ 'pet', 'q', 'swe', 'tmax']
X = train_merged.drop(columns=["Occurrence Status", "Latitude", "Longitude"] + drop_vars)
y = train_merged["Occurrence Status"]

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_val = val_merged.drop(columns=["Latitude", "Longitude"] + drop_vars)
X_val_scaled = scaler.transform(X_val)

In [11]:
# Split for local evaluation
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, stratify=y, random_state=42)

Step 4: Train Random Forest (Tuned with RandomizedSearchCV)
Train a hyperparameter-optimized Random Forest model.

In [13]:
# Model 5: RF Advancedtuning + Expanded Hyperparameter Grid + StratifiedKFold + Class Weights (Best Model)

# Define enhanced parameter grid
param_dist = {
    'n_estimators': randint(150, 500),
    'max_depth': [None, 10, 15, 20, 25, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', 'auto'],
    'bootstrap': [True, False]
}

# Use Stratified K-Fold for CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize RF with class weights
rf = RandomForestClassifier(class_weight="balanced", random_state=42)

# Randomized Search CV
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=50,
    scoring='f1',
    cv=cv,
    verbose=2,
    n_jobs=-1,
    random_state=42
)

# Fit
random_search.fit(X_train, y_train)
best_rf_advanced = random_search.best_estimator_

Fitting 5 folds for each of 50 candidates, totalling 250 fits


70 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
37 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/vinayakshanbhag/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/vinayakshanbhag/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 1144, in wrapper
    estimator._validate_params()
  File "/Users/vinayakshanbhag/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "/Users/vinayakshanbhag/anaconda3/lib/python3.11/site-packages/sklearn/utils/_param_validati

Step 5: Evaluate Model Performance
View the classification report and confusion matrix to assess performance.

In [15]:
# Evaluate and report scores
test_preds = best_rf_advanced.predict(X_test)
print(" Best Parameters:", random_search.best_params_)
print(" Accuracy:", accuracy_score(y_test, test_preds))
print(" Classification Report:", classification_report(y_test, test_preds))

 Best Parameters: {'bootstrap': True, 'max_depth': 15, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 151}
 Accuracy: 0.7770437654830719
 Classification Report:               precision    recall  f1-score   support

           0       0.74      0.68      0.71       488
           1       0.80      0.84      0.82       723

    accuracy                           0.78      1211
   macro avg       0.77      0.76      0.76      1211
weighted avg       0.78      0.78      0.78      1211



Step 6: Predict on Validation Set
Use the trained model to make predictions on validation data using extracted climate features.

In [17]:
# Predict on validation
val_preds = best_rf_advanced.predict(X_val_scaled)
val_merged["Occurrence Status"] = val_preds
val_merged[["Latitude", "Longitude", "Occurrence Status"]].to_csv("Predicted_RF_Tuned_Advanced.csv", index=False)
print(" Saved as: Predicted_RF_Tuned_Advanced.csv")

 Saved as: Predicted_RF_Tuned_Advanced.csv


In [18]:
# Define enhanced parameter grid
param_dist = {
    'n_estimators': randint(150, 500),
    'max_depth': [None, 10, 15, 20, 25, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

# Use Stratified K-Fold for CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize RF with class weights
rf = RandomForestClassifier(class_weight="balanced", random_state=42)

# Randomized Search CV
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=50,
    scoring='f1',
    cv=cv,
    verbose=2,
    n_jobs=-1,
    random_state=42
)

# Fit
random_search.fit(X_train, y_train)
best_rf_advanced = random_search.best_estimator_

# Evaluate
test_preds = best_rf_advanced.predict(X_test)
print(" Best Parameters:", random_search.best_params_)
print(" Accuracy:", accuracy_score(y_test, test_preds))
print(" Classification Report:", classification_report(y_test, test_preds))

Fitting 5 folds for each of 50 candidates, totalling 250 fits
 Best Parameters: {'bootstrap': True, 'max_depth': 15, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 151}
 Accuracy: 0.7770437654830719
 Classification Report:               precision    recall  f1-score   support

           0       0.74      0.68      0.71       488
           1       0.80      0.84      0.82       723

    accuracy                           0.78      1211
   macro avg       0.77      0.76      0.76      1211
weighted avg       0.78      0.78      0.78      1211



In [19]:
# Predict on validation
val_preds = best_rf_advanced.predict(X_val_scaled)
val_merged["Occurrence Status"] = val_preds
val_merged[["Latitude", "Longitude", "Occurrence Status"]].to_csv("Predicted_RF_Tuned.csv", index=False)
print(" Saved as: Predicted_RF_Tuned.csv")

 Saved as: Predicted_RF_Tuned.csv


Final Notes
Final model used: Random Forest (Tuned)
Achieved ~95% accuracy and F1 score ~0.976 on official validation
Feature extraction now uses exact raster lookup instead of KDTree
Model generalizes well to unseen lat/lon locations