## Day 18

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("sensitivity_soil_nutrient.csv")
df.head()

Unnamed: 0,SampleID,site,block,paddock,plot,slope,rainfall_reduction,grazing_treatment,year,type,faith_pd
0,GMDR-FK-2018-4,FK,1,1,4,4,0,stable,2018,bacteria,3.912.908.739
1,GMDR-TB-2018-45,TB,3,2,45,2,50,heavy,2018,bacteria,3.298.556.468
2,GMDR-TB-2018-8,TB,1,3,8,1,75,destock,2018,bacteria,3.332.257.877
3,GMDR-TB-2018-31,TB,2,3,31,4,99,destock,2018,bacteria,3.806.342.641
4,GMDR-TB-2018-23,TB,2,2,23,6,0,heavy,2018,bacteria,4.135.953.774


In [3]:
df.shape

(1006, 11)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1006 entries, 0 to 1005
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   SampleID            1006 non-null   object
 1   site                1006 non-null   object
 2   block               1006 non-null   int64 
 3   paddock             1006 non-null   int64 
 4   plot                1006 non-null   int64 
 5   slope               1006 non-null   int64 
 6   rainfall_reduction  1006 non-null   int64 
 7   grazing_treatment   1006 non-null   object
 8   year                1006 non-null   int64 
 9   type                1006 non-null   object
 10  faith_pd            1006 non-null   object
dtypes: int64(6), object(5)
memory usage: 86.6+ KB


In [5]:
df.columns

Index(['SampleID', 'site', 'block', 'paddock', 'plot', 'slope',
       'rainfall_reduction', 'grazing_treatment', 'year', 'type', 'faith_pd'],
      dtype='object')

In [6]:
df.dtypes

SampleID              object
site                  object
block                  int64
paddock                int64
plot                   int64
slope                  int64
rainfall_reduction     int64
grazing_treatment     object
year                   int64
type                  object
faith_pd              object
dtype: object

In [7]:
df["faith_pd"] = (
    df["faith_pd"]
    .str.replace(".", "", regex=False)
    .astype(float)
)

# target column is object 
# to develop a ML model target varibale should be numerical.

In [8]:
df.dtypes

SampleID               object
site                   object
block                   int64
paddock                 int64
plot                    int64
slope                   int64
rainfall_reduction      int64
grazing_treatment      object
year                    int64
type                   object
faith_pd              float64
dtype: object

SampleID is dropped as it does not carry any predictive meaning.

In [9]:
df=df.drop(columns=['SampleID'])

In [10]:
df.columns

Index(['site', 'block', 'paddock', 'plot', 'slope', 'rainfall_reduction',
       'grazing_treatment', 'year', 'type', 'faith_pd'],
      dtype='object')

In [11]:
X = df.drop("faith_pd", axis=1)
y = df["faith_pd"]

- faith_pd is the value we want to predict
- All other columns are inputs

In [12]:
cat = X.select_dtypes(include="object").columns
num = X.select_dtypes(exclude="object").columns

cat, num

(Index(['site', 'grazing_treatment', 'type'], dtype='object'),
 Index(['block', 'paddock', 'plot', 'slope', 'rainfall_reduction', 'year'], dtype='object'))

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num),
        ("cat", OneHotEncoder(drop="first"), cat)
    ]
)

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

model = Pipeline([
    ("preprocessing", preprocessor),
    ("rf", RandomForestRegressor(
        n_estimators=300,
        random_state=42,
        n_jobs=-1
    ))
])

In [17]:
model.fit(X_train, y_train)

In [18]:
y_pred = model.predict(X_test)

In [20]:
from sklearn.metrics import mean_squared_error
import numpy as np

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

In [21]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

rmse, r2

(np.float64(1185600297.522036), -0.04042782742404705)

The Random Forest regression model produced an RMSE of approximately 1.18 × 10⁹ and an R² score of −0.04.

This indicates that the model’s predictions are highly inaccurate and perform worse than a simple baseline model that predicts the mean value of the target variable. A negative R² suggests that the current feature set and model configuration are not capturing meaningful patterns in the data.

Possible reasons include:

Extremely large target values after numeric conversion, leading to scale imbalance

Noise or weak relationships between input features and the target variable (faith_pd)

Insufficient feature engineering or the need for target transformation (e.g., log scaling)