## About Dataset

This CSV file contains agricultural data for various crops cultivated in multiple states in India during the years 1997-2020. The dataset is focused on predicting crop yields based on several agronomic factors, such as weather conditions, fertilizer and pesticide usage, and other relevant variables. The dataset is presented in tabular form, with each row representing data for a specific crop and its corresponding features. It has 19698 rows and 10 columns(9 features and 1 label).

#### Columns Description:

- Crop: The name of the crop cultivated.
- Crop_Year: The year in which the crop was grown.
- Season: The specific cropping season (e.g., Kharif, Rabi, Whole Year).
- State: The Indian state where the crop was cultivated.
- Area: The total land area (in hectares) under cultivation for the specific crop.
- Production: The quantity of crop production (in metric tons).
- Annual_Rainfall: The annual rainfall received in the crop-growing region (in mm).
- Fertilizer: The total amount of fertilizer used for the crop (in kilograms).
- Pesticide: The total amount of pesticide used for the crop (in kilograms).
- Yield: The calculated crop yield (production per unit area).



https://www.kaggle.com/datasets/akshatgupta7/crop-yield-in-indian-states-dataset/data


In [1]:
import pandas as pd

df = pd.read_csv("../data/crop_yield.csv")
df.head()

Unnamed: 0,Crop,Crop_Year,Season,State,Area,Production,Annual_Rainfall,Fertilizer,Pesticide,Yield
0,Arecanut,1997,Whole Year,Assam,73814.0,56708,2051.4,7024878.38,22882.34,0.796087
1,Arhar/Tur,1997,Kharif,Assam,6637.0,4685,2051.4,631643.29,2057.47,0.710435
2,Castor seed,1997,Kharif,Assam,796.0,22,2051.4,75755.32,246.76,0.238333
3,Coconut,1997,Whole Year,Assam,19656.0,126905000,2051.4,1870661.52,6093.36,5238.051739
4,Cotton(lint),1997,Kharif,Assam,1739.0,794,2051.4,165500.63,539.09,0.420909


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19689 entries, 0 to 19688
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Crop             19689 non-null  object 
 1   Crop_Year        19689 non-null  int64  
 2   Season           19689 non-null  object 
 3   State            19689 non-null  object 
 4   Area             19689 non-null  float64
 5   Production       19689 non-null  int64  
 6   Annual_Rainfall  19689 non-null  float64
 7   Fertilizer       19689 non-null  float64
 8   Pesticide        19689 non-null  float64
 9   Yield            19689 non-null  float64
dtypes: float64(5), int64(2), object(3)
memory usage: 1.5+ MB


In [7]:
import os
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

In [None]:
DATA_PATH = "crop_yield.csv" 
RANDOM_STATE = 42
TEST_SIZE = 0.2
TARGET = "Yield"              

In [None]:
print("Data shape:", df.shape)
print("\nColumns:", list(df.columns))
print("\nMissing values per column:\n", df.isna().sum())
print("\nSample rows:\n", df.sample(5, random_state=RANDOM_STATE))
print("\nNumeric summary:\n", df.describe().T)

Data shape: (19689, 10)

Columns: ['Crop', 'Crop_Year', 'Season', 'State', 'Area', 'Production', 'Annual_Rainfall', 'Fertilizer', 'Pesticide', 'Yield']

Missing values per column:
 Crop               0
Crop_Year          0
Season             0
State              0
Area               0
Production         0
Annual_Rainfall    0
Fertilizer         0
Pesticide          0
Yield              0
dtype: int64

Sample rows:
                         Crop  Crop_Year       Season              State  \
18238  Peas & beans (Pulses)       2016  Kharif       Jammu and Kashmir   
6918                   Maize       1999  Rabi                    Odisha   
4894                  Potato       2016  Winter               Meghalaya   
10960                   Ragi       2008  Autumn               Jharkhand   
15615            Castor seed       2017  Kharif          Madhya Pradesh   

           Area  Production  Annual_Rainfall   Fertilizer  Pesticide     Yield  
18238    210.00        1010            902.8    3

In [9]:
df.columns

Index(['Crop', 'Crop_Year', 'Season', 'State', 'Area', 'Production',
       'Annual_Rainfall', 'Fertilizer', 'Pesticide', 'Yield'],
      dtype='object')

In [14]:
# Drop rows with missing target
df = df.dropna(subset=["Yield"])

# Define features
categorical_cols = ["Crop", "Season", "State"]
numeric_cols = ["Annual_Rainfall", "Fertilizer", "Pesticide"]
X = df[categorical_cols + numeric_cols]
y = df["Yield"]

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [15]:
numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols)
    ]
)

In [16]:
models = {
    "Linear Regression": LinearRegression(),
    "KNN Regressor": KNeighborsRegressor(),
    "SVR": SVR(),
    "Random Forest": RandomForestRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "XGBoost": XGBRegressor(objective="reg:squarederror")
}

In [17]:
results = []

for name, model in models.items():
    pipeline = Pipeline(steps=[("preprocessor", preprocessor),
                               ("model", model)])
    pipeline.fit(X_train, y_train)

    y_train_pred = pipeline.predict(X_train)
    y_test_pred = pipeline.predict(X_test)

    metrics = {
        "Model": name,
        "Train RMSE": np.sqrt(mean_squared_error(y_train, y_train_pred)),
        "Test RMSE": np.sqrt(mean_squared_error(y_test, y_test_pred)),
        "Train MAE": mean_absolute_error(y_train, y_train_pred),
        "Test MAE": mean_absolute_error(y_test, y_test_pred),
        "Train R2": r2_score(y_train, y_train_pred),
        "Test R2": r2_score(y_test, y_test_pred),
    }
    results.append(metrics)

    print("\nModel Name:", name)
    for k, v in metrics.items():
        if k != "Model":
            print(f"{k}: {v:.4f}")
    print("="*50)

df_results = pd.DataFrame(results)
print("\nSummary of all models:")
print(df_results)


Model Name: Linear Regression
Train RMSE: 336.9407
Test RMSE: 389.7400
Train MAE: 57.2619
Test MAE: 63.1729
Train R2: 0.8514
Test R2: 0.8104

Model Name: KNN Regressor
Train RMSE: 129.7956
Test RMSE: 245.1775
Train MAE: 8.7709
Test MAE: 16.5693
Train R2: 0.9779
Test R2: 0.9250

Model Name: SVR
Train RMSE: 875.3793
Test RMSE: 896.4453
Train MAE: 77.5028
Test MAE: 76.6288
Train R2: -0.0031
Test R2: -0.0030

Model Name: Random Forest
Train RMSE: 67.1915
Test RMSE: 136.9942
Train MAE: 3.5644
Test MAE: 9.4811
Train R2: 0.9941
Test R2: 0.9766

Model Name: Gradient Boosting
Train RMSE: 82.6199
Test RMSE: 162.9958
Train MAE: 8.1268
Test MAE: 13.6068
Train R2: 0.9911
Test R2: 0.9668

Model Name: XGBoost
Train RMSE: 4.9801
Test RMSE: 257.1538
Train MAE: 1.1855
Test MAE: 15.6503
Train R2: 1.0000
Test R2: 0.9175

Summary of all models:
               Model  Train RMSE   Test RMSE  Train MAE   Test MAE  Train R2  \
0  Linear Regression  336.940700  389.739981  57.261898  63.172873  0.851386   
1  

In [18]:
df_results

Unnamed: 0,Model,Train RMSE,Test RMSE,Train MAE,Test MAE,Train R2,Test R2
0,Linear Regression,336.9407,389.739981,57.261898,63.172873,0.851386,0.810422
1,KNN Regressor,129.795551,245.177458,8.770855,16.569302,0.977947,0.924976
2,SVR,875.379265,896.445289,77.502821,76.62878,-0.003104,-0.002963
3,Random Forest,67.191539,136.99416,3.564426,9.481136,0.99409,0.976577
4,Gradient Boosting,82.619868,162.9958,8.126821,13.606837,0.991064,0.966842
5,XGBoost,4.980126,257.153791,1.185512,15.650346,0.999968,0.917468


In [19]:
models = {
    "RandomForest": (RandomForestRegressor(random_state=42),
                     {
                         "model__n_estimators": [100, 200, 300, 500],
                         "model__max_depth": [5, 10, 20, None],
                         "model__min_samples_split": [2, 5, 10],
                         "model__min_samples_leaf": [1, 2, 4]
                     }),
    "XGBoost": (XGBRegressor(objective="reg:squarederror", random_state=42),
                {
                    "model__n_estimators": [200, 400, 600],
                    "model__learning_rate": [0.01, 0.05, 0.1],
                    "model__max_depth": [3, 5, 7, 9],
                    "model__subsample": [0.7, 0.8, 1.0],
                    "model__colsample_bytree": [0.7, 0.8, 1.0]
                })
}

In [20]:
best_model = None
best_score = -np.inf
best_name = ""

for name, (model, param_dist) in models.items():
    print(f"\nüîé Tuning {name}...")
    pipe = Pipeline(steps=[("preprocessor", preprocessor),
                           ("model", model)])
    rs = RandomizedSearchCV(
        pipe,
        param_distributions=param_dist,
        n_iter=10,
        scoring="r2",
        cv=3,
        random_state=42,
        n_jobs=-1,
        verbose=1
    )
    rs.fit(X_train, y_train)

    y_pred = rs.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)

    print(f"‚úÖ Best params for {name}: {rs.best_params_}")
    print(f"R2: {r2:.4f}, RMSE: {rmse:.4f}, MAE: {mae:.4f}")

    if r2 > best_score:
        best_score = r2
        best_model = rs.best_estimator_
        best_name = name

print(f"\nüèÜ Best Model: {best_name} with R2 = {best_score:.4f}")
joblib.dump(best_model, "best_crop_yield_model.pkl")
print("üíæ Saved as best_crop_yield_model.pkl")


üîé Tuning RandomForest...
Fitting 3 folds for each of 10 candidates, totalling 30 fits
‚úÖ Best params for RandomForest: {'model__n_estimators': 300, 'model__min_samples_split': 10, 'model__min_samples_leaf': 1, 'model__max_depth': 20}
R2: 0.9756, RMSE: 139.7845, MAE: 10.2476

üîé Tuning XGBoost...
Fitting 3 folds for each of 10 candidates, totalling 30 fits
‚úÖ Best params for XGBoost: {'model__subsample': 0.7, 'model__n_estimators': 200, 'model__max_depth': 3, 'model__learning_rate': 0.1, 'model__colsample_bytree': 0.8}
R2: 0.9422, RMSE: 215.1549, MAE: 19.8925

üèÜ Best Model: RandomForest with R2 = 0.9756
üíæ Saved as best_crop_yield_model.pkl
