# Model 4 : Final Iteration 

After going over the limitations of our previous models, we have throught out a new strategy to approachbteh training and development of the model.

## Data Cleaning and Feature Engineering 

### Cleaning Steps

* **Removed the `Date` column** (not useful for prediction)
* **Created categorical columns**:

  * `Soil_Quality_Class` from continuous `Soil_Quality`, using custom bins (poor, average, good, excellent)
* **Label-encoded categorical variables** for model compatibility:

  * `Soil_Type`
  * `Soil_Quality_Class`
  * `Crop_Type` (for training/classification)
* **Kept all other columns as numeric features**
* **Added `sowing_month` and `harvesting_month`** based on `Crop_Type`, as per Pakistan’s crop calendars



### Feature Engineering

* The following columns were used as model features:

  * `Soil_Type` (encoded)
  * `Soil_pH`
  * `Temperature`
  * `Humidity`
  * `Wind_Speed`
  * `N`
  * `P`
  * `K`
  * `Soil_Quality_Class` (encoded)
  * `sowing_month`
  * `harvesting_month`

In [10]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load data
df = pd.read_csv('crop_yield_dataset.csv')

# Remove date column
if 'Date' in df.columns:
    df = df.drop(columns=['Date'])

# Add sowing/harvesting months based on crop type
crop_months = {
    'wheat':      (11, 4),
    'corn':       (2, 7),
    'rice':       (6, 10),
    'barley':     (10, 3),
    'soybean':    (6, 10),
    'cotton':     (4, 10),
    'sugarcane':  (2, 11),
    'tomato':     (8, 11),
    'potato':     (10, 1),
    'sunflower':  (1, 5)
}
df['Crop_Type_LC'] = df['Crop_Type'].str.strip().str.lower()
df['sowing_month'] = df['Crop_Type_LC'].map(lambda x: crop_months.get(x, (0, 0))[0])
df['harvesting_month'] = df['Crop_Type_LC'].map(lambda x: crop_months.get(x, (0, 0))[1])
df = df.drop(columns=['Crop_Type_LC'])

# Add Soil_Quality_Class
def soil_quality_class(value):
    if value > 60:
        return 'excellent'
    elif value > 45:
        return 'good'
    elif value > 25:
        return 'average'
    else:
        return 'poor'

df['Soil_Quality_Class'] = df['Soil_Quality'].apply(soil_quality_class)

# --- FIX: Use separate encoders for each feature ---
soil_type_le = LabelEncoder()
df['Soil_Type'] = soil_type_le.fit_transform(df['Soil_Type'])

soil_quality_le = LabelEncoder()
df['Soil_Quality_Class'] = soil_quality_le.fit_transform(df['Soil_Quality_Class'])

crop_le = LabelEncoder()
df['Crop_Type_Label'] = crop_le.fit_transform(df['Crop_Type'].str.strip().str.lower())

# Now you can safely save these encoders for deployment:
import joblib
encoders = {
    'soil_type': soil_type_le,
    'soil_quality': soil_quality_le,
    'crop_type': crop_le
}
joblib.dump(encoders, 'encoder.pkl')

# Data is ready for ML!
df.head()


Unnamed: 0,Crop_Type,Soil_Type,Soil_pH,Temperature,Humidity,Wind_Speed,N,P,K,Crop_Yield,Soil_Quality,sowing_month,harvesting_month,Soil_Quality_Class,Crop_Type_Label
0,Wheat,2,5.5,9.440599,80.0,10.956707,60.5,45.0,31.5,0.0,22.833333,11,4,3,9
1,Corn,1,6.5,20.052576,79.947424,8.591577,84.0,66.0,50.0,104.87131,66.666667,2,7,1,1
2,Rice,2,5.5,12.143099,80.0,7.227751,71.5,54.0,38.5,0.0,27.333333,6,10,0,4
3,Barley,4,6.75,19.751848,80.0,2.682683,50.0,40.0,30.0,58.939796,35.0,10,3,0,0
4,Soybean,2,5.5,16.110395,80.0,7.69607,49.5,45.0,38.5,32.970413,22.166667,6,10,3,5


Dropped the Yield values that were zero

In [11]:
df = df[df['Crop_Yield'] != 0].copy()


## Training 

Now we move on to training a random forest classifier 

In [12]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report, accuracy_score, mean_absolute_error, mean_squared_error, r2_score
import joblib
import json



# -- FEATURES AND TARGETS --
features = [
    'Soil_Type', 'Soil_pH', 'Temperature', 'Humidity', 'Wind_Speed',
    'N', 'P', 'K', 'Soil_Quality_Class', 'sowing_month', 'harvesting_month'
]
X = df[features]
y_class = df['Crop_Type_Label']
y_reg = df['Crop_Yield']

# -- SPLIT --
X_train, X_test, y_class_train, y_class_test, y_reg_train, y_reg_test = train_test_split(
    X, y_class, y_reg, test_size=0.2, random_state=42, stratify=y_class
)

# -- TRAIN --
clf = RandomForestClassifier(n_estimators=120, random_state=42, class_weight='balanced')
clf.fit(X_train, y_class_train)

reg = RandomForestRegressor(n_estimators=120, random_state=42)
reg.fit(X_train, y_reg_train)

# -- TEST METRICS --
y_class_pred = clf.predict(X_test)
print("Crop Recommendation (Classification) Report:")
print(classification_report(y_class_test, y_class_pred))
print(f"Accuracy: {accuracy_score(y_class_test, y_class_pred):.3f}")

y_reg_pred = reg.predict(X_test)
print("Yield Prediction (Regression) Metrics:")
print(f"MAE: {mean_absolute_error(y_reg_test, y_reg_pred):.2f}")
print(f"RMSE: {mean_squared_error(y_reg_test, y_reg_pred, squared=False):.2f}")
print(f"R2: {r2_score(y_reg_test, y_reg_pred):.3f}")

# -- SAVE MODELS AND ENCODERS --
joblib.dump(clf, 'model_classifier.pkl')
joblib.dump(reg, 'model_regressor.pkl')
encoders = {
    'soil_type': soil_type_le,
    'soil_quality': soil_quality_le,
    'crop_type': crop_le
}
joblib.dump(encoders, 'encoder.pkl')
with open('features.json', 'w') as f:
    json.dump(features, f)



Crop Recommendation (Classification) Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       504
           1       1.00      1.00      1.00       505
           2       1.00      1.00      1.00       508
           3       1.00      1.00      1.00       512
           4       1.00      1.00      1.00       514
           5       1.00      1.00      1.00       513
           6       1.00      1.00      1.00       511
           7       1.00      1.00      1.00       510
           8       1.00      1.00      1.00       513
           9       1.00      1.00      1.00       509

    accuracy                           1.00      5099
   macro avg       1.00      1.00      1.00      5099
weighted avg       1.00      1.00      1.00      5099

Accuracy: 1.000
Yield Prediction (Regression) Metrics:
MAE: 3.34
RMSE: 4.88
R2: 0.953


