## Objective  
Identify the single most predictive soil feature for determining the optimal crop.  

Using the provided `soil_measures.csv` dataset containing nitrogen (N), phosphorous (P), potassium (K), pH values, and the target crop, this code evaluates each feature individually to find the one that produces the best prediction score for "crop".  

The result will be stored in a dictionary `best_predictive_feature` where:  
- **Key** = best feature name  
- **Value** = corresponding evaluation score (based on the chosen metric).  

This helps farmers prioritize which soil metric to measure when resources are limited, enabling data-driven decisions to maximize crop yield.


In [20]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler



In [18]:
crops = pd.read_csv("soil_measures.csv")
crops.isna().sum()
crops.crop.unique()
crops.shape

(2200, 5)

In [11]:
X, y = crops.drop(columns="crop"), crops["crop"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [26]:
features_performance = {}
for feature in ["N", "P", "K", "ph"]:
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train[[feature]])
    X_test_scaled = scaler.transform(X_test[[feature]])

    log_reg = LogisticRegression()
    log_reg.fit(X_train_scaled, y_train)

    y_pred = log_reg.predict(X_test_scaled)

    f1 = f1_score(y_test, y_pred, average="weighted")
    features_performance[feature] = f1
    print(f"F1-score for {feature}: {f1:.4f}")

best_predictive_feature = {"K": features_performance["K"]}
best_predictive_feature

F1-score for N: 0.1008
F1-score for P: 0.0940
F1-score for K: 0.1356
F1-score for ph: 0.0675


{'K': 0.1356131859628798}