`How Machine Learning Helps Farmers Select the Best Crops for Their Soil`

`October 2025`

This project explores how machine learning can assist farmers in selecting the most suitable crops based on soil nutrient levels. By analyzing a dataset containing soil properties and crop types, we aim to identify which soil feature is the best predictor for crop selection.

`Any questions, please reach out!`

Chiawei Wang, PhD\
Data & Product Analyst\
<chiawei.w@outlook.com>

`*` Note that the table of contents and other links may not work directly on GitHub.

[Table of Contents](#table-of-contents)
1. [Executive Summary](#executive-summary)
   - [Challenge](#challenge)
   - [Research Questions](#research-questions)
   - [Data Overview](#data-overview)
   - [Approach](#approach)
   - [Results](#results)
   - [Conclusion](#conclusion)
2. [Exploratory Data Analysis](#exploratory-data-analysis)

# Executive Summary

## Challenge

Farmers need to decide which crop to plant each season to maximize their yield. Soil condition is a crucial factor affecting crop growth, and measuring soil metrics can be expensive. The goal is to identify the most important soil feature that predicts the best crop choice using machine learning.

## Research Questions

Research questions:

1. Which feature in the dataset produces the best score for predicting crop?
2. What is the best predictive feature and its evaluation score?

## Data Overview

| Index | Column | Type    | Description                                   |
| ----- | ------ | ------- | --------------------------------------------- |
| 0     | `N`    | int64   | Nitrogen ratio in the soil                   |
| 1     | `P`    | int64   | Phosphor ratio in the soil                   |
| 2     | `K`    | int64   | Potassium ratio in the soil                   |
| 3     | `ph`   | float64 | pH value of the soil                          |
| 4     | `crop` | object  | Categorical values that contain various crops |

## Approach

1. Read the data into a pandas DataFrame and perform exploratory data analysis
2. Split the data
3. Evaluate feature performance
4. Create the best_predictive_feature variable

## Results

The F1-scores for each feature are as follows:

- F1-score for N: 0.09
- F1-score for P: 0.15
- F1-score for K: 0.24
- F1-score for ph: 0.05

## Conclusion

Potassium was identified as the best predictive feature for classifying crop types, achieving the highest F1-score of 0.24. This insight can help farmers prioritize measuring potassium levels in their soil to make informed decisions about crop selection, ultimately maximizing their yield. However, the overall low F1-scores suggest that relying on a single soil feature may not be sufficient for accurate crop prediction, indicating the need for a more comprehensive approach that considers multiple soil properties.

# Exploratory Data Analysis

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import warnings

In [2]:
# Read in the CSV as a DataFrame
df = pd.read_csv('soil.csv')

# Preview the data
print(df.shape)
df.head()

(2200, 5)


Unnamed: 0,N,P,K,ph,crop
0,90,42,43,6.502985,rice
1,85,58,41,7.038096,rice
2,60,55,44,7.840207,rice
3,74,35,40,6.980401,rice
4,78,42,42,7.628473,rice


In [3]:
# Suppress convergence warnings
warnings.filterwarnings('ignore')

# Check for missing values
df.isna().sum()

# Check how many crops we have, i.e. multi-class target
df.crop.unique()

# Split into feature and target sets
X = df.drop(columns = 'crop')
y = df['crop']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Create a dictionary to store the model performance for each feature
feature_performance = {}

# Train a logistic regression model for each feature
for feature in ['N', 'P', 'K', 'ph']:
    log_reg = LogisticRegression(multi_class = 'multinomial')
    log_reg.fit(X_train[[feature]], y_train)
    y_pred = log_reg.predict(X_test[[feature]])
    
    # Calculate F1 score, the harmonic mean of precision and recall
    # Could also use balanced_accuracy_score
    f1 = metrics.f1_score(y_test, y_pred, average = 'weighted')

    # Add feature-f1 score pairs to the dictionary
    feature_performance[feature] = f1
    print(f'F1-score for {feature}: {f1:.2f}')

F1-score for N: 0.09
F1-score for P: 0.15
F1-score for K: 0.24
F1-score for ph: 0.05
