Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called soil_measures.csv, which contains:

"N": Nitrogen content ratio in the soil
"P": Phosphorous content ratio in the soil
"K": Potassium content ratio in the soil
"pH" value of the soil
"crop": categorical values that contain various crops (target variable).
Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the "crop" column is the optimal choice for that field.

In this project, you will build multi-class classification models to predict the type of "crop" and identify the single most importance feature for predictive performance.

Identify the single feature that has the strongest predictive performance for classifying crop types.

Find the feature in the dataset that produces the best score for predicting "crop".
From this information, create a variable called best_predictive_feature, which:
Should be a dictionary containing the best predictive feature name as a key and the evaluation score (for the metric you chose) as the value.


In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [2]:
# Load the dataset
crop_df = pd.read_csv("/kaggle/input/soil-measure/soil_measures.csv")

**Check for missing values**

You can chain the pandas DataFrame methods isna().sum() to count the number of null values in each column, helping you decide whether you need to drop or impute missing values.

In [3]:
# Check for missing values
print(crop_df)

missing_values = crop_df.isna().sum()
print(missing_values)

        N   P   K        ph    crop
0      90  42  43  6.502985    rice
1      85  58  41  7.038096    rice
2      60  55  44  7.840207    rice
3      74  35  40  6.980401    rice
4      78  42  42  7.628473    rice
...   ...  ..  ..       ...     ...
2195  107  34  32  6.780064  coffee
2196   99  15  27  6.086922  coffee
2197  118  33  30  6.362608  coffee
2198  117  32  34  6.758793  coffee
2199  104  18  30  6.779833  coffee

[2200 rows x 5 columns]
N       0
P       0
K       0
ph      0
crop    0
dtype: int64


**Check for crop types**

To confirm if "crop" is a binary or multi-label feature you can use the pandas Series .unique() method to display all unique values in that column.

In [4]:
# Check for crop types
crop_unique = crop_df['crop'].unique()
print(crop_unique)

['rice' 'maize' 'chickpea' 'kidneybeans' 'pigeonpeas' 'mothbeans'
 'mungbean' 'blackgram' 'lentil' 'pomegranate' 'banana' 'mango' 'grapes'
 'watermelon' 'muskmelon' 'apple' 'orange' 'papaya' 'coconut' 'cotton'
 'jute' 'coffee']


**Split the data
Create training and test sets using all features.**

Features and target variables
Create a variable containing the features, all columns except "crop", and another variable containing only the "crop".



In [5]:
# Features and target variables
X = crop_df.drop("crop", axis=1).values
y = crop_df["crop"].values

Use train_test_split()

In [6]:
# Use train_test_split()
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2, random_state =42)


**Evaluate feature performance**

Predict the crop using each feature individually. You should build a model for each feature. That means you will build four models.

In [7]:
# Create a dictionary to store each features predictive performance
features_performance = {}
feature_accuracy ={}
acc = []
f1 = []
features = ["N", "P", "K", "ph"]

**Loop through the features**

You can train and evaluate the performance of each feature by looping through them using the syntax for feature in ["N", "P", "K", "ph"]:

**Training a multi-class classifier algorithm**

Inside of the for loop iterating over a list of features, you can call LogisticRegression() to create your model, assigning to the variable log_reg.
You should set the multi_class argument to "multinomial" so that multi-class prediction is supported.
Fit the model to the feature in X_train by subsetting it using double square brackets e.g., log_reg.fit(X_train[[feature]], y_train)

**Predicting target values using the test set**

You can use the model's .predict() method, subsetting the feature from X_test, to predict target values.
Convention is to store the results as a variable called y_pred.



**Evaluating the performance of each feature**

You can calculate F1 score, which is the harmonic mean of precision and recall, to evaluate feature performance.

Alternatively, you can use metrics.balanced_accuracy_score().
Scikit-learn's metrics.f1_score() function takes the target values, y_test, and the predicted values, y_pred, in order to calculate the F1 score.
Set the f1_score()'s keyword argument average equal to "weighted" when calculating performance for each feature.

Assign the results of f1_score() to a variable called feature_performance.
If you created an empty dictionary called feature_performance outside of a for loop where you built your models, you can add the feature-performance key-value pairs to the dictionary using the syntax feature_performance[feature] = feature_importance.
You can use a print() statement with an f-string to output the feature and the performance, for example, print(f"F1-score for {feature}: {feature_performance}")

In [8]:
# Loop through the features
for feature in features:
    # Training a multi-class classifier algorithm
    # max_iter is used to ensure convergence
    log_reg = LogisticRegression(multi_class='multinomial', max_iter=1000)
    log_reg.fit(X_train[:, [features.index(feature)]], y_train)
    
    # Predicting target values using the test set
    y_pred = log_reg.predict(X_test[:, [features.index(feature)]])
    
    # Evaluating the performance of each feature
    feature_accuracy = accuracy_score(y_test, y_pred)
    acc.append(feature_accuracy)
    print(f"Feature: {feature}, Accuracy: {feature_accuracy}")
    
    # Using the F1.score()
    feature_performance = f1_score(y_test, y_pred, average = 'weighted')
    # feature_performance[feature] = feature_importance
    f1.append(feature_performance)
    print(f"F1-score for {feature}: {feature_performance}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Feature: N, Accuracy: 0.1590909090909091
F1-score for N: 0.11469020273959639


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Feature: P, Accuracy: 0.18863636363636363
F1-score for P: 0.12613828555719708
Feature: K, Accuracy: 0.2727272727272727
F1-score for K: 0.23123431237441186
Feature: ph, Accuracy: 0.09772727272727273
F1-score for ph: 0.04532731061152114


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Saving the information**

Create a variable called best_predictive_feature.
It should contain a single key-value pair.
The key should be a string representing the name of the feature that produced the best model performance.
The value should be the model's evaluation metric score.

In [9]:
 print("accuracy:",acc)
 print("f1_score:", f1)

accuracy: [0.1590909090909091, 0.18863636363636363, 0.2727272727272727, 0.09772727272727273]
f1_score: [0.11469020273959639, 0.12613828555719708, 0.23123431237441186, 0.04532731061152114]


In [10]:
# Compare the performance metrics from both lists
best_predictive_feature = {"feature": None, "score": 0}  # Initialize the variable

# Compare the performance metrics from both lists
if acc and f1:
    max_performance = max(f1)
    max_accuracy = max(acc)
    
    if max_performance > max_accuracy:
        best_predictive_feature["feature"] = features[acc.index(max_accuracy)]  # Use the index to find the corresponding feature
        best_predictive_feature["score"] = max_performance
    else:
        best_predictive_feature["feature"] = features[f1.index(max_performance)]  
        best_predictive_feature["score"] = max_accuracy

# Print the best predictive feature and its score
print("Best Predictive Feature:", best_predictive_feature["feature"])
print("Score:", best_predictive_feature["score"])


Best Predictive Feature: K
Score: 0.2727272727272727


In [11]:
# Print F1 scores
print("F1 scores:", f1)

# Find the index of the best F1 score
best_f1_index = f1.index(max(f1))

# Select the feature corresponding to the best F1 score
best_feature = features[best_f1_index]

# Print the best feature
print("Best Predictive Feature:", best_feature)


F1 scores: [0.11469020273959639, 0.12613828555719708, 0.23123431237441186, 0.04532731061152114]
Best Predictive Feature: K


In [12]:
# All required libraries are imported here for you.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Load the dataset
crop_df = pd.read_csv("/kaggle/input/soil-measure/soil_measures.csv")

# Features and target variables
X = crop_df.drop("crop", axis=1)
y = crop_df["crop"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a dictionary to store each feature's performance
feature_performance = {}

# Loop through the features
for feature in X_train.columns:
    # Training a multi-class classifier algorithm
    log_reg = LogisticRegression(multi_class='multinomial', max_iter=1000)
    log_reg.fit(X_train[[feature]], y_train)
    
    # Predicting target values using the test set
    y_pred = log_reg.predict(X_test[[feature]])
    
    # Using the F1 score for evaluation
    feature_performance[feature] = f1_score(y_test, y_pred, average='weighted')
    
    # Print the feature and its performance
    print(f"F1-score for {feature}: {feature_performance[feature]}")

# Find the feature with the best performance
best_feature = max(feature_performance, key=feature_performance.get)

# Create the best_predictive_feature variable
best_predictive_feature = {best_feature: feature_performance[best_feature]}

# Print the best predictive feature and its score
print("Best Predictive Feature:", best_predictive_feature)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1-score for N: 0.11469020273959639


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1-score for P: 0.12613828555719708


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1-score for K: 0.23123431237441186
F1-score for ph: 0.04532731061152114
Best Predictive Feature: {'K': 0.23123431237441186}
