# Crop Recommendation Decision Engine

Author: Darrell Leong

Contact: darrell.leong@yara.com

This engine aims to automate crop selection recommendation. The dataset presents ideal crops to be planted based on macro-environmental considerations of:

- Soil N-P-K ratios
- Soil pH
- Expected temperature
- Expected humidity
- Expected annual rainfall

Here a model is trained from these historical recommendations, in order to provide an autonomous crop selection based on the above variables.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Exploration

In [None]:
df = pd.read_csv("../input/crop-recommendation-dataset/Crop_recommendation.csv")
df.tail()

In [None]:
ax = df['label'].value_counts().plot(kind='bar')
ax.set_ylabel("Counts")

Labels are evenly distributed, no observable class imbalance.

In [None]:
df.corr(method="pearson")

Significant correlation between P and K, possible issues with multicollinearity.

In [None]:
ax1 = df.plot.scatter(x='P', y='K')

In [None]:
X, y = df[['N', 'P', 'K', 'temperature', 'humidity', 'ph', 'rainfall']].values, df['label'].values

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit_transform(X, y)

lda.explained_variance_ratio_

In [None]:
import matplotlib.pyplot as plt 

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(['N', 'P', 'K', 'temperature', 'humidity', 'ph', 'rainfall'], lda.coef_[0])
ax.set_ylabel("LDA1 coefficient")

Crop recommendation appears to be the most sensitive to the K input, but that could be attributed to the high variance of K in the dataset. 

## Data Preparation

In [None]:
from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print ('Training Set: %d, rows\nTest Set: %d rows' % (X_train.size, X_test.size))

In [None]:
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
import numpy as np

# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,2,3,4,5,6]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode them)
categorical_features = []
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

## Model Training

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestClassifier

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', RandomForestClassifier())])


# fit the pipeline to train a regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model, "\n")

# Get predictions
predictions = model.predict(X_test)

# Display metrics
# mse = mean_squared_error(y_test, predictions)
# print("MSE:", mse)
# rmse = np.sqrt(mse)
# print("RMSE:", rmse)
# r2 = r2_score(y_test, predictions)
# print("R2:", r2)

# Plot predicted vs actual
# plt.scatter(y_test, predictions)
# plt.xlabel('Actual Labels')
# plt.ylabel('Predicted Labels')
# plt.title('Futura Pricing Predictions - Preprocessed')
# z = np.polyfit(y_test, predictions, 1)
# p = np.poly1d(z)
# plt.plot(y_test,p(y_test), color='magenta')
# plt.show()

## Model Performance

In [None]:
import matplotlib.pyplot as plt 
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(model, X_test, y_test)  
plt.show() 

In [None]:
# Classification Report
from sklearn import metrics

print(metrics.classification_report(predictions,y_test))

## Model Export

In [None]:
# Save Model
import pickle

modelFinal = pipeline.fit(X, (y))
# print(metrics.classification_report(modelFinal.predict(X_test),y_test))
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(modelFinal, open(filename, 'wb'))