<img src="https://images.theconversation.com/files/304187/original/file-20191128-176618-zrwazf.jpg">
<h1><center>Wild Blueberry Yield Prediction Model</center></h1>
<p><center>Exploratory Data Analysis and Explainable AI</center></p>

# Introduction

The most challenging task in the agriculture sector is to accurately predict crop yield. Here is the complete Wild Blueberry Yield Prediction Model coupled with Exploratory Data Analysis(EDA) and Explainable AI.

**Importing all the crucial libraries**

In [None]:
!pip install dabl

In [None]:
!pip install shap

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import dabl
import shap

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score

import warnings
warnings.filterwarnings("ignore")

**Data Preprocessing**

In [None]:
dataset = pd.read_csv('../input/wild-blueberry-yield-prediction/Data in Brief/Data in Brief/WildBlueberryPollinationSimulationData.csv')

In [None]:
dataset.head(10)

In [None]:
dataset.drop('Row#', axis='columns', inplace=True)

In [None]:
dataset.head(10)

In [None]:
dataset.tail(10)

# Exploratory Data Analysis

In [None]:
dataset.info()

In [None]:
import plotly.express as px
fig = px.histogram(dataset, x="yield")
fig.show()

In [None]:
dabl.plot(dataset, target_col="yield")

**Removing Outliers**

In [None]:
sns.boxplot(x=dataset['bumbles'])

In [None]:
sns.boxplot(x=dataset['honeybee'])

In [None]:
q1 = dataset.quantile(0.25)
q2 = dataset.quantile(0.75)
iqr = q2 -q1
print(iqr)

In [None]:
dataset = dataset[~((dataset < (q1 - 1.5 * iqr)) |(dataset > (q2 + 1.5 * iqr))).any(axis=1)]
dataset.shape

**Correlation**

In [None]:
plt.figure(figsize=(20,20))
c = dataset.corr()


In [None]:
c

In [None]:
plt.figure(figsize=(15,12))
sns.heatmap(c, annot=True, cmap="YlGnBu")
plt.title('Heatmap to understand the correlation between input data', fontsize=15)
plt.show()

# Multiple Linear Regression

**Modelling**

In [None]:
X = dataset[['clonesize','honeybee','bumbles', 'andrena', 'osmia', 'MaxOfUpperTRange', 'MinOfUpperTRange', 'AverageOfUpperTRange', 'MaxOfLowerTRange', 'MinOfLowerTRange', 'AverageOfLowerTRange', 'RainingDays', 'AverageRainingDays', 'fruitset', 'fruitmass', 'seeds']]
X

In [None]:
y = dataset['yield']
y

In [None]:
print(X.shape)

In [None]:
print(y.shape)

In [None]:
X_train, X_val, Y_train, Y_val = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
print(X_train.shape)

In [None]:
print(X_val.shape)

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()

**Model Evaluation**

In [None]:
reg.fit(X_train, Y_train)

**Prediction**

In [None]:
Y_pred = reg.predict(X_val)

In [None]:
rmse = np.sqrt(np.mean((Y_val - Y_pred)**2))
rmse

In [None]:
from sklearn.metrics import r2_score
r2_score(Y_pred, Y_val)

In [None]:
data_pred = X_val.iloc[8,:] 
data_pred_array = data_pred.values.reshape(1, -1)


reg.predict(data_pred_array)

# Random Forest Regression

**Modelling**

In [None]:
from sklearn.ensemble import RandomForestRegressor

**Model Evaluation and Prediction**

In [None]:
params = {
    'n_estimators' : [25, 50, 75, 100, 150, 200],
    'max_depth' : [2, 4, 6, 8, 10]  
}
rfreg = GridSearchCV(RandomForestRegressor(random_state=0), params)
rfreg.fit(X_train, Y_train)

Y_pred = rfreg.predict(X_val)

rmse = np.sqrt(np.mean((Y_val - Y_pred)**2))

print("RMSE : {:.2f}".format(rmse))

In [None]:
r2_score(Y_pred, Y_val)

# XGBoost

In [None]:
from xgboost import XGBRegressor
regressor = XGBRegressor()

In [None]:
regressor.fit(X_train, Y_train)

In [None]:
Y_pred = regressor.predict(X_val)

In [None]:
rmse = np.sqrt(np.mean((Y_val - Y_pred)**2))
rmse

In [None]:
r2_score(Y_pred, Y_val)

# Explainable AI

In [None]:
shap_values = shap.TreeExplainer(regressor).shap_values(X_val)

In [None]:
shap.summary_plot(shap_values, X_val, plot_type="bar")

In [None]:
shap.summary_plot(shap_values, X_val)

In [None]:
exp = shap.Explainer(reg, X_train)
shap_values = exp.shap_values(data_pred)

In [None]:
shap.initjs()
shap.force_plot(exp.expected_value, shap_values, data_pred)

In [None]:
#end

# Ref :

* dabl lib : https://www.kaggle.com/parulpandey/useful-python-libraries-for-data-science

* Explainable AI : <br>
https://en.wikipedia.org/wiki/Explainable_artificial_intelligence <br>
https://towardsdatascience.com/explainable-artificial-intelligence-part-2-model-interpretation-strategies-75d4afa6b739
             