# Decision Tree Regressor to determine Crop Yield.

## Problem Statement
Agriculture is the backbone of our country's econony as it provides employment to a majority section of our society.
However with changing climate there is uncertainty regarding the agricultural produce.
A machine learning model can be created to decide this in advance.
A model as such surely seems appealing but one should always remember that it predicts based on previous outcomes,
thus more the number of previous outcomes more are chances of good prediction.

We have used Random Forest Regression to predict the Crop yield(Y) for given feature vector, X = [humidity, moisture,
rainfall, meanTemp, minTemp, maxTemp, soil quality- alkaline, chalky, sandy, clay] for the given milllet dataset.


## Importing Libraries

In [1]:
# Pandas is used for data manipulation
import pandas as pd

import numpy as np
from sklearn.model_selection import train_test_split

## Data Collection

In [2]:
data = pd.read_csv('/Users/thepanshu/PycharmProjects/CropYield/milletdata.csv')

## Handling Missing Attributes

In [3]:
# Replace missing attributes with mean.
data.fillna(data.mean(), inplace=True)
print(data.head(10))
print('The shape of our features is:', data.shape)

    moisture  rainfall  humidity  meanTemp  maxTemp  minTemp  alkaline  sandy  \
0  12.801685  0.012360      57.0      62.0     71.0     52.0       0.0    1.0   
1  12.851654  0.004172      57.0      58.0     73.0     43.0       0.0    1.0   
2  12.776774  0.000000      56.0      58.0     69.0     46.0       0.0    0.0   
3  12.942001  0.031747      62.0      56.0     70.0     43.0       0.0    1.0   
4  12.984652  0.066629      65.0      56.0     70.0     42.0       0.0    0.0   
5  12.964471  0.027191      65.0      58.0     70.0     46.0       1.0    0.0   
6  12.737998  0.026821      61.0      56.0     70.0     42.0       0.0    0.0   
7  12.819382  0.010284      58.0      57.0     72.0     42.0       0.0    0.0   
8  12.883909  0.020465      63.0      60.0     76.0     45.0       0.0    0.0   
9  12.784513  0.060054      62.0      59.0     71.0     47.0       0.0    1.0   

   chalky  clay  label  
0       0   0.0      2  
1       0   0.0      0  
2       1   0.0      4  
3       

In [4]:
# iloc for extracting rows
label = pd.get_dummies(data.label).iloc[: ,1:]
data = pd.concat([data,label],axis=1)
data.drop('label', axis=1,inplace=True)
train=data.iloc[:, 0:10].values
test=data.iloc[: ,10:].values


# Training and Testing

In [5]:
X_train,X_test,y_train,y_test=train_test_split(train, test, test_size=0.33)

## Pre-processing

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
#transform data so that it has 0 as mean and 1 as standard deviation
X_train = sc.fit_transform(X_train)
print(X_train)
X_test = sc.transform(X_test)

[[ 0.92503115 -0.6056408   0.15708972 ... -0.66129655 -0.23031541
   1.56015249]
 [ 0.1103655  -0.6829728   1.7225628  ... -0.66129655 -0.23031541
   1.56015249]
 [ 0.84382078 -0.53181428 -0.0665493  ... -0.66129655 -0.23031541
  -0.64215195]
 ...
 [ 0.88782147  2.52733664  2.00211157 ... -0.66129655 -0.23031541
   1.56015249]
 [-0.83577216 -0.73126252  0.71618725 ... -0.66129655 -0.23031541
  -0.64215195]
 [ 0.68269222 -0.73126252 -0.01063954 ... -0.66129655 -0.23031541
   1.56015249]]


## Regressor Model

In [7]:
# In Scikit-learn, optimization of decision tree performed by only pre-pruning. Maximum depth of the tree can be used
# as a control variable for pre-pruning.
from sklearn.tree import DecisionTreeRegressor
rg = DecisionTreeRegressor(max_depth= 7,criterion = 'mse', splitter= 'best')

In [8]:
#Fitting the regressor into training set
rg.fit(X_train,y_train)
pred = rg.predict(X_test)

In [9]:
from sklearn.metrics import accuracy_score
# Finding the accuracy of the model
a = accuracy_score(y_test,pred)
print("The accuracy of this model is: ", a*100)

The accuracy of this model is:  97.72727272727273


## Conclusion

In [10]:
# Decision tree was made for the problem and it has been found that the accuracy is 95% + with mean squared error
# as the tree's criteria. The split for train and test was by 33:67.