# PROBLEM STATEMENT
**The Toxic Pesticides
Though, many of us don't appreciate much, but a farmer's job is real test of endurance and determination. Once the seeds are sown, he works days and nights to make sure that he cultivates a good harvest at the end of season. A good harvest is ensured by several factors such as availability of water, soil fertility, protecting crops from rodents, timely use of pesticides & other useful chemicals and nature. While a lot of these factors are difficult to control for, the amount and frequency of pesticides is something the farmer can control.**

**Pesticides are also special, because while they protect the crop with the right dosage. But, if you add more than required, they may spoil the entire harvest. A high level of pesticide can deem the crop dead / unsuitable for consumption among many outcomes. This data is based on crops harvested by various farmers at the end of harvest season. To simplify the problem, you can assume that all other factors like variations in farming techniques have been controlled for.**

**You need to daetermine the outcome of the harvest season, i.e. whether the crop would be healthy (alive), damaged by pesticides or damaged by other reasons.**

The evaluation metric for this hackathon is Accuracy Score.

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn import datasets, ensemble
from sklearn.linear_model import LogisticRegression
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import datetime as dt
from sklearn import  metrics   
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

# DATA INPUT

In [None]:

train=pd.read_csv('../input/av-janatahack-machine-learning-in-agriculture/train_yaOffsB.csv')
test=pd.read_csv('../input/av-janatahack-machine-learning-in-agriculture/test_pFkWwen.csv')

In [None]:
train['dataset']='train'
test['dataset']='test'

In [None]:
final=pd.concat([train,test])

In [None]:
final['dataset'].value_counts()

# EXPLORATORY DATA ANALYSIS(EDA)

In [None]:
sns.set_style('whitegrid')

In [None]:
# Number_Weeks_Used
sns.distplot(final[final['Crop_Damage']==0]['Number_Weeks_Used'],bins=40,color='black',kde=False)
sns.distplot(final[final['Crop_Damage']==1]['Number_Weeks_Used'],bins=40,color='green',kde=False)
sns.distplot(final[final['Crop_Damage']==2]['Number_Weeks_Used'],bins=40,color='blue',kde=False)
plt.legend(labels=['ALIVE', 'DAMAGE-OTHERS', 'DAMAGE-PESTICIDE'])

In [None]:
#Number_Weeks_Quit
sns.distplot(final[final['Crop_Damage']==0]['Number_Weeks_Quit'],bins=40,color='black',kde=False)
sns.distplot(final[final['Crop_Damage']==1]['Number_Weeks_Quit'],bins=40,color='green',kde=False)
sns.distplot(final[final['Crop_Damage']==2]['Number_Weeks_Quit'],bins=40,color='blue',kde=False)
plt.legend(labels=['ALIVE', 'DAMAGE-OTHERS', 'DAMAGE-PESTICIDE'])

In [None]:
# Season
sns.countplot(x='Season',data=final,hue='Crop_Damage')
plt.legend(labels=['ALIVE', 'DAMAGE-OTHERS', 'DAMAGE-PESTICIDE'])

In [None]:
df = final[final['dataset']=='train'].groupby(['Season', 'Crop_Damage']).agg({'Crop_Damage': 'count'})
percentage = df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
df, percentage

*Seasons dont have much effect in survival percentage rate*

In [None]:
# Number_Doses_Week
sns.distplot(final[final['Crop_Damage']==0]['Number_Doses_Week'],bins=30,color='black',kde=False)
sns.distplot(final[final['Crop_Damage']==1]['Number_Doses_Week'],bins=30,color='green',kde=False)
sns.distplot(final[final['Crop_Damage']==2]['Number_Doses_Week'],bins=30,color='red',kde=False)
plt.legend(labels=['ALIVE', 'DAMAGE-OTHERS', 'DAMAGE-PESTICIDE'])

In [None]:
#Pesticide_Use_Category
sns.countplot(x='Pesticide_Use_Category',data=final,hue='Crop_Damage')
plt.legend(labels=['ALIVE', 'DAMAGE-OTHERS', 'DAMAGE-PESTICIDE'])

In [None]:
df = final[final['dataset']=='train'].groupby(['Pesticide_Use_Category', 'Crop_Damage']).agg({'Crop_Damage': 'count'})
percentage = df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
df,percentage

*Among pesticide categories, category 2(previously used) has highest Survival rate*

In [None]:
#Soil_Type
sns.countplot(x='Soil_Type',data=final,hue='Crop_Damage')
plt.legend(labels=['ALIVE', 'OTHERS', 'PESTICIDE'])

In [None]:
df = final[final['dataset']=='train'].groupby(['Soil_Type', 'Crop_Damage']).agg({'Crop_Damage': 'count'})
percentage = df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
df, percentage

*Soil Type1 has slightly higher survival rate than Type0*

In [None]:
#Estimated_Insects_Count
sns.distplot(final[final['Crop_Damage']==0]['Estimated_Insects_Count'],bins=30,color='black',kde=False)
sns.distplot(final[final['Crop_Damage']==1]['Estimated_Insects_Count'],bins=30,color='blue',kde=False)
sns.distplot(final[final['Crop_Damage']==2]['Estimated_Insects_Count'],bins=30,color='red',kde=False)
plt.legend(labels=['ALIVE', 'DAMAGE-OTHERS', 'DAMAGE-PESTICIDE'])

In [None]:
#Crop_Type
sns.countplot(x='Crop_Type',data=final,hue='Crop_Damage')
plt.legend(labels=['ALIVE', 'DAMAGE-OTHERS', 'DAMAGE-PESTICIDE'])

In [None]:
df = final[final['dataset']=='train'].groupby(['Crop_Type', 'Crop_Damage']).agg({'Crop_Damage': 'count'})
percentage = df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
df, percentage

*Crop Type1 has slightly more survival percentage rate than Type0*

In [None]:
sns.heatmap(final.isna())

In [None]:
sns.heatmap(final.corr(),annot=True)

*Feature Estimated_Insects_Count has good correlation with Number_Weeks_Used*

# FEATURE ENGINEERING

In [None]:
# Imputing missing Values
final.fillna(-1, inplace=True)

In [None]:
final['Uniq_crops_per_season']=final.groupby(['Season'])['ID'].transform('nunique')

final['Uniq_crops_per_soil']=final.groupby(['Soil_Type'])['ID'].transform('nunique')

final['Uniq_crops_per_pesticide']=final.groupby(['Pesticide_Use_Category'])['ID'].transform('nunique')

final['Uniq_crops_per_CT']=final.groupby(['Crop_Type'])['ID'].transform('nunique')

final['Crop_Type'] = final['Crop_Type'].astype('category')
#final['Crop_Type'].value_counts()

final['Soil_Type'] = final['Soil_Type'].astype('category')
#final['Soil_Type'].value_counts()

final['Pesticide_Use_Category']=final['Pesticide_Use_Category'].astype('category')
#final['Pesticide_Use_Category'].value_counts()

final['Season']=final['Season'].astype('category')
#final['Season'].value_counts()

final['Uniq_crops_per_season'] = final['Uniq_crops_per_season'].astype('category')
#final['Uniq_crops_per_season'].value_counts()

final['Uniq_crops_per_soil'] = final['Uniq_crops_per_soil'].astype('category')
#final['Uniq_crops_per_soil'].value_counts()

final['Uniq_crops_per_pesticide']=final['Uniq_crops_per_pesticide'].astype('category')
#final['Uniq_crops_per_pesticide'].value_counts()

final['Uniq_crops_per_CT']=final['Uniq_crops_per_CT'].astype('category')
#final['Uniq_crops_per_CT'].value_counts()


# Train Test Split

In [None]:
train, test = final[final.dataset == 'train'], final[final.dataset == 'test']
train.drop(['dataset'], inplace=True, axis=1)
test.drop(['dataset'], inplace=True, axis=1)
test.drop(['Crop_Damage'], inplace=True, axis=1)
train.shape, test.shape

In [None]:
train.drop(['ID'], axis=1, inplace=True)
test_id=test['ID']
test.drop(['ID'], axis=1, inplace=True)
test_id

In [None]:
y=train['Crop_Damage']
train.drop(['Crop_Damage'], axis=1, inplace=True)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(train, y,test_size=0.20,random_state=0,
                                                    stratify=y)

# FITTING MODEL AND MAKING PREDICTION ON TEST

In [None]:
params = {}
params['learning_rate'] = 0.04
#params['min_child_samples'] = 40
params['max_depth'] = 18
params['n_estimators'] = 1000
params['objective'] = 'multiclass'
params['boosting_type'] = 'gbdt'
params['subsample'] = 0.7
params['random_state'] = 42
params['colsample_bytree']=0.7
params['min_data_in_leaf'] = 55
params['reg_alpha'] = 1.7
params['reg_lambda'] = 1.11
params['class_weight']: {0: 0.44, 1: 0.4, 2: 0.37}

In [None]:
from lightgbm import LGBMClassifier
predictors=train.columns
clf = LGBMClassifier(**params)

clf.fit(x_train[predictors], y_train, eval_set=[(x_test, y_test)], verbose=50,
        eval_metric = 'multi_error', early_stopping_rounds = 100)

In [None]:
# Finding best iteration 
best_iter = clf.best_iteration_
params['n_estimators'] = best_iter
params

In [None]:
clf = LGBMClassifier(**params)

clf.fit(train[predictors], y, eval_metric='multi_error', verbose=False, categorical_feature='auto')

# eval_score_auc = roc_auc_score(df_train[label_col], clf.predict(df_train[feature_cols]))
eval_score_acc = accuracy_score(y, clf.predict(train[predictors]))

print('ACC: {}'.format(eval_score_acc))

In [None]:
preds = clf.predict(test)
pred=pd.Series(preds)

In [None]:
submission = pd.DataFrame({'ID':test_id, 'Crop_Damage':preds})

# FEATURE IMPORTANCE

In [None]:
fi = pd.Series(index = predictors, data = clf.feature_importances_)
fi.sort_values(ascending=False)[0:20][::-1].plot(kind = 'barh')