Being beginner in machine learning, I am yet to learn and explore Neural networks. Currently, i am trying to learn more and more about Random Forest and Boosting algorithms. Hence in this notebook my main objective was to focus more and **exploratory data analysis part** and **hyperparameter tunning of RF and Xgboost algorithm**. 

## **Step 1 - Getting Data**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import pickle
import gc

In [None]:
train = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv')

In [None]:
train.columns

In [None]:
# Checking for class imbalance
plt.figure(figsize=(15,10))
train.groupby("Cover_Type").Cover_Type.hist()
train['Cover_Type'].value_counts()

In [None]:
# Removing Class 5 as it has only 1 instance
class_5_indices = train.index[train.Cover_Type==5]
train = train.drop(class_5_indices,axis=0)
Z_train = train.copy()
Z_test = test.copy()

## **Step 2: EDA**

## List of New features added
- Elevation square and Cube
- soil types - Extremely stony, stony, rubbly and so on ....
- Common soil families - root outcrop, vanet, leighchan .....
- soil freq - number of families in particular soil type
- Tan(slope)
- Hill shade average, std deviation and increase/decrease over the day
- distance to hydrology vertical + distance to hydrology horizontal
- distance to hydrology roadways + distance to firepoints
- wilderness count

## $Elevation$, $Elevation^2$ & $Elevation^3$

In [None]:
## Creating Elevation Square and Elevation cube as new Feature
Z_train['Elevation_sq'] = train.Elevation*train.Elevation
Z_test['Elevation_sq'] = test.Elevation*test.Elevation
Z_train['Elevation_cube'] = train.Elevation*train.Elevation*train.Elevation
Z_test['Elevation_cube'] = test.Elevation*test.Elevation*test.Elevation

In [None]:
plt.figure(figsize=(15,10))
ax = sns.countplot(data=train,x='Elevation',hue='Cover_Type')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(250))
plt.show()

## Soil Types 
### Name of Soil types can be used to extract more features like stone density and common families among various soil types.

- Cathedral family - Rock outcrop complex, extremely stony.
- Vanet - Ratake families complex, very stony.
- Haploborolis - Rock outcrop complex, rubbly.
- Ratake family - Rock outcrop complex, rubbly.

(Please see other names of soil types on data page of competition)

- Many soil families are common among various soil types (for ex: see Rock outcrop is present in soil types: 1,3,4,6,10,27,28,32,33,35,37)
- Also from names further characterstics of soil can be extracted: such as wheter soil is stony or rubbly




In [None]:
soil_types = ['Soil_Type1', 'Soil_Type2', 'Soil_Type3',
       'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8',
       'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12',
       'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16',
       'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20',
       'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24',
       'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28',
       'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32',
       'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36',
       'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40']

def all_soil_types(x):
    lst = []
    soil_count = 1
    for i in soil_types:
        if x[i]==1: 
            lst.append(soil_count)
        soil_count = soil_count+1
    return(lst)
#dummy.apply(lambda x: all_soil_types(x))
train['soil_lst'] = train.apply(lambda x: all_soil_types(x),axis=1)
test['soil_lst'] = test.apply(lambda x: all_soil_types(x),axis=1)

In [None]:
extremely_stony = [1,24,25,27,28,29,30,31,32,33,34,36,37,38,39,40]
very_stony = [2,9,18,26]
rubbly = [3,4,5,10,11,13]
stony = [6,12]
extremely_bouldery = [22]
no_stony_info = [7,8,14,15,16,17,19,20,21,23,35]

def check(z,chck_lst):
    return_value = 0
    for soil_type in z:
        if soil_type in chck_lst:
            return_value = 1
    return return_value
    
    
Z_train['extremely_stony'] = train.soil_lst.map(lambda x: check(x,extremely_stony))
Z_train['very_stony'] = train.soil_lst.map(lambda x: check(x,very_stony))
Z_train['rubbly'] = train.soil_lst.map(lambda x: check(x,rubbly))
Z_train['stony'] = train.soil_lst.map(lambda x: check(x,stony))
Z_train['extremely_bouldery'] = train.soil_lst.map(lambda x: check(x,extremely_bouldery))
Z_train['no_stony_info'] = train.soil_lst.map(lambda x: check(x,no_stony_info))

Z_test['extremely_stony'] = test.soil_lst.map(lambda x: check(x,extremely_stony))
Z_test['very_stony'] = test.soil_lst.map(lambda x: check(x,very_stony))
Z_test['rubbly'] = test.soil_lst.map(lambda x: check(x,rubbly))
Z_test['stony'] = test.soil_lst.map(lambda x: check(x,stony))
Z_test['extremely_bouldery'] = test.soil_lst.map(lambda x: check(x,extremely_bouldery))
Z_test['no_stony_info'] = test.soil_lst.map(lambda x: check(x,no_stony_info))


In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=Z_train[train['Cover_Type']>=0],x='Elevation',y='extremely_stony',hue=train['Cover_Type'],palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(250))
plt.show()

In [None]:
Rock_outcrop = [1,3,4,6,10,27,28,32,33,35,37]
Leighcan = [22,23,24,25,27,28,31,32,33,38,39]
Catamount = [10,11,13,26,31,32,33]
Rock_land = [12,13,30,34,36,40]
Vanet = [2,5,6]
Bullwark = [10,11,13]
Moran = [38,39,40]
Typic = [19,20,21,23]
Aquolis = [14,16,17,19]
Cryorthents = [34,37,39,40]
Cryumbrepts = [35,36,37]
Cryaquepts = [20,35]
Z_train['Rock_outcrop'] = train.soil_lst.map(lambda x: check(x,Rock_outcrop))
Z_train['Leighcan'] = train.soil_lst.map(lambda x: check(x,Leighcan))
Z_train['Catamount'] = train.soil_lst.map(lambda x: check(x,Catamount))
Z_train['Rock_land'] = train.soil_lst.map(lambda x: check(x,Rock_land))
Z_train['Vanet'] = train.soil_lst.map(lambda x: check(x,Vanet))
Z_train['Bullwark'] = train.soil_lst.map(lambda x: check(x,Bullwark))
Z_train['Moran'] = train.soil_lst.map(lambda x: check(x,Moran))
Z_train['Typic'] = train.soil_lst.map(lambda x: check(x,Typic))
Z_train['Aquolis'] = train.soil_lst.map(lambda x: check(x,Aquolis))
Z_train['Cryorthents'] = train.soil_lst.map(lambda x: check(x,Cryorthents))
Z_train['Cryumbrepts'] = train.soil_lst.map(lambda x: check(x,Cryumbrepts))
Z_train['Cryaquepts'] = train.soil_lst.map(lambda x: check(x,Cryaquepts))

Z_test['Rock_outcrop'] = test.soil_lst.map(lambda x: check(x,Rock_outcrop))
Z_test['Leighcan'] = test.soil_lst.map(lambda x: check(x,Leighcan))
Z_test['Catamount'] = test.soil_lst.map(lambda x: check(x,Catamount))
Z_test['Rock_land'] = test.soil_lst.map(lambda x: check(x,Rock_land))
Z_test['Vanet'] = test.soil_lst.map(lambda x: check(x,Vanet))
Z_test['Bullwark'] = test.soil_lst.map(lambda x: check(x,Bullwark))
Z_test['Moran'] = test.soil_lst.map(lambda x: check(x,Moran))
Z_test['Typic'] = test.soil_lst.map(lambda x: check(x,Typic))
Z_test['Aquolis'] = test.soil_lst.map(lambda x: check(x,Aquolis))
Z_test['Cryorthents'] = test.soil_lst.map(lambda x: check(x,Cryorthents))
Z_test['Cryumbrepts'] = test.soil_lst.map(lambda x: check(x,Cryumbrepts))
Z_test['Cryaquepts'] = test.soil_lst.map(lambda x: check(x,Cryaquepts))

In [None]:
train['soil_lst_freq'] = train.soil_lst.map(lambda x: len(x))
test['soil_lst_freq'] = test.soil_lst.map(lambda x: len(x))
Z_train['soil_lst_freq'] = train['soil_lst_freq']
Z_test['soil_lst_freq'] = test['soil_lst_freq']

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=Z_train[train['Cover_Type']>=0],x='Elevation',y='soil_lst_freq',hue=train['Cover_Type'],palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
#ax.xaxis.set_major_locator(ticker.MultipleLocator(250))
plt.show()

## Aspect

In [None]:
## mapping all values between 0 to 360 degree range
train['Aspect_2'] = train.Aspect.map(lambda x : x-360 if x>360 else (x+360 if x<0  else x))
test['Aspect_2'] = test.Aspect.map(lambda x : x-360 if x>360 else (x+360 if x<0  else x))
Z_train['Aspect'] = train['Aspect_2']
Z_test['Aspect'] = test['Aspect_2']

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=train,x='Elevation',y='Aspect',hue=train.Cover_Type,palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(250))
plt.show()

## Slope & tan(Slope)

In [None]:
train['Slope_wo_negative'] = train.Slope.map(lambda x: x if x>0 else 0)
test['Slope_wo_negative'] = test.Slope.map(lambda x: x if x>0 else 0)
Z_train.Slope = train['Slope_wo_negative']
Z_test.Slope = test['Slope_wo_negative']

In [None]:
Z_train['slope_tan'] = np.tan(Z_train.Slope*(3.14/180))
Z_test['slope_tan'] = np.tan(Z_test.Slope*(3.14/180))

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=Z_train,x='Elevation',y='Slope',hue=train.Cover_Type,palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(250))
plt.show()

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=Z_train,x='Elevation',y='slope_tan',hue=train.Cover_Type,palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(250))
plt.show()

## Vertical Distance to Hydrology, Horizontal Distance to Hydrology & (Vertical Distance to $Hydrology^2$ + Horizontal Distance to $Hydrology^2$)

In [None]:
max_threshold = 1000
train['hor_dist_hydro'] = train.Horizontal_Distance_To_Hydrology.map(lambda x: abs(x) if x<0 else(max_threshold if x>max_threshold else x))
test['hor_dist_hydro'] = test.Horizontal_Distance_To_Hydrology.map(lambda x: abs(x) if x<0 else(max_threshold if x>max_threshold else x))
Z_train.Horizontal_Distance_To_Hydrology = train.hor_dist_hydro
Z_test.Horizontal_Distance_To_Hydrology = test.hor_dist_hydro

In [None]:
train['ver_dist_negtve'] = train.Vertical_Distance_To_Hydrology.map(lambda x: 0 if x<0 else 1)
test['ver_dist_negtve'] = test.Vertical_Distance_To_Hydrology.map(lambda x: 0 if x<0 else 1)

# converting negative to positive
train['ver_dist_hydro'] = train.Vertical_Distance_To_Hydrology.map(lambda x: abs(x) if x<0 else x)
test['ver_dist_hydro'] = test.Vertical_Distance_To_Hydrology.map(lambda x: abs(x) if x<0 else x)

# max threshold for features
max_threshold = 400
train['ver_dist_hydro'] = train.ver_dist_hydro.map(lambda x: max_threshold if x>max_threshold else x)
test['ver_dist_hydro'] = test.ver_dist_hydro.map(lambda x: max_threshold if x>max_threshold else x)

Z_train.Vertical_Distance_To_Hydrology = train.ver_dist_hydro
Z_test.Vertical_Distance_To_Hydrology = test.ver_dist_hydro
Z_train['ver_dist_negtve'] = train.ver_dist_negtve
Z_test['ver_dist_negtve'] = test.ver_dist_negtve

In [None]:
Z_train['hydrology_sum'] = Z_train.Horizontal_Distance_To_Hydrology*Z_train.Horizontal_Distance_To_Hydrology + Z_train.Vertical_Distance_To_Hydrology*Z_train.Vertical_Distance_To_Hydrology
Z_test['hydrology_sum'] = Z_test.Horizontal_Distance_To_Hydrology*Z_test.Horizontal_Distance_To_Hydrology + Z_test.Vertical_Distance_To_Hydrology*Z_test.Vertical_Distance_To_Hydrology

## Distance to Road,Fire Points and (Distance to Road + Distance to Fire) 

In [None]:
# converting negative to positive
train['hor_dist_road'] = train.Horizontal_Distance_To_Roadways.map(lambda x: abs(x) if x<0 else x)
test['hor_dist_road'] = test.Horizontal_Distance_To_Roadways.map(lambda x: abs(x) if x<0 else x)

# max threshold for features
max_threshold = 5000
train['hor_dist_road'] = train.hor_dist_road.map(lambda x: max_threshold if x>max_threshold else x)
test['hor_dist_road'] = test.hor_dist_road.map(lambda x: max_threshold if x>max_threshold else x)

Z_train.Horizontal_Distance_To_Roadways = train.hor_dist_road
Z_test.Horizontal_Distance_To_Roadways = test.hor_dist_road

In [None]:
# converting negative to positive
train['hor_dist_fire'] = train.Horizontal_Distance_To_Fire_Points.map(lambda x: abs(x) if x<0 else x)
test['hor_dist_fire'] = test.Horizontal_Distance_To_Fire_Points.map(lambda x: abs(x) if x<0 else x)

# max threshold for features
max_threshold = 5000
train['hor_dist_fire'] = train.hor_dist_fire.map(lambda x: max_threshold if x>max_threshold else x)
test['hor_dist_fire'] = test.hor_dist_fire.map(lambda x: max_threshold if x>max_threshold else x)

Z_train.Horizontal_Distance_To_Fire_Points = train.hor_dist_fire
Z_test.Horizontal_Distance_To_Fire_Points = test.hor_dist_fire

In [None]:
Z_train['fire_road_sum'] = Z_train.Horizontal_Distance_To_Fire_Points*Z_train.Horizontal_Distance_To_Fire_Points + Z_train.Horizontal_Distance_To_Roadways*Z_train.Horizontal_Distance_To_Roadways
Z_test['fire_road_sum'] = Z_test.Horizontal_Distance_To_Fire_Points*Z_test.Horizontal_Distance_To_Fire_Points + Z_test.Horizontal_Distance_To_Roadways*Z_test.Horizontal_Distance_To_Roadways

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=Z_train,x='Elevation',y='fire_road_sum',hue=train.Cover_Type,palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(1000))
plt.show()

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=Z_train,x='Elevation',y='hydrology_sum',hue=train.Cover_Type,palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(1000))
plt.show()

## Hillshade

In [None]:
# max threshold for features
max_threshold = 255
train['Hillshade_9am_thresholded'] = train.Hillshade_9am.map(lambda x: 0 if x<0 else(x if x < max_threshold else max_threshold))
test['Hillshade_9am_thresholded'] = test.Hillshade_9am.map(lambda x: 0 if x<0 else(x if x < max_threshold else max_threshold))
Z_train.Hillshade_9am = train.Hillshade_9am_thresholded
Z_test.Hillshade_9am = test.Hillshade_9am_thresholded

In [None]:
# max threshold for features
max_threshold = 255
train['Hillshade_Noon_thresholded'] = train.Hillshade_Noon.map(lambda x: 0 if x<0 else(x if x < max_threshold else max_threshold))
test['Hillshade_Noon_thresholded'] = test.Hillshade_Noon.map(lambda x: 0 if x<0 else(x if x < max_threshold else max_threshold))

Z_train.Hillshade_Noon = train.Hillshade_Noon_thresholded
Z_test.Hillshade_Noon = test.Hillshade_Noon_thresholded

In [None]:
# max threshold for features
max_threshold = 255
train['Hillshade_3pm_thresholded'] = train.Hillshade_3pm.map(lambda x: 0 if x<0 else(x if x < max_threshold else max_threshold))
test['Hillshade_3pm_thresholded'] = test.Hillshade_3pm.map(lambda x: 0 if x<0 else(x if x < max_threshold else max_threshold))


Z_train.Hillshade_3pm = train.Hillshade_3pm_thresholded
Z_test.Hillshade_3pm = test.Hillshade_3pm_thresholded


## Hillshade variation over the day

In [None]:
Z_train['Hillshade_3pm-9am'] =  train['Hillshade_3pm'].astype(int) - train['Hillshade_9am'].astype(int)
Z_train['Hillshade_Noon-9am'] =  train['Hillshade_Noon'].astype(int) - train['Hillshade_9am'].astype(int)
Z_train['Hillshade_3pm-Noon'] =  train['Hillshade_3pm'].astype(int) - train['Hillshade_Noon'].astype(int)
Z_test['Hillshade_3pm-9am'] =  test['Hillshade_3pm'].astype(int) - test['Hillshade_9am'].astype(int)
Z_test['Hillshade_Noon-9am'] =  test['Hillshade_Noon'].astype(int) - test['Hillshade_9am'].astype(int)
Z_test['Hillshade_3pm-Noon'] =  test['Hillshade_3pm'].astype(int) - test['Hillshade_Noon'].astype(int)



## Hillshade Average and std deviation over the day

In [None]:
train['Hillshade_avg'] = train[['Hillshade_9am','Hillshade_3pm','Hillshade_Noon']].agg(func=np.mean,axis=1)
test['Hillshade_avg'] = test[['Hillshade_9am','Hillshade_3pm','Hillshade_Noon']].agg(func=np.mean,axis=1)

Z_train['Hillshade_avg'] = train['Hillshade_avg']
Z_test['Hillshade_avg'] = test['Hillshade_avg']

In [None]:
train['Hillshade_std'] = train[['Hillshade_9am','Hillshade_3pm','Hillshade_Noon']].agg(func=np.std,axis=1)
test['Hillshade_std'] = test[['Hillshade_9am','Hillshade_3pm','Hillshade_Noon']].agg(func=np.std,axis=1)

Z_train['Hillshade_std'] = train['Hillshade_std']
Z_test['Hillshade_std'] = test['Hillshade_std']

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=Z_train,x='Elevation',y='Hillshade_avg',hue=train.Cover_Type,palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(1000))
plt.show()

In [None]:
plt.figure(figsize=(20,10))
ax = sns.scatterplot(data=Z_train,x='Elevation',y='Hillshade_std',hue=train.Cover_Type,palette='Set2')
plt.xticks(rotation=45,horizontalalignment='right',fontweight='light')
ax.xaxis.set_major_locator(ticker.MultipleLocator(1000))
plt.show()

## Wilderness_Area 

In [None]:
Z_train['Wilderness_freq'] = Z_train.Wilderness_Area1 + Z_train.Wilderness_Area2 + Z_train.Wilderness_Area3 + Z_train.Wilderness_Area4
Z_test['Wilderness_freq'] = Z_test.Wilderness_Area1 + Z_test.Wilderness_Area2 + Z_test.Wilderness_Area3 + Z_test.Wilderness_Area4

## Step 3: Random Forest based Machine Learning Model
    Paremeters tested 
    - max_features = [0.3,0.5,0.7,1]
    - max_depth = [50,100,150]
    

In [None]:
del train
#del test
gc.collect()

In [None]:

Z_y = Z_train.Cover_Type
Z_train = Z_train.drop('Cover_Type',axis=1)
Z_train = Z_train.drop('Id',axis=1)
Z_test = Z_test.drop('Id',axis=1)




In [None]:

Randomforest_6 = RandomForestClassifier()
parameters_grid = [{'max_features':[0.7],'max_depth':[100]}]
Randomforest_grid_search_6 = GridSearchCV(Randomforest_6,param_grid=parameters_grid,cv=2,scoring='accuracy')
Randomforest_grid_search_6.fit(Z_train,Z_y)


In [None]:
Randomforest_grid_search_6.best_score_

In [None]:
y_test_predict_8 = Randomforest_grid_search_6.best_estimator_.predict(Z_test)

In [None]:

output = pd.DataFrame()
output['Id'] = test['Id']
output['Cover_Type'] = y_test_predict_8
output.to_csv('submission.csv',index=False)


In [None]:
feature_importances = pd.DataFrame()
feature_importances['Features'] = list(Z_train.columns)
feature_importances['RF_features_impodtance'] = Randomforest_grid_search_6.best_estimator_.feature_importances_
feature_importances.sort_values(by=['RF_features_impodtance'],ascending=False)