# Forest Cover Type Project

## Load Data & Setup

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv('../input/forest-cover-type-prediction/train.csv')

# display train data
train.head()

In [None]:
# drop ID column
train = train.iloc[:,1:]
train.head()

In [None]:
# size of data frame
train.shape

Since all variables are numeric integers, there are no need for further conversions.

In [None]:
# look at the data types of each feature and see if there needs to be any pre-processing
train.dtypes

## Exploratory Data Analysis
- Our dataset has **54** features and **1** target variable, `Cover_Type`. 
- From 54 features, 10 are numeric and 44 are categorical.
- From 44 categorical, 40 are `Soil_Type` and 4 of `Wilderness_Area`
- These are the following forest cover types in target variable `Cover_Type`:
    1. Spruce/Fir
    2. Lodgepole Pine
    3. Ponderosa Pine
    4. Cottonwood/Willow
    5. Aspen
    6. Douglas-fir
    7. Krummholz

## Data Exploration
### Feature Statistics
- Part 1. Describe **numerical features**
- Part 2. Describe **binary/categorical features**

In [None]:
# extract all numerical features from train
num_features = train.iloc[:,:10]

# extract all binary features from train
cat_features = train.iloc[:, 10:-1]

#### Part 1. Describe numerical features
- **mean** of the feature varies from 16 to 2749.
- **std** for `Horizontal_Distance_To_Roadways` is the most spread out data, followed by `Horizontal_Distance_To_Fire_Points` and `Elevation`.
- The most desnsed and near to mean is `Slope` followed by all 3 features of `Hillshade`. 
    - See **Boxplot #1** in *Feature Visualization Section*
- All features have a minimum value of 0 except `Elevation` and `Vertical_Distance_To_Hydrology` features.
    - `Elevation` has the highest minimum value and `Vertical_Distance_To_Hydrology` has a negative value.
- `Hillshades` features except `Hillshade_3pm` have a similar maximum value.
- `Horizontal_Distance_To_Fire_Points` has the highest maximum value followed by `Horizontal_Distance_To_Roadways` features. They also have the highest ranges of all features.
- `Slope` has the lowest maximum value and range. The `Aspect` feature follows closely behind this same concept.

It is good to note that the reason some features are widely spread and have high values, is because 5 out of the 10 variables are measured in meters. These variables are: `Elevation`,`Horizontal_Distance_To_Hydrology`,`Vertical_Distance_To_Hydrology`,`Horizontal_Distance_To_Roadways`,`Horizontal_Distance_To_Fire_Points`. This makes sense that these have high values and ranges.


Features like `Aspect` and `Slope` are measured in degrees which means there maximum values can't go above 360. `Hillshade` features can only take on a maximum value of 255.


In [None]:
num_features.describe()

#### Part 2. Describe categorical features
- Categorical variables will either have a value of 0 or 1. The **mean** can tell us useful information.
    - `Wilderness_Area3` followed by `Wilderness_Area4` has the highest mean. This signifies that these variables have the most presence in the data compared to other Wilderness Area. Most of our features will consist of `Wilderness_Area3` and `Wilderness_Area4`.
    - The least amount of observations will be seen from `Wilderness_Area2`.
- One more to notice here is that when we add all the mean of `Wilderness_Area` we get a result 0.999999 which is approximately 1. This may mean all the observations can be from any one Wilderness Area. (Cross Check Here: **xx**)
- Probability wise, the next observation that we get will have a 42.0% probability take from `Wilderness_Area3`, 30.9% probability take from `Wilderness_Area4` and so on for others. 
    - We can look into more details with the following plot in the *Feature Visualization Section*: **Barplot #2**.
- Probability wise, we can document the same for `Soil_Types` too. 
    - We can look at **Barplot** #3 and plot xx in *Feature Visualization Section*.


By looking at these statistics of two different data types, we can see that there is different spreads and uneven amount of distribution. In this case we will feature scale these so that all the features have similar ranges between 0 and 1. Some algorithms can be sensitive to high values hence giving us inappropriate results while some algorithms are not. To be on the safe side, we will feature scale it and will do this in the **Data Engineering** section: **xx**.

In [None]:
cat_features.describe()

### Feature Skew
- For normal distribution, the skewness should be zero. Thus any balanced data should have a skewness near zero.
- Negative values indicate data is skewed left. The left tail is long relative to the right tail.
- Positive values indicate data is skewed right. The right tail is long relative the left tail.

In [None]:
skew = train.skew()
skew_df = pd.DataFrame(skew, index=None, columns=['Skewness'])

In [None]:
print(skew)

#### Skewness Inferences
- `Soil_Type8` and `Soil_Type25` has the highest skewness. This means that the mass of the distribution is concentrated to the left and has long tail to the right followed by `Soil_Type9, 28 and 26`. This is also called **right skewed distribution**. 
    - We can see here that mostly all of the observations will have a 0 value for this feature in the **Feature Visualization Section**: **Barplot #3**
- The `Hillshade` variables have a negatively skewed distribution.
- ML algorithm can be very sensitive to such ranges of data and can give us inappropriate/weak restuls. **Feature Scaling** will handle these as discussed earlier.

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x=skew_df.index, y='Skewness', data=skew_df)
var = plt.xticks(rotation=90)

### Class Distribution
Now we will look at the class distribution for `Cover_Type` by grouping it and calculating total occurrence.


We can see that `Cover_Type` has an equal distribution.

In [None]:
train.groupby('Cover_Type').size()

### Feature Visualization
First, we will visualize the spread and outliers of the data of numerical features.

#### Boxplot #1: Numerical Features Inferences
- `Slope` is the most squeezed box plot. It having a least range means that the **median** and **mean** will be quite close.
- `Aspect` features is the only one with little to none outliers. Since both `Aspect` and `Slope` are measure in degrees, `Aspect` takes on much bigger range than `Slope` because it has the lowest max score, which means `Aspect` is less densed than `Slope`.
- The `Hillshade` features also have a similar plot to Slope, which includes many outliers and taking on a smaller range.
- `Vertical_Distance_To_Hydrology` is also similar to Slope except here the minimum value is negative.
- `Elevation` is the only feature that doesn't have a minimum value of 0. It is instead plotted in the middle having many outliers too.
- `Horizontal_Distance_To_Roadways` has the most spread out data of all features. This is because it has highest standard deviation score. `Horizontal_Distance_To_Fire_Points` has a similar look, but it has the maximum value.
    - If we compare these two features, the last 50% of `Horizontal_Distance_To_Roadways` is much more spread and less dense compared to `Horizontal_Distance_To_Fire_Points`, hence having a high standard deviation score.

In [None]:
# plot bg
sns.set_style("whitegrid")

plt.subplots(figsize=(21,14))
color = sns.color_palette('pastel')
sns.boxplot(data=num_features, orient='h', palette=color)
plt.title('Spread of Data in Numerical Features', size=18)
plt.xlabel('# of Observations', size=16)
plt.ylabel('Features', size=16)
plt.xticks(size=16)
plt.yticks(size=16)

sns.despine()
plt.show()

### Feature Distribution
Now we will plot how Wilderness_Area are distributed.

#### Barplot #2: Number of Observations of Wilderness Areas Inferences:
- Visually, we can see that `Wilderness_Area3` and `Wilderness_Area4` has the most presence.
- `Wilderness_Area2` has the least amount of observations. Which confirms it will not have the most presence in our data.

In [None]:
# split cat_features
wild_data, soil_data = cat_features.iloc[:,:4], cat_features.iloc[:,4:]

# plot bg
sns.set_style("darkgrid", {'grid.color':'.1'})
flatui = ["#e74c3c", "#34495e", "#2ecc71","#3498db"]

# use seaborn, pass colors to palette
palette = sns.color_palette(flatui)

# sum the data, plot bar
wild_data.sum().plot(kind='bar', figsize=(10,8), color='#34a028')
plt.title('# of Observations of Wilderness Areas', size=18)
plt.xlabel('Wilderness Areas', size=16)
plt.ylabel('# of Observations', size=16)
plt.xticks(rotation='horizontal', size=12)
plt.yticks(size=12)

sns.despine()
plt.show()

In [None]:
# total count of each wilderness area
wild_data.sum()

#### Barplot #3: Number of Observations of Soil Type Inferences:


Now we will plot the number of observations for `Soil Type`.
- In the bar plot below, we can see that there many different types of distributions: **normale distribution, bimodal distribution, unimodal distribution, and left & right-skewed distribution** showing up in pieces.
- The most observation is seen from `Soil_Type10` followed by `Soil_Type29`.
    - From a statistical analysis, `Soil_Type10` has a presence in 14.1% of observations in the data.
    - `Soil_Type10` also had the least skewed value of all in Soil Types as we had seen earlier in data exploration.
- The variable with the least amount of observations are `Soil_Type7` and `Soil_Type15`.
    - Soil Types has the most skewed values because these variables with a skew variable of 0 were so little, making it densely concentrated towards 0 and long flat tail to the right having form of **positively skewed distribution** or **right skewed distribution** (Details in *Feature Skew* Section).

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color': '.1'})

# sum data, plot bar
soil_data.sum().plot(kind='bar', figsize=(24,12), color='#a87539')
plt.title('# of Observations of Soil Types', size=18)
plt.xlabel('Soil Types', size=16)
plt.ylabel('# of Observations', size=16)
plt.xticks(rotation=90, size=14)
plt.yticks(size=14)

sns.despine()
plt.show()

In [None]:
# statistical description of highest observation of soil type
soil_data.loc[:,'Soil_Type10'].describe()

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid_color': '.1'})

# sum soil data, pass it as a series
soil_sum = pd.Series(soil_data.sum())
soil_sum.sort_values(ascending=False, inplace=True)

# plot bar
soil_sum.plot(kind='barh', figsize=(23,17), color='#a87539')
plt.gca().invert_yaxis()
plt.title('# of Observations of Soil Types', size=18)
plt.xlabel('# of Observation', size=16)
plt.ylabel('Soil Types', size=16)
plt.xticks(rotation='horizontal',size=14)
plt.yticks(size=14)

sns.despine()
plt.show()

### Feature Comparison
Next we will compare each feature in our data to the target variable. This will help us visualize how much dense and distributed each target variable's class is compared to the feature. We will use the violin plot to visualize.


#### Violin Plot 4.1 Numerical Features Inferences:
- `Elevation`
    - `Cover_Type4` has the most forest cover at elevation between 2000m - 2500m.
    - `Cover_Type3` has the fewest presence around that same elevation.
    - `Cover_Type7` has observations of most elevated trees ranging as low as ~2800m to as high as ~3800m.
        - `Cover_Type7` max value in elevation did belong to this forest type.
        - This will be an important feature since every feature tells a different story to different classes of forest cover type. This could be useful in our algorithm.
- `Aspect`
    - This feature has a normal distribution for each class.
- `Slope`
    - Slope has lower values compared to most features as its measured in degrees and least to `Aspect` which is also measured in degrees.
    - It has the least maximum value of all features. Looking at the plot we can say that it belongs to `Cover_Type2`.
    - All classes have dense slope observations between 0-20 degrees.
- `Horizontal_Distance_To_Hydrology`
    - This has the right or positively skewed distribution where most of the values for all classes are towards 0-50m.
- `Vertical_Distance_To_Hydrology`
    - This is also positively skewed distribution but this takes on values much closer to 0 for all classes for most observations.
    - The highest value in this feature belongs to `Cover_Type2`. This feature also has the least minimum value. In this case, `Cover_Type2` has the most range of observations compared to other classes.
- `Hillshade_9am` and `Hillshade_Noon` are left or negatively skewed distribution where they take on max value between 200-250 index value for most observation in each class.
- `Hillshade_3pm` has a normal distribution for all classes.

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color': '.1'})

# set target variable
target = train['Cover_Type']

# features to be compared with target variable
features = num_features.columns

# loop for violin plot
for i in range(0, len(features)):
    plt.subplots(figsize=(16,11))
    sns.violinplot(data=num_features, x=target, y=features[i])
    plt.xticks(size=14)
    plt.yticks(size=14)
    plt.xlabel('Forest Cover Types', size=18)
    plt.ylabel(features[i], size=18)
    
    plt.show()

#### Violin Plot 4.2 Wilderness Area Inferences:
- `Wilderness_Area1` belongs to forest `Cover_Type1`, `Cover_Type2`, and `Cover_Type5`.
- `Wilderness_Area3` belongs to all classes except `Cover_Type4`.
- `Wilderness_Area2` and `Wilderness_Area4` has the least observations, their dense is less on 1 on all classes compared to `Wilderness_Area1` and `Wilderness_Area3`.

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color': '.1'})

# set target variable
target = train['Cover_Type']
# features to be compared with target variable
features = wild_data.columns

# loop for violin plots
for i in range(0, len(features)):
    
    plt.subplots(figsize=(13,9))
    sns.violinplot(data=wild_data, x=target, y=features[i])
    plt.xticks(size=14)
    plt.yticks(size=14)
    plt.xlabel('Forest Cover Types', size=16)
    plt.ylabel(features[i], size=16)
    
    plt.show()

#### Violin Plot 4.3 Soil Type Inferences:
- `Soil_Type4` is the only soil type that has presence in all forest cover types.
- `Soil_Type`: 7 and 15 visually, have little to no presence in all forest cover types.
- `Soil_Type`: 3 and 6 has presence in `Cover_Type`: 2, 3, 4, 6
- `Soil_Type`: 10, 11, 16, and 17 and has presence in `Cover_Type` 1 thru 6.
- `Soil_Type`: 23, 24, 31 and 33 has presence in `Cover_Type`: 1, 2, 5, 6, 7.
- `Soil_Type`: 29 and 30, has presence in `Cover_Type`: 1, 2, 5, 7.
- `Soil_Type`: 22, 27, 35, 38, 39, and 40 has presence in `Cover_Type`: 1, 2, and 7.
- `Soil_Type`: 18 and 28 has presence in `Cover_Type`: 2 and 5.
- `Soil_Type`: 19 and 26 has presence in `Cover_Type`: 1, 2, and 5.
- `Soil_Type`: 8 and 25 has presence in only `Cover_Type2`.
- `Soil_Type`: 1, 5, and 14 has presence in `Cover_Type`: 3, 4, and 6.
- `Soil_Type37` has presence in `Cover_Type7`.


- `Cover_Type4` has the least amount of `Soil_Type` count.
- `Cover_Type2` has the most presence in `Soil_Type` count.

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color':'.1'})

# set target variable
target = train['Cover_Type']
# features compare with target variable
features = soil_data.columns

# violin for loop
for i in range(0, len(features)):
    plt.subplots(figsize=(13,9))
    sns.violinplot(data=soil_data, x=target, y=features[i])
    plt.xticks(size=14)
    plt.yticks(size=14)
    plt.xlabel('Forest Cover Types', size=16)
    plt.ylabel(features[i], size=16)
    
    plt.show()

### Feature Correlation
Part of our data is binary. A **correlation matrix** requires continuous data, so we will exclude binary data.


- Features that less or no correlation will be indicated by the color **black**.
- Features with positive correlation are colored **orange**.
- Features with negative correlation are colored **blue**.


#### Correlation Plot #5 Inferences:
- `Hillshade_3pm` and `Hillshade_9am` show a high negative correlation.
- `Hillshade_3pm` and `Aspect` show a high positive correlation.
- `Hillshade_3pm` and `Aspect` also had the most normal distribution compared to forest cover type classes (**Plot 4.1**)
- The following pairs had a positive correlation:
    - `Vertical_Distance_To_Hydrology` and `Horizontal_Distance_To_Hydrology`
    - `Horizontal_Distance_To_Roadways` and `Elevation`
    - `Hillshade_3pm` and `Aspect`
    - `Hillshade_3pm` and `Hillshade_Noon`
- The following pairs had a negative correlation:
    - `Hillshade_9am` and `Aspect`
    - `Hillshade_Noon` and `Slope`
- The following pair has no correlation:
    - `Hillshade_9am` and `Horizontal_Distance_To_Roadways`
- The least correlated value tells us that each feature has different valuable information that could be important features for predictions.

In [None]:
plt.subplots(figsize=(15,10))

# compute correlation matrix
num_features_corr = num_features.corr()

# generate mask for upper triangle
mask = np.zeros_like(num_features_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# generate heatmap masking the upper triangle and shrink the cbar
sns.heatmap(num_features_corr, mask=mask, center=0, square=True, annot=True, annot_kws={"size": 15}, cbar_kws={"shrink": .8})
plt.xticks(size=13)
plt.yticks(size=13)

plt.show()

#### Scatterplot #6 Features with correlation greater than 0.5
Let's look at the paired features with correlation greater than 0.5. These will be the feature pairs with a positive correlation.

#### Inferences:
- `Hillshade_3pm` and `Aspect` represent a **sigmoid function** relationship. The data points at the boundaries mostly belong to `Cover_Type`: 3, 4, 5.
- `Vertical_Distance_To_Hydrology` and `Horizontal_Distance_To_Hydrology` represent a **linear function** but more spread out.
    - `Cover_Type`: 1, 2, 7 have more observations spreaded out.
    - `Cover_Type`: 3, 4, 5, 6 are mode densely packed from 0-600m Horizontal_Distance_To_Hydrology
- `Elevation` and `Horizontal_Distance_To_Roadways` is a spread out **linear function**.
    - `Cover_Type` 1, 2, and 7 has the highest elevation and a widespread of points from 0m to ~7000m `Horizontal_Distance_To_Roadways`
    - `Cover_Type` 4 and 6 have a densed dataset where there is both low elevation and horizontal distance to roadways in meters.
- `Hillshade_Noon` and `Hillshade_3pm`
    - `Cover_Type` 1, 2, 6 and 7 have a higher hillshade index at noon and 3pm.
    - `Cover_Type` 4 and 5 have a lower hillshade index at noon and 3pm.

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color': '.1'})

# paired features with positive correlation
list_data_corr = [['Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology'],
                  ['Elevation','Horizontal_Distance_To_Roadways'],
                  ['Aspect','Hillshade_3pm'],
                  ['Hillshade_3pm','Hillshade_Noon']]

# loop through outer list
# take 2 features from inner list
for i,j in list_data_corr:
    plt.subplots(figsize=(15,12))
    sns.scatterplot(data=train, x=i, y=j, hue="Cover_Type", legend='full', palette='rainbow_r')
    plt.xticks(size=15)
    plt.yticks(size=15)
    plt.xlabel(i, size=16)
    plt.ylabel(j, size=16)
    
    plt.show()

# Feature Engineering
We will do the following in the Feature Engineering section:
- Look if any observations are present in more than one type of same category of `Wilderness_Area` and `Soil_Type`.
- Delete columns which has '0' value for all observation.
- Delete observations which has null values in any of its features.
- Delete any duplicate entries.
- Reduce features by keeping best.
- Scale values in a specific range.
- Perform Train-Test Split.

## Observation Cleaning
There may be a possibility where we `Soil_Type` and `Wilderness_Area` are recorded as present for more than one type or maybe none. We will check for each feature.

#### Inference:
In both `Soil_Type` and `Wilderness_Area` we have no present in more than one type or none.

#### 1. Check for `Wilderness_Area`

In [None]:
# count for more than 1 presence
more_count = 0
# count for no presence
none_count = 0
# total count
total = 0

# loop through each row of wilderness area column
for index, row in wild_data.iterrows():
    # add the values of each col of that row
    total = row.sum(axis=0)
    
    # check for greater than 1
    if total > 1:
        more_count += 1
        total = 0
        break
        
    # check for none    
    if total == 0:
        none_count += 1
        total = 0
        
print(f'We have {more_count} observations that shows presence in more than 1 Wilderness Area.')
print(f'We have {none_count} observations that shows no presence in any Wilderness Area.')

#### 2. Check for `Soil_Type`

In [None]:
# count for more than 1 presence
more_count = 0
# count for no presence
none_count = 0
# total count
total = 0

# loop through each row of soil type column
for index, row in soil_data.iterrows():
    # add the values of each col of that row
    total = row.sum(axis=0)
    
    # check for greater than 1
    if total > 1:
        more_count += 1
        total = 0
        break
        
    # check for none
    if total == 0:
        none_count += 1
        total = 0

print(f'We have {more_count} observations that shows presence in more than 1 Soil Type Area.')
print(f'We have {none_count} observations that shows no presence in any Soil Type Area.')

### Handling Missing Values
Looks like we have no missing values.

In [None]:
train.dropna()

In [None]:
train.shape

### Handling Duplicates
There are no duplicates.

In [None]:
# delete duplicates, except the first observation
train.drop_duplicates(keep='first')

In [None]:
train.shape

# Dimensionality Reduction
- Based on EDA, we have lots of observations and features to train the model. This will make the algorithm run slowly, which may give ML models difficulty learning, overfitting in the training set, and do worse in submission/testing.
- We also see from checking for missing values and duplicates that `Wilderness_Area` and `Soil_Type` have no category that has no observations. This means that every feature has presence and we can't just delete because it may play an importance for the ML models in predicting classes.


To approach the problem, in this section we will see how each feature has an impact on predicting classes. We will use the following classifiers: **Extra Trees, Random Forest, Gradient Boosting Classifiers**. We will also use **AdaBoost** that will offer us the attribute `feature_importance_` to see which feature has more importance compared to others any by how much.

#### Dimensionality Reduction Inferences:
- We can see the **RFC** and **ETC** show similar results. The features do show up in different ranks, but not a great difference.
- In the **ADB**, the top 8 features are enough to predict classess. It is interesting to see that `Wilderness_Area4` ties with `Elevation` because scores pretty low in the other classifiers.
- `Elevation` takes on a similar dominance in each classifier.
- `Hillshade` features are seen on the top 10 list of every classifier except for **ADB**.
- In the feature visualization section of *correlation*, we saw that `Hillshade` features had a nice correlation with each other and other features like `Slope`, `Aspect`, and `Horizontal_Distance_To_Roadways`. They also show dominance in predicting, meaning they might had correlated but they have useful information in predicting the target variable.
- `Elevation`, `Vertical` and `Horizontal` Distance to Hydrology show presence in top 10 for all classifiers.
- `Horizontal Distance To Roadways` and `Fire Points` have the highest standard deviation including outliers im all classifiers except in **ADB**.

All this being said, these classification show that numerical features dominate when it comes to predicting forest classes. Now we will consider the top 15 to 20 features as a reasonable choice.

### Extra-Trees Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

etc_model = ExtraTreesClassifier(random_state = 53) # pass the model
X = train.iloc[:,:-1] # feed features to var X
y = train['Cover_Type'] # feed target variable to y

etc_model.fit(X,y) # train the ETC model

# extract feature importances
etc_feature_importances = pd.DataFrame(etc_model.feature_importances_, index=X.columns,
                                      columns=['ETC']).sort_values('ETC', ascending=False)

etc_model = None # remove trace of this ETC model
etc_feature_importances.head(10)

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc_model = RandomForestClassifier(random_state = 53) # pass the model
rfc_model.fit(X,y) # train the model

# extract feature importances
rfc_feature_importances = pd.DataFrame(rfc_model.feature_importances_, index=X.columns, 
                                       columns=['RFC']).sort_values('RFC', ascending=False)

rfc_model = None # remove trace of this RFC model
rfc_feature_importances.head(10)

### AdaBoost Classifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

adb_model = AdaBoostClassifier(random_state = 53) # pass the model
adb_model.fit(X,y) # train the model

# extract feature importances
adb_feature_importances = pd.DataFrame(adb_model.feature_importances_, index=X.columns,
                                      columns=['ADB']).sort_values('ADB', ascending=False)

adb_model = None # remove trace of this ADB model
adb_feature_importances.head(10)

### Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbc_model = GradientBoostingClassifier(random_state = 53) # pass the model
gbc_model.fit(X,y) # train the model

# extract feature importances
gbc_feature_importances = pd.DataFrame(gbc_model.feature_importances_, index=X.columns,
                                      columns=['GBC']).sort_values('GBC', ascending=False)

gbc_model = None # remove trace of GBC model
gbc_feature_importances.head(10)

Feed the top 20 features in a variable as a dataframe including the target variable. To determine this most of the similar features pop up in all four classifiers, while adding on the 10 additional **soil types** that were placed in the top.

In [None]:
sample = train[[
    'Elevation','Horizontal_Distance_To_Roadways','Horizontal_Distance_To_Fire_Points','Wilderness_Area4',
    'Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology','Aspect','Hillshade_3pm','Hillshade_Noon',
    'Hillshade_9am','Soil_Type28','Soil_Type18','Soil_Type19','Soil_Type20','Soil_Type21','Soil_Type22',
    'Soil_Type10','Soil_Type3','Soil_Type30','Soil_Type4','Cover_Type'
]]

## Feature Scaling
Before train-test split, we will scale the features to some specific range. We will scale all feature values to specific range of 0 to 1. Before doing this we will split the feature and target variables because we do not want to scale our target variable.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# pass range to the function and then save it
scaler = MinMaxScaler(feature_range = (0,1))

X = sample.iloc[:,:-1] # feed sample features to X
y = sample['Cover_Type'] # feed target variable to y

X_scaled = scaler.fit_transform(X) # apply feature scaling to all features

In [None]:
X_scaled

## Train-Test Split
Now we can split into 75% - 25% train-test set respectively.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=53)

In [None]:
print(X_train.shape, X_test.shape)

# Model Evaluation
Next we will feed our data to see how each model performs using two evaluation metrics: **accuracy** and **f1 score**:
- **Accuracy** is the measure of the correct predicted data divided by total number of observations hence giving a value ranfing between 0 and 1, while 0 is no correctly predicted class and 1 is all correctly predicted class.
- **f1 score** is more useful than accuracy epescially in the case where you have uneven amount of class distribution as in our case. It's the weighted average of precision and recall. Therefore, this score takes both false positives and false negatives into account.
- Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it's better to look at both precision and recall or f1 score.


We will do the following:
- Train the data on training set and test the performance on the benchmark model (Naive Bayes Classifier).
- Use 10 K-Fold CV to test the performance of our model.

First we will define a function to train the models using the training data and calculate model's performance using `accuracy` and `f1 score`.

In [None]:
from sklearn.model_selection import cross_val_score
import time

# function
def model_evaluation(clf):
    clf = clf # pass classifier to variable
    
    t_start = time.time() # record time
    clf = clf.fit(X_train, y_train) # classifier learning model
    t_end = time.time() # record time
    
    c_start = time.time() # record time
    accuracy = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
    f1_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='f1_macro')
    c_end = time.time() # record time
    
    # calculate mean of all 10 obs' accuracy and f1 as percent
    acc_mean = np.round(accuracy.mean() * 100, 2)
    f1_mean = np.round(f1_score.mean() * 100, 2)
    
    t_time = np.round((t_end - t_start) / 60, 3) # time for training
    c_time = np.round((c_end - c_start) / 60, 3) # time for evaluating scores
    
    clf = None # remove traces of classifier
    
    print(f'The accuracy score of this classifier is: {acc_mean}%.')
    print(f'The f1 score of this classifier is: {f1_mean}%.')
    print(f'This classifier took {t_time} minutes to train and {c_time} minutes to evaluate CV and metric scores.')

### Benchmark Model: `MultinomialNB Classifier`
We will not see how the performance of `MultinomialNB Classifier` on given training data. This performs quite quickly, but has poor **precision** and **recall**.

In [None]:
from sklearn.naive_bayes import MultinomialNB

model_evaluation(MultinomialNB())

## Models
Now we will move on to measure performance:
1. K-Nearest Neighbor (KNN)
2. Random Forest (RF)
3. Stochastic Gradient Descent Classifier (SGDC)
4. Extra Trees Classifier (ETC)
5. Logistic Regression (LR)

### 1. K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model_evaluation(KNeighborsClassifier(n_jobs=-1))

### 2. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
model_evaluation(RandomForestClassifier(n_jobs=-1, random_state=53))

### 3. Stochastic Gradient Descent Classifier

In [None]:
from sklearn.linear_model import SGDClassifier
model_evaluation(SGDClassifier(n_jobs=-1, random_state=53))

### 4. Extra Trees Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
model_evaluation(ExtraTreesClassifier(n_jobs=-1, random_state=53))

### 5. Logisitic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model_evaluation(LogisticRegression(n_jobs=-1, random_state=53, solver='liblinear'))

#### Model Inferences:
- All models work better than our benchmark model.
- **ETC** performs the best with an accuracy of 84.59% and f1 score of 84.32% respectively taking the least amount of time running cross validation and metric results. Also given its flexibility it performed well with default parameters.
- **RF** performs at second best to ETC. Along with ETC it is interesting to see as their results might be high and close enough. Some tuning of the parameters would probably get a better result.
- **KNN** performs a little under RF. This is usual because KNN works well with datasets that have doublets. Since we checked the data for no duplicates, KNN can only choose the most similar samples by distance. While RF and ETC cal learn other definitions of locality which could stretch far by some features and short by others.

# Train Final Model
We will be choosing `ExtraTreesClassifier` for our submission model.

In [None]:
from sklearn.metrics import accuracy_score, f1_score

clf = ExtraTreesClassifier(n_estimators=50, random_state=53) # best classifier
clf = clf.fit(X_train, y_train) # train model
predict = clf.predict(X_test) # predict unseen data
accuracy = accuracy_score(y_test, predict) # calculate accuracy
f1_score = f1_score(y_test, predict, average='macro') # calculate f1 score

accuracy = np.round(accuracy * 100, 3)
f1_score = np.round(f1_score * 100, 3)

clf = None # clean traces

print(f'The accuracy score of our final model ETC on our testing set is {accuracy}%.')
print(f'The f1 score of our final model ETC on our testing set is {f1_score}%.')