# Forest Cover Type Project Exploratory Data Analysis

In the Forest Cover Type competition, we are asked to predict the forest cover type (the predominant kind of tree cover) from cartographic variables. The purpose of this EDA notebook is to provide an overview of how python visualization tools can be used to understand the complex, large dataset. EDA is the first step in this workflow where the decision-making process is initiated for the feature selection. Some valuable insights can be obtained by looking at the distribution of the target, relationship to the target and link between the features.

# Load Data & Setup

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

import warnings
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv('../input/forest-cover-type-prediction/train.csv')

# display train data
train.head()

In [None]:
# drop ID column
train = train.iloc[:,1:]
train.head()

In [None]:
# size of data frame
train.shape

Since all variables are numeric integers, there are no need for further conversions.

In [None]:
# look at the data types of each feature and see if there needs to be any pre-processing
train.dtypes

# Exploratory Data Analysis
- Our dataset has **54** features and **1** target variable, `Cover_Type`. 
- From 54 features, 10 are numeric and 44 are categorical.
- From 44 categorical, 40 are `Soil_Type` and 4 of `Wilderness_Area`
- These are the following forest cover types in target variable `Cover_Type`:
    1. Spruce/Fir
    2. Lodgepole Pine
    3. Ponderosa Pine
    4. Cottonwood/Willow
    5. Aspen
    6. Douglas-fir
    7. Krummholz

# Data Exploration
# Feature Statistics
- Part 1. Describe **numerical features**
- Part 2. Describe **binary/categorical features**

In [None]:
# extract all numerical features from train
num_features = train.iloc[:,:10]

# extract all binary features from train
cat_features = train.iloc[:, 10:-1]

#### Part 1. Describe numerical features
- **mean** of the feature varies from 16 to 2749.
- **std** for `Horizontal_Distance_To_Roadways` is the most spread out data, followed by `Horizontal_Distance_To_Fire_Points` and `Elevation`.
- The most desnsed and near to mean is `Slope` followed by all 3 features of `Hillshade`. 
    - See **Boxplot #1** in *Feature Visualization Section*
- All features have a minimum value of 0 except `Elevation` and `Vertical_Distance_To_Hydrology` features.
    - `Elevation` has the highest minimum value and `Vertical_Distance_To_Hydrology` has a negative value.
- `Hillshades` features except `Hillshade_3pm` have a similar maximum value.
- `Horizontal_Distance_To_Fire_Points` has the highest maximum value followed by `Horizontal_Distance_To_Roadways` features. They also have the highest ranges of all features.
- `Slope` has the lowest maximum value and range. The `Aspect` feature follows closely behind this same concept.

It is good to note that the reason some features are widely spread and have high values, is because 5 out of the 10 variables are measured in meters. These variables are: `Elevation`,`Horizontal_Distance_To_Hydrology`,`Vertical_Distance_To_Hydrology`,`Horizontal_Distance_To_Roadways`,`Horizontal_Distance_To_Fire_Points`. This makes sense that these have high values and ranges.


Features like `Aspect` and `Slope` are measured in degrees which means there maximum values can't go above 360. `Hillshade` features can only take on a maximum value of 255.


In [None]:
num_features.describe()

In [None]:
train.iloc[:,:10].hist(figsize=(16,12), bins=50)
plt.show()

#### Part 2. Describe categorical features
- Categorical variables will either have a value of 0 or 1. The **mean** can tell us useful information.
    - `Wilderness_Area3` followed by `Wilderness_Area4` has the highest mean. This signifies that these variables have the most presence in the data compared to other Wilderness Area. Most of our features will consist of `Wilderness_Area3` and `Wilderness_Area4`.
    - The least amount of observations will be seen from `Wilderness_Area2`.
- One more to notice here is that when we add all the mean of `Wilderness_Area` we get a result 0.999999 which is approximately 1. This may mean all the observations can be from any one Wilderness Area. (Cross Check Here: **xx**)
- Probability wise, the next observation that we get will have a 42.0% probability take from `Wilderness_Area3`, 30.9% probability take from `Wilderness_Area4` and so on for others. 
    - We can look into more details with the following plot in the *Feature Visualization Section*: **Barplot #2**.
- Probability wise, we can document the same for `Soil_Types` too. 
    - We can look at **Barplot** #3 and plot xx in *Feature Visualization Section*.


By looking at these statistics of two different data types, we can see that there is different spreads and uneven amount of distribution. In this case we will feature scale these so that all the features have similar ranges between 0 and 1. Some algorithms can be sensitive to high values hence giving us inappropriate results while some algorithms are not. To be on the safe side, we will feature scale it and will do this in the **Data Engineering** section: **xx**.

In [None]:
cat_features.describe()

# Feature Skew
- For normal distribution, the skewness should be zero. Thus any balanced data should have a skewness near zero.
- Negative values indicate data is skewed left. The left tail is long relative to the right tail.
- Positive values indicate data is skewed right. The right tail is long relative the left tail.

In [None]:
skew = train.skew()
skew_df = pd.DataFrame(skew, index=None, columns=['Skewness'])

In [None]:
print(skew)

#### Skewness Inferences
- `Soil_Type8` and `Soil_Type25` has the highest skewness. This means that the mass of the distribution is concentrated to the left and has long tail to the right followed by `Soil_Type9, 28 and 26`. This is also called **right skewed distribution**. 
    - We can see here that mostly all of the observations will have a 0 value for this feature in the **Feature Visualization Section**: **Barplot #3**
- The `Hillshade` variables have a negatively skewed distribution.
- ML algorithm can be very sensitive to such ranges of data and can give us inappropriate/weak restuls. **Feature Scaling** will handle these as discussed earlier.

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x=skew_df.index, y='Skewness', data=skew_df)
var = plt.xticks(rotation=90)

### Class Distribution
Now we will look at the class distribution for `Cover_Type` by grouping it and calculating total occurrence.


We can see that `Cover_Type` has an equal distribution.

In [None]:
train.groupby('Cover_Type').size()

# Feature Visualization
First, we will visualize the spread and outliers of the data of numerical features.

#### Boxplot #1: Numerical Features Inferences
- `Slope` is the most squeezed box plot. It having a least range means that the **median** and **mean** will be quite close.
- `Aspect` features is the only one with little to none outliers. Since both `Aspect` and `Slope` are measure in degrees, `Aspect` takes on much bigger range than `Slope` because it has the lowest max score, which means `Aspect` is less densed than `Slope`.
- The `Hillshade` features also have a similar plot to Slope, which includes many outliers and taking on a smaller range.
- `Vertical_Distance_To_Hydrology` is also similar to Slope except here the minimum value is negative.
- `Elevation` is the only feature that doesn't have a minimum value of 0. It is instead plotted in the middle having many outliers too.
- `Horizontal_Distance_To_Roadways` has the most spread out data of all features. This is because it has highest standard deviation score. `Horizontal_Distance_To_Fire_Points` has a similar look, but it has the maximum value.
    - If we compare these two features, the last 50% of `Horizontal_Distance_To_Roadways` is much more spread and less dense compared to `Horizontal_Distance_To_Fire_Points`, hence having a high standard deviation score.

In [None]:
# plot bg
sns.set_style("whitegrid")

plt.subplots(figsize=(21,14))
color = sns.color_palette('pastel')
sns.boxplot(data=num_features, orient='h', palette=color)
plt.title('Spread of Data in Numerical Features', size=18)
plt.xlabel('# of Observations', size=16)
plt.ylabel('Features', size=16)
plt.xticks(size=16)
plt.yticks(size=16)

sns.despine()
plt.show()

# Feature Distribution
Now we will plot how Wilderness_Area are distributed.

#### Barplot #2: Number of Observations of Wilderness Areas Inferences:
- Visually, we can see that `Wilderness_Area3` and `Wilderness_Area4` has the most presence.
- `Wilderness_Area2` has the least amount of observations. Which confirms it will not have the most presence in our data.

In [None]:
# split cat_features
wild_data, soil_data = cat_features.iloc[:,:4], cat_features.iloc[:,4:]

# plot bg
sns.set_style("darkgrid", {'grid.color':'.1'})
flatui = ["#e74c3c", "#34495e", "#2ecc71","#3498db"]

# use seaborn, pass colors to palette
palette = sns.color_palette(flatui)

# sum the data, plot bar
wild_data.sum().plot(kind='bar', figsize=(10,8), color='#34a028')
plt.title('# of Observations of Wilderness Areas', size=18)
plt.xlabel('Wilderness Areas', size=16)
plt.ylabel('# of Observations', size=16)
plt.xticks(rotation='horizontal', size=12)
plt.yticks(size=12)

sns.despine()
plt.show()

In [None]:
# total count of each wilderness area
wild_data.sum()

#### Barplot #3: Number of Observations of Soil Type Inferences:


Now we will plot the number of observations for `Soil Type`.
- In the bar plot below, we can see that there many different types of distributions: **normale distribution, bimodal distribution, unimodal distribution, and left & right-skewed distribution** showing up in pieces.
- The most observation is seen from `Soil_Type10` followed by `Soil_Type29`.
    - From a statistical analysis, `Soil_Type10` has a presence in 14.1% of observations in the data.
    - `Soil_Type10` also had the least skewed value of all in Soil Types as we had seen earlier in data exploration.
- The variable with the least amount of observations are `Soil_Type7` and `Soil_Type15`.
    - Soil Types has the most skewed values because these variables with a skew variable of 0 were so little, making it densely concentrated towards 0 and long flat tail to the right having form of **positively skewed distribution** or **right skewed distribution** (Details in *Feature Skew* Section).

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color': '.1'})

# sum data, plot bar
soil_data.sum().plot(kind='bar', figsize=(24,12), color='#a87539')
plt.title('# of Observations of Soil Types', size=18)
plt.xlabel('Soil Types', size=16)
plt.ylabel('# of Observations', size=16)
plt.xticks(rotation=90, size=14)
plt.yticks(size=14)

sns.despine()
plt.show()

In [None]:
# statistical description of highest observation of soil type
soil_data.loc[:,'Soil_Type10'].describe()

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid_color': '.1'})

# sum soil data, pass it as a series
soil_sum = pd.Series(soil_data.sum())
soil_sum.sort_values(ascending=False, inplace=True)

# plot bar
soil_sum.plot(kind='barh', figsize=(23,17), color='#a87539')
plt.gca().invert_yaxis()
plt.title('# of Observations of Soil Types', size=18)
plt.xlabel('# of Observation', size=16)
plt.ylabel('Soil Types', size=16)
plt.xticks(rotation='horizontal',size=14)
plt.yticks(size=14)

sns.despine()
plt.show()

# Feature Comparison
Next we will compare each feature in our data to the target variable. This will help us visualize how much dense and distributed each target variable's class is compared to the feature. We will use the violin plot to visualize.


#### Violin Plot 4.1 Numerical Features Inferences:
- `Elevation`
    - `Cover_Type4` has the most forest cover at elevation between 2000m - 2500m.
    - `Cover_Type3` has the fewest presence around that same elevation.
    - `Cover_Type7` has observations of most elevated trees ranging as low as ~2800m to as high as ~3800m.
        - `Cover_Type7` max value in elevation did belong to this forest type.
        - This will be an important feature since every feature tells a different story to different classes of forest cover type. This could be useful in our algorithm.
- `Aspect`
    - This feature has a normal distribution for each class.
- `Slope`
    - Slope has lower values compared to most features as its measured in degrees and least to `Aspect` which is also measured in degrees.
    - It has the least maximum value of all features. Looking at the plot we can say that it belongs to `Cover_Type2`.
    - All classes have dense slope observations between 0-20 degrees.
- `Horizontal_Distance_To_Hydrology`
    - This has the right or positively skewed distribution where most of the values for all classes are towards 0-50m.
- `Vertical_Distance_To_Hydrology`
    - This is also positively skewed distribution but this takes on values much closer to 0 for all classes for most observations.
    - The highest value in this feature belongs to `Cover_Type2`. This feature also has the least minimum value. In this case, `Cover_Type2` has the most range of observations compared to other classes.
- `Hillshade_9am` and `Hillshade_Noon` are left or negatively skewed distribution where they take on max value between 200-250 index value for most observation in each class.
- `Hillshade_3pm` has a normal distribution for all classes.

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color': '.1'})

# set target variable
target = train['Cover_Type']

# features to be compared with target variable
features = num_features.columns

# loop for violin plot
for i in range(0, len(features)):
    plt.subplots(figsize=(16,11))
    sns.violinplot(data=num_features, x=target, y=features[i])
    plt.xticks(size=14)
    plt.yticks(size=14)
    plt.xlabel('Forest Cover Types', size=18)
    plt.ylabel(features[i], size=18)
    
    plt.show()

#### Violin Plot 4.2 Wilderness Area Inferences:
- `Wilderness_Area1` belongs to forest `Cover_Type1`, `Cover_Type2`, and `Cover_Type5`.
- `Wilderness_Area3` belongs to all classes except `Cover_Type4`.
- `Wilderness_Area2` and `Wilderness_Area4` has the least observations, their dense is less on 1 on all classes compared to `Wilderness_Area1` and `Wilderness_Area3`.

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color': '.1'})

# set target variable
target = train['Cover_Type']
# features to be compared with target variable
features = wild_data.columns

# loop for violin plots
for i in range(0, len(features)):
    
    plt.subplots(figsize=(13,9))
    sns.violinplot(data=wild_data, x=target, y=features[i])
    plt.xticks(size=14)
    plt.yticks(size=14)
    plt.xlabel('Forest Cover Types', size=16)
    plt.ylabel(features[i], size=16)
    
    plt.show()

#### Violin Plot 4.3 Soil Type Inferences:
- `Soil_Type4` is the only soil type that has presence in all forest cover types.
- `Soil_Type`: 7 and 15 visually, have little to no presence in all forest cover types.
- `Soil_Type`: 3 and 6 has presence in `Cover_Type`: 2, 3, 4, 6
- `Soil_Type`: 10, 11, 16, and 17 and has presence in `Cover_Type` 1 thru 6.
- `Soil_Type`: 23, 24, 31 and 33 has presence in `Cover_Type`: 1, 2, 5, 6, 7.
- `Soil_Type`: 29 and 30, has presence in `Cover_Type`: 1, 2, 5, 7.
- `Soil_Type`: 22, 27, 35, 38, 39, and 40 has presence in `Cover_Type`: 1, 2, and 7.
- `Soil_Type`: 18 and 28 has presence in `Cover_Type`: 2 and 5.
- `Soil_Type`: 19 and 26 has presence in `Cover_Type`: 1, 2, and 5.
- `Soil_Type`: 8 and 25 has presence in only `Cover_Type2`.
- `Soil_Type`: 1, 5, and 14 has presence in `Cover_Type`: 3, 4, and 6.
- `Soil_Type37` has presence in `Cover_Type7`.


- `Cover_Type4` has the least amount of `Soil_Type` count.
- `Cover_Type2` has the most presence in `Soil_Type` count.

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color':'.1'})

# set target variable
target = train['Cover_Type']
# features compare with target variable
features = soil_data.columns

# violin for loop
for i in range(0, len(features)):
    plt.subplots(figsize=(13,9))
    sns.violinplot(data=soil_data, x=target, y=features[i])
    plt.xticks(size=14)
    plt.yticks(size=14)
    plt.xlabel('Forest Cover Types', size=16)
    plt.ylabel(features[i], size=16)
    
    plt.show()

# Feature Correlation
Part of our data is binary. A **correlation matrix** requires continuous data, so we will exclude binary data.


- Features that less or no correlation will be indicated by the color **black**.
- Features with positive correlation are colored **orange**.
- Features with negative correlation are colored **blue**.


#### Correlation Plot #5 Inferences:
- `Hillshade_3pm` and `Hillshade_9am` show a high negative correlation.
- `Hillshade_3pm` and `Aspect` show a high positive correlation.
- `Hillshade_3pm` and `Aspect` also had the most normal distribution compared to forest cover type classes (**Plot 4.1**)
- The following pairs had a positive correlation:
    - `Vertical_Distance_To_Hydrology` and `Horizontal_Distance_To_Hydrology`
    - `Horizontal_Distance_To_Roadways` and `Elevation`
    - `Hillshade_3pm` and `Aspect`
    - `Hillshade_3pm` and `Hillshade_Noon`
- The following pairs had a negative correlation:
    - `Hillshade_9am` and `Aspect`
    - `Hillshade_Noon` and `Slope`
- The following pair has no correlation:
    - `Hillshade_9am` and `Horizontal_Distance_To_Roadways`
- The least correlated value tells us that each feature has different valuable information that could be important features for predictions.

In [None]:
plt.subplots(figsize=(15,10))

# compute correlation matrix
num_features_corr = num_features.corr()

# generate mask for upper triangle
mask = np.zeros_like(num_features_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# generate heatmap masking the upper triangle and shrink the cbar
sns.heatmap(num_features_corr, mask=mask, center=0, square=True, annot=True, annot_kws={"size": 15}, cbar_kws={"shrink": .8})
plt.xticks(size=13)
plt.yticks(size=13)

plt.show()

#### Scatterplot #6 Features with correlation greater than 0.5
Let's look at the paired features with correlation greater than 0.5. These will be the feature pairs with a positive correlation.

#### Inferences:
- `Hillshade_3pm` and `Aspect` represent a **sigmoid function** relationship. The data points at the boundaries mostly belong to `Cover_Type`: 3, 4, 5.
- `Vertical_Distance_To_Hydrology` and `Horizontal_Distance_To_Hydrology` represent a **linear function** but more spread out.
    - `Cover_Type`: 1, 2, 7 have more observations spreaded out.
    - `Cover_Type`: 3, 4, 5, 6 are mode densely packed from 0-600m Horizontal_Distance_To_Hydrology
- `Elevation` and `Horizontal_Distance_To_Roadways` is a spread out **linear function**.
    - `Cover_Type` 1, 2, and 7 has the highest elevation and a widespread of points from 0m to ~7000m `Horizontal_Distance_To_Roadways`
    - `Cover_Type` 4 and 6 have a densed dataset where there is both low elevation and horizontal distance to roadways in meters.
- `Hillshade_Noon` and `Hillshade_3pm`
    - `Cover_Type` 1, 2, 6 and 7 have a higher hillshade index at noon and 3pm.
    - `Cover_Type` 4 and 5 have a lower hillshade index at noon and 3pm.

In [None]:
# plot bg
sns.set_style("darkgrid", {'grid.color': '.1'})

# paired features with positive correlation
list_data_corr = [['Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology'],
                  ['Elevation','Horizontal_Distance_To_Roadways'],
                  ['Aspect','Hillshade_3pm'],
                  ['Hillshade_3pm','Hillshade_Noon']]

# loop through outer list
# take 2 features from inner list
for i,j in list_data_corr:
    plt.subplots(figsize=(15,12))
    sns.scatterplot(data=train, x=i, y=j, hue="Cover_Type", legend='full', palette='rainbow_r')
    plt.xticks(size=15)
    plt.yticks(size=15)
    plt.xlabel(i, size=16)
    plt.ylabel(j, size=16)
    
    plt.show()

# Reverse One-Hot Encoding
Since `Soil_Type` and `Wilderness_Area` are one-hot encoded, we will apply reverse one-hot encoding to look at the relationship of the following:
- Wilderness Area and Cover Type
- Wilderness Area, Soil Type and Cover Type

# Understanding `Cover_Type` Distribution Among `Wilderness_Areas`
**Inferences:**
- Spruce/Fir, Lodgepole Pine and Krummholz (Cover_Type1,2,7) mostly found in Rawah, Neota and Comanche Peak Wilderness_Area1,2, and 3.
- It is likely to find Ponderosa Pine (Cover_Type3) in Cache la Poudre Wilderness_Area4 rather than other areas.
- Cottonwood/Willow (Cover_Type4) seems to be found only in Cache la Poudre Wilderness_Area4.
- Aspen (Cover_Type5) is equally likely to come from wilderness area Rawah and Comanch (1,3).
- Douglas fir (Cover_Type6) can be found in any of the wilderness areas.

In [None]:
# create one column as Wilderness_Area_Type and represent it as categorical data
train['Wilderness_Area_Type'] = (train.iloc[:, 10:14] == 1).idxmax(1)

# list of wilderness area
wilderness_area = sorted(train['Wilderness_Area_Type'].value_counts().index.tolist())

# plot cover_type distribution for each wilderness area
for area in wilderness_area:
    subset = train[train['Wilderness_Area_Type'] == area]
    sns.kdeplot(subset["Cover_Type"], label=area, linewidth=2)
    
# set title, legends and labels
plt.ylabel("Density")
plt.xlabel("Cover_Type")
plt.title("Density of Cover_Type Among Different Wilderness_Areas", size=14)
plt.legend()
plt.show()

# Understanding `Soil_Type` and `Cover_Type` Relationship
**Inferences:**
- Wilderness_Area3 is more diverse in Soil_Type and Cover_Type.
- Only Soil_Type 1 through 20 is represented in Wilderness_Area4, thus Cover_Types in that area grew with them.
- Cover_Type7 seems to grow with Soil_Type 25 through 40.
- Cover_Type5 and 6 can grow with most of the soil types.
- Cover_Type3 loves Soil_Type 0 through 15.
- Cover_Type1 and 2 can grow with any Soil_Type.

In [None]:
def split_numbers_chars(row):
    '''This function fetches the numerical characters at the end of a string
    and returns alphabetical character and numerical chaarcters respectively'''
    head = row.rstrip('0123456789')
    tail = row[len(head):]
    return head, tail

def reverse_one_hot_encode(dataframe, start_loc, end_loc, numeric_column_name):
    ''' this function takes the start and end location of the one-hot-encoded column set and numeric column name to be created as arguments
    1) transforms one-hot-encoded columns into one column consisting of column names with string data type
    2) splits string column into the alphabetical and numerical characters
    3) fetches numerical character and creates numeric column in the given dataframe
    '''
    dataframe['String_Column'] = (dataframe.iloc[:, start_loc:end_loc] == 1).idxmax(1)
    dataframe['Tuple_Column'] = dataframe['String_Column'].apply(split_numbers_chars)
    dataframe[numeric_column_name] = dataframe['Tuple_Column'].apply(lambda x: x[1]).astype('int64')
    dataframe.drop(columns=['String_Column','Tuple_Column'], inplace=True)

In [None]:
reverse_one_hot_encode(train, 15, 55, "Soil_Type")

In [None]:
# ploy relationship of soil type and cover type among different wilderness areas
g = sns.FacetGrid(train, col="Wilderness_Area_Type",
                  col_wrap=2, height=5, col_order=wilderness_area)
g = g.map(plt.scatter, "Cover_Type", "Soil_Type", edgecolor="w", color="b")

# Next Steps

We continue this project with the following notebooks:
- **[G2] ForestCoverType_Training Notebook** [https://www.kaggle.com/emknowles/g2-forestcovertype-training-notebook/](http://)
- **[G2] ForestCoverType_ModelParams Notebook** [https://www.kaggle.com/emknowles/g2-forestcovertype-modelparams-notebook/](http://)
- **[G2] ForestCoverType_FinalModelEvaluation Notebook** [https://www.kaggle.com/emknowles/g2-forestcovertype-finalmodelevaluation-notebook](http://)
- **[G2] ForestCoverType_Submission Notebook** [https://www.kaggle.com/emknowles/g2-forest-cover-type-submission-v05](http://)