# Forest cover types EDA

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


In [None]:
dataset=pd.read_csv('../input/forest-cover-type-prediction/train.csv')

In [None]:
df=dataset.copy()

In [None]:
df.shape

## Data types

In [None]:
df.info()

All features are numeric

In [None]:
df.isna().sum()

There aren't any null values

In [None]:
df=df.drop(['Id'],axis=1)

In [None]:
df.head()

In [None]:
df.rename(columns={'Wilderness_Area1':'Rawah','Wilderness_Area2':'Neota',
'Wilderness_Area3': 'Comanche Peak','Wilderness_Area4' : 'Cache la Poudre'},inplace=True)

In [None]:
cover_types={1:'Spruce',2 :'L.Pine',3 :'P.Pine',4 : 'Willow',5 : 'Aspen', 6 : 'Douglas-fir',7: 'Krummholz'}
df=df.replace({'Cover_Type':cover_types})

In [None]:
df.skew()

## Descriptive statistics with respect to cover types
 

In [None]:
df.groupby(['Cover_Type']).describe().T

In [None]:
g=sns.factorplot(x='Cover_Type',kind='count',data=df,color='darkseagreen')
g.set(title='Sampling distribution of cover types')
g.set_xticklabels(rotation=90)

We can see that overall sampling distribution of patches of seven different cover types over four wilderness areas  is uniform. 

Each cover type corresponding to 2160 patches

In [None]:
df1=df.copy()
df1["Wild_area"] = df.iloc[:,10:14].idxmax(axis=1)
df1['Soil'] = df.iloc[:,14:54].idxmax(axis=1)

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='Wild_area',data=df1,palette="Set3",hue='Cover_Type')

Each of the wilderness area has a distinct type of a abundant cover.Thus area's properties shall reflect properties of its respective abundant cover type.
Now that we know dominant cover type in each area, we shall further compare cover types and wild areas with respect to various continuous features.

Abundant cover types:
    
1)Rawah-Lodgepole pine and Spruce

2)Neota-Krummholz

3)Comanche Peak-Aspen and Krummholz

4)Cache la Poudre- Willow

One interesting thing to note here is presence of Willow trees only in Cache la Poudre

Also, Comanche Peak is the most diverse of all other areas, and Neota being least. Possible reason for low diversity might be high elevation. We shall check this in the plots below.

## Soil type as a parameter to distinguish within cover types

In [None]:
soil_columns=['Soil_Type'+str(i)for i  in range(1,41)]
abundance=[df[df[i]==1][i].count() for i in soil_columns]
num = [i for i in range(1,41)]
plt.figure(figsize=(10,5))
g=sns.barplot(x=num,y=abundance,palette='ch:.25')
g.set(title="Abundance of Soil Types",ylabel="No. of patches",xlabel="Soil Type")
sns.despine()

Soil type 10, that is, Bullwark - Catamount families - Rock outcrop complex, rubbly soil is the most common type of soil. If Soil Type 10 supports growth of many cover types it cannot be used as a distinguishing factor to predict cover types. Same goes for the other soil types.

We shall now take a closer look at which soil type supports growth of which all cover types in the plot below.

In [None]:
dd=df.groupby(['Cover_Type'])[soil_columns].sum()
dd.T.plot(kind = 'bar', figsize = (18,10),stacked=True)
plt.title('Abundance of Soil type with respect to cover types',fontsize=15)

We can observe a pattern over here. Soil type 1 to 10 excluding 7,8 and 9 belonging majorly to soil families, Cathedral and Vanet are found to support growth of Douglas-fir, Willow and Ponderosa Pine.

Soil types 22 to 33 belonging to soil families Leighcan and Como can be seen to support growth of Aspen, Spruce, Lodgepole Pine.

And soil later in the range from soil type 35 to 40 belonging to soil families Cryumbrepts,Bross, Moran supports growth exclusively of Krummholz and Spruce.

Soil type that support least number of cover types can be used to distinguish within cover types for example:
As mentioned in the plot above, here we can see the reason why soil type 4,6,10,23,30,31,32,33 cannot be used as class separating parameters as they support growth of many cover types.

Soil type 9, Troutville family, very stony supports growth of only Lodgepole pine. Likewise, other soil type that can be utilized for distinguishing cover types include soil type 12, 14, 18, 27, 28, 35, 36, 37.

Some cover types grow specifically in certain type of soil whereas trees such as Aspen and Douglas fir can be seen growing in any type of soil.

## Continuous features 

Let us now address continuous features namely : Elevation, Aspect, Slope, Horizontal_Distance_To_Hydrology, Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways, Horizontal_Distance_To_Fire_Points.

First let us take a look at distribution of data points of the above mentioned features.

In [None]:
def distp(feature,a,b):
    return sns.distplot(df[feature],color=a,ax=axs[b],kde=False)
fig, axs = plt.subplots(ncols=7,figsize=(22,10))
fig.suptitle("Distribution of observations",fontsize='20')

distp('Elevation','green',0)
distp('Aspect','turquoise',1)
distp('Slope','yellow',2)
distp('Horizontal_Distance_To_Hydrology','navy',3)
distp('Vertical_Distance_To_Hydrology','brown',4)
distp('Horizontal_Distance_To_Roadways','orange',5)
distp('Horizontal_Distance_To_Fire_Points','purple',6)

We can see from above plots that data in horizontal distance to hydrology, horizontal distance to roadways and horizontal distance to fire points is positively skewed as we can see few number of occurrences in the lower value that is below zero.

In [None]:
df1.drop(df1.columns[14:54], axis=1, inplace=True)
df1.drop(df1.columns[10:14],axis=1,inplace=True)

Let us take a look at correlation plots including density plots across various parameters in the dataset using Seaborn's Pairplot function and also correlation heatmap. Further we shall select and individually plot ones with interesting correlations.

We shall try to look for some sort of correlation and other patterns that will aid in determining class separation factors.

In [None]:
sns.pairplot(data=df1,hue='Cover_Type',palette='Set1')

We can observe class separation to some extent in case of elevation and some interesting patterns with respect to aspect, hillshade and slope.So, let us take a closer look at these parameters with Cover Type as response variable.

In [None]:
fig, axs = plt.subplots(ncols=3,figsize=(15,4))
fig.suptitle("Positive Correlations",fontsize='20')

sns.lineplot(x= "Aspect",y="Hillshade_3pm",data=df,color='green',ax=axs[0])
sns.lineplot(x= "Hillshade_Noon",y="Hillshade_3pm",data=df,color='green',ax=axs[1])
sns.lineplot(x= "Horizontal_Distance_To_Hydrology",y="Vertical_Distance_To_Hydrology",color="green",data=df,ax=axs[2])

In [None]:

fig, axs = plt.subplots(ncols=3,figsize=(15,4))
fig.suptitle("Negative Correlations",fontsize='20')

sns.lineplot(x= "Hillshade_3pm",y="Hillshade_9am",data=df,color='red',ax=axs[0])
sns.lineplot(x= "Slope",y="Elevation",data=df,color='red',ax=axs[1])
sns.lineplot(x= "Hillshade_Noon",y="Slope",data=df,color='red',ax=axs[2])

Talking about hillshades, relationship between hillshade at 3pm and 9 am is clearly negative while 3pm and noon have positive correlation.

In [None]:
plt.figure(figsize=(18,12))
corr_matrix=df1.corr()
mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_matrix,annot=True ,cbar = True,cmap="YlGnBu",mask=mask)

In addition to what we observed in pairplots and line plots above,here we can now see all positive and negative correlations with darkest color representing strong correlation and vice versa.

Now that we have seen abundant cover types with respect to each area. Let us now compare each continuous feature with cover type and wilderness areas.

In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(15,6))
fig.suptitle("Elevation",fontsize='20')
sns.swarmplot(x= "Cover_Type",y="Elevation",data=df1,palette='Set2',ax=axs[0])
sns.swarmplot(x= "Wild_area",y="Elevation",data=df1,palette="Set2",ax=axs[1],hue='Cover_Type')

As expected we can observe Neota Wilderness Area highly elevated reflecting low diversity in this area in comparison to other three areas. 

Other three area's elevation fall in the range 2,400 to 2,900 and only Neota's is above 3000.

If we compare both the plots Krummholz can be clearly seen growing on elevated patches in Rawah, Comanche Peak and Neota.

Ponderosa Pine, Douglas-fir grows in Comanche Peak and Cache la Poudre with elevation in the range 2000meters2750meters.

As discussed earlier, Willow being the tree that grows on patches with low elevation is found only in Cache La Poudre.

In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(15,6))
fig.suptitle("Aspect",fontsize='20')
sns.boxplot(x= "Cover_Type",y="Aspect",data=df1,palette='Set2',ax=axs[0])
sns.swarmplot(x= "Wild_area",y="Aspect",data=df1,palette="Set2",ax=axs[1],hue='Cover_Type')

In [None]:
degrees = df1['Aspect']
radians = np.deg2rad(degrees)

bin_size = 20
a , b=np.histogram(degrees, bins=np.arange(0, 360+bin_size, bin_size))
centers = np.deg2rad(np.ediff1d(b)//2 + b[:-1])

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='polar')
ax.bar(centers, a, width=np.deg2rad(bin_size), bottom=0.0, color='.8', edgecolor='k')
ax.set_theta_zero_location("N")
ax.set_theta_direction(-1)
plt.title('Aspect')

Aspect is the orientation of slope, measured clockwise in degrees from 0 to 360, where 0 is North-facing 90 is East facing, 180 is South facing and 270 is West facing.

In the plots above we can see that most of the vegetation in the data set are found on patches of land with downward slope facing to the east.

If we look at the box plot of Aspect, we can see that all the cover types except Willow has Aspect values ranging from 0 to 360 degrees. Thus making Aspect not a suitable parameter to differentiate within these cover types.

Although one thing to note is that Willow has no Aspect value above 270-280 degrees indictaing that there will be less or no prevalence of Willow trees in slope facing West direction.

In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(15,6))
fig.suptitle("Slope",fontsize='20')


sns.violinplot(x= "Cover_Type",y="Slope",data=df1,palette='Set2',ax=axs[0])
sns.swarmplot(x= "Wild_area",y="Slope",data=df1,palette="Set2",ax=axs[1],hue='Cover_Type')

In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(15,6))
fig.suptitle("Horizontal Distance To Hydrology",fontsize='20')


sns.swarmplot(x= "Cover_Type",y='Horizontal_Distance_To_Hydrology',data=df1,palette='Set2',ax=axs[0])
sns.swarmplot(x= "Wild_area",y='Horizontal_Distance_To_Hydrology',data=df1,palette="Set2",ax=axs[1],hue='Cover_Type')

Krummholz cab be seen growing in wider range of horizontal distance to hydrology.


In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(15,6))
fig.suptitle("Vertical Distance To Hydrology",fontsize='20')


sns.swarmplot(x= "Cover_Type",y='Vertical_Distance_To_Hydrology',data=df1,palette='Set2',ax=axs[0])
sns.swarmplot(x= "Wild_area",y='Vertical_Distance_To_Hydrology',data=df1,palette="Set2",ax=axs[1],hue='Cover_Type')

Vertical distance to hydrology is approximately equal to 100 on an average. With maximum distance being 400 .

Vertical distance fall into same range for all cover types hence cannot be used to conclude any major inferences with respect to them. 

In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(15,6))
fig.suptitle("Horizontal Distance To Roadways",fontsize='20')
sns.violinplot(x= "Cover_Type",y='Horizontal_Distance_To_Roadways',data=df1,palette='Set2',ax=axs[0])
sns.swarmplot(x= "Wild_area",y='Horizontal_Distance_To_Roadways',data=df1,palette="Set2",ax=axs[1],hue='Cover_Type')

Patches of all cover types can be found near roadways upto a distance of 2000 metres. 

Only L.Pine and Spruce are found beyond a distance of 6000 metres from roadways.

Rawah wilderness can be seen to accomodate more forest cover beyond a distance of 6000 metres from the road.

In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(15,6))
fig.suptitle("Horizontal Distance To Fire Points",fontsize='20') 
sns.boxplot(x= "Cover_Type",y='Horizontal_Distance_To_Fire_Points',data=df1,palette='Set2',ax=axs[0])
sns.swarmplot(x= "Wild_area",y='Horizontal_Distance_To_Fire_Points',data=df1,palette="Set2",ax=axs[1],hue='Cover_Type')

We can see many outliers here.

Fire is known to indirectly benefit Aspen trees, since it allows the saplings to flourish in open sunlight in the burned landscape. Hence we can see that patches with Aspen trees in Rawah Wilderness area has less horizontal distance to fire points.

## Summary 

Elevation is one feature where in it was easy to detect pattern or class separation although there was slight overlap.
Some trees such as Krummholz has greater prevalence on highly elevated patches whereas Willow, Douglas-fir and P.pine can be seen growing in patches with low elevation in comparison to other cover trypes.

Soil type which supports least number of cover types can be used to differentiate between cover types.

Vertical distance to hydrology is directly proportional to horizontal distance to hydrology.

Krummholz can grow on patches with zero to 1200 meters horizontal distance to hydrology indicating that it can grow on steeper slope as well where in there is less water in the soil .

Willow is the only cover type that has no Aspect value above 270-280 degrees and only grows in Cache la Poudre.