# Forest Cover Data Exploration

## Motivation

The purpose of this project is to provide a simple overview of how Python data visualization tools can be used to understand a complex, large dataset.
The dataset in question contains information about features of forested areas. The data includes numerical variables (distance to XXX feature) as well as categorical variables (soil type, tree cover type).   

Through 7 data visualization tenchniques, we will drive understanding of this data and the trends that underlie it.



## Setup  
1. Import necessary packages
2. Gain a high level understanding of data  
3. Set up data for manipulation

In [None]:
# Supress unnecessary warnings so that presentation looks clean
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Import packages
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#import data
training = pd.read_csv('/kaggle/input/forest-cover-type-prediction/train.csv')
test = pd.read_csv('/kaggle/input/forest-cover-type-prediction/test.csv')

In [None]:
#Look at training data - 56 populated columns
training.head(10)

In [None]:
training.describe()

In [None]:
# %matplotlib inline embeds a static image of what we are trying to show in our notebook
#.columns shows the columns we are using
%matplotlib inline
training.columns

## Exploratory Plan

**1. Explore by:**    

    1. Making histograms to understand distributions for all numeric variables   
    
    2. Creating and visualize correlation matrix to understand correlations between variables  
    
    3. Creating a pivot table to view average values numerically
    
    4. Vizualizing average numerical values using barplots   
    
    5. Using violin plots to see the relationship between categorical variables (cover type) and other forest attributes.    
    
    6. Making scatterplots to visualize horizontal distance to hydrology by elevation by cover type    
    
    7. Creating countplots relate categorical variables:    
        a. Cover type by wilderness area  
        b. Cover type by soil type

## Histograms  
Understand distributions for all numeric variables   

In [None]:
# .describe() shows us the measures of central tendency for our data
#Help us think about the data differntly and help us make associations
training.describe()

In [None]:
# look at numeric values separately 
df_num = training [['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology','Horizontal_Distance_To_Roadways', 
                    'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm','Horizontal_Distance_To_Fire_Points']]

In [None]:
# for loop runs through all of the numeric variables and displays histograms for them all

for i in df_num.columns:
    plt.hist(df_num[i])
    plt.title(i)
    plt.show()

## Correlation Matrices  
Understand correlations between variables  

In [None]:
# Prints a correlation matrix that shows how variables correlate with each other.
print(df_num.corr())

In [None]:
# Print a heatmap representation of the above correlation matrix
sns.heatmap(df_num.corr())

**Interpretation of degrees of correlation:**

Perfect: If the value is near ± 1, then it said to be a perfect correlation: as one variable increases, the other variable tends to also increase (if positive) or decrease (if negative).

High degree: If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a strong correlation.

Moderate degree: If the value lies between ± 0.30 and ± 0.49, then it is said to be a medium correlation.

Low degree: When the value lies below + .29, then it is said to be a small correlation.

No correlation: When the value is zero.

## Pivot Table  
View average values numerically

In [None]:
#Get the average numerical attributes of the cover types
pd.pivot_table(training, index = 'Cover_Type', values = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology','Horizontal_Distance_To_Roadways',
                                                         'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm','Horizontal_Distance_To_Fire_Points'])

## Barplots  
Vizualize average numerical values 

In [None]:
# Visualize the interesting numerical attributes based on their average value
    # We will visualize all the attributes using Bar plots
        # Black tips on bars are error bars- show variability of data (standard devation)

#names of all the attributes 
cols = training.columns

#number of attributes (exclude target)
size = len(cols)-1

#x-axis has target attribute to distinguish between classes
x = cols[size]

#y-axis shows values of an attribute
y = cols[0:size]

#Plot violin for all attributes
for i in range(0,size):
    sns.barplot(x=x, y=y[i], data=training)
    plt.show()

## Violin Plots  
See the relationship between categorical variables (cover type) and other forest attributes. 

In [None]:
# We will visualize all the attributes using Violin Plot - a combination of box and density plots

#names of all the attributes 
cols = training.columns

#number of attributes (exclude target)
size = len(cols)-1

#x-axis has target attribute to distinguish between classes
x = cols[size]

#y-axis shows values of an attribute
y = cols[0:size]

#Plot violin for all attributes
for i in range(0,size):
    sns.violinplot(data=training,x=x,y=y[i])  
    plt.show()

# Elevation is strongly correlated with Cover_Type
#Aspect contains a couple of normal distribution for several classes
#Horizontal distance to road, fire points hydrology have similar distribution
#Hillshade 9am and 12pm display left skew
#Hillshade 3pm is normal
#Lots of 0s in vertical distance to hydrology


## Scatterplots  
Visualize horizontal distance to hydrology by elevation by cover type    

In [None]:
# Make array 'plot_features' with Horizontal Distance columns

plot_features = ['Horizontal_Distance_To_Hydrology', 
                 'Horizontal_Distance_To_Roadways', 
                 'Horizontal_Distance_To_Fire_Points']

# pick a Seaborn color pallete
colors = sns.color_palette('deep')

# Make a copy of the training data
sample = training.copy()

#Set up a for loop

#Loop through Cover values 1-6
for cover in [1,2,3,4,5,6,7]:
    
    # Rest = every element in the list except for the current cover element
    rest = list(set([1,2,3,4,5,6,7]) - set([cover]))
    
    # Copy Cover_Type from training set
    sample['Cover_Type'] = training['Cover_Type'].copy()
    
    # Set every value from the "rest" list to 0
    sample['Cover_Type'] = sample['Cover_Type'].replace(rest, 0)
    
    # create a figure object
    fig = plt.figure(figsize=(16, 12))
    #Choose colors
    palette = ['lavender', colors[cover]]
    
    # For loop to create scatterplots
    
    #Loops 1-3 because we are trying to show 3 Horizontal distances
    for i in range(3):
        
        # The first (3,3) defines the setup of the subpl0t
        # i+1 loops through all of the i values, which will loop through the 3 elements in plot_features(Horizontal Distances)
        fig.add_subplot(3, 3, i+1)
        
        # X axis = elevation
        # Y axis is a loop through the Horizontal distances
        # data= our new sample
        # Hue = the Cover type for this given loop
        # Marker = what is on the scatterplot (+ is a little + on it)
        # palette = the colors we picked above
        
        ax = sns.scatterplot(x='Elevation', 
                             y=plot_features[i], 
                             data=sample, 
                             hue='Cover_Type',
                             marker='+',
                             palette=palette)
    #tight_layout automatically adjusts subplot params so that the subplot(s) fits in to the figure area. 
    plt.tight_layout()
    plt.show()

## Countplots  
Relate categorical variables:    
            a. Cover type by wilderness area  
            b. Cover type by soil type

In [None]:
# Quick for loop to get numbered list of columns for use below
for col in training.columns:
    print(training.columns.get_loc(col),col)     

In [None]:
# Group one-hot encoded variables of a category into one single variable
    # One hot encoded variables are representations of categorical variables as binary vectors
        # For example, Wilderness_Area is represented as a binary vector in 4 columns for each of the 4 wilderness areas


#names of all the columns
cols = training.columns

# Training.shape returns an array of number of rows, number of columns
    # So number of rows=r , number of columns=c
r,c = training.shape

#Create a new dataframe with r rows, one column for each encoded category, and target in the end

data = pd.DataFrame(index=np.arange(0, r),columns=['Wilderness_Area','Soil_Type','Cover_Type'])

# We now have an empty dataframe with rows= number of rows in Training and a column for each of our 
data

In [None]:
#Make an entry in 'data' for each r as category_id, target value

# For loop in range (0 up to the number of rows in 'Data'- which is 15,120)
    # Range (0,15120) is actually 0-15119

for i in range(0,r):
    w=0;
    s=0;
    
    # Category1 range - FInd Wilderness area
        # 10-13 Is the column of the first wilderness area through the column of the first soil type 
            # Range (10,14) is actually the numbers 10-13 
            
    for j in range(10,14):
        
        # (training.iloc[row,column] returns the value at the row, column location. 
            #So if there is a 1 at this location that means this row has a binary "yes" identifying it as being from that wilderness location
            # For example, training.iloc[i,j] will evaluate to 1 if i= 5 and j=10, and at that "cell (5,10)" the binary identifier is yes
                # column 10 is Wilderness area 1
            # So once we evaluate to yes, we move onto the steps below the if statements
            
        if (training.iloc[i,j] == 1):
            
            # W is going to be our wilderness location when we input below 
                # So, using the above example (i=5, j=10), wilderness area would be set to 10-9 =1. 
                # w =1
                     # If the area in row 5 had been of Wilderness Area 2, it would have looped through once more. 
                        # So J would have equaled 11.
                            # W would have equaled (11-9)=2
                        
            w=j-9  # Wilderness Area input. 10-9=1
            
            # now we have a W value, so we stop the loop for finding wilderness area for this given row and move on to find soil type
            break
            
    # Category2 range   
        # 14-54 is the column of the first soil type through the column of the last soil type 
              # Range (14,54) is actually the numbers 14-53
                # If you look above to the numbered for loop column list, 53 is the last soil column
            
    for k in range(14,54):
        
         # (training.iloc[row,column] returns the value at the row, column location. 
            #So if there is a 1 at this location that means this row has a binary "yes" identifying it as being of that soil
            # For example, training.iloc[i,k] will evaluate to 1 if i= 5 and k=43, and at that "cell (5,43)" the binary identifier is yes
                # column 43 is soil type 30
            # So once we evaluate to yes, we move onto the steps below the if statements
            
        if (training.iloc[i,k] == 1):
            
                # S is going to be our wilderness location when we input below 
                # So, using the above example (i=5, k=43), wilderness area would be set to 43-13 =30. 
                # s = 30
                     # If the area in row 5 had been of Soil Type 31, it would have looped through once more. 
                        # So k would have equaled 44.
                            # S would have equaled (44-13)=31
            
            s=k-13 # Soil Type input. 43-13=30
            
            # now we have a S value, so we stop the loop for finding Soil Type for this given row and move on to input the values
            break
    
    
    # Make an entry in 'data' for each r
    
        # Set the row i = 5 to a 3 element array that fills in the 3 empty columns in the 'data' table.
            # 3 elements are wilderness area, soil type, and Cover type
        
        # i is the row in question - 5 in this example
        # w is the wilderness area - (10-9)=1 in this example 
        #S is the soil type - (43-13) = 30 in this example 
        #training.iloc[i,c-1] is the cover type
            #Cover type is not broken out into one hot encoded variables so we can just index it by using training.iloc
                # i = the row in question, 5 in this example
                # c - 1 gets us the last row of the training set
                
          
    data.iloc[i]=[w,s,training.iloc[i,c-1]]

In [None]:
# Now our data table is populated
data

In [None]:
#Plot for Wilderness Area  
sns.countplot(x="Wilderness_Area", hue="Cover_Type", data=data)
plt.show()

#Plot for Soil Type
plt.rc("figure", figsize=(25, 10))
sns.countplot(x="Soil_Type", hue="Cover_Type", data=data)
plt.show()

# right-click and open the image in a new window for larger size 

WildernessArea_4 has a lot of presence for cover_type 4. Good class distinction  

WildernessArea_3 has very litte class distinction  

Soil Types **1-6, 10-14, 17, 22-23, 29-33, 35, and 38-40** offer significant class distinction as counts for some are very high

## Review and Future Work  

In this notebook, we used data visualization to gain an understanding of a large dataset of information about forests.   
In the future, we could leverage this understanding to build a classification model that predicts forest cover type by using the other columns of the dataset.  

Thank you for reading!