# Overview:
In this project, you will use deep learning to predict forest cover type (the most common kind of tree cover) based only on cartographic variables. The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. The covertypes are the following:

1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz


Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is raw and has not been scaled or preprocessed for you. It contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so existing forest cover types are mainly a result of ecological processes rather than forest management practices.

Project Objectives:
Develop one or more classifiers for this multi-class classification problem.
Use TensorFlow with Keras to build your classifier(s).
Use your knowledge of hyperparameter tuning to improve the performance of your model(s).
Test and analyze performance.
Create clean and modular code.
Prerequisites:
All lessons, articles, and previous projects in Build Deep Learning Models with TensorFlow

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from matplotlib.image import imread

In [None]:
#from google.colab import files
#uploaded = files.upload()

In [None]:
data = pd.read_csv("../input/forest-cover-type-dataset/covtype.csv")

In [None]:
data.head()

In [None]:
data['Cover_Type'].nunique()

In [None]:
# lets check the shape of the dataset
data.shape

In [None]:
# There are more than 581000 variables with 55 features

In [None]:
data.info()

In [None]:
data.isnull().sum()

# Section 1: Exploratory Data Analysis

I want to get an understanding for which variables are important, get visualize data to see if any outliners or abnormal data points.

In [None]:
import seaborn as sns

In [None]:
# Lets see how the total classes distribute throughout all the data
sns.countplot(x=data['Cover_Type'])

In [None]:
# Class 4 is Cottonwood/Willow appeared least while Lodgepole Pine has highest counts follwing is Spruce/Fir

In [None]:
data.corrwith(data['Cover_Type']).sort_values()

In [None]:
data.corr()

In [None]:
data.corrwith(data['Cover_Type']).sort_values()

In [None]:
# Visualize through heatmap
plt.figure(figsize=(80,60))
sns.heatmap(data=data.corr(), cmap='YlGnBu', annot=True)

In [None]:
data.describe()

In [None]:
# Wilderness Area4 correlated the most with class feature in the dataset

In [None]:
# Lets check out feature Slope
plt.figure(figsize=(19,8))
sns.countplot(x=data['Slope'], hue=data['Cover_Type'])

In [None]:
# There are significant amount of class 1 and 2 spotted in a range of slope from 1 to 27. 

In [None]:
# similarly we can also visualize the feature Aspect
plt.figure(figsize=(19,8))
sns.countplot(x=data['Aspect'], hue=data['Cover_Type'])

In [None]:
data.head()

In [None]:
data.dtypes

In [None]:
# Since all data types are int64, we only need to scale our features to make sure they are on the same scale in order to train our model

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Set X and y variables to the .values of features and labels
X = data.drop('Cover_Type', axis=1).values
y = data['Cover_Type'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [None]:
# Now , we need to normalize the data as I mentioned above
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

In [None]:
# Label our classes
from sklearn.preprocessing import StandardScaler, LabelEncoder


In [None]:
class_names = ['Spruce/Fir', 'Lodgepole Pine',
                   'Ponderosa Pine', 'Cottonwood/Willow',
                   'Aspen', 'Douglas-fir', 'Krummholz']

In [None]:
le = LabelEncoder()

In [None]:
y_train = le.fit_transform(y_train)

In [None]:
y_test = le.transform(y_test)

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout, InputLayer
from tensorflow.keras.constraints import max_norm
from tensorflow.keras.optimizers import Adam

In [None]:
model = Sequential()
#input layer
model.add(InputLayer(input_shape=(X_train.shape[1],)))

#hidden layer
model.add(Dense(78, activation='relu'))
model.add(Dropout(0.3))
#hidden layer
#model.add(Dense(39, activation='relu'))
#model.add(Dropout(0.3))
# hidden layer
model.add(Dense(39, activation='relu'))
model.add(Dropout(0.3))

# output layer
model.add(Dense(7, activation='softmax'))

# compile model
opt = Adam(learning_rate=0.01)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
early_stop = EarlyStopping(monitor='val_accuracy',verbose=0, patience=3)

In [None]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=1024, callbacks=[early_stop])

In [None]:
#from tensorflow.keras.models import load_model

In [None]:
#model.save('forest_cover_classification.h5')

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses[['loss','val_loss']].plot()

In [None]:
losses[['accuracy','val_accuracy']].plot()

In [None]:
model.metrics_names
score = model.evaluate(X_test,y_test, verbose=0)
print(score[0])
print(score[1])

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
predictions = model.predict_classes(X_test)

In [None]:
print(classification_report(y_test,predictions))

In [None]:
# Would have been better! As we can see from the report our model did not really pick up any correct predictions on class 3, 4 and 5. 

In [None]:
early_stop = EarlyStopping(monitor='val_accuracy',min_delta=0.0001, verbose=0, patience=3)

In [None]:
model = Sequential()
#input layer
model.add(InputLayer(input_shape=(X_train.shape[1],)))

model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))

#hidden layer
model.add(Dense(78, activation='relu'))
model.add(Dropout(0.2))
#hidden layer
model.add(Dense(39, activation='relu'))
model.add(Dropout(0.2))
# hidden layer
model.add(Dense(19, activation='relu'))
model.add(Dropout(0.2))

# output layer
model.add(Dense(7, activation='softmax'))

# compile model
opt = Adam(learning_rate=0.01)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history = model.fit(X_train, y_train, validation_split=0.1, epochs=200, batch_size=1024, callbacks=[early_stop])

In [None]:
score = pd.DataFrame(history.history)

In [None]:
score[['loss','val_loss']].plot()

In [None]:
score[['accuracy','val_accuracy']].plot()

In [None]:
y_preds = model.predict(X_test)

In [None]:
y_preds = np.argmax(y_preds, axis=1)

In [None]:
 print(classification_report(y_test, y_preds, target_names=class_names))

In [None]:
print(confusion_matrix(y_preds, y_test))

- By looking at the confusion matrix, there are many miss-classifications on each class. Class 2-Lodgepole Pine, Cottonwood Willow, Aspen, Douglas-fir have treumendous amounts of missing classifications. That's why we might want to investigate our dataset:
 + is our dataset imbalanced? 
 + As above analysis, we saw that wildness area 4 most correlated to the class feature but the rest weren't. Also the soil types are similarly to each other and might cause noise to the model. 
 + If the dataset was imbalanced, using metric accuracy might be not a realiable way to train the model. 
 

## Still not done yet..