# December Tabular Series
My approach will be in 3 main steps:
1. EDA (Exploratory Data Analysis)
2. Preprocessing
3. Modeling & fine tuning

Library Installs and Imports:

In [None]:
!pip install millify #Readable large numbers

In [None]:
#Library Imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #Visualization
import seaborn as sns #Visualitzation
from millify import millify #Readable large numbers

In [None]:
#Get Data
train = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv')

Test data will only be used for model testing to avoid any data leakage. All the EDA will be done with train set.

# 1. EDA (Exploratory Data Analysis)

## 1.1 Data Inspection 
I start by visualizing the table to get an idea of all features and its values. An explanation of all features is detailed in: https://www.kaggle.com/c/forest-cover-type-prediction/data

In [None]:
train.head().T #I am using .T transpose as the table is to big to visualize horitzontaly.

## 1.2 Target class distribution 

I want to know how many values we have for each class. We will see if there are any class imbalances and take them into account for future steps.

In [None]:
y_counts = train["Cover_Type"].value_counts()                # Count number of records for each distinc value (class)
y_counts.sort_index(inplace = True)                          # Sort the returned Series, for better plotting
bars = sns.barplot(x = y_counts.index,y = y_counts.values)   # Make barplot

records = []                                                 #Convert the numerical values to easy readable strings 
for x in y_counts:
    records.append(millify(x, precision= 1))
    
plt.bar_label(bars.containers[0], labels = records)          # Set tha labels displaying the values of each bar
plt.show()
print("Total number of records: {:,}".format(sum(y_counts)))

There is a hughe class inbalance, one class has 1 record while an other has 2.3Millions! We will have to put a lot of emphasis into it to correctly train our models. We will also need to be careful when getting results as, predicting only classes 1 and 2 might have very good results despite failing to identify all other classes.

## 1.3 Descriptive Statistics 
Next, we are going to look at the descriptive statistics of each feature to get an idea of the values it contains. As we have continuos and categorical data, I will first split it and work with each independently. As I am just trying to get an idea of how the data is, there is no problem on doing several steps. Later I will combine them to look for more specific characteristics.

In [None]:
train.info()   #Checking if all features are numbers and how they are organised in the table

In [None]:
#Splitting continuous and categorical data for easier analysis
continuous = train.iloc[:, 1:11]
categorical = train.iloc[:, 11:-1]
target = train["Cover_Type"]

In [None]:
continuous.describe().T    # I have selected only the columns with values, as they are ordered I just selected the fist 10 

Features come in very different scales and ranges. It might be interesting to try scaling for some models. However we have to be carefull with scaling as models lose interpretability and possible relations between variables.

## 1.4 Feature Distribution
Now we are going to look for the features distribution and relation to the target (Cover Type). 

In [None]:
continuous['Cover_Type'] = target     # I am joining the target to see the relation between features and target 

In [None]:
continuous_s =  continuous.sample(30000, random_state = 1)   # As we have many rows I am just going to take a sample of the data to work faster

In [None]:
features = continuous.columns[:-1]   # I drop the target as it is not a feature

i = 1
fig = plt.figure(figsize = (30,10))
for feature in features:
    ax = fig.add_subplot(2, 5, i)
    sns.kdeplot(data = continuous_s,x = feature, alpha = 0.2, fill = True, legend = True, palette="Set2")
    i += 1
plt.show()

We clearly see that features don't follow a normal distribution. We will have to standarize distributions for some models. 

## 1.5 Relation with target 
As we have a hughe class imbalance I will use the violinplot to understand the direct relationship between features and target. A direct density distribution is unsuitable as the imbalances make the plot harder to interpretate. 

In [None]:
#Example of direct density plot, classes with very low samples are harder to see!
fig = plt.figure(figsize = (12,4))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
sns.kdeplot(data = continuous_s,x = "Elevation", hue="Cover_Type", alpha = 0.2, fill = True, legend = True, ax = ax1, palette="Set2")
sns.violinplot(data = continuous_s,x = "Cover_Type", y = "Elevation", linewidth = 0.5, palette="Set2", ax=ax2)

In [None]:
features = continuous.columns[:-1]   # I drop the target as it is not a feature

i = 1
fig = plt.figure(figsize = (30,10))
for feature in features:
    ax = fig.add_subplot(2, 5, i)
    sns.violinplot(data = continuous_s,x = "Cover_Type", y = feature, linewidth = 0.5, palette="Set2")
    i += 1
plt.show()

There are many similar distributions between classes, the most distinctive is elevation. It will be interesting to check wheter a naive model based on eleveation will give good results

## 1.6 Correlation between features

I am cheching wheter there exist a correlation between the variables. This could let to poor model performance and needs to be adressed.

In [None]:
sns.pairplot(continuous_s, hue= "Cover_Type", palette="Set2")

There is defenetly not a direct correlation between features. But again, elevation seems to be a good predictior for Cover Type (target). As an alternative check, I am going to plot the correlation matrix (by pearson coeficient) of the features. 

In [None]:
corr = continuous_s.corr(method='pearson')
sns.heatmap(corr)

As expected, we can cleary see again that there is no correlation between features.

## 1.7 Categorical EDA

Next, I am going to analise the categorical features.

In [None]:
categorical.columns                   #Print all categorical columns

Categorical features are all One-Hot Encoded, which means that each feature has multiple binary columns refearing to each possible class. That is fine for modeling but not for EDA, so first thing is grouping each feature in one column.

In [None]:
categories = pd.DataFrame()
categories["Wilderness_Area"] = categorical.iloc[:,: 4].idxmax(1)
categories["Soil_Type"] = categorical.iloc[:,4:].idxmax(1)
categories["Cover_Type"] = target                                  #Include target to see relation with features
categories.head()

Now we can check how many times each class apears to get an idea of the distribution

In [None]:
fig = plt.figure(figsize = (12,4))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

sns.countplot(data= categories, y = "Wilderness_Area", ax = ax1)
sns.countplot(data= categories, y = "Soil_Type", ax = ax2)
ax2.axes.get_yaxis().set_visible(False)


Both distributions are quite unbalanced, I wonder if there is a class that is directly related to a Cover type (target). To check it  I am going to plot the relation.

In [None]:
fig = plt.figure(figsize = (12,4))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

sns.countplot(data= categories, y = "Wilderness_Area", hue = "Cover_Type", ax = ax1)
sns.countplot(data= categories, y = "Soil_Type", hue = "Cover_Type", ax = ax2)
ax2.get_yaxis().set_visible(False)

I can not see any strong relationship between them

In [None]:
features = continuous.columns[:-1]                       # Exclude target
complete = pd.concat([continuous, categories], axis=1)   # Concat continuos and categorical datasets
complete_s =  complete.sample(30000, random_state = 1)   # Take sample to speed up viz

i = 0
fig = plt.figure(figsize = (30,60))

for feature in features:
    i += 1
    ax = fig.add_subplot(10, 2, i)
    sns.violinplot(data = complete_s, x= "Wilderness_Area", y = feature)
    i += 1
    ax = fig.add_subplot(10, 2, i)
    sns.boxplot(data = complete_s, x= "Soil_Type", y = feature, width = 0.2)
    ax.get_xaxis().set_visible(False)
    

Finally let's see the relationship between the categorical variables

In [None]:
df = complete[['Soil_Type','Wilderness_Area']]                                            # Select only cat. variables
df = pd.DataFrame(df.groupby(["Soil_Type","Wilderness_Area"], as_index=False).size())     # Count how records of each
matrix = df.pivot(index='Soil_Type', columns='Wilderness_Area', values='size')            # Pivot into matrix
sns.heatmap(matrix)                                                                       #Viz, easier as there are many Soil types 

As seen before most data is in Soil type 1 and W.A. 1 and 3. I can't see any relevant relationship in this matrix.

# 2. Data Preprocessing
Check for errors on the dataset and handle them to get a clean DB.

## 2.1 Missing Values

First I am going to check for nulls or missing values. If it is the case I will decide wether to remove the record or fill the gap:

In [None]:
train.isna().sum()       #Count missing

Luckly the dataset is clean! There are no missing values.

## 2.2 Duplicates
Now I am going to check for duplicate rows:

In [None]:
train[train.duplicated(keep=False)]

Again, no duplicate rows. We can get to the final steps safely.

## 2.3 Train Test Split
I am going to split my data into train/test with 1/3 of the data as test.

In [None]:
train["Aspect"][train["Aspect"] < 0] += 360
train["Aspect"][train["Aspect"] > 359] -= 360

test["Aspect"][test["Aspect"] < 0] += 360
test["Aspect"][test["Aspect"] > 359] -= 360

train.loc[train["Hillshade_9am"] < 0, "Hillshade_9am"] = 0
test.loc[test["Hillshade_9am"] < 0, "Hillshade_9am"] = 0

train.loc[train["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0
test.loc[test["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0

train.loc[train["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0
test.loc[test["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0

train.loc[train["Hillshade_9am"] > 255, "Hillshade_9am"] = 255
test.loc[test["Hillshade_9am"] > 255, "Hillshade_9am"] = 255

train.loc[train["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255
test.loc[test["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255

train.loc[train["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255
test.loc[test["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255

In [None]:
from sklearn.model_selection import train_test_split

x = train.iloc[:, :-1]        # Drop the target column
y = train["Cover_Type"]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=1)

x.head()                    # Check that everything is OK


# 3. Modeling
First, I am going to creat a baseline model, then I am going to try multiple models with some basic parameters and select the best scores. I will finally invest some time in fine tuning these best models to get the most of them and select the best one.
##  3.1 Baseline
I am going to start with a basic model that can serve as a baseline. As the dataset is highly immbalanced I will use a dummy classifier that just predicts the most frequent class. This will probably give me an accuracy of just over 50%.

In [None]:
from sklearn.dummy import DummyClassifier

baseline = DummyClassifier(strategy = "most_frequent")
baseline.fit(X_train,y_train )
baseline.score(X_test,y_test )

Exactly what we expected. Now we have a score to beat!

## Naive Bayes - Gaussian

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

nb = GaussianNB()
nb.fit(X_train,y_train )

train_score = nb.score(X_train,y_train) *100
test_score = cross_val_score(nb,X_test,y_test).mean() *100

print(f"Train accuracy: {train_score} %")
print(f"Test accuracy: {test_score} %")

#nb.score(X_test,y_test )

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=1, solver='liblinear')
lr.fit(X_train,y_train )

train_score = lr.score(X_train,y_train) *100
test_score = cross_val_score(lr,X_test,y_test).mean() *100

print(f"Train accuracy: {train_score} %")
print(f"Test accuracy: {test_score} %")

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=1, max_depth=10)
dtc.fit(X_train,y_train )
dtc.score(X_test,y_test )

train_score = dtc.score(X_train,y_train) *100
test_score = cross_val_score(dtc,X_test,y_test).mean() *100

print(f"Train accuracy: {train_score} %")
print(f"Test accuracy: {test_score} %")

## XGBoost

In [None]:
y_train_xgb = y_train-1
y_train_xgb.value_counts()
y_test_xgb = y_test - 1

In [None]:
from xgboost import XGBClassifier

tree_method = "gpu_hist" #hist or gpu_hist

xgb = XGBClassifier(objective='multi:softmax', n_estimators=100,seed=123, tree_method = tree_method,use_label_encoder=False)
xgb.fit(X_train,y_train_xgb )

train_score = xgb.score(X_train,y_train_xgb) *100
test_score = cross_val_score(xgb,X_test,y_test_xgb).mean() *100

print(f"Train accuracy: {train_score} %")
print(f"Test accuracy: {test_score} %")

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = xgb.predict(X_test)

cm = confusion_matrix(y_test_xgb, y_pred)
plt.figure(figsize=(12,10))
sns.heatmap(cm/np.sum(cm), annot=True, fmt='.2%')

In [None]:
from sklearn.model_selection import train_test_split

#train_s = train.sample(50000, random_state = 1)
train_s = train.groupby("Cover_Type").apply(lambda x: x.sample(min(len(x), 10000)))

x = train_s.iloc[:, :-1]        # Drop the target column
y = train_s["Cover_Type"]

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(x, y, test_size=0.33, random_state=1)

y_train_xgb = y_train_s - 1
y_test_xgb = y_test_s - 1

In [None]:
y_train_xgb.value_counts()

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

tree_method = "gpu_hist" #hist or gpu_hist

#score = {"Accuracy": "accuracy"}
clf_xgb = XGBClassifier(objective='multi:softmax', n_estimators=100,seed=123, tree_method = tree_method,use_label_encoder=False)

parameter = {'n_estimators': [10,100,300,500, 800, 1000, 5000]}
search = GridSearchCV(clf_xgb, parameter, n_jobs=-1, return_train_score = True)
search.fit(X_train_s, y_train_xgb)
search.score(X_test_s, y_test_xgb)

print(search.best_score_)
print(search.best_estimator_)

In [None]:
results = pd.DataFrame(search.cv_results_)
results.T

In [None]:
clf_xgb = XGBClassifier(objective='multi:softmax', n_estimators = 200, seed = 123, tree_method = tree_method, use_label_encoder=False)

parameter = {'max_depth': [4,6,8,10], 'subsample':[0.5, 0.7, 0.9], 'colsample_bytree': [0.5, 0.7, 0.9]}
search = RandomizedSearchCV(clf_xgb, parameter, refit = "AUC", cv = 5, n_jobs=-1)
search.fit(X_train_s, y_train_xgb)
search.score(X_test_s, y_test_xgb)

print(search.best_score_)
print(search.best_estimator_)

In [None]:
results = pd.DataFrame(search.cv_results_)
results.T

In [None]:
y_train_xgb = y_train-1
y_train_xgb.value_counts()
y_test_xgb = y_test - 1

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

tree_method = "gpu_hist" #hist or gpu_hist

clf_xgb = XGBClassifier(objective='multi:softmax', n_estimators = 200, seed = 123, tree_method = tree_method, use_label_encoder=False, subsample = 0.9, max_depth = 8, colsample_bytree = 0.9 )
clf_xgb.fit(X_train,y_train_xgb)

print(clf_xgb.score(X_train,y_train_xgb))
print(clf_xgb.score(X_test, y_test_xgb))

In [None]:
predictions = clf_xgb.predict(test) + 1       # Make predictions
submission = pd.DataFrame(test["Id"])     # Create submission file with Ids
submission["Cover_Type"] = predictions    # Append predictions
submission.head()                         # Check

In [None]:
submission.to_csv('submission.csv', index=False)