### Exploratory and XGBoostClassifier

In this notebook we made a simple EDA to get --> 0.87

Then a simples XGBoost Classifier to get --> 0.91. 

NOTE:You can go further just adjusting the parameters.

### --> Did you like this notebook?  **PLEASE UPVOTE**


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import xgboost as xgb


# from sklearn.cluster import KMeans
# from sklearn.cluster import MiniBatchKMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import os

files = [ os.path.join(dirname, filename) for dirname, _, filenames in os.walk('/kaggle/input') for filename in filenames   ]         
files

In [None]:
data = pd.read_csv(files[1])
data.set_index('Id', inplace = True )
y =  data.iloc[:,-1:]

Using a histogram, we can see that CoverTypes 2 and 1 are the most frequent in dataset.

In [None]:
plt.hist(y, bins=8)
plt.title("Distribuitions of Cover_Types in train-set")
plt.xlabel("Cover Type")

Then we split the features in 3 different fields:
- Geographical (10 first columns)
- Wilderness Area (4 columns in the middle)
- Soil Type (40 last columns)


In [None]:
# cols = data.iloc[:,:14].columns
cols_geo  = data.iloc[:,:10].columns     # Columns with Geographic information
cols_wild = data.iloc[:,10:14].columns  # Wilderness columns
cols_soil = data.iloc[:,-41:-1].columns   # Soil Types columns

## 1. Geographic features

Firstly, we are going to verify which geographic features impacts in the "Cover Type" 

When a histogram is plotted for Cover_Types(target), we can see that 'Elevation' is clear different 

In [None]:
rows = 2
cols = len(cols_geo)//rows

fig, ax = plt.subplots(rows,cols, figsize=(20,8))

for cover in [1,2,3]:
    for item in range(len(cols_geo)):
        if cover ==3:
            ax[item//cols, item%cols].hist(data[data.Cover_Type.isin(range(3,8))][cols_geo[item]], 
                                           alpha=0.5, 
                                           label='Cover'+str(cover),
                                          bins = 100)
        else:
            ax[item//cols, item%cols].hist(data[data.Cover_Type.isin([cover])][cols_geo[item]], 
                                           alpha=0.5, 
                                           label='Cover'+str(cover),
                                          bins=100)
        ax[item//cols, item%cols].set_title(cols_geo[item])
        
ax[0,0].legend()


In [None]:
fig, ax = plt.subplots(figsize=(10,5))
for item in range(1,8):
    ax.hist(data[data.Cover_Type.isin([item])][cols_geo[0]], alpha=0.5, label='Cover'+str(item), bins= 100)
    ax.set_title(cols_geo[0])
ax.legend()

From this chart it is possible to predict that:
- Elevation < 2500 --> Cover Type = 3
- 2500 < Elevation < 3000 --> Cover Type = 2
- Elevation > 3000 --> Cover Type =1 

### --> This submission results in 0.8724 in public leaderboard.

## 2. Now we check the WildernessAreas

We joined the 4 columns of Wilderness areas into one conjugated columns. Then, if a row has [0,1,1,0] in respective Wilderness Areas, the conjugated columns became '23'.

In [None]:
import pandas as pd

data = pd.read_csv(files[1])
data.set_index('Id', inplace = True )

cols_wild = data.iloc[:,10:14].columns  # Wilderness columns

Grab a cup of coffe. The cell below take a while to run. (10 min)

In [None]:
# This cell take a while

from tqdm.notebook import tqdm_notebook

data['join_Wilder'] = [""]*len(data)

for n, col in enumerate(cols_wild):
    print(col)
    data['join_Wilder'] += [ str(n+1) if data[col][i]==1 else "" for i in tqdm_notebook(range(len(data[col])) )]
data

From the histogram below, WildernessArea = 3 is the most usual in the dataset. The second most frequent is 1.

In [None]:
fig, ax = plt.subplots(figsize=(20,7))
ax.hist(sorted(data['join_Wilder']))

In [None]:
cols = ['Cover_Type', 'join_Wilder']
wilder_gb = data.groupby(cols)['Elevation'].count()
wilder_gb

In [None]:
ax = wilder_gb.unstack(level=0).plot(kind='bar', 
#                                      subplots=True, 
                                     layout =(7,1), 
                                     figsize=(20,5), 
#                                      sharey=True,
                                     sharex=True)

In [None]:
ax = wilder_gb.unstack(level=1).plot(kind='bar', 
#                                      subplots=True, 
                                     layout =(7,1), 
                                     figsize=(20,5), 
#                                      sharey=True,
                                     sharex=True)

In [None]:
# Now without the Covers 1 and 2

wilder_gb.unstack(level=0)[[3,4,5,6,7]]

In [None]:
ax = wilder_gb.unstack(level=0)[[3,4,5,6,7]].plot(kind='bar', 
                                     layout =(7,1), 
                                     figsize=(20,5), 
                                     sharex=True)

## 3. Now we check the Soil Type

We could not find any pattern from SoilType besides de 40 different types. Then we used it as it was delivered.

In [None]:
data[cols_soil].sum(axis=0)
# Just Soils 7 and 15 has not occurence

## 4. XGBoost - Classifier

This model is going to use 'Elevation', and all the 'WildernessAreas' and 'Soil_Types'

In [None]:
def prepare_tables(file = 'train'):
    
    if file == 'train':
        data = pd.read_csv(files[1])
        data.set_index('Id', inplace = True)
        data, y = data.iloc[:,:-1], data.iloc[:,-1:]
        
    elif file =='test':
        data = pd.read_csv(files[2])
        data.set_index('Id', inplace = True)
        y = 0
        
    cols_wild = data.iloc[:,10:14].columns  # Wilderness columns
    cols_soil = data.iloc[:,-40:].columns   # Soil Types columns

    cols =[]
    cols.append( [i for i in cols_wild ] )
    cols.append( [i for i in cols_soil ] )
    all_columns =  [ i for item in cols for i in item ]

    train = data[all_columns].astype(np.uint8)      # Change the type to reduce memory usage
    train['Elevation'] = data['Elevation']/1000
    
    return train, y

In [None]:
train, target = prepare_tables('train')
test , y = prepare_tables('test')

To begin, we tried a simple classifier below with fixed parameters for the model.

In [None]:
import xgboost as xgb

X_train, X_valid, y_train, y_valid = train_test_split(train, target, test_size=0.2, random_state=13)

# xg_train = xgb.DMatrix( X_train, y_train.Cover_Type,  enable_categorical = True)
xg_train = xgb.DMatrix( X_train, y_train)
xg_valid = xgb.DMatrix( X_valid, y_valid)
xg_test = xgb.DMatrix( test)
print("Matrixes - Loaded\n")

params = {
    'booster': 'gbtree',    
    'eta': 0.03,
    'max_depth': 5,     
    'learning_rate': 0.05,
    'objective': 'multi:softprob', # 'multi:softmax',
    'num_class' : 4
}

classifier = xgb.XGBClassifier(params, tree_method='gpu_hist'  )  ## HERE YOU ACTIVATE YOUR GPU

model =  classifier.fit(X_train, y_train)
pred_xg_geo = model.predict(X_valid)
print('Accuracy on validation = ', np.round(accuracy_score(y_valid, pred_xg_geo)*100, 2), '%')

### Using this model above, we got --> 0.91199 in publig leaderboard. An improvement from the "Just Elevation model"

#### --> Did you like this notebook?  **PLEASE UPVOTE**

In [None]:
test_xg_geo = model.predict(test)

subm = pd.DataFrame( test_xg_geo, index= test.index, columns = ['Cover_Type'])
subm.to_csv('Submission.csv')

Here we can see the best result obtained above and its parameter. Using the best_params_, just predict using the test-set and submit.