# Table of Content


* [Introduction](#sec1)
* [Librairies](#sec2)
* [Loading Data/Rapid Overview](#sec3)
* [Feature Engineering/Data Cleansing](#sec4)
   - [Quick Data Transformation](#subsec1)
   - [Data Visualisation](#subsec2)
   - [Dealing With Skewness](#subsec3)
   - [Correlation between features](#subsec4)
   - [Reducing Memory Usage](#subsec5)
* [Building and Training an NN model](#sec5)


<a id="sec1"></a>
## Introduction


This month, more than building a complete pipeline easy to follow along, we are going to focus a bit on Data Transformation and **how handling skewness can benefits a model (or not)**.

Obviously, as always this notebook mostly uses **basic Bokeh visualisation**.

This month data is based on a previous Competition [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview). 

From multiple variables the main goal is to predict a Forest Type. Because this competition is close to a previous one, many insights found for the previous competition can also be applied. Thanks to many posts, one can improve his model.([Fixing features](https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293373))

However here, we are not going to focus on that aspect.

<a id="sec2"></a>
## Librairies

In [None]:
import numpy as np
import pandas as pd


from bokeh.io import output_notebook

from bokeh.models import ColumnDataSource, ColorBar, LinearColorMapper, BasicTicker, HoverTool
from bokeh.layouts import row, column, grid

from bokeh.plotting import figure, show, output_file
from bokeh.palettes import Viridis256, Spectral7, Inferno
from bokeh.transform import factor_cmap

import os

import seaborn as sns
import matplotlib.pyplot as plt


import tensorflow as tf
from tensorflow import keras


import sklearn

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold


from scipy.stats import skew

output_notebook()

<a id="sec3"></a>
## Loading Data / Rapid Overview

___

The dataset was generated by CTGAN using a real dataset, [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview).

After going through discussions on this dataset, few things can be retrieve :

1. Several features can be created to improve accuracy
2. Understanding features 


From all these information, let's build a solid classifier achieve strong results.
___

In [None]:
files = []

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        files.append(os.path.join(dirname, filename))
        
raw_data = pd.read_csv(files[1]).set_index('Id')

test_df = pd.read_csv(files[2]).set_index('Id')


In [None]:
print(f'Number of samples in train.csv : {len(raw_data)}\n')

print(f'Number of rows in test.csv : {len(test_df)}\n')

print(f'Number of features for both training and testing : {len(raw_data.columns)}\n ')
raw_data.head(5)

In [None]:
print(raw_data.isna().sum()[0:10])

# we don't have missing values ! otherwise we would have to work on (taking average/removing/...)

In [None]:
raw_data.describe().transpose()

<a id="sec4"></a>
## Feature Engineering / Data Cleansing
___
<a id="subsec1"></a>
### Quick Data Transformation


**Some features are *static*, their values are always the same :**

**-`Soil_Type7`**

**-`Soil_Type15`**

**Thus we can rapidly remove them**


**Moreover few features are going *beyond* their supposed range**

Thanks to the information gave here : [fixing features](https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293373)


We know that :

1. **The following variables should have values that lie between $[0-255]$**

    - `Hillshade_Noon` 
    - `Hillshade_9am` 
    - `Hillshade_3pm`

**2. `Aspect` is an angle variable and its values must lie between $[0 -360]$**


In [None]:
clean_df = raw_data.copy()

del raw_data


clean_df.drop(columns = ['Soil_Type7', 'Soil_Type15'], inplace = True)

test_df.drop(columns = ['Soil_Type7', 'Soil_Type15'], inplace = True)


train_target = clean_df.pop('Cover_Type')

clean_df['Hillshade_9am'] = np.clip(clean_df['Hillshade_9am'].values, 0,255)
test_df['Hillshade_9am'] = np.clip(test_df['Hillshade_9am'].values, 0,255)

clean_df['Hillshade_Noon'] = np.clip(clean_df['Hillshade_Noon'].values, 0,255)
test_df['Hillshade_Noon'] = np.clip(test_df['Hillshade_Noon'].values, 0,255)


clean_df['Hillshade_3pm'] = np.clip(clean_df['Hillshade_3pm'].values, 0,255)
test_df['Hillshade_3pm'] = np.clip(test_df['Hillshade_3pm'].values, 0,255)





clean_df['Aspect'] = np.mod(clean_df['Aspect'].values, 360)
test_df['Aspect'] = np.mod(test_df['Aspect'].values, 360)



**Thanks to a [discussion](https://www.kaggle.com/c/forest-cover-type-prediction/discussion/10693) from the first Competition, we know that adding features to indicate when values are negative or positive might be good**

In [None]:
clean_df['HighWater'] = (clean_df['Vertical_Distance_To_Hydrology']< 0).astype(np.int16)

test_df['HighWater'] = (test_df['Vertical_Distance_To_Hydrology']< 0).astype(np.int16)



clean_df['EVDtH'] = clean_df['Elevation'] -clean_df['Vertical_Distance_To_Hydrology']
test_df['EVDtH'] = test_df['Elevation'] -test_df['Vertical_Distance_To_Hydrology']


#clean_df['EHDtH'] = clean_df['Elevation'] -clean_df['Horizontal_Distance_To_Hydrology']*0.2
#test_df['EHDtH'] = test_df['Elevation'] -test_df['Horizontal_Distance_To_Hydrology']*0.2

<a id="subsec2"></a>
### Data Analysis (Visualisation)

___
We clearly have two main types of features :

- ***Numerical***  : Elevation, Aspect , ...
- ***Categorical*** : Soil_TypeX, Wilderness_AreaX


First, let's analyse closely all Categorical Features, how balanced they are

___



In [None]:
forest_type = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine', 'CottonWood/willow', 'Aspen', 'Douglas-fir', 'Krummholz']

class_ratio = train_target.value_counts().sort_index().to_list()



source = ColumnDataSource(data = dict(x_values = forest_type, y_values = class_ratio, color = Spectral7))

TOOLTIPS = [
    ("Type of forest", "@x_values"),
    ('value', "@y_values")
]



p = figure(x_range = forest_type,
           y_axis_type="log",
           y_range = [0.1,10**7],
           width = 800,
           height = 600,
           title = 'Target feature proportion by class', tooltips = TOOLTIPS)

p.vbar(x = 'x_values',
       top = 'y_values',
       source = source, 
       alpha = 0.8,
       line_color = 'white',
       color = 'color',
       bottom=0.1,
      width = 0.9)

p.xgrid.grid_line_color = None
p.background_fill_color = '#fafafa'

show(p)

In [None]:
number_sample = len(train_target)

for idx,forest in enumerate(forest_type):
    
    print(f'Class {idx+1} corresponding to {forest} forest represents {class_ratio[idx]/number_sample:.5%} of the total dataset')

 ___
 **Classes are clearly imbalanced**: 
 
 - ***Aspen type* has only one example ! Even if we keep it, our model won't be able to predict it properly.**

 - **Class 1 and 2 are overly presents and overshadow other classes**
 
 We must keep the same ratio when splitting the data to training and validation set !
 
 Plus removing Class 5 might be an idea
 
 ___

In [None]:
cat_features = [ 'Wilderness_Area1',
       'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4',
       'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5',
       'Soil_Type6', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10',
       'Soil_Type11', 'Soil_Type12', 'Soil_Type13', 'Soil_Type14',
       'Soil_Type16', 'Soil_Type17', 'Soil_Type18',
       'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22',
       'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26',
       'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30',
       'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34',
       'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38',
       'Soil_Type39', 'Soil_Type40','HighWater']



number_cat_features = len(cat_features)

figures = []

palette = [Viridis256[63], Viridis256[191]]

for num, cat in enumerate(cat_features):
        
    cat_serie = clean_df[cat].value_counts()
    
    source = ColumnDataSource(data = 
                              dict(x_values = cat_serie.index.astype('str').to_list(),
                                   y_values = cat_serie.to_list()))
    
    p = figure(x_range = source.data['x_values'],
               width = 200, height = 200,
               title = 'Proportion of '+cat,
               toolbar_location=None,
               tools="")
    
    p.vbar(x = 'x_values',
           top ='y_values',
           source = source,
           line_color='white',
           fill_color=factor_cmap('x_values', palette=palette, factors = source.data['x_values']),
           width = 0.9)
    
    p.title.text_font_size = '8pt'
    p.xgrid.grid_line_color = None
    p.yaxis.visible = False
    
    figures.append(p)
    

show(grid(figures, ncols = 6))
    
    

---
Most of Categorical features are imbalanced so the features are sparse. 

By analyzing their impact on target features, we might be able to remove few of them.

**But before doing so, let's explore Numerical and Ordinal features**

---

In [None]:
num_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
                'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
                'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
                'Horizontal_Distance_To_Fire_Points']


def plot_violin(dataframe, features, sample_size):
    
    distrib_df = dataframe[features].sample(sample_size).melt(var_name = 'column_name', value_name = 'values')
    
    
    plt.figure(figsize = (20,8))
    
    sns.violinplot(x = 'column_name', y = 'values', data = distrib_df)
    
    plt.title('Distribution of numerical features without normalisation')
    plt.xlabel(" ")
    plt.xticks(rotation=45)
    plt.show()
    
plot_violin(clean_df,num_features, 100000)
    
    

In [None]:
def make_histogram(dataframe, features):
    
    figures = []
    
    for feature in features : 
    
        hist_val, edges_val = np.histogram(dataframe[feature], density=True, bins=100)

        p = figure(width = 500, height = 300, title = 'Histogram of raw '+ feature, tools='', background_fill_color="#fafafa")

        p.quad(top=hist_val, bottom=0, left=edges_val[:-1], right=edges_val[1:],
               fill_color="navy", line_color="white", alpha=0.5)
        p.y_range.start = 0
        p.xaxis.axis_label = 'x'
        p.yaxis.axis_label = 'Pr(x)'
        p.grid.grid_line_color="white"
        
        figures.append(p)
        
    return figures

histos = make_histogram(clean_df,num_features)

show(grid(histos, ncols = 3))
    
    

<a id="subsec3"></a>
### Dealing with Skewness

---
**After analysing each features' histogram, some  features clearly got skewed distributions.**

- *Negative Skew* for example `Hillshade_Noon`

- *Positive Skew* for example `Horizontal_Distance_To_Roadways`

**A solution would to apply *transformations* in order to obtain more Bell-like distribution (Gaussian Distribution)**

- Why ?


- Having distributions close to the normal distribution might help to model to learn relationships between variables ! (i think?)

---

In [None]:
print('Training data ----------- \n')

for feat in num_features : 
    print('Skewness of the ' +feat+ f' distribution is : {skew(clean_df[feat])}')
    

print('Testing data ----------- \n')

for feat in num_features : 
    print('Skewness of the ' +feat+ f' distribution is : {skew(test_df[feat])}')


---
**First, let's focus on `Hillshade_Noon` and `Hillshade_9am`.** 

**Moreover other variables have negative values, most transformation do not handle this type of data**

**Skew values that lie in between $[-1,1]$ won't be analysed, because this skewness value isn't important**

*(Also because, i still have everything to learn in terms of statistics)*


They both have negative values, so two transformation are possible to make them more like Normal Distribution :

- $\sqrt{max(x+1) -x}$

- $\log({max(x+1) - 1}) +eps$


After some analysis using `square_root` provide bettern results ( should i quickly add a function to show it?)

---

In [None]:

def handle_neg_skew(dataframe, feature):
    
    find_max = (dataframe[feature]+1).max()
        

    return  np.sqrt(find_max - dataframe.pop(feature))




clean_df['Sqrt_Hillshade_Noon'] = handle_neg_skew(clean_df,'Hillshade_Noon')
test_df['Sqrt_Hillshade_Noon'] = handle_neg_skew(test_df,'Hillshade_Noon')



clean_df['Sqrt_Hillshade_9am'] = handle_neg_skew(clean_df,'Hillshade_9am')
test_df['Sqrt_Hillshade_9am'] = handle_neg_skew(test_df,'Hillshade_9am')




print('Training data')
print('Skewness of Sqrt_Hillshade_Noon distribution is : ',skew(clean_df['Sqrt_Hillshade_Noon']))
print('Skewness of Sqrt_Hillshade_9am distribution is : ',skew(clean_df['Sqrt_Hillshade_9am']))

print('Testing data')
print('Skewness of Sqrt_Hillshade_Noon distribution is : ',skew(test_df['Sqrt_Hillshade_Noon']))
print('Skewness of Sqrt_Hillshade_9am distribution is : ',skew(test_df['Sqrt_Hillshade_9am']))




>**Normally we should use the max found in training set for the test set however after checking both split have the same max. So it changes nothing**

In [None]:
histos = make_histogram(clean_df,['Sqrt_Hillshade_9am','Sqrt_Hillshade_Noon'])

show(grid(histos, ncols = 2))

---
Dealing with Skeness with other variable like `Horizontal_Distance_To_Hydrology` or `Horizontal_Distance_To_Hydrology` is more complex.

Because they have a **Positive Skew** but also **negative values**

However when dealing with positive skew, you mostly use to type of transformation :

- $\log(x)+ eps$

- $\sqrt{x}$


Both function that cannot handle negative values. **Thus two solution are possible :**

* **Adding a constant** (not great when facing new values that are possibly worse)

* **Removing/Clipping the negative values** (not great, because we are losing information on the real distribution)


Since we have a lot of data here, we will clip the negative values to $0$

---

> **After trying to fix Skewed distributions, results on this dataset are WORSE. Sometimes data transformation isn't useful !! We won't use it !!**



In [None]:

"""
clean_df['Sqrt_Horizontal_Distance_To_Hydrology'] = np.sqrt(np.clip(clean_df['Horizontal_Distance_To_Hydrology'],0,None))

histos = make_histogram(clean_df,['Horizontal_Distance_To_Hydrology','Sqrt_Horizontal_Distance_To_Hydrology'])




show(grid(histos, ncols = 2))

clean_df['Sqrt_Horizontal_Distance_To_Roadways'] = np.sqrt(np.clip(clean_df['Horizontal_Distance_To_Roadways'],0,None))

histos = make_histogram(clean_df,['Horizontal_Distance_To_Roadways','Sqrt_Horizontal_Distance_To_Roadways'])


show(grid(histos, ncols = 2))

clean_df['Sqrt_Horizontal_Distance_To_Fire_Points'] = np.sqrt(np.clip(clean_df['Horizontal_Distance_To_Fire_Points'],0,None))

histos = make_histogram(clean_df,['Horizontal_Distance_To_Fire_Points','Sqrt_Horizontal_Distance_To_Fire_Points'])


show(grid(histos, ncols = 2))

#The problem with clipping is clearly visible across histograms, the distribution form is sligthly modified for $0$.

clean_df.drop(columns = ['Horizontal_Distance_To_Hydrology',
                         'Horizontal_Distance_To_Roadways',
                         'Horizontal_Distance_To_Fire_Points'], inplace = True)


test_df['Sqrt_Horizontal_Distance_To_Fire_Points'] = np.sqrt(np.clip(test_df.pop('Horizontal_Distance_To_Fire_Points'),0,None))
test_df['Sqrt_Horizontal_Distance_To_Roadways'] = np.sqrt(np.clip(test_df.pop('Horizontal_Distance_To_Roadways'),0,None))
test_df['Sqrt_Horizontal_Distance_To_Hydrology'] = np.sqrt(np.clip(test_df.pop('Horizontal_Distance_To_Hydrology'),0,None))



"""
print()

<a id="subsec4"></a>
### Correlation between features
___


Sometimes a simplest way to find relationship between variables is through correlation matrix.

This matrix can help creating new features or even dropping ones

___

In [None]:
clean_df['target'] = train_target


In [None]:


def plot_correlation_heatmap(dataframe_corr, width = 900, height = 800):
    
    dataframe_corr.index.name = 'Features1'
    dataframe_corr.columns.name = 'Features2'

    corr_matrix = pd.DataFrame(dataframe_corr.stack(), columns = ['correlation']).reset_index()



    mapper = LinearColorMapper(palette = Viridis256,
                              low = corr_matrix['correlation'].min(),
                              high = corr_matrix['correlation'].max())


    TOOLS = "hover,save,pan, box_zoom,reset"

    p = figure(title = 'Correlation Matrix',
              x_range = corr_matrix['Features1'].drop_duplicates().to_list(),
              y_range = corr_matrix['Features2'].drop_duplicates().to_list(),
              x_axis_location = 'below',
              width = width,
              height = height,
              tools = TOOLS,
              toolbar_location = 'left')


    p.grid.grid_line_color = None
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = "7px"
    p.axis.major_label_standoff = 0
    p.xaxis.major_label_orientation = np.pi / 3

    p.rect(x='Features1', y="Features2", width=1, height=1,
           source=corr_matrix,
           fill_color={'field': 'correlation', 'transform': mapper},
           line_color=None)

    color_bar = ColorBar(color_mapper=mapper,
                         major_label_text_font_size="7px",
                         ticker=BasicTicker(desired_num_ticks=256),
                         border_line_color=None)
    p.add_layout(color_bar, 'right')

    return p



#corr_df = clean_df.corr()

#p = plot_correlation_heatmap(corr_df)

#show(p)

___
**Many information can be deduced from the correlation heatmap**

1. Because `Wilderness Areas` are exclusive, they are highly correlated between each other
2. `Soil Type` features have more or less the same impact
3. `Elevation` is highly correlated to target



Let's first normalise inputs, then through the process of creating models we will find the optimal solution (more or less)
___

In [None]:
soil_features = [x for x in clean_df.columns if x.startswith('Soil_Type')]

clean_df['Soil_type_count'] = clean_df[soil_features].sum(axis = 1)
test_df['Soil_typ_count'] = test_df[soil_features].sum(axis = 1)


wilderness_features = [x for x in clean_df.columns if x.startswith('Wilderness_Area')]

clean_df['Wilderness_area_count'] = clean_df[wilderness_features].sum(axis=1)
test_df['Wilderness_area_count'] = test_df[wilderness_features].sum(axis=1)

> Note : Few more features could have been added, but we will keep it intuitive
<a id="subsec5"></a>
### Reducing memory usage
___

When loading large dataset in pandas, adjusting ***types*** can drastically reduce memory usage!

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df





clean_df = reduce_mem_usage(clean_df)

test_df = reduce_mem_usage(test_df)

In [None]:
new_num_features = ['Elevation', 'Aspect', 'Slope',
                    'Horizontal_Distance_To_Hydrology',
                    'Vertical_Distance_To_Hydrology',
                    'Horizontal_Distance_To_Roadways',
                    'Sqrt_Hillshade_9am', 'Sqrt_Hillshade_Noon', 'Hillshade_3pm',
                    'Horizontal_Distance_To_Fire_Points', 'EVDtH']


mean_val = clean_df[new_num_features].mean()
standard_dev_val = clean_df[new_num_features].std()   


clean_df[new_num_features] = (clean_df[new_num_features] - mean_val )/ standard_dev_val



test_df[new_num_features] = (test_df[new_num_features] - mean_val )/ standard_dev_val  


<a id="sec5"></a>
## Building and Training a NN model

___

**Many approaches might work, one can try for example :**

1. Should we normalised one-hot encoded features? (thus *mean* and *standard deviation* are the same across all variables)
3. Modifying more feature distributions 


> But first let's remove class 5 from our training dataset

In [None]:
clean_df = clean_df.loc[clean_df['target'] !=5]

mapping = {1: 0,
           2: 1,
           3: 2,
           4: 3,
           6: 4,
           7: 5}


clean_df['target'].replace(mapping, inplace=True)


### Class Weights
____

We have to remember that we are dealing with imbalanced dataset, so we don't have many positive samples for few classes,

So the idea is to build a classifier that heavily weight the few examples that are available. It is possible to do so, by passing Keras weights for each class through a parameter `best_weight` in `model.fit`


**Unfortunately sometimes, this does not improve convergence/accuracy. On this dataset using CLASS WEIGHTS DOES NOT IMPROVE ACCURACY**
___

In [None]:
my_dict = clean_df['target'].value_counts().to_dict()

class_weight = {}

total = len(clean_df)

for key, value in my_dict.items():
    class_weight[key] = (1/value) * (total/len(my_dict))
    
print(class_weight)




In [None]:
train_target = np.array(clean_df.pop('target'))

train_values = np.asarray(clean_df)


---
1. `ReduceLROnPlateau` is a function that once our model is *stuck* will help by decreasing the learning rate and help the model to reach more accuratly the local minima

2. `EarlyStopping` just helps by stopping the training when no improvement is seen after a consecutive number of epoch
---

In [None]:
BATCH_SIZE = 2048

EPOCHS = 30

FOLDS = 10

NUMBER_FEATURE = 56


predictions = np.zeros((1,1))

reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor = 'val_accuracy',
                                              patience = 2,
                                              factor = 0.5,
                                              verbose = 0)


early_stop = keras.callbacks.EarlyStopping(monitor ='val_accuracy',
                                           patience = 4,
                                           restore_best_weights = True, 
                                           verbose = 1)

In [None]:
def make_model(input_size):
    
    model = tf.keras.Sequential([
        keras.layers.InputLayer(input_shape = (input_size,)),
        keras.layers.Dense(units = 256 , activation = 'relu'),
        
        keras.layers.BatchNormalization(),
        keras.layers.Dense(units = 128, activation = 'relu'),
        
        keras.layers.BatchNormalization(),
        keras.layers.Dense(units = 128,  activation = 'relu'),
        
        keras.layers.BatchNormalization(),
        keras.layers.Dense(units = 64, activation = 'relu'),
        keras.layers.Dense(units = 6, activation = 'softmax'),
        
        
    ])
    
    model.compile(optimizer = keras.optimizers.Adam(learning_rate = 4e-4),
                 loss =  tf.keras.losses.SparseCategoricalCrossentropy() ,
                 metrics = ['accuracy'])
    
    return model

    
    
def build_dataset(features_matrix, target):
    
    dataset = tf.data.Dataset.from_tensor_slices((features_matrix,target))
    #dataset = dataset.batch(BATCH_SIZE, drop_remainder = True).prefetch(1)
    
    return dataset

In [None]:
complete_hist = []

cross_val = StratifiedKFold(n_splits= FOLDS, random_state = 1337, shuffle = True)

for fold_idx, (train_idx, val_idx) in enumerate(cross_val.split(train_values, train_target)):
    
    X_train, X_val = train_values[train_idx], train_values[val_idx]
    y_train, y_val = train_target[train_idx], train_target[val_idx]
    
    
    
    mymodel = make_model(input_size = NUMBER_FEATURE)
    
    history = mymodel.fit(X_train,y_train,
                   batch_size = BATCH_SIZE,
                   epochs = EPOCHS,
                   validation_data = (X_val,y_val),
                   callbacks = [reduce_lr,early_stop],
                   verbose = 0)
    print(f"----Fold number {fold_idx}-----")
    #print(f"Training Loss : {history.history['loss'][-1]} , Training Accuracy : {history.history['accuracy'][-1]} ")
    print(f"Validation Loss : {history.history['val_loss'][-1]} Validation Accuracy : {history.history['val_accuracy'][-1]}")
    
    
    
    predictions = predictions + mymodel.predict(np.asarray(test_df))
    
    complete_hist.append(history)
    
    
    
#x_train, x_val, y_train, y_val = train_test_split(np.asarray(clean_df),train_target , test_size = 0.15, stratify =train_target )


#Stratify option allows to split the dataset in order to keep proportion regarding train_target,
#So classes proportions will remain the same accross training and validation set 




In [None]:
def plot_metrics(history,custom_title=""):
    
    titles = ['Loss', 'Accuracy']
    metrics = ['loss', 'val_loss', 'accuracy','val_accuracy']
    
    palette = Inferno[4]
    figures = []
    
    epochs = len(history.history['loss'])
    
    for k in range(0,4,2):
        
        p = figure(width = 600, height = 400, title = custom_title)
        
        
        
        p.line(np.arange(epochs), history.history[metrics[k]], legend_label = 'training', line_width = 4, color = palette[1])
        
        p.line(np.arange(epochs), history.history[metrics[k+1]], line_width = 4, legend_label = 'validation', color = palette[2])
        
        figures.append(p)
        
    return figures

show(row(plot_metrics(complete_hist[-1], custom_title= 'Last Fold results ')))

In [None]:
pred_df = pd.DataFrame({'pred_val' :  np.argmax(predictions, axis = 1)})


mapping = {0: 1,
           1: 2,
           2: 3,
           3: 4,
           4: 6,
           5: 7}


pred_df['pred_val'].replace(mapping, inplace=True)


In [None]:
submission_df = pd.DataFrame(data = {'Id' : test_df.index, 'Cover_Type' : pred_df.values.reshape(-1,)}).set_index('Cover_Type')

submission_df.to_csv('submission.csv', header=True)