<h1><font color="#113D68" size=6>TINTOlib: Converting Tidy Data into Image for 2-Dimensional Convolutional Neural Networks</font></h1>

<h1><font color="#113D68" size=5>Template Classification Machine Learning problem with a Hibryd Networks (CNN+MLP)</font></h1>

<br><br>
<div style="text-align: right">
<font color="#113D68" size=3>Manuel Castillo-Cara</font><br>
<font color="#113D68" size=3>Raúl García-Castro</font><br>

</div>

---

---

<a id="indice"></a>
<h2><font color="#004D7F" size=5>Index</font></h2>

* [0. Context](#section0)
* [1. Description](#section1)
    * [1.1. Main Features](#section11)
    * [1.2. Citation](#section12)
    * [1.3. Documentation and License](#section13)
* [2. Libraries](#section2)
    * [2.1. System setup](#section21)
    * [2.2. Invoke the libraries](#section22)
* [3. Data processing](#section3)
    * [3.1. TINTOlib methods](#section31)
    * [3.2. Read the dataset](#section32)
    * [3.3. Generate images](#section33)
    * [3.4. Read images](#section34)
    * [3.5. Mix images and tidy data](#section35)
* [4. Pre-modelling phase](#section4)
    * [4.1. Data curation](#section41)
    * [4.2. One-hot encoding](#section42)
* [5. Modelling hybrid network](#section5)
    * [5.1. FFNN for tabular data](#section51)
    * [5.2. CNN for TINTOlib images](#section52)
    * [5.3. Concatenate branches](#section53)
    * [5.4. Metrics](#section54)
    * [5.5. Compile and fit](#section55)
* [6. Results](#section6)
    * [6.1. Train/Validation representation](#section61)
    * [6.2. Validation/Test evaluation](#section62)

---
<a id="section0"></a>
# <font color="#004D7F" size=6> 0. Context</font>

This is a tutorial on how to read the images created by TINTOlib and pass them to a very simple pretrained Convolutional Neural Network (CNN). The images must already be created by the TINTOlib software. See the documentation in GITHUB for how to create the images from tabular data.

Remember that when using CNN you can set the training to be done with GPUs to improve performance.

<div class="alert alert-block alert-info">

<i class="fa fa-info-circle" aria-hidden="true"></i>
You can see all information about TINTOlib code in [GitHub](https://github.com/oeg-upm/TINTOlib)

<div class="alert alert-block alert-info">

<i class="fa fa-info-circle" aria-hidden="true"></i>
You can see all information about TINTOlib documentation in [PyPI](https://tintolib.readthedocs.io/en/latest/installation.html)

---
<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<a id="section1"></a>
# <font color="#004D7F" size=6> 1. Description</font>

The growing interest in the use of algorithms-based machine learning for predictive tasks has generated a large and diverse development of algorithms. However, it is widely known that not all of these algorithms are adapted to efficient solutions in certain tidy data format datasets. For this reason, novel techniques are currently being developed to convert tidy data into images with the aim of using Convolutional Neural Networks (CNNs). TINTOlib offers the opportunity to convert tidy data into images through several techniques: TINTO, IGTD, REFINED, SuperTML, BarGraph, DistanceMatrix and Combination.

---
<a id="section11"></a>
# <font color="#004D7F" size=5> 1.1. Main Features</font>

- Supports all CSV data in **[Tidy Data](https://www.jstatsoft.org/article/view/v059i10)** format.
- For now, the algorithm converts tabular data for binary and multi-class classification problems into machine learning.
- Input data formats:
    - **Tabular files**: The input data could be in **[CSV](https://en.wikipedia.org/wiki/Comma-separated_values)**, taking into account the **[Tidy Data](https://www.jstatsoft.org/article/view/v059i10)** format.
    - **Dataframe***: The input data could be in **[Pandas Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)**, taking into account the **[Tidy Data](https://www.jstatsoft.org/article/view/v059i10)** format.
    - **Tidy Data**: The **target** (variable to be predicted) should be set as the last column of the dataset. Therefore, the first columns will be the features.
    - All data must be in numerical form. TINTOlib does not accept data in string or any other non-numeric format.
- Runs on **Linux**, **Windows** and **macOS** systems.
- Compatible with **[Python](https://www.python.org/)** 3.7 or higher.

---
<a id="section12"></a>
# <font color="#004D7F" size=5> 1.2. Citation</font>

**TINTOlib** is an python library that makes **Synthetic Images** from [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) (also knows as **Tabular Data**).

**Citing TINTO**: If you used TINTO in your work, please cite the **[SoftwareX](https://doi.org/10.1016/j.softx.2023.101391)**:

```bib
@article{softwarex_TINTO,
    title = {TINTO: Converting Tidy Data into Image for Classification
            with 2-Dimensional Convolutional Neural Networks},
    journal = {SoftwareX},
    author = {Manuel Castillo-Cara and Reewos Talla-Chumpitaz and
              Raúl García-Castro and Luis Orozco-Barbosa},
    year = {2023},
    pages = {101391},
    issn = {2352-7110},
    doi = {https://doi.org/10.1016/j.softx.2023.101391}
}
```

And use-case developed in **[INFFUS Paper](https://doi.org/10.1016/j.inffus.2022.10.011)**

```bib
@article{inffus_TINTO,
    title = {A novel deep learning approach using blurring image
            techniques for Bluetooth-based indoor localisation},
    journal = {Information Fusion},
    author = {Reewos Talla-Chumpitaz and Manuel Castillo-Cara and
              Luis Orozco-Barbosa and Raúl García-Castro},
    volume = {91},
    pages = {173-186},
    year = {2023},
    issn = {1566-2535},
    doi = {https://doi.org/10.1016/j.inffus.2022.10.011}
}
```

---
<a id="section13"></a>
# <font color="#004D7F" size=5> 1.3. Documentation and License</font>

TINTOlib has a wide range of documentation on both GitHub and PiPY. 

Moreover, TINTOlib is free and open software with Apache 2.0 license.

<div class="alert alert-block alert-info">

<i class="fa fa-info-circle" aria-hidden="true"></i>
You can see all information about TINTOlib code in [GitHub](https://github.com/oeg-upm/TINTOlib)

<div class="alert alert-block alert-info">

<i class="fa fa-info-circle" aria-hidden="true"></i>
You can see all information about TINTOlib documentation in [PyPI](https://tintolib.readthedocs.io/en/latest/installation.html)

---
<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<a id="section2"></a>
# <font color="#004D7F" size=6> 2. Libraries</font>

---
<a id="section21"></a>
# <font color="#004D7F" size=5> 2.1. System setup</font>

Before installing the libraries you must have the `mpi4py` package installed on the native (Linux) system. This link shows how to install it: 
- Link: [`mpi4py` in Linux](https://www.geeksforgeeks.org/how-to-install-python3-mpi4py-package-on-linux/)

For example, in Linux:

```
    sudo apt-get install python3
    sudo apt install python3-pip
    sudo apt install python3-mpi4py
```

If you are in Windows, Mac or, also, Linux, you can install from PyPI if you want:
```
    sudo pip3 install mpi4py
```

<div class="alert alert-block alert-info">
    
<i class="fa fa-info-circle" aria-hidden="true"></i>
Note that you must **restart the kernel or the system** so that it can load the libraries. 

Now, once you have installed `mpi4py` you can install the PyPI libraries and dependences.

In [None]:
!pip install torchmetrics pytorch_lightning TINTOlib imblearn keras_preprocessing mpi4py

<div class="alert alert-block alert-info">
    
<i class="fa fa-info-circle" aria-hidden="true"></i>
Note that you must **restart the kernel** so that it can load the libraries. 

---
<a id="section22"></a>
# <font color="#004D7F" size=5> 2.2. Invoke the libraries</font>

The first thing we need to do is to declare the libraries

In [None]:
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
#import cv2
import gc
import matplotlib.pyplot as plt
#import openslide
#from openslide.deepzoom import DeepZoomGenerator
import tifffile as tifi
import sklearn
import tensorflow as tf
import seaborn as sns
from PIL import Image


from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score,mean_absolute_percentage_error

from keras_preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import load_model

from sklearn.model_selection import train_test_split
from tensorflow.keras.applications import vgg16, vgg19, resnet50, mobilenet, inception_resnet_v2, densenet, inception_v3, xception, nasnet, ResNet152V2
from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Flatten, Dropout, BatchNormalization, InputLayer, LayerNormalization
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.optimizers import SGD, Adam, Adadelta, Adamax
from tensorflow.keras import layers, models, Model
from tensorflow.keras.losses import MeanAbsoluteError, MeanAbsolutePercentageError
from tensorflow.keras.layers import Input, Activation,MaxPooling2D, concatenate
from tensorflow.keras.utils import to_categorical

from imblearn.over_sampling import RandomOverSampler

#Models of TINTOlib
from TINTOlib.tinto import TINTO
from TINTOlib.supertml import SuperTML
from TINTOlib.igtd import IGTD
from TINTOlib.refined import REFINED
from TINTOlib.barGraph import BarGraph
from TINTOlib.distanceMatrix import DistanceMatrix
from TINTOlib.combination import Combination
import TINTOlib.utils

---
<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<a id="section3"></a>
# <font color="#004D7F" size=6> 3. Data processing</font>

The first thing to do is to read all the images created by TINTO. TINTO creates a folder which contains subfolders corresponding to each target that has the problem. Each image corresponds to a sample of the original dataset.

---
<a id="section31"></a>
# <font color="#004D7F" size=5> 3.1. TINTOlib methods</font>

We prepare the declaration of the classes with the TINTOlib method we want to transform. Note that TINTOlib has several methods and we will have to choose one of them since each method generates different images.

In addition, we establish the paths where the dataset is located and also the folder where the images will be created.

In [None]:
#Select the model and the parameters
problem_type = "supervised"
image_model = REFINED(problem= problem_type,hcIterations=5)
#image_model = TINTO(problem= problem_type,blur=True)
#image_model = IGTD(problem= problem_type)
#image_model = BarGraph(problem= problem_type)
#image_model = DistanceMatrix(problem= problem_type)
#image_model = Combination(problem= problem_type)

#Define the dataset path and the folder where the images will be saved
dataset_path = "iris.csv"
images_folder = "hybridclassification"

<div class="alert alert-block alert-info">

<i class="fa fa-info-circle" aria-hidden="true"></i>
You can see all TINTOlib method in the [PyPI documentation](https://tintolib.readthedocs.io/en/latest/installation.html)

---
<a id="section32"></a>
# <font color="#004D7F" size=5> 3.2. Read the dataset</font>

In this part, we proceed to read the dataset according to the path specified above and also standardize the name that the target will have, in this case, it will be called `class`.

In [None]:
#Read CSV
df = pd.read_csv(dataset_path)
df.head(2)

---
<a id="section33"></a>
# <font color="#004D7F" size=5> 3.3. Generate images</font>

Now we can generate the images with the `generateImages()` generic function. Likewise, we create a dataset that will have the path of each of the samples with the corresponding image created for it. 

Note that each image is created based on a row, therefore, each numerical sample of the dataset will correspond to a particular image. In other words, we will have the same number of images as samples/rows.

In [None]:
#Generate the images
image_model.generateImages(df, images_folder)
img_paths = os.path.join(images_folder,problem_type+".csv")

print(img_paths)

---
<a id="section34"></a>
# <font color="#004D7F" size=5> 3.4. Read Images</font>

Now, we read the created images 

In [None]:
imgs = pd.read_csv(img_paths)

#imgs["images"]= images_folder + "\\" + imgs["images"]
imgs["images"]= images_folder + "/" + imgs["images"]

---
<a id="section35"></a>
# <font color="#004D7F" size=5> 3.5. Mix images and tidy data</font>

Since we are going to use hybrid networks, i.e. create a model in which we join a CNN for the images and a MLP for the tabular data, we are going to join it in order to integrate all the data in our hybrid model.


In [None]:
# Select all the attributes to normalize
columns_to_normalize = df.columns[:-1]

# Normalize between 0 and 1
df_normalized = (df[columns_to_normalize] - df[columns_to_normalize].min()) / (df[columns_to_normalize].max() - df[columns_to_normalize].min())

# Combine the attributes and the label
df_normalized = pd.concat([df_normalized, df[df.columns[-1]]], axis=1)

df_normalized.head(2)

Combine the images and tidy data in the same dataframe, split attributes and objective value

In [None]:
combined_dataset = pd.concat([imgs,df_normalized],axis=1)

df_x = combined_dataset.drop("class",axis=1)
df_y = combined_dataset["class"]

print(df_y)

---
<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<a id="section4"></a>
# <font color="#004D7F" size=6> 4. Pre-modelling phase</font>

Once the data is ready, we load it into memory with an iterator in order to pass it to the CNN.

---
<a id="section41"></a>
# <font color="#004D7F" size=5> 4.1. Data curation</font>

Note that each method generates images of **different pixel size**. For example:
- `TINTO` method has a parameter that you can specify the size in pixels which by default is 20. 
- Other parameters such as `Combined` generates the size automatically and you must obtain them from the _shape_ of the images.

<div class="alert alert-block alert-info">

<i class="fa fa-info-circle" aria-hidden="true"></i>
You can see all information about TINTOlib documentation in [PyPI](https://tintolib.readthedocs.io/en/latest/installation.html)

In [None]:
pixels = 369

Split in train/test/validation. 

Note that the partitioning of the images is also performed, in addition to the tabular data.

In [None]:
import cv2
X_train, X_val, y_train, y_val = train_test_split(df_x, df_y, test_size = 0.40, random_state = 123,stratify=df_y)
X_val, X_test, y_val, y_test = train_test_split(X_val, y_val, test_size = 0.50, random_state = 123,stratify=y_val)

X_train_num = X_train.drop("images",axis=1)
X_val_num = X_val.drop("images",axis=1)
X_test_num = X_test.drop("images",axis=1)

"""X_train_img = np.array([cv2.imread(img) for img in X_train["images"]])
X_val_img = np.array([cv2.imread(img) for img in X_val["images"]])
X_test_img = np.array([cv2.imread(img) for img in X_test["images"]])"""

X_train_img = np.array([cv2.resize(cv2.imread(img),(pixels,pixels)) for img in X_train["images"]])
X_val_img = np.array([cv2.resize(cv2.imread(img),(pixels,pixels)) for img in X_val["images"]])
X_test_img = np.array([cv2.resize(cv2.imread(img),(pixels,pixels)) for img in X_test["images"]])

n_class = df['Species'].value_counts().count()
attributes = len(X_train_num.columns)

print("Image shape",X_train_img[0].shape)
print("Attributes",attributes)
print("Classes",n_class)
size=X_train_img[0].shape[0]
print("Image size (pixels):", pixels)

---
<a id="section42"></a>
# <font color="#004D7F" size=5> 4.2. One-hot encoding</font>

One-hot encoding **only** for multiclass machine learning problems

In [None]:
# y-1 because target is between [1,28] and not [0,27]
y_train_oh =  to_categorical(y_train-1,n_class)
y_val_oh = to_categorical(y_val-1,n_class)
y_test_oh = to_categorical(y_test-1,n_class)

<a id="section5"></a>
# <font color="#004D7F" size=6> 5. Modeling hybrid network</font>

Now we can start the CNN+MLP training. Before that we prepare the algorithm for reading data.

In this example, 2 branch networks is created
- 1º branch: FFNN for tabular data
- 2º branch: CNN for TINTOlib images

---
<a id="section51"></a>
# <font color="#004D7F" size=5> 5.1. FFNN for tabular data</font>

This is an example of a simple FFNN for tabular data. Note that we are not looking for the optimization of the CNN but to show an example of TINTOlib execution.

In [None]:
dropout = 0.5

In [None]:
filters_ffnn = [128,64,32]

ff_model = Sequential()
ff_model.add(Input(shape=(attributes,)))

for layer in filters_ffnn:
    ff_model.add(Dense(layer, activation="relu"))
    ff_model.add(BatchNormalization())
    ff_model.add(Dropout(dropout))

---
<a id="section52"></a>
# <font color="#004D7F" size=5> 5.2. CNN for TINTOlib images</font>

This is an example of a simple CNN for TINTOlib images. Note that we are not looking for the optimization of the CNN but to show an example of TINTOlib execution.

In [None]:
filters_cnn =  [16,32,64,128]

cnn_model = Sequential()
cnn_model.add(Input(shape=(pixels,pixels, 3)))

for layer in filters_cnn:
    cnn_model.add(Conv2D(layer, (3, 3), padding="same"))
    cnn_model.add(Activation("relu"))
    cnn_model.add(BatchNormalization())
    cnn_model.add(MaxPooling2D(pool_size=(2, 2)))

# flatten the volume, then FC => RELU => BN => DROPOUT
cnn_model.add(Flatten())
cnn_model.add(Dense(128))
cnn_model.add(Activation("relu"))
cnn_model.add(BatchNormalization())
cnn_model.add(Dropout(dropout))
cnn_model.add(Dense(64))
cnn_model.add(Activation("relu"))
cnn_model.add(BatchNormalization())
cnn_model.add(Dropout(dropout))
# apply another FC layer, this one to match the number of nodes
# coming out of the MLP
cnn_model.add(Dense(32))
cnn_model.add(Activation("relu"))
cnn_model.add(BatchNormalization())
cnn_model.add(Dropout(dropout))


---
<a id="section53"></a>
# <font color="#004D7F" size=5> 5.3. Concatenate branches</font>

Finally, we must concatenate the output of the CNN branch with the output of the FFNN branch in a final FFNN that will give the predictions.

In [None]:
combinedInput = concatenate([ff_model.output, cnn_model.output])
x = Dense(64, activation="relu")(combinedInput)
x = BatchNormalization()(x)
x = Dropout(dropout)(x)
x = Dense(64, activation="relu")(x)
x = BatchNormalization()(x)
x = Dropout(dropout)(x)
x = Dense(64, activation="relu")(x)
x = BatchNormalization()(x)
x = Dropout(dropout)(x)
x = Dense(n_class, activation="softmax")(x)

model = Model(inputs=[ff_model.input, cnn_model.input], outputs=x)

---
<a id="section54"></a>
# <font color="#004D7F" size=5> 5.4. Metrics</font>

Define metrics and some hyperparameters

In [None]:
METRICS = [
    #tf.keras.metrics.TruePositives(name = 'tp'),
    #tf.keras.metrics.FalsePositives(name = 'fp'),
    #tf.keras.metrics.TrueNegatives(name = 'tn'),
    #tf.keras.metrics.FalseNegatives(name = 'fn'), 
    tf.keras.metrics.BinaryAccuracy(name ='accuracy'),
    tf.keras.metrics.Precision(name = 'precision'),
    tf.keras.metrics.Recall(name = 'recall'),
    tf.keras.metrics.AUC(name = 'auc'),
]

Print the model

In [None]:
from keras.utils import plot_model
model.summary()
plot_model(model)

---
<a id="section55"></a>
# <font color="#004D7F" size=5> 5.5. Compile and fit</font>

Note to specify the **loss depending** on whether you have a binary or multiclass classification problem.

In [None]:
#HYPERPARAMETERS
opt = Adam(learning_rate=1e-3)

In [None]:
model.compile(
    loss="categorical_crossentropy", 
    optimizer=opt,
    metrics = METRICS
)
epochs = 5

In [None]:
model_history=model.fit(
    x=[X_train_num, X_train_img], y=y_train_oh,
    validation_data=([X_val_num, X_val_img], y_val_oh),
    epochs=epochs , 
    batch_size=8,
)

In [None]:
print(model_history.history.keys())

<a id="section6"></a>
# <font color="#004D7F" size=6> 6. Results</font>

Finally, we can evaluate our hybrid model with the images created by TINTOlib in any of the ways represented below.

---
<a id="section61"></a>
# <font color="#004D7F" size=5> 6.1. Train/Validation representation</font>

In [None]:
#print(model_history.history['loss'])
plt.plot(model_history.history['loss'], color = 'red', label = 'loss')
plt.plot(model_history.history['val_loss'], color = 'green', label = 'val loss')
plt.legend(loc = 'upper right')

plt.show()

In [None]:
plt.plot(model_history.history['accuracy'], color = 'red', label = 'accuracy')
plt.plot(model_history.history['val_accuracy'], color = 'green', label = 'val accuracy')
plt.legend(loc = 'upper right')
plt.show()

---
<a id="section62"></a>
# <font color="#004D7F" size=5> 6.2. Validation/Test evaluation</font>

In [None]:
score_test= model.evaluate([X_test_num, X_test_img], y_test_oh)

In [None]:
prediction = model.predict([X_test_num,X_test_img],)
real_values= y_test.values-1
predicted_classes = np.argmax(prediction, axis = 1)

result = [list(t) for t in zip(predicted_classes, real_values)]
#print(np.round(prediction))


In [None]:
test_accuracy = score_test[1]
test_auc = score_test[4]
test_precision = score_test[2]
test_recall = score_test[3]

print("Test accuracy:",test_accuracy)
print("Test AUC:",test_auc)
print("Test precision:",test_precision)
print("Test recall:",test_recall)

In [None]:
train_accuracy = model_history.history["accuracy"][-1]
train_auc = model_history.history["auc"][-1]
train_precision = model_history.history["precision"][-1]
train_recall = model_history.history["recall"][-1]
train_loss = model_history.history["loss"][-1]

print("Train accuracy:",train_accuracy)
print("Train AUC:",train_auc)
print("Train precision:",train_precision)
print("Train recall:",train_recall)
print("Train loss:",train_loss)

In [None]:
validation_accuracy = model_history.history["val_accuracy"][-1]
validation_auc = model_history.history["val_auc"][-1]
validation_precision = model_history.history["val_precision"][-1]
validation_recall = model_history.history["val_recall"][-1]
validation_loss = model_history.history["val_loss"][-1]

print("Validation accuracy:",validation_accuracy)
print("Validation AUC:",validation_auc)
print("Validation precision:",validation_precision)
print("Validation recall:",validation_recall)
print("Validation loss:",validation_loss)

<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<div style="text-align: right"> <font size=6><i class="fa fa-coffee" aria-hidden="true" style="color:#004D7F"></i> </font></div>