# Open this notebook with Google Colab ;)

In [0]:
import os, sys
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

# Now, let's load data from the Wine review Kaggle challenge

You can download the data from the [dedicated Kaggle page](https://www.kaggle.com/zynicide/wine-reviews).

However, as we are running this on Google Colab, let's see how to load the data from Kaggle directly to Google Colab. First, let's install the `kaggle` package : 

In [0]:
!pip install kaggle

Then, go to your Kaggle account and create a "New API TOKEN". It will launch the download of a file that you can store on your compute. Now, you have to load this file with the following command : 

In [0]:
from google.colab import files
files.upload()

Once this is done, just run : 

In [0]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

You can now download the dataset thanks to : 

In [0]:
!kaggle datasets download -d zynicide/wine-reviews 

We can now unzip the files

In [0]:
import zipfile
zip_ref = zipfile.ZipFile('wine-reviews.zip', 'r')
zip_ref.extractall('files')
zip_ref.close()

Let's check the files that you have downloaded : 

In [0]:
!ls files

Let's now read one of the file

In [0]:
import pandas as pd
import os

data_path = 'files'
df = pd.read_csv(os.path.join(data_path, 'winemag-data-130k-v2.csv'), index_col=0)

Each line of the dataset corresponds to a wine description. Let's check out what is in the data, in particular the different columns as : 
- the price of the wine bottle
- its description
- the number of "points" it has (on a scale from 0 to 100)

In [0]:
df.head()

The goal here is no to do any data engineering. So let's take care of the missing values by removing the corresponding lines

In [0]:
df = df.dropna(axis=0, subset=['price', 'description', 'points'], how='any')

df = df[:5000] # Always start like that, not to waste any time first

Let's take input data. Again, the goal is not to do any data engineering so let's skip the train/test splits here.

In [0]:
X_text = df['description']
y = df['price']

If we take a look at the data, we see that each description is a long list of strings. However, if we don't tell the computer where the words starts and ends, it will consider it as a long and unique word. For that reason, you have to split it into words.

In [0]:
X_text.iloc[0]

In [0]:
X_text = X_text.apply(lambda x: x.split(' '))

Another step is to convert your words into tokens as it is what the computer will work on

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer

tk = Tokenizer()
tk.fit_on_texts(X_text)
X_tokens = tk.texts_to_sequences(X_text)



Now, we need to pad the data. But instead of padding it to the maximum length within the input data, let's pad it to a smaller number to accelerate the algorithm convergence - at almost no cost as there are reasons to believe that we don't need the entire sentence to get the importance of the wine and thus its price.

In [0]:
plt.hist([len(_) for _ in X_tokens])
plt.show()

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 40
X_pad = pad_sequences(X_tokens, dtype=float, padding='post', maxlen=maxlen)

# A sequential model to handle text

Let's first build a model with : 
- an Embedding designed for our task
- a Conv1D layer instead of a RNN

In [0]:
from tensorflow.keras import layers, models, Sequential


model = Sequential()
model.add(layers.Embedding(input_dim=20000, output_dim=10, mask_zero=True, input_length=maxlen))
model.add(layers.Conv1D(10, kernel_size=5, activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='relu'))

model.compile(loss="mse", optimizer='adam', metrics=['mae'])



Let's look at the parameters of the model

In [0]:
model.summary()

<!> There is another way to check the model layers :

In [0]:
import tensorflow as tf
tf.keras.utils.plot_model(model, "sequential_model.png", show_shapes=True)


# Let's dive into TensorBoard ! This is a amazing tool to see how the Neural Network works

In [0]:

# Load the TensorBoard notebook extension
%load_ext tensorboard

# Clear any logs from previous runs
!rm -rf ./logs/ 

In [0]:
import datetime
import tensorflow as tf 

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

Now, the logs of our fit will be given to a specifically designed folder that stores some of the information that Tensorboard needs

In [0]:
model.fit(X_pad, y.values, 
          epochs=5, 
          batch_size=32,
          callbacks=[tensorboard_callback])

In [0]:
%tensorboard --logdir logs/fit

# But how to take into account other information?

However, how to do to take other information into the model? For example the points (on a 0 to 100 scale) of each bottle, which should tell us how cheap or expensive a wine can be?

In [0]:
df.plot.scatter('price', 'points')
plt.xlim(0, 200)

In [0]:
X_num = df['points'].values

Not all the models are sequential! In fact, what we will do is have two branches, one for the text and the other for standard data

<img src="https://raw.githubusercontent.com/lewagon/data-images/master/DL/stack_layers.png" width='70%'>

The very interesting aspect is that you can optimize the two sub-NN jointly! 

### How to do that?

In [0]:
# Let's write a branch for the points values

input_num = layers.Input(shape=(1,))

x_num = layers.Dense(64, activation="relu")(input_num)
x_num = layers.Dense(32, activation="relu")(x_num)
x_num = layers.Dense(4, activation="relu")(x_num)
x_num = models.Model(inputs=input_num, outputs=x_num)


In [0]:
# Let's write the second branch for the text
input_text = layers.Input(shape=(maxlen,))

x_text = layers.Embedding(input_dim=20000, output_dim=10, mask_zero=True)(input_text)
x_text = layers.Conv1D(10, kernel_size=5, activation='relu')(x_text)
x_text = layers.Flatten()(x_text)
x_text = layers.Dense(10, activation='relu')(x_text)
x_text = models.Model(inputs=input_text, outputs=x_text)




In [0]:
# Let's combine the two streams of data and add two dense layers on top!
combined = layers.concatenate([x_text.output, x_num.output])
output = layers.Dense(2, activation="relu")(combined)
output = layers.Dense(1, activation="linear")(output)


We here write the entire data flow into the neural network. 

This line defines the entire model - the previous consecutive layers are not enough as we didn't say that the model is Sequential

In [0]:
model_combined = models.Model(inputs=[x_text.input, x_num.input], outputs=output)

Now, we can combine and fit it!

And pay attention to the `X` data 

In [0]:
model_combined.compile(loss="mse", optimizer='adam', metrics=['mae'])

model_combined.fit(x=[X_pad, X_num], 
          y=y,
          epochs=100, 
          batch_size=8)

Let's have a look at the summary now : 

In [0]:
model_combined.summary()

let's look 

In [0]:
import tensorflow as tf
tf.keras.utils.plot_model(model_combined, "multi_input_model.png", show_shapes=True)


You can think about each stream / branch as an input source of data. And there are many use-cases where you can encounter such type of data?

# Any example of where you might have such data?


- Medical data : ECG, EEG, MRI, PET, cognitive assessments, biomarkers, ...

<img src="https://raw.githubusercontent.com/lewagon/data-images/master/DL/medical_data.png" width='70%'>

- Object detection, in autonomous car for instance where you take a decision based on many sensors (multiple cameras, radars, speed, map, ...

<img src="https://raw.githubusercontent.com/lewagon/data-images/master/DL/autonomous_vehicle.png" width='70%'>
