# Tutorial Tensorflow

En este tutorial se implementará el modelo de [factorización matricial](http://base.sjtu.edu.cn/~bjshen/2.pdf) con la librería [tensorflow](https://www.tensorflow.org/).

Instalación de dependencias
```
$ pip install numpy
$ pip install pandas
$ pip install tensorflow
```

Lo primero de todo es importar las librerías que se utilizarán en este tutorial. 

In [1]:
import numpy as np
import tensorflow as tf
import pandas as pd
import time
! mkdir tmp

mkdir: tmp: File exists


## Importación de datos

En este tutorial se utilizará el mismo set de datos que el laboratorio práctico Nº2.

In [2]:
# Load dataset
data_path = '../assignment-2/dataset/ratings.dat'
headers = ['user_id', 'movie_id', 'rating', 'timestamp']
data = pd.read_csv(data_path,
                   names=headers,
                   delimiter=';',
                   usecols=['user_id', 'movie_id', 'rating'])

print('# Users:', data['user_id'].nunique())
print('# Items:', data['movie_id'].nunique())
print('# Data:', data.shape[0])
data[:5]

# Users: 6040
# Items: 3706
# Data: 1000209


Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


Una vez cargado los datos transformamos los ID de usuario y películas mediante un diccionario para que tengan orden correlativo desde 0 a N.

Luego separamos el set de entrenamiento y de test, dejando el 20% de los datos para este último.

También implementamos una función que nos entregará los datos agrupados en batch.

In [3]:
# Prepate batches
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

# Transform data to int
user_to_int = {user: i for i, user in enumerate(data['user_id'].unique())}
item_to_int = {item: i for i, item in enumerate(data['movie_id'].unique())}

# Map user and items to int
user_data = data['user_id'].map(lambda user: user_to_int[user]).tolist()
item_data = data['movie_id'].map(lambda item: item_to_int[item]).tolist()
rating_data = data['rating'].tolist()

# split into train / test
u_train, u_test, v_train, v_test, r_train, r_test = train_test_split(
    user_data, item_data, rating_data, test_size=0.2)

def get_batch(user_data, item_data, rating_data, batch_size=32):
    # Generate complete batches
    count = 0
    max_len = len(rating_data)
    n_batches = max_len // batch_size
    
    # Shuffle data
    user_data, item_data, rating_data = shuffle(user_data, item_data, rating_data)
    
    user_data = user_data[0:n_batches*batch_size]
    item_data = item_data[0:n_batches*batch_size]
    rating_data = rating_data[0:n_batches*batch_size]
    
    for i in range(0, max_len, batch_size):
        count += 1
        u = user_data[i:i+batch_size]
        v = item_data[i:i+batch_size]
        y = rating_data[i:i+batch_size]
            
        yield u, v, y, count, n_batches

## Definición del grafo de cómputo

En la siguiente celda se utiliza el framework para generar el grafo de cómputo

- El primer paso define los parámetros del grafo los cuales corresponden a la cantidad de usuarios e items del set de datos, como también el hiperparámetro k, alpha y el _learning rate_
- Luego se definen las variables que serán alimentadas desde el diccionario.
- Se inicializan las matrices P y Q del modelo con una distribución uniforme entre [-1, 1]
- Los _embeddings_ de usuarios e item son obtenidos mediante una tabla _lookup_
- *y_hat* corresponde a la predicción del modelo, i.e. al producto vectorial entre los embeddings
- Finalmente definimos la pérdida que se desea optimizar y se crea el objeto optimizador.

In [4]:
# Define the graph model

# 1. Define model parameters
n_users = data['user_id'].nunique()
n_items = data['movie_id'].nunique()
k = 40
alpha = tf.constant(.001, name='alpha')
learning_rate = .01

# 2. Define variables that are fed through the dictionary session
# User, item and ratings placeholders
user_input = tf.placeholder(tf.int32, [None], name='user_input')
item_input = tf.placeholder(tf.int32, [None], name='item_input')
y = tf.placeholder(tf.float32, [None], name='ratings_input')

# 3. Define and Initilize matrix embeddings
# User embeddings come from P matrix
# Item embeddings come from Q matrix

with tf.name_scope('P_matrix'):
    P_matrix = tf.Variable(tf.random_uniform((n_users, k), -1, 1), name='user_embeddings')
    tf.summary.tensor_summary('P_matrix', P_matrix)
with tf.name_scope('Q_matrix'):
    Q_matrix = tf.Variable(tf.random_uniform((n_items, k), -1, 1), name='item_embeddings')
    tf.summary.tensor_summary('Q_matrix', Q_matrix)

# 4. Fetch embeddings with a lookup table
# Define user and item embedding
user_embed = tf.nn.embedding_lookup(P_matrix, user_input, name='user_embed')
item_embed = tf.nn.embedding_lookup(Q_matrix, item_input, name='item_embed')

# 5. Compute prediction
with tf.name_scope('prediction'):
    y_hat = tf.reduce_sum(tf.multiply(user_embed, item_embed), 1)
    tf.summary.scalar('prediction', y_hat)
    pred_histogram = tf.summary.histogram("mean_prediction", y_hat)

    
# Compute loss function
# loss = 1/n (y - y_hat) ** 2
mse_loss = tf.losses.mean_squared_error(y, y_hat)

# reg_loss = alpha * (||p|| + ||q||)
reg_loss = tf.add(tf.multiply(alpha, tf.nn.l2_loss(user_embed)), tf.multiply(alpha, tf.nn.l2_loss(item_embed)))

with tf.name_scope('error'):
    loss = tf.add(mse_loss, reg_loss)
    tf.summary.scalar('error', loss)
    loss_histogram = tf.summary.histogram("mean_loss", loss)

    
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)

## Entrenamiento del modelo

El primer paso para poder ejecutar el modelo ya definido es generar un objeto de sesión e inicializar las variables. Luego iteramos las veces que se quiere para ajustar el modelo.

Para el ajuste del modelo, se genera un diccionario que alimenta el grafo de cómputo y luego se evalúa la sesión con el diccionario. En este caso se evalúa la pérdida y el objeto optimizador, el cual es el encargado de calcular el gradiente y actualizar los parámetros.

In [7]:
# Train model
epochs = 20
batch_size = 1024
print_every_n = 100

# Cretae session
sess = tf.Session()

# Aseguremonos de ver el modelo del grafo en TensorBoard
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('./tmp/run1', sess.graph)

# Initialize session variables
sess.run(tf.global_variables_initializer())

start = time.time()

# Iterate over opechs
for e in range(epochs):
    # Iterate over all batches
    for users, items, scores, batch_number, total_batches in get_batch(u_train, v_train, r_train, batch_size=batch_size):
        train_feed = {
            user_input: users,
            item_input: items,
            y: scores
        }

        # Feed the graph
        batch_loss, _ = sess.run([loss, optimizer], feed_dict=train_feed)
            
        # Print progress
        if (batch_number % print_every_n == 0):
            end = time.time()
            print('[Train] Epoch: {}/{}  '.format(e+1, epochs),
                  'Batch: {}/{} ({:.2f}%)  '.format(batch_number, total_batches, batch_number / total_batches * 100),
                  'Train loss: {:.10f}  '.format(batch_loss),
                  '{:.4f} sec/batch'.format((end - start) / batch_number))
            
            sum1 = sess.run(pred_histogram, feed_dict=train_feed)
            sum2 = sess.run(loss_histogram, feed_dict=train_feed)  
            writer.add_summary(sum1, e)
            writer.add_summary(sum2, e)

    # Validate with test set
    val_start = time.time()
    val_loss = 0.
    for users, items, scores, batch_number, total_batches in get_batch(u_test, v_test, r_test, batch_size=batch_size):
        validation_feed = {
            user_input: users,
            item_input: items,
            y: scores
        }
        # Feed the graph
        val_loss += sess.run([loss], feed_dict=validation_feed)[0]

    val_loss /= total_batches
    end = time.time()
    print('[Validation] Epoch: {}/{}  '.format(e+1, epochs),
          'Validation loss: {:.10f}  '.format(val_loss),
          '{:.4f} sec'.format((end - val_start)))

[Train] Epoch: 1/20   Batch: 100/781 (12.80%)   Train loss: 24.7124023438   0.0143 sec/batch
[Train] Epoch: 1/20   Batch: 200/781 (25.61%)   Train loss: 20.5979576111   0.0098 sec/batch
[Train] Epoch: 1/20   Batch: 300/781 (38.41%)   Train loss: 16.9126205444   0.0081 sec/batch
[Train] Epoch: 1/20   Batch: 400/781 (51.22%)   Train loss: 10.9266796112   0.0076 sec/batch
[Train] Epoch: 1/20   Batch: 500/781 (64.02%)   Train loss: 7.2323837280   0.0072 sec/batch
[Train] Epoch: 1/20   Batch: 600/781 (76.82%)   Train loss: 6.2153320312   0.0070 sec/batch
[Train] Epoch: 1/20   Batch: 700/781 (89.63%)   Train loss: 5.6692728996   0.0070 sec/batch
[Validation] Epoch: 1/20   Validation loss: 5.2253034347   0.6958 sec
[Train] Epoch: 2/20   Batch: 100/781 (12.80%)   Train loss: 4.9673781395   0.0799 sec/batch
[Train] Epoch: 2/20   Batch: 200/781 (25.61%)   Train loss: 4.7676677704   0.0427 sec/batch
[Train] Epoch: 2/20   Batch: 300/781 (38.41%)   Train loss: 4.6218252182   0.0302 sec/batch
[Train

[Train] Epoch: 12/20   Batch: 500/781 (64.02%)   Train loss: 4.3661437035   0.1426 sec/batch
[Train] Epoch: 12/20   Batch: 600/781 (76.82%)   Train loss: 4.4077596664   0.1197 sec/batch
[Train] Epoch: 12/20   Batch: 700/781 (89.63%)   Train loss: 4.3776950836   0.1033 sec/batch
[Validation] Epoch: 12/20   Validation loss: 4.3800078685   0.7495 sec
[Train] Epoch: 13/20   Batch: 100/781 (12.80%)   Train loss: 4.5186204910   0.7513 sec/batch
[Train] Epoch: 13/20   Batch: 200/781 (25.61%)   Train loss: 4.4384679794   0.3786 sec/batch
[Train] Epoch: 13/20   Batch: 300/781 (38.41%)   Train loss: 4.2769827843   0.2550 sec/batch
[Train] Epoch: 13/20   Batch: 400/781 (51.22%)   Train loss: 4.3247418404   0.1932 sec/batch
[Train] Epoch: 13/20   Batch: 500/781 (64.02%)   Train loss: 4.3625121117   0.1555 sec/batch
[Train] Epoch: 13/20   Batch: 600/781 (76.82%)   Train loss: 4.3220934868   0.1304 sec/batch
[Train] Epoch: 13/20   Batch: 700/781 (89.63%)   Train loss: 4.3627882004   0.1125 sec/batch

## Prueba del modelo

Finalmente, podemos alimentar el grafo con un nuevo diccionario para generar predicciones. En este caso se utiliza el usuario con ID 0 y los productos 1,2 y 3.

In [6]:
test_dict = {
    user_input: [0, 0, 0],
    item_input: [1, 2, 3],
}
predictions = sess.run([y_hat], feed_dict=test_dict)[0]
    
print(predictions)

[2.824893  3.9052625 3.0218327]
