# Chuletario Machine Learning III - Yago Tobio Souto
---
## Indice: 

+ [**EDA**](#section1)
  + [Import & File Previews](#eda-section1)
  + [Agrupación, data types, datos unicos](#eda-section2)
  + [Outliers](#eda-section3)
  + [Sparsity](#eda-section4)
  + [Recommendation Systems EDA](#eda-section5)
+ [**PCA**](#section2)
  + [Preparación del dataset y normalizar](#pca-section1)
  + [Generar latent dims, aplicar PCA, determinar error](#pca-section2)
  + [Explicación del modelo - autovalores, varianza](#pca-section3)
  + [Averiguar latent dim optimo haciendo plots](#pca-section4)
+ [**PCA combinado con Clustering**](#important-exercise)
+ [**Clustering**](#section3)
  + [Data preparation](#clustering-section1)
  + [Calculations (Silhouette-score, Rand Index, Mutual Info, etc...)](#clustering-section2)
  + [HAC](#clustering-section3)
  + [K-Means Clustering](#clustering-section4)
  + [Mini-batches](#clustering-section5)
  + [Mixture Models](#clustering-section6)

+ [**Memory-Based Filtering**](#section4)
  + [EDA](#rs-section0)
  + [User-Based Filtering](#rs-section1)
  + [Item-Based Filtering](#rs-section2)
  
+ [**Model-Based Filtering**](#section5)
  + [EDA](@rs-section00)
  + [Singular Value Decomposition](#rs-section3)
  + [Matrix Factorisation](#rs-section4)

+ [**Implicit Feedback**](#section6)
  + [BPR - Bayesian Personalised Ranking](#rs-section5)
  + [WMF - Weighted Matrix Factorisation](#rs-section6)
  + [FM - Factorisation Machines](#rs-section7)
  
+ [**Natural Language Processing**](#section7)
  + [Intro](#nlp-section0)
  + [VADER](#nlp-section1)
  + [ROBERTA](#nlp-section2)
  
+ [**Network Analysis**](#section8)
  + [Graph Modality - SoRec](#na-section1)
  + [Text Modality - CTR (Collaborative Topic Regression)](#na-section2)

[*Para que depois muchos digan que no se trabaya...*](https://www.youtube.com/watch?v=Zcb8yPEItwA)

#### **Librerías**


In [13]:
# Standard library imports
import itertools
import math
import os
import sys
import time
import warnings
from collections import defaultdict

# Data manipulation and numerical libraries
import numpy as np
import pandas as pd
from scipy import sparse as sp, stats
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
from scipy.sparse.linalg import svds
from scipy.special import softmax
from scipy.stats import multivariate_normal

# Text and natural language processing
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Machine Learning and Data Science libraries
from sklearn import metrics
from sklearn.cluster import AgglomerativeClustering, KMeans, MiniBatchKMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import mean_squared_error, silhouette_samples, silhouette_score
from sklearn.metrics.cluster import adjusted_rand_score, normalized_mutual_info_score, rand_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.metrics import root_mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.calibration import CalibrationDisplay
from sklearn.metrics import RocCurveDisplay

# Graph and plot libraries
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.cbook import boxplot_stats
from matplotlib.ticker import FixedLocator, FixedFormatter
import seaborn as sns
import plotly.express as px

# Deep Learning libraries
import tensorflow as tf
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
from torch.utils.data import Dataset

# Transformer models
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Utility and other specific tools
from adjustText import adjust_text
from tabulate import tabulate
from tqdm.notebook import tqdm

# Recommender systems and specific utilities
import cornac
from cornac.data.text import BaseTokenizer
from cornac.data import GraphModality, ImageModality, TextModality, text as cornac_text
from cornac.datasets import amazon_clothing, filmtrust, movielens
from cornac.eval_methods import RatioSplit
from cornac.models import BPR, BaselineOnly, CTR, ItemKNN, MF, NMF, PMF, SVD, SoRec, UserKNN, VBPR, WMF
from cornac.utils import cache
#from cornac.datasets.python_splitters import python_random_split
#from cornac.models.cornac.cornac_utils import predict_ranking

# Elasticsearch
from elasticsearch import Elasticsearch, helpers

#from recommenders.utils.timer import Timer
#from recommenders.datasets import movielens
#from recommenders.utils.notebook_utils import store_metadata
#from recommenders.models.recommenders.utils.constants import SEED
#from recommenders.datasets.python_splitters import python_random_split
#from recommenders.evaluation.python_evaluation import (
#    map,
#    ndcg_at_k,
#    precision_at_k,
#    recall_at_k,
#)

# Constants and configurations
SEED = 42
VERBOSE = False
pd.set_option("max_colwidth", 0)
%matplotlib inline

# Printing versions of libraries
print(f"System version: {sys.version}")
print(f"Cornac version: {cornac.__version__}")
print(f"TensorFlow version: {tf.__version__}")

System version: 3.12.3 (v3.12.3:f6650f9ad7, Apr  9 2024, 08:18:47) [Clang 13.0.0 (clang-1300.0.29.30)]
Cornac version: 2.1
TensorFlow version: 2.16.1


# EDA (Import-file + pre-processing) <a id="section1"></a>

#### Import & File Previews <a id='eda-section1'></a>
* Import file
* Preview file 
* Eliminar columnas
* Sample de los datos
* Copiar un dataset

In [None]:
# * Import file
df_path = "df_path"
df = pd.read_csv(df_path, header=0)

# * - Preview, Resumen de caracteristicas, num filas x columns.
df.head()
df.info()
df.shape

# * - Eliminar columnas
df.drop("Timestamp", axis=1, inplace=True)

# * - Ubicar los na's
df.isna().sum()

# * - Coger una muestra aleatoria de los datos
df_sample = df.sample(n=10000, random_state=42, ignore_index=True)

# * - Copiar un dataset
df_X = df_sample.copy()

#### Agrupación, data types, datos unicos <a id='eda-section2'></a>

* Proporción de Ratings
* Unique - Conteo y valores unicos 
* Agrupación de columnas
* Visualizar los data types
* Obten los productos con mayor rating count 
* THRESHOLD DEL INTER - Conservar solo productos con un minimo de criticas

In [None]:

#TODO - Añadir grafico de barras
# ? - Hallar el número de ratings individuales - (Útil para observar proporción y escala de los ratings)
df_sample.value_counts("Rating", normalize=True)

column_unique_values = df_sample[
    "column"
].unique()  # ? - Esto obtiene los valores únicos por col

number_column_unique_values = df_sample[
    "column"
].nunique()  # ? - Esto el número de valores únicos

# * - Agrupación de columnas:
group_by_and_count = pd.DataFrame(df_sample.groupby("ProductId")["Rating"].count())
sorted_values_by_criteria = group_by_and_count.sort_values("Rating", ascending=False)
sorted_values_by_criteria.head(10)
#! - Esto de abajo y lo de arriba, hacen lo mismo
# * - Obten los productos con mayor número de críticas
item_rate_count = (
    df_sample.groupby("ProductId")["UserId"].nunique().sort_values(ascending=False)
)
item_rate_count  # ? - Get the number of reviews for a product

# * - Meter info previa en un dataFrame:
unique_counts = df_sample.nunique()
unique_values = [df_sample[column].unique() for column in df_sample.columns]

# * - Obten en array los tipos de datos por columna:
data_types = [str(df_sample[column].dtype) for column in df_sample.columns]

unique_counts_df = pd.DataFrame(
    {
        "feature": df_sample.columns,
        "unique_count": unique_counts,
        "unique_values": unique_values,
        "data_type": data_types,
    }
)
unique_counts_df

# ! - THRESHOLD DEL INTER:  Filtrar el conjunto de datos, y conservar aquellos productos con al menos 20 criticas:
reviews_per_rating = df_sample[["productId", "rating"]].value_counts()
select_product = (reviews_per_rating >= 20).groupby("productId").all()
select_product = select_product.index[select_product].to_list()
df = df_sample.loc[df_sample["productId"].isin(select_product)]
df.shape

#### Outliers <a id='eda-section3'></a>

* Función de análisis de outliers de ratings
* Obtener los outliers y visualizar el boxplot
* Obtener porcentaje de outliers de nuestra muestra
* Quitar los outliers en caso de que no haya sesgo evidente en el dataset

In [None]:
def explore_outliers(df, num_vars):
    """
    Explora y identifica los valores atípicos de variables numéricas en un DataFrame.

    Retorna:
    - outliers_df (diccionario): Diccionario con las variables numéricas como claves. Cada valor es otro diccionario
      con las claves 'values' (valores atípicos), 'positions' (posiciones de los valores atípicos en el DataFrame)
      e 'indices' (índices de los valores atípicos en el DataFrame).
    """
    outliers_df = dict()
    for k in range(len(num_vars)):
        var = num_vars[k]
        sns.boxplot(df, x=var)
        outliers_df[var] = boxplot_stats(df[var])[0][
            "fliers"
        ]  # ? - Boxplot de TODOS LOS RATINGS EN NUESTRA MUESTRA
        out_pos = np.where(df[var].isin(outliers_df[var]))[0].tolist()
        out_idx = [df[var].index.tolist()[k] for k in out_pos]
        outliers_df[var] = {
            "values": outliers_df[var],
            "positions": out_pos,
            "indices": out_idx,
        }
    return outliers_df

In [None]:
# * Obtener los outliers y visualizar el boxplot.
outlier_ratings = explore_outliers(df_sample, ["Rating"])

# * Obtener porcentaje de outliers de nuestra muestra:
print(
    "Percentage of outliers:",
    round(len(outlier_ratings.get("Rating").get("indices")) / len(df_sample), 3) * 100,
    "%",
)

# ! - Si hay un sesgo muy claro en el boxplot, NO recomendamos quitar las anomalías para
# ! - capturar todos los comportamientos posibles de usuarios.

# * En caso de querer quitar los outliers:
df_sample.drop(outlier_ratings.get("Rating").get("indices"), inplace=True)

#### Sparsity <a id='eda-section4'></a>

* Calculo de Sparsity - Indica longtail property. (Si el sparsity es alto, optamos por cosine similarity)
* Plot para observar si existe el Long-Tail Property

In [None]:
# * - Calculo de Sparsity. Nos dice si nuestro dataset exhibe propiedades long-tail
# * - Si el sparsity es alto, yo optaría por hacer Cosine similarity
# * - Como de llena esta nuestra matriz de ratings:
def print_sparsity(df):
    n_users = df.UserId.nunique()
    n_items = df.ProductId.nunique()
    n_ratings = len(df)
    rating_matrix_size = n_users * n_items
    sparsity = 1 - n_ratings / rating_matrix_size

    print(f"Number of users: {n_users}")
    print(f"Number of items: {n_items}")
    print(f"Number of available ratings: {n_ratings}")
    print(f"Number of all possible ratings: {rating_matrix_size}")
    print("-" * 40)
    print(f"SPARSITY: {sparsity * 100.0:.2f}%")


print_sparsity(df_sample)

In [None]:
# * - Opcional: Plot para observar si hay long tail property: (CUIDADO CON NOMBRES DE COLS)
popular_products = pd.DataFrame(df_sample.groupby("ProductId")["Rating"].count())
most_popular = popular_products.sort_values("Rating", ascending=False)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
# First plot
axes[0].bar(
    x=range(len(item_rate_count)),
    height=item_rate_count.values,
    width=5.0,
    align="edge",
)
axes[0].set_xticks([])
axes[0].set(
    title="Long tail of rating frequency",
    xlabel="Item ordered by decreasing frequency",
    ylabel="#Ratings",
)

# Second plot adaptation
# Assuming most_popular is a Series. If it's a DataFrame, you might need to adjust this part.
x_pos = range(len(most_popular.head(30)))  # Generate x positions
axes[1].bar(x=x_pos, height=most_popular.head(30)["Rating"], align="center")
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(most_popular.head(30).index, rotation="vertical")
axes[1].set(
    title="Top 30 Most Popular Items",
    xlabel="Item",
    ylabel="Frequency or some other metric",
)

plt.tight_layout()
plt.show()

#### Recommendation Systems EDA <a id='eda-section5'></a>
* Generar la matriz de ratings 
* EDA de la matriz de ratings 
* User-Profiling 
* Item-Profiling
* Distirbución de ratings por producto 
* Distribución de ratings por usuario

In [None]:
# * Generar la matriz de ratings:
ratings_matrix = df_sample.pivot_table(
    index="UserId",
    columns="ProductId",
    values="Rating",
)

# * EDA de la matriz de ratings:
ratings_matrix.head()
df = ratings_matrix
df["Mean Rating"] = df.mean(axis=1)  # ? - Get the mean score for each user
sns.histplot(
    x="Mean Rating", binwidth=0.5, data=df
)  # ? - Histograma de la media de puntuación

# * PARA PODER HACER ITEM Y USER PROFILING
# * User-profiling para User-based
# ? - Dataset para agrupar los items
df_user_10k = pd.read_csv("path.csv").set_index("UserId").drop("Timestamp", axis=1)
items = df_user_10k.groupby(
    "ProductId"
)  # ? - Obtener lista de productos criticados por usuario
items.get_group("B002OVV7F0")  # ? - Pass ProductId - Get the ratings

# ? - Dataset para agrupar los users
df_item_10k = pd.read_csv("path.csv").set_index("ProductId").drop("Timestamp", axis=1)
users = df_item_10k.groupby("UserId")  # ? - Obtener lista de usuarios por producto
users.get_group("A39HTATAQ9V7YF")  # ? - Pass UserId - Get the ratings for a user

# ? - Observar distribución ratings producto especifico:
df_item_10k.loc["B0050QLE4U"].hist()
# ? - Observar distribución ratings user especifico:
df_user_10k.loc["A39HTATAQ9V7YF"].hist()

----

# PCA (Principal Component Analysis)  <a id="section2"></a>

#### Preparación del dataset + Normalizar <a id='pca-section1'></a>

* Copiar el dataset 
* Definición de columnas categoricas y numericas
* Train/Test/Split
* Aplicar Standard Scaler para normalizar el dataset

In [None]:


#! En caso que nos den un archivo normal: 
df = pd.read_csv('path_to_csv.csv')
dfY = df['target']
dfX = df.drop('target', axis=1, inplace=True)

# ? - En caso que los quieras volver a unir 
df = dfX.copy()
df['target'] = dfY

#! En caso que nos den algo con alguna libería rara como sklearn y load_breast_cancer:
dataset = load_breast_cancer()
X_data = dataset.data
Y_data = dataset.target

dfX = pd.DataFrame(X_data, columns=dataset.feature_names)
dfY = pd.DataFrame(Y_data, columns=["target"])

df = dfX.copy()
df['target'] = dfY

# * - En caso que sea un dataset normal: Copiar el dataset con el fin de hacer operaciones + define columnas
df_pca = df.copy()
df_pca.head()

# ? - Por si tienes que distinguir entre categoricas y numericas, creo que en el examen nos dará numericas
CATEGORICAL_COLUMNS = [
    "TIME_OF_DAY",
    "AIRCRAFT",
    "AC_MASS",
    "PHASE_OF_FLIGHT",
    "SPECIES",
    "NUM_STRUCK",
]

NUMERICAL_COLUMNS = ["INCIDENT_YEAR", "HEIGHT", "SPEED", "DISTANCE", "COST_INFL_ADJ"]

# * - Paso II: Solo aplicar el PCA a las variables numericas. Convertir variables si necesario
# * - Despues arrancamos haciendo el train-test split:
X = df_pca[NUMERICAL_COLUMNS] # ! - No te olvides de modificar todo esto
X = X.drop(columns=["TARGET_VAR"])  # * Variable target a identificar
y = df_pca.TARGET_VAR  # * Continuous data for the darget

# * - Train/Test Split
X_train, X_test, y_train, y_tests = train_test_split(
    X, y, test_size=0.2, random_state=0
)

# * - Paso III: Aplicamos el Standard Scaler con el objetivo de normalizar
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#### Generar latent dims, aplicar PCA, determinar su error <a id='pca-section2'></a>

* Observar la dimensión D original de la matriz
* Generar vector de dimensiones latentes
* Aplicar PCA y determinar su error
* Hacer plot del train set vs. test-set reconstruction error

In [None]:
# Reconstruction error on MNIST vs number of latent dimensions used by PCA
# * This is dimension D
X_rank = np.linalg.matrix_rank(X_train)
max_components = min(X_train.shape[0], X_train.shape[1]) - 1
Ks = list(
    range(1, max_components + 1)
)  # Ajustando dinámicamente el rango basado en el tamaño del conjunto de datos
# * - Distinta manera de obtenerlo
#L_linspace = np.linspace(1, X_rank, 10, dtype=int) 


# ? - Vectors which are intended to store the RMSE values for train and test datasets.
RMSE_train = []
RMSE_test = []

# * For each L which we want to test out perform a PCA and it's corresponding reconstruction
# * Both for the train & test datasets - Then for that specific L, log what the RMS error is in order to plot it.
RMSE_train = []
RMSE_test = []

for index, K in enumerate(Ks):
    pca = PCA(n_components=K)

    Xtrain_transformed = pca.fit_transform(X_train)
    Xtrain_proj = pca.inverse_transform(Xtrain_transformed)
    RMSE_train.append(root_mean_squared_error(X_train, Xtrain_proj))

    Xtest_transformed = pca.transform(X_test)
    Xtest_proj = pca.inverse_transform(Xtest_transformed)
    RMSE_test.append(root_mean_squared_error(X_test, Xtest_proj))

# * Plot train set reconstruction error vs. test set reconstruction error:
# Configurar la figura y los ejes para dos subplots: uno al lado del otro
fig, axs = plt.subplots(1, 2, figsize=(12, 5))  # 1 fila, 2 columnas

# Gráfico para el conjunto de entrenamiento
axs[0].plot(Ks, RMSE_train, marker="o", color="blue")
axs[0].set_title("Train Set Reconstruction Error")
axs[0].set_xlabel("Number of Principal Components")
axs[0].set_ylabel("RMSE")

# Gráfico para el conjunto de prueba
axs[1].plot(Ks, RMSE_test, marker="x", color="red")
axs[1].set_title("Test Set Reconstruction Error")
axs[1].set_xlabel("Number of Principal Components")
# axs[1].set_ylabel("RMSE")  # Opcional, ya que comparten el mismo eje Y

plt.tight_layout()  # Ajustar automáticamente los parámetros de la subtrama para dar un relleno especificado
plt.show()

#### Explicación del modelo PCA: <a id='pca-section3'></a>
* Una vez hemos hecho el PCA para todos los valores y hemos observado el error de reconstrucción decidimos el modelo optimo
  * Obtener los autovalores, varianza explicada + cumulative sum variance
* Scree-plot
* Log-Likelihood (Este no me esta saliendo muy bien)
* Varianza cumulativa para establecer el threshold (elegir en base a esto la dim) + Plot

In [None]:
# ! - Como en el ultimo bucle hemos hecho PCA para todos los componentes, podemos sacar ahora el punto donde nuestro threshold se define: 
pca.components_.shape
eigenvalues = pca.explained_variance_
variance_explained = pca.explained_variance_ratio_
cumulative_variance_explained = pca.explained_variance_ratio_.cumsum()

In [None]:
# * - Screeplot
fig, ax = plt.subplots()
xs = np.arange(1, len(eigenvalues))
ys = eigenvalues[0:len(eigenvalues)-1]
plt.title("screeplot")
plt.xlabel("num PCs")
plt.ylabel("eigenvalues")
plt.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
ax.plot(xs, ys)
plt.show()

In [None]:
# * Profile Likelihood
def log_likelihood(evals):
    Lmax = len(evals)
    ll = np.arange(0.0, Lmax)

    for L in range(Lmax):

        group1 = evals[0 : L + 1]  # Divide Eigenvalues in two groups
        group2 = evals[L + 1 : Lmax]

        mu1 = np.mean(group1)
        mu2 = np.mean(group2)

        # eqn (20.30)
        sigma = (np.sum((group1 - mu1) ** 2) + np.sum((group2 - mu2) ** 2)) / Lmax

        ll_group1 = np.sum(multivariate_normal.logpdf(group1, mu1, sigma))
        ll_group2 = np.sum(multivariate_normal.logpdf(group2, mu2, sigma))

        ll[L] = ll_group1 + ll_group2
    return ll

In [None]:
ll = log_likelihood(eigenvalues)  # * Insert all of the corresponding eigenvalues.

xs = np.arange(1, len(eigenvalues))
ys = ll[0:len(eigenvalues)-1]

plt.xlabel("num PCs")
plt.ylabel("profile log likelihood")
plt.plot(xs, ys)
idx = np.argmax(ys)
plt.axvline(xs[idx], c="grey")
plt.show()

In [None]:
# * Cambia el threshold como veas relevante
threshold = 0.9
idx = np.where(cumulative_variance_explained > threshold)[0][0]
exp_var = np.cumsum(pca.explained_variance_ratio_)

sns.lineplot(cumulative_variance_explained)
plt.axvline(idx, c="r")
plt.axhline(exp_var[idx], c="r")
plt.xlabel("# PCA Components")
plt.ylabel("% Variance Explained")
plt.title("Number of Latent dimensions: " + str(idx))

In [None]:
# Perform PCA
n_components = 2  #! - Definir
pca = PCA(n_components=n_components)
pca.fit(X_train)
X_transformed = pca.fit_transform(X_train)
evals = pca.explained_variance_  # eigenvalues in descending order
fraction_var = np.cumsum(evals[0:5] / np.sum(evals))   # ! - Change the dimension number to coincide with components

# Access the matrix W of eigenvectors (principal components)
W = pca.components_
print("Matrix W (Eigenvectors/Principal Components):")
print(W)

# Access the eigenvalues (explained variance)
eigenvalues = pca.explained_variance_
print("Eigenvalues (Explained Variance):")
print(eigenvalues)

# Access the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:")
print(explained_variance_ratio)

In [None]:
# * Dimensión del PCA generado
pca.components_.shape

# * - Array de Autovalores
eigenvalues = pca.explained_variance_
eigenvalues

# * - Array de Varianza
variance_explained = pca.explained_variance_ratio_
variance_explained

# * - Cum Sum
cum_variance_explained = pca.explained_variance_ratio_.cumsum()
cum_variance_explained

# * - Plot con el threshold para determinar latent dim optimo
threshold = 0.9
idx = np.where(cum_variance_explained > threshold)[0][0]
exp_var = np.cumsum(pca.explained_variance_ratio_)

sns.lineplot(cum_variance_explained)
plt.axvline(idx, c="r")
plt.axhline(exp_var[idx], c="r")
plt.xlabel("# PCA Components")
plt.ylabel("% Variance Explained")
plt.title("Number of Latent dimensions: " + str(idx))

#### Averiguar el latent dim optimo haciendo plots <a id='pca-section4'></a>
1. Generar la tabla para observar las propiedades de los PC.
2. Visualización de los datos usando los dos latent dims más relevantes y observar similitud

In [None]:
# Principal components correlation coefficients
loadings = pca.components_

# Number of features before PCA
n_features = pca.n_features_in_

# Feature names before PCA
feature_names = NUMERICAL_COLUMNS #dfX.columns en caso que no hayas split 

# PC names
pc_list = [f"PC{i}" for i in list(range(1, n_features + 1))]

# Match PC names to loadings
pc_loadings = dict(zip(pc_list, loadings))

# Matrix of corr coefs between feature names and PCs
loadings_df = pd.DataFrame.from_dict(pc_loadings)
loadings_df["feature_names"] = feature_names
loadings_df = loadings_df.set_index("feature_names")
loadings_df

In [None]:
# Get the loadings of x and y axes
xs = loadings_df.iloc[:, 0]
ys = loadings_df.iloc[:, 1]

plt.figure(figsize=(15, 10))
# Plot the loadings on a scatterplot
for i, varnames in enumerate(feature_names):
    plt.scatter(xs[i], ys[i], s=200)
    plt.arrow(
        0,
        0,  # coordinates of arrow base
        xs[i],  # length of the arrow along x
        ys[i],  # length of the arrow along y
        color="r",
        head_width=0.01,
    )
    plt.text(xs[i], ys[i], varnames)

# Define the axes
ymin = np.min(ys.iloc[i])
ymax = np.max(ys.iloc[i])
# Define the axes
xticks = np.linspace(0, 0.5, num=10)
yticks = np.linspace(ymin, ymax, num=10)
plt.xticks(xticks)
plt.yticks(yticks)
plt.xlabel("PC1")
plt.ylabel("PC2")

# Show plot
plt.title("2D Loading plot with vectors")
plt.show()

In [None]:
sorted_components = sorted(
    enumerate(variance_explained), key=lambda x: x[1], reverse=True
)
sorted_components

----
# Sección Clustering para hacerlo rapido de una. Cuando hayas acabado de hacer PCA <a id='important-exercise'></a>
* Definir el modelo con train y test y el target 
* Generar grafica de accuracies 
* Hacer PCA 
* Matriz de confusión 
* Positive Class Classification 
* AUC

In [None]:
# useful function that help us with the dataframes
def dataset_to_pandas(data, target):
    data = pd.DataFrame(data, columns=["X" + str(i + 1) for i in range(data.shape[1])])
    inputs = data.columns
    data["Y"] = target
    output = "Y"
    return data, inputs, output

In [None]:

# * Hacemos split para el dataset y estandarizamos
data_std = StandardScaler().fit_transform(X_train)
XTR, XTS, YTR, YTS = train_test_split(
    data_std, y, test_size=0.2, random_state=1, stratify=y
)

In [None]:
# Perform PCA in all datasets
subspace_dim = 6
pca = PCA(subspace_dim) # ! PCA con el numero optimo de lo que hallamos en la sección previa
XTR_pca = pca.fit_transform(XTR) # * X Training
XTS_pca = pca.fit_transform(XTS) # * X Test

In [None]:
accrcies = []

# Select the values of k neighbours
k_start = 2
k_stop = 50
k_step = 1

k_values = np.arange(start=k_start, stop=k_stop, step=k_step).astype("int")

train_data, train_inputs, train_outputs = dataset_to_pandas(XTR_pca, YTR)

for k in k_values:
    knn_pipe = Pipeline(
        steps=[
            ("scaler", StandardScaler()),
            ("knn", KNeighborsClassifier(n_neighbors=k)),
        ]
    )
    knn_pipe.fit(train_data[train_inputs], train_data[train_outputs])

    accrcies.append(knn_pipe.score(train_data[train_inputs], train_data[train_outputs]))

In [None]:
accrcies = np.array(accrcies)
# Plot accuracies vs k
ax_acc = sns.scatterplot(x=k_values, y=accrcies)
sns.lineplot(x=k_values, y=accrcies, ax=ax_acc)
ax_acc.set(xlabel="k (num. of neighbors)", ylabel="Accuracy");

**Sección Cross-Validation**

In [None]:
k_values = np.arange(1, 50)
hyp_grid = {"knn__n_neighbors": k_values}
knn_pipe_hyper = Pipeline(
    steps=[("scaler", StandardScaler()), ("knn", KNeighborsClassifier())]
)

In [None]:
num_folds = 10

knn_gridCV = GridSearchCV(
    estimator=knn_pipe_hyper, param_grid=hyp_grid, cv=num_folds, return_train_score=True
)

knn_gridCV.fit(XTR_pca, YTR)

In [None]:
knn_gridCV.best_params_
knn_gridCV.score(XTR_pca, YTR), knn_gridCV.score(XTS_pca, YTS)

**Matriz de confusión**

In [None]:
model = knn_gridCV # * Matriz de Confusión
fig = plt.figure(constrained_layout=True, figsize=(6, 2))
spec = fig.add_gridspec(1, 2)
ax1 = fig.add_subplot(spec[0, 0])
ax1.set_title("Training")
ax1.grid(False)
ax3 = fig.add_subplot(spec[0, 1])
ax3.set_title("Test")
ax3.grid(False)
ConfusionMatrixDisplay.from_estimator(
    model, XTR_pca, YTR, cmap="Greens", colorbar=False, ax=ax1, labels=[1, 0]
)
ConfusionMatrixDisplay.from_estimator(
    model, XTS_pca, YTS, cmap="Greens", colorbar=False, ax=ax3, labels=[1, 0]
)
plt.suptitle("Confusion Matrices")
plt.show();

**Fraction of positives**

In [None]:
plt.figure(constrained_layout=False, figsize=(12, 12))
fig, ax = plt.subplots()
CalibrationDisplay.from_estimator(
    knn_gridCV, XTS_pca, YTS, n_bins=10, name="knn_pipe", pos_label=1, ax=ax
)

**Curva AUC**

In [None]:
fig = plt.figure(figsize=(12, 4))
spec = fig.add_gridspec(1, 2)
ax1 = fig.add_subplot(spec[0, 0])
ax1.set_title("Training")
ax2 = fig.add_subplot(spec[0, 1])
ax2.set_title("Test")
RocCurveDisplay.from_estimator(knn_gridCV, XTR_pca, YTR, plot_chance_level=True, ax=ax1)
RocCurveDisplay.from_estimator(knn_gridCV, XTS_pca, YTS, plot_chance_level=True, ax=ax2)
plt.show();

----
# Clustering <a id="section3"></a>

#### Data preparation: <a id='clustering-section1'></a>
* Copiar el dataset
* Coger variable predicción

In [None]:
df_clustering = df.copy()
df_clustering.head()

#### Calculations <a id='clustering-section2'></a>
* Purity Score
* Rand Index
* Adjusted Rand Index 
* Mutual Information

In [None]:
def purity_score(y_true, y_pred):  # * Purity Score
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)


purity_score(df_eval["Y"], df_eval["Y_knn_pred"])

# * Rand Index
rand_score(dfTR_eval["Y"], dfTR_eval["Y_knn_pred"])

# * Adjusted Rand Score
adjusted_rand_score(dfTR_eval["Y"], dfTR_eval["Y_knn_pred"])

# * Normalized Mutual Info Score
normalized_mutual_info_score(dfTR_eval["Y"], dfTR_eval["Y_knn_pred"])

#### Hierarchical Agglomerative Clustering (HAC) <a id='clustering-section3'></a>
* Normal 
* Single-link
* Complete-link
* Average-link
* Ward-Linkage

In [None]:
linked = linkage(
    X, method="single"
)  # ! - You can change the method from single to 'complete', 'ward', 'average', etc...
""" #! - Descomentar este bloque si quieres hallar la fusión con mayor incremento de distancia. 
distances = linked[:, 2]
# Calcular los incrementos de distancia entre fusiones sucesivas
increments = np.diff(distances)

# Identificar la fusión con el mayor incremento en distancia
largest_increment_index = np.argmax(increments)
# La distancia en la que ocurre esta fusión
largest_increment_distance = distances[largest_increment_index]
"""
# labelList = range(1, 6)
plt.figure(figsize=(10, 7))
dendrogram(
    linked,
    orientation="top",
    # labels=labelList,
    distance_sort="descending",
    show_leaf_counts=True,
)
#! - plt.axhline(y=largest_increment_distance, c="k", ls="--", lw=0.5) <- Descomentar esta linea tambien si te conviene.
plt.title("Dendrograma con la Mayor Fusión Indicada")
plt.show()

#### K-Means Clustering - Asegurar de antemano que tienes valores numéricos y no categóricos <a id='clustering-section4'></a>
* Elbow method para hallar el número optimo
* Silhouette Score
* Silhouette Score w/ clusters diagrams

In [None]:
inertias = []

# Rango de valores de k para probar
k_values = range(2, 12)  # ! - Cambiar el rango

kmeans_per_k = [KMeans(n_clusters=k, n_init=10, random_state=42).fit(df) for k in Ks]

inertias = [model.inertia_ for model in kmeans_per_k]

silhouette_scores = [silhouette_score(df, model.labels_) for model in kmeans_per_k]

for k in k_values:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(df)
    inertias.append(kmeans.inertia_)

In [None]:
plt.figure(figsize=(8, 5))  # * Graficar el método del codo
plt.plot(k_values, inertias, "-o")
plt.title("Método del Codo")
plt.xlabel("Número de Clusters, k")
plt.ylabel("Inercia")
plt.xticks(k_values)
plt.show()

In [None]:
plt.figure()  # * Silhouette Score
plt.plot(Ks, silhouette_scores, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
warnings.filterwarnings("ignore")


# * Silhouette Score visualization per cluster
def plot_silhouette(model, X):
    mu = model.cluster_centers_
    K, D = mu.shape
    y_pred = model.labels_
    silhouette_coefficients = silhouette_samples(X, y_pred)
    silhouette_scores = silhouette_score(X, model.labels_)
    cmap = cm.get_cmap("Pastel2")
    colors = [cmap(i) for i in range(K)]
    padding = len(X) // 30
    pos = padding
    for i in range(K):
        coeffs = silhouette_coefficients[y_pred == i]
        coeffs.sort()
        color = mpl.cm.Spectral(i / K)
        # color = colors[i]
        plt.fill_betweenx(
            np.arange(pos, pos + len(coeffs)),
            0,
            coeffs,
            facecolor=color,
            edgecolor=color,
            alpha=0.7,
        )
        pos += len(coeffs) + padding
    score = silhouette_scores
    plt.axvline(x=score, color="red", linestyle="--")
    plt.title("$k={}, score={:0.2f}$".format(K, score), fontsize=16)


for model in kmeans_per_k:
    K, D = model.cluster_centers_.shape
    plt.figure()
    plot_silhouette(model, df)
    fname = f"kmeans_silhouette_diagram{K}.pdf"
    plt.tight_layout()

#### Mini-Batches <a id='clustering-section5'></a>

In [None]:
K = 50
times = np.empty((K, 2))
inertias = np.empty((K, 2))

for k in range(1, K + 1):
    kmeans = KMeans(n_clusters=k, random_state=42)
    minibatch_kmeans = MiniBatchKMeans(
        n_clusters=k, random_state=42
    )  #! This is the minibatch declaration

    start = time()
    kmeans.fit(df)
    times[k - 1, 0] = time() - start
    inertias[k - 1, 0] = kmeans.inertia_

    start = time()
    minibatch_kmeans.fit(
        df
    )  # ! - Here is where we fit the mini-batch model for faster training
    times[k - 1, 1] = time() - start
    inertias[k - 1, 1] = minibatch_kmeans.inertia_

# Graficar la inercia y los tiempos de entrenamiento
plt.figure(figsize=(10, 5))

# Subplot para la inercia/distorsión
plt.subplot(121)
plt.plot(range(1, K + 1), inertias[:, 0], "r--", label="K-Means")
plt.plot(range(1, K + 1), inertias[:, 1], "b.-", label="Mini-batch K-Means")
plt.xlabel("$k$", fontsize=16)
plt.title("Distorsión", fontsize=14)
plt.legend(fontsize=14)

# Subplot para el tiempo de entrenamiento
plt.subplot(122)
plt.plot(range(1, K + 1), times[:, 0], "r--", label="K-Means")
plt.plot(range(1, K + 1), times[:, 1], "b.-", label="Mini-batch K-Means")
plt.xlabel("$k$", fontsize=16)
plt.title("Tiempo de entrenamiento (segundos)", fontsize=14)
plt.legend(fontsize=14)

plt.tight_layout()
plt.show()

#### Mixture-Models <a id='clustering-section6'></a>
* Definir Gaussian Mixture
* Sacar sus pesos
* Visualizar

In [None]:
K = 5  # ! - To modify
gm = GaussianMixture(n_components=K, covariance_type="full", n_init=10, random_state=42)
gm.fit(X)

In [None]:
w = gm.weights_
mu = gm.means_
Sigma = gm.covariances_

In [None]:
resolution = 100
grid = np.arange(-10, 10, 1 / resolution)
xx, yy = np.meshgrid(grid, grid)
X_full = np.vstack([xx.ravel(), yy.ravel()]).T

In [None]:
# score_samples is the log pdf
pdf = np.exp(gm.score_samples(X_full))
pdf_probas = pdf * (1 / resolution) ** 2
print("integral of pdf {}".format(pdf_probas.sum()))

In [None]:
# Choosing K. Co,mpare to kmeans_silhouette
Ks = range(2, 9)
gms_per_k = [
    GaussianMixture(n_components=k, n_init=10, random_state=42).fit(X) for k in Ks
]

bics = [model.bic(X) for model in gms_per_k]
aics = [model.aic(X) for model in gms_per_k]

In [None]:
gm_full = GaussianMixture(
    n_components=K, n_init=10, covariance_type="full", random_state=42
)
gm_tied = GaussianMixture(
    n_components=K, n_init=10, covariance_type="tied", random_state=42
)
gm_spherical = GaussianMixture(
    n_components=K, n_init=10, covariance_type="spherical", random_state=42
)
gm_diag = GaussianMixture(
    n_components=K, n_init=10, covariance_type="diag", random_state=42
)
gm_full.fit(X)
gm_tied.fit(X)
gm_spherical.fit(X)
gm_diag.fit(X)

make_plot(
    gm_full, X, "full"
)  # * You can then switch the 'full' and the model for any of the ones defined above

#### Clustering combined with PCA - MIRA MEJOR EL OTRO <a id='important-exercise-jic'></a>
* **La probabilidad es que vayas a tener que hacer la sección de PCA del bloque anterior**

In [None]:
# Copy the dataset to preserve the original data
df_pca = df.copy()

# Define numerical columns
NUMERICAL_COLUMNS = ["INCIDENT_YEAR", "HEIGHT", "SPEED", "DISTANCE", "COST_INFL_ADJ"]

# * Select only numerical data
X = df_pca[NUMERICAL_COLUMNS]
y = df_pca["TARGET_VAR"]  # Assuming you're interested in some target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# * - Estandarizamos los datos ya que a PCA le afectan las escalas y la varianza
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# ! - Para averiguar el numero optimo de clusters subir a la sección 2 del PCA
pca = PCA()
X_train_transformed = pca.fit_transform(X_train)

# Calculate the cumulative variance explained by each component
cum_variance_explained = np.cumsum(pca.explained_variance_ratio_)

# Determine the number of components that explain at least 90% of the variance
threshold = 0.9
num_components = np.where(cum_variance_explained >= threshold)[0][0] + 1

print(
    f"Number of components to retain {num_components} which explain at least {threshold*100}% of variance"
)

sns.lineplot(range(1, len(cum_variance_explained) + 1), cum_variance_explained)
plt.axvline(num_components, color="r", linestyle="--")
plt.xlabel("# PCA Components")
plt.ylabel("% Variance Explained")
plt.title("Explained Variance vs. Number of Components")
plt.show()

# Perform PCA with the selected number of components
pca_optimal = PCA(n_components=num_components)
X_train_pca = pca_optimal.fit_transform(X_train)
X_test_pca = pca_optimal.transform(X_test)

# Test different numbers of clusters
k_values = range(1, 11)
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_train_pca)
    inertias.append(kmeans.inertia_)

# Plot the elbow graph
plt.figure(figsize=(8, 5))
plt.plot(k_values, inertias, "-o")
plt.title("Elbow Method For Optimal k")
plt.xlabel("Number of Clusters, k")
plt.ylabel("Inertia")
plt.xticks(k_values)
plt.show()

optimal_k = 3  # Example based on the elbow plot observation
final_kmeans = KMeans(n_clusters=optimal_k, random_state=42)
final_kmeans.fit(X_train_pca)

# Assign clusters
clusters = final_kmeans.labels_

# Cluster centers
cluster_centers = final_kmeans.cluster_centers_

# Plotting cluster centers and data points in 2D (only possible if num_components >= 2)
plt.figure(figsize=(8, 6))
plt.scatter(
    X_train_pca[:, 0],
    X_train_pca[:, 1],
    c=clusters,
    cmap="viridis",
    marker="o",
    alpha=0.6,
)
plt.scatter(
    cluster_centers[:, 0], cluster_centers[:, 1], c="red", marker="x", s=100
)  # Marking the cluster centers
plt.title("Data points and cluster centers")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.show()

----

# Rec. Systems - **Memory-Based** Collaborative Filtering (Cornac) <a id='section4'></a>

#### EDA for Rec. Systems - Just in case you forgot to do it before <a id='rs-section0'></a>

In [None]:
# * Generar la matriz de ratings:
ratings_matrix = df_sample.pivot_table(
    index="UserId",
    columns="ProductId",
    values="Rating",
)

# * EDA de la matriz de ratings:
ratings_matrix.head()
df_rm = ratings_matrix #! - Cuidado antes de pegar esta linea
df_rm["Mean Rating"] = df_rm.mean(axis=1)  # ? - Get the mean score for each user
sns.histplot(
    x="Mean Rating", binwidth=0.5, data=df_rm
)  # ? - Histograma de la media de puntuación

# * PARA PODER HACER ITEM Y USER PROFILING
# * User-profiling para User-based
# ? - Dataset para agrupar los items
df_user_10k = pd.read_csv("path.csv").set_index("UserId").drop("Timestamp", axis=1)
items = df_user_10k.groupby(
    "ProductId"
)  # ? - Obtener lista de productos criticados por usuario
items.get_group("B002OVV7F0")  # ? - Pass ProductId - Get the ratings

# ? - Dataset para agrupar los users
df_item_10k = pd.read_csv("path.csv").set_index("ProductId").drop("Timestamp", axis=1)
users = df_item_10k.groupby("UserId")  # ? - Obtener lista de usuarios por producto
users.get_group("A39HTATAQ9V7YF")  # ? - Pass UserId - Get the ratings for a user

# ? - Observar distribución ratings producto especifico:
df_item_10k.loc["B0050QLE4U"].hist()
# ? - Observar distribución ratings user especifico:
df_user_10k.loc["A39HTATAQ9V7YF"].hist()

#### User-based filtering - Cornac <a id='rs-section1'></a>
* Just in case you forgot to do the recommendation system EDA
* Individual user and item exploration
* User-Based Cornac Function (Pearson, Cosine, Mean-Centered)
* User-Profiling
* Score-prediction for K items for a specific user 
* Obtención del predicted ratings matrix
* Predecir un score para usuario y item especifico

In [None]:
# * Normalise the ratings matrix by subtracting every user's rating by the mean users rating:
normalized_ratings_matrix = ratings_matrix.subtract(ratings_matrix.mean(axis=1), axis=0)

# * Generación de modelos user-based con Pearson, Cosine y centrados
def userknn_cornac(df: pd.DataFrame):
    # * Cuidado con el nombre de las columnas, a veces es userId, otras UserId
    df = df.astype({"UserId": object, "ProductId": object})
    records = df.to_records(index=False)
    result = list(records)

    K = 18  #! - Si te da error, baja el numero de k number of nearest neighbors
    VERBOSE = False
    SEED = 42
    uknn_cosine = UserKNN(
        k=K, similarity="cosine", name="UserKNN-Cosine", verbose=VERBOSE
    )
    uknn_cosine_mc = UserKNN(
        k=K,
        similarity="cosine",
        mean_centered=True,
        name="UserKNN-Cosine-MC",
        verbose=VERBOSE,
    )
    uknn_pearson = UserKNN(
        k=K, similarity="pearson", name="UserKNN-Pearson", verbose=VERBOSE
    )
    uknn_pearson_mc = UserKNN(
        k=K,
        similarity="pearson",
        mean_centered=True,
        name="UserKNN-Pearson-MC",
        verbose=VERBOSE,
    )

    # Metrics
    rec_300 = cornac.metrics.Recall(k=K)
    prec_30 = cornac.metrics.Precision(k=K)
    rmse = cornac.metrics.RMSE()
    mae = cornac.metrics.MAE()

    ratio_split = RatioSplit(result, test_size=0.1, seed=SEED, verbose=VERBOSE)
    cornac.Experiment(
        eval_method=ratio_split,
        models=[uknn_cosine, uknn_cosine_mc, uknn_pearson, uknn_pearson_mc],
        metrics=[rec_300, prec_30, rmse, mae],
    ).run()

    userknn_models = {
        "uknn_cosine": uknn_cosine,
        "uknn_cosine_mc": uknn_cosine_mc,
        "uknn_pearson": uknn_pearson,
        "uknn_pearson_mc": uknn_pearson_mc,
    }

    return userknn_models

In [None]:
# UserKNN methods #! - En caso que quieras aplicar el probability re-weighting, amplification - Si no quieres pasa a la siguiente celda
K = 50  # number of nearest neighbors
uknn_base = UserKNN(
  k=K, similarity="pearson", name="UserKNN-Base", verbose=VERBOSE
)
uknn_amp1 = UserKNN(
  k=K, similarity="pearson", amplify=0.5, name="UserKNN-Amp0.5", verbose=VERBOSE
)
uknn_amp2 = UserKNN(
  k=K, similarity="pearson", amplify=3.0, name="UserKNN-Amp3.0", verbose=VERBOSE
)
uknn_idf = UserKNN(
  k=K, similarity="pearson", weighting="idf", name="UserKNN-IDF", verbose=VERBOSE
)
uknn_bm25 = UserKNN(
  k=K, similarity="pearson", weighting="bm25", name="UserKNN-BM25", verbose=VERBOSE
)

feedback = movielens.load_feedback(variant="100K")
ratio_split = RatioSplit(feedback, test_size=0.1, seed=SEED, verbose=VERBOSE)
cornac.Experiment(
  eval_method=ratio_split,
  models=[uknn_base, uknn_amp1, uknn_amp2, uknn_idf, uknn_bm25],
  metrics=[cornac.metrics.RMSE()],
).run()

In [None]:
userknn_models = userknn_cornac(df_sample)  # ? - Returns the data with the Metrics
model = userknn_models.get("uknn_cosine_mc")
# ?^^Luego tendras que justificar que modelo eliges. Esta bien que cojamos el mean centered (mc)

In [None]:
def user_profiling(UID, model, user_df, TOPK=5):

    rating_mat = model.train_set.matrix

    UIDX = list(model.train_set.uid_map.items())[UID][0]

    print(f"UserID = {UIDX}")
    print("-" * 35)
    print(user_df.loc[UIDX])

    ratings = pd.DataFrame(rating_mat.toarray())
    user_ratings = ratings.loc[UID]
    top_rated_items = np.argsort(user_ratings)[-TOPK:]
    print(f"\nTOP {TOPK} RATED ITEMS BY USER {UID}:")
    print("-" * 35)
    print(user_df.iloc[top_rated_items.array])

In [None]:
df_user = pd.read_csv("dataset.csv").set_index("userId")
df_user.head()

In [None]:
model = userknn_models.get(
    "uknn_cosine_mc"
)  #!!! - SUPER IMPORTANTE HACER LA ELECCIÓN DEL MODELO EN BASE A LOS RESULTADOS
top_rated_items = user_profiling(3, model, df_user, TOPK=10)

In [None]:
# * Predicción de score para cualquier producto:
def uknn_get_scores(UID, model, user_df, TOPK=5):

    UIDX = list(model.train_set.uid_map.items())[UID][0]
    recommendations, scores = model.rank(UID)
    print(f"\nTOP {TOPK} RECOMMENDATIONS FOR USER {UIDX}:")
    print("Scores:", scores[recommendations[:TOPK]])
    print(user_df.iloc[recommendations[:TOPK]])


model = userknn_models.get('uknn_cosine') # * Decidimos de nuevo que modelo usar
predicted_ratings = pd.DataFrame(
    model.train_set.matrix.toarray(), 
    columns=list(model.train_set.iid_map.items()), 
    index=list(model.train_set.uid_map.items())
)

predicted_ratings.head()

user_model = userknn_models.get("uknn_pearson_mc") # * Decidimos el coeficiente de similitud 
item_mat_id = user_model.train_set.iid_map.get("B002I098JE") # * Decidimos el item 
user_mat_id = user_model.train_set.uid_map.get("A7B5JEED0RKXG") # * Decidimos el usuario
# * Get the rating
user_model.score(user_mat_id, item_mat_id)

#### Item-Based filtering (Cornac) <a id='rs-section2'></a>

* Function based on nearest k nearest neighbors with Pearson, Cosine, Centered
* Item-profiling
* Obtención del predicted ratings matrix
* Como obtener una puntuación especifica de la ratings matrix predecida para usuario y item

Steps - Predict user's rating for one movie process:
* 1. Create a list of the movies which the user 1 has watched and rated.
* 2. Rank the similarities between the movies that user 1 has rated and the movie to predict.
* 3. Select top n movies with the highest similarity scores.
* 4. Calculate the predicted rating using weighted average of similarity scores and the ratings from user 1.

In [None]:
def itemknn_cornac(df):
    # * Cuidado aquí con los nombres de las columnas
    df = df.astype({"UserId": object, "ProductId": object})
    records = df.to_records(index=False)
    result = list(records)
    K = 18  #!- number of nearest neighbors
    VERBOSE = False
    SEED = 42
    iknn_cosine = ItemKNN(
        k=K, similarity="cosine", name="ItemKNN-Cosine", verbose=VERBOSE
    )
    iknn_cosine_mc = ItemKNN(
        k=K,
        similarity="cosine",
        mean_centered=True,
        name="ItemKNN-Cosine-MC",
        verbose=VERBOSE,
    )
    iknn_pearson = ItemKNN(
        k=K, similarity="pearson", name="ItemKNN-Pearson", verbose=VERBOSE
    )
    iknn_pearson_mc = ItemKNN(
        k=K,
        similarity="pearson",
        mean_centered=True,
        name="ItemKNN-Pearson-MC",
        verbose=VERBOSE,
    )

    # Metrics
    rmse = cornac.metrics.RMSE()
    mae = cornac.metrics.MAE()
    prec = cornac.metrics.Precision(k=K)
    ratio_split = RatioSplit(result, test_size=0.2, seed=SEED, verbose=VERBOSE)
    cornac.Experiment(
        eval_method=ratio_split,
        models=[iknn_cosine, iknn_pearson, iknn_pearson_mc, iknn_cosine_mc],
        metrics=[rmse, mae, prec],
    ).run()
    itemknn_models = {
        "iknn_cosine": iknn_cosine,
        "iknn_pearson": iknn_pearson,
        "iknn_pearson_mc": iknn_pearson_mc,
        "iknn_cosine_mc": iknn_cosine_mc,
    }
    return itemknn_models

In [None]:
def item_profiling(UID, model, item_df, TOPK=5):

    rating_mat = model.train_set.matrix

    UIDX = list(model.train_set.iid_map.items())[UID][0]

    print(f"ProductID = {UIDX}")
    print("-" * 35)
    print(item_df.loc[UIDX])

    ratings = pd.DataFrame(rating_mat.toarray())
    item_ratings = ratings.iloc[UID]
    top_rated_items = np.argsort(item_ratings)[-TOPK:]
    print(f"\nTOP {TOPK} RECOMMENDED USERS FOR ITEM {UID}:")
    print("-" * 35)
    print(df_item_10k.iloc[top_rated_items.array])

In [None]:
def itemknn_get_scores(
    UID, model, item_df, TOPK=5
):  # ? En caso que quieras una función más especifica a los ratings: (Puedes pasar)
    UIDX = list(model.train_set.iid_map.items())[UID][0]
    recommendations, scores = model.rank(UID)
    print(f"\nTOP {5} USERS FOR ITEM {UIDX}:")
    print("Scores:", scores[recommendations[:TOPK]])
    print(item_df.iloc[recommendations[:TOPK]])

In [None]:
# * Create the models
itemknn_models = itemknn_cornac(df)
# * Pick the models
model = itemknn_models.get(
    "iknn_cosine_mc"
)  #!!! - SUPER IMPORTANTE HACER LA ELECCIÓN DEL MODELO

# * Get the top rated items for a specific user
top_rated_items = item_profiling(2, model, df_item_10k)

# * Indice del item, modelo seleccionado, y le pasas el dataset
itemknn_get_scores(3, model, df_item_10k)

# * Get the predictions matrix filled in
predicted_ratings = pd.DataFrame(
    model.train_set.matrix.toarray(), 
    columns=list(model.train_set.iid_map.items()), 
    index=list(model.train_set.uid_map.items())
)

predicted_ratings.head()

In [None]:
# * Obten el rating especifico para un producto y user:
item_model = itemknn_models.get("iknn_cosine_mc")
item_mat_id = item_model.train_set.iid_map.get("B002I098JE")
user_mat_id = item_model.train_set.uid_map.get("A7B5JEED0RKXG")
item_model.score(user_mat_id, item_mat_id)

----
# Rec. Systems - **Model-Based** Collaborative Filtering (Cornac) <a id='section5'></a>

#### EDA <a id='rs-section00'></a>

In [None]:
# * Asegurate que tienes la matriz del pivotTable de antemano (ratings_matrix)
# * En caso que no la tengas:
ratings_matrix = df_sample.pivot_table(
    index="UserId",
    columns="ProductId",
    values="Rating",
)

# * Rellenamos la ratings matrix con la media
mean_rating = 2.5
r_df = ratings_matrix.fillna(mean_rating)

# * Pasamos el df a numpy
r = r_df.to_numpy()

# * - Centramos los ratings al subtraer la media general de la matriz.
user_ratings_mean = np.mean(r, axis=1)
r_centered = r - user_ratings_mean.reshape(-1, 1)

# * - Verificamos que la dimension se ha mantenido al subtraer
# * - Multiplica los dos numeros del r.shape y ver si coincide con el count_nonzero
print(r.shape)
print(np.count_nonzero(r))

#### Singular Value Decomposition <a id='rs-section3'></a>
* Generar el modelo SVD con K dimensión latente
* Plot de K contra RMSE para elegir el K optimo
* Modelo baseline (Solo con sesgos)
* Recomendación de producto mediante SVD

In [None]:
def svd_cornac( # * - Función para generar el módelo SVD
    df, k_min=10, k_max=2000, step=100
):  # ! - Modificar dependiendo del dataset
    """
    Ejecuta experimentos SVD para un rango de valores de 'k' sobre un DataFrame utilizando Cornac.

    Devuelve:
    - (lista, cornac.Experiment): Lista de modelos SVD entrenados y el objeto experimento con los resultados.

    """
    # * Cuidado aquí con las columnas - userId o UserId
    df = df.astype({"UserId": object, "ProductId": object})
    records = df.to_records(index=False)
    result = list(records)

    VERBOSE = False
    SEED = 42

    svd_models = []
    k_values = np.arange(k_min, k_max, step)
    for k in k_values:
        svd_models.append(
            SVD(
                name="SVD" + str(k),
                k=k,
                max_iter=30,
                learning_rate=0.01,
                lambda_reg=0.02,
                verbose=True,
            )
        )

    # Metrics
    rmse = cornac.metrics.RMSE()
    mae = cornac.metrics.MAE()

    ratio_split = RatioSplit(result, test_size=0.1, seed=SEED, verbose=VERBOSE)
    svd_experiment = cornac.Experiment(
        eval_method=ratio_split,
        models=svd_models,
        show_validation=True,
        metrics=[rmse, mae],
    )
    svd_experiment.run()

    return svd_models, svd_experiment

svd_models, svd_experiment = svd_cornac(df_sample, 100, 500, 50)

In [None]:

# * Función para hacer el plot comparando el número de dimensiones latentes vs. Error RMSE.
# * Propósito: Obtener el K óptimo
def plot_rmse_cornac(experiment, metric_name="RMSE"):
    metric_values = []
    names_models = []
    for i in range(len(experiment.result)):
        metric_values.append(
            svd_experiment.result[i].metric_avg_results.get(metric_name)
        )
        names_models.append(svd_experiment.result[i].model_name)

    plt.xlabel("Latent Dimensions")
    plt.ylabel("RMSE")
    plt.title("SVD")
    plt.plot(names_models, metric_values, "o-")
    plt.show()


plot_rmse_cornac(svd_experiment)

In [None]:

# ! -  En caso de querer hacer un baseline model para SVD - Solo considerando los bias:
df = df_sample.astype({"UserId": object, "ProductId": object})
records = df.to_records(index=False)
result = list(records)

# ? - Instantiate an evaluation method to split data into train and test sets.
ratio_split = cornac.eval_methods.RatioSplit(
    data=df_sample.values, test_size=0.1, verbose=True
)

# ? - Instantiate the models of interest
bo = cornac.models.BaselineOnly(
    max_iter=30, learning_rate=0.01, lambda_reg=0.02, verbose=True
)

# Instantiate evaluation measures
mae = cornac.metrics.MAE()
rmse = cornac.metrics.RMSE()

# Instantiate and run an experiment.
cornac.Experiment(eval_method=ratio_split, models=[bo], metrics=[mae, rmse]).run()
# ? - Tras ejecutar esto compara con el bloque de arriba

In [None]:
# * Función para recomendar a un usuario mediante el indice su producto
def recommend_products(index, model, data, num_products=5):

    print(
        "Name of Model:", model.name
    )  # ? - Sustituir el n, por el indice del modelo con menor RMSE

    # Rank all test items for a given user.
    df_rank = pd.DataFrame(
        {"ranked_items": model.rank(index)[0], "item_scores": model.rank(index)[1]},
        columns=["ranked_items", "item_scores"],
    )
    print(
        "Target UserId", data.iloc[index].UserId
    )  # * Cuidado aqui con el df_smaple

    df_rank.sort_values("item_scores", ascending=False, inplace=True)

    print(
        "Recommended products:",
        data.iloc[df_rank.head(num_products).ranked_items.values]["ProductId"].values,
    )
    print("Predicted scoreds: ", df_rank.head(num_products).item_scores.values)


recommend_products(1, svd_models[n], df_sample)

#### Matrix Factorisation <a id='rs-section4'></a>
* Option 1 - Single Matrix Factorisation model no comparison
* Option 2 - Multiple models including Non-Negative Matrix Factorisation & Probabilistic Matrix Factorisation **<- Choose this one**
  * Visualización de los puntos con los 2 factores con mayor varianza
  * Aplicamos clustering en el espacio de los 2 factores

> Factorises the ratings matrix into the product of two lower-rank matrices. "Capturing the low-rank structure of the user-item interactions".
> * Y (mxn) => P (mxk) & $Q^T$ (kxn), where k << m, n is the latent factor size. So Y = $P*Q^T$
> * P is the user matrix (m $\rightarrow$ # of users) -> Rows measure user interest in item chars.
> * Q is the item matrix (n $\rightarrow$ # of items) -> Rows measure item characteristics set.

In [None]:

# * Load the data for the MF
# ? Option 1 - Non-csv
data = movielens.load_feedback(variant="100K")

# !!!! - For when you want to produce a single MF model, SI NECESITAS COMPARAR VE ABAJO
# ? !!! - Option 2 - CORNAC CSV IMPORT
pandas_df = pd.read_csv("csv_path")
df = pandas_df.astype({"UserId": object, "ProductId": object})
records = df.to_records(index=False)
result = list(records)

# * Data split and calculate the RMSE
rs = RatioSplit(result, test_size=0.2, seed=SEED, verbose=VERBOSE)
rmse = cornac.metrics.RMSE()

K = 10
lbd = 0.01  # ? Lambda -> Regularisation
mf = MF(
    k=K,
    max_iter=20,
    learning_rate=0.01,
    lambda_reg=lbd,
    use_bias=False,
    verbose=VERBOSE,
    seed=SEED,
    name=f"MF(K={K},lambda={lbd:.4f})",
)

# * Execute the MF model
cornac.Experiment(eval_method=rs, models=[mf], metrics=[rmse]).run()
# ? - ^^If you only have one model.

In [None]:

# * NMF: Variant where the latent factors are constrained to be non-negative
# * Ideal for non-negative factors like image processing, text mining, and rec. systems.
# * As there are no negative factors.
# * Allows for better interpretabiliy to reason with positive values:
def mf_cornac(df, K=10):
    df = df.astype({"UserId": object, "ProductId": object})
    records = df.to_records(index=False)
    result = list(records)
    VERBOSE = False
    SEED = 42
    lbd = 0.01
    baseline = BaselineOnly(
        max_iter=20, learning_rate=0.01, lambda_reg=lbd, verbose=VERBOSE
    )
    mf1 = MF(
        k=K,
        max_iter=20,
        learning_rate=0.01,
        lambda_reg=0.0,
        use_bias=False,
        verbose=VERBOSE,
        seed=SEED,
        name=f"MF(K={K})",
    )
    mf2 = MF(
        k=K,
        max_iter=20,
        learning_rate=0.01,
        lambda_reg=lbd,
        use_bias=False,
        verbose=VERBOSE,
        seed=SEED,
        name=f"MF(K={K},lambda={lbd:.4f})",
    )
    mf3 = MF(
        k=K,
        max_iter=20,
        learning_rate=0.01,
        lambda_reg=lbd,
        use_bias=True,
        verbose=VERBOSE,
        seed=SEED,
        name=f"MF(K={K},bias)",
    )
    nmf = NMF(
        k=K,
        max_iter=200,
        learning_rate=0.01,
        use_bias=False,
        verbose=VERBOSE,
        seed=SEED,
        name=f"NMF(K={K})",
    )
    ratio_split = RatioSplit(result, test_size=0.1, seed=SEED, verbose=VERBOSE)
    cornac.Experiment(
        eval_method=ratio_split,
        models=[baseline, mf1, mf2, mf3, nmf],
        metrics=[cornac.metrics.RMSE()],
    ).run()

    mf_models = {"baseline": baseline, "mf1": mf1, "mf2": mf2, "mf3": mf3, "nmf": nmf}
    return mf_models

In [None]:
K = 10
mf_models = mf_cornac(df_sample, K)

model = mf_models.get("mf3")  # ? - Selecciona el que mejor RMSE tenga
var_df = pd.DataFrame(
    {"Factor": np.arange(K), "Variance": np.var(model.i_factors, axis=0)}
)

# * Observar la info y varianza de cada dimensión latente del MF
fig, ax = plt.subplots(figsize=(12, 5))
plt.title("MF")
sns.barplot(x="Factor", y="Variance", data=var_df, hue="Factor", legend=False, ax=ax)
# * Con NMF estas forzando que los autovalores sean positivos, para poder mejorar su interpretabilidad
# * PMF -> Probabilistic Matrix Factorisation (Tema 6)-> Y esa casi siempre mejora (Estudiaría así - Hazte un esquema)

#### **K-Means para el Matrix Factorisation con los 2 factores con mayor varianza**

In [None]:
TOP2F = (0, 2) # * Modificar a los factores más significativos
SAMPLE_SIZE = 500

mf = model
rng = np.random.RandomState(42)
sample_inds = rng.choice(np.arange(mf.i_factors.shape[0]), size=SAMPLE_SIZE)
sample_df = pd.DataFrame(data=mf.i_factors[sample_inds][:, TOP2F], columns=["x", "y"])
sns.lmplot(x="x", y="y", data=sample_df, height=11.0, fit_reg=False)

In [None]:
def pick_centroids(data, k):
    indexes = np.random.choice(len(data), size=k, replace=False)
    centroids = data[indexes]
    return centroids

def assign_cluster(data, centroids):
    # Pairwise squared L2 distances. Shape [n, k]
    distances = ((data[:, np.newaxis] - centroids) ** 2).sum(axis=2)
    # find closest centroid index. Shape [n]
    clusters = np.argmin(distances, axis=1)
    return clusters

def update_centroids(data, clusters, k):
    # Mean positions of data within clusters
    centroids = [np.mean(data[clusters == i], axis=0) for i in range(k)]
    return np.array(centroids)

In [None]:
class KMEANS:
    def __init__(self, k):
        self.k = k

    def fit(self, data, steps=20):
        self.centroids = pick_centroids(data, self.k)
        for step in range(steps):
            clusters = assign_cluster(data, self.centroids)
            self.centroids = update_centroids(data, clusters, self.k)

    def predict(self, data):
        return assign_cluster(data, self.centroids)

In [None]:
kmeans = KMEANS(k=3)
data = sample_df.to_numpy()
kmeans.fit(data)
clusters = kmeans.predict(data)

In [None]:
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap="viridis", alpha=0.5)
plt.scatter(
    kmeans.centroids[:, 0],
    kmeans.centroids[:, 1],
    c="red",
    s=100,
    edgecolor="black",
    label="Centroids",
)
plt.title("Cluster Visualization with Centroids")
plt.grid(False)
plt.show()

In [None]:
def plot_decision_boundaries(clusterer, X, resolution=1000):
    plt.figure()
    mins = X.min(axis=0)
    maxs = X.max(axis=0)
    xx, yy = np.meshgrid(
        np.linspace(mins[0], maxs[0], resolution),
        np.linspace(mins[1], maxs[1], resolution),
    )
    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.title("Cluster Visualization with Voronoi cells")
    plt.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]), cmap="Pastel2")
    plt.contour(
        Z, extent=(mins[0], maxs[0], mins[1], maxs[1]), linewidths=1, colors="k"
    )
    plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap="viridis", alpha=0.5)
    plt.scatter(
        kmeans.centroids[:, 0],
        kmeans.centroids[:, 1],
        c="red",
        s=100,
        edgecolor="black",
        label="Centroids",
    )
    plt.show()


plot_decision_boundaries(kmeans, data)

----

# Implicit Feedback - Interaction based (Cornac) <a id='section6'></a>

#### BPR - Bayesian Probabilistic Ranking <a id='rs-section5'></a>

In [None]:
# * Import the data and split into train/test
data = pd.read_csv("path_to_csv")
train, test = python_random_split(data, 0.75)
train_set = cornac.data.Dataset.from_uir(train.itertuples(index=False), seed=SEED)

# print("Number of users: {}".format(train_set.num_users))
# print("Number of items: {}".format(train_set.num_items))

In [None]:
# * BPR Model
# ? -top k items to recommend
TOP_K = 10

# ? - Model parameters
NUM_FACTORS = 250
NUM_EPOCHS = 100

bpr = cornac.models.BPR(
    k=NUM_FACTORS,  # ? - Control the dimension of the latent space.
    max_iter=NUM_EPOCHS,  # ? - Num of iterations for SGD
    learning_rate=0.01,  # ? - Controls the step size alpha for gradient update. Small in this case
    lambda_reg=0.001,  # ? - L2 Regularisation
    verbose=True,
    seed=SEED,
).fit(
    train_set
)  # ? - In case you wish to train it directly

# * The BPR model is effectively designed for item ranking. So we should only measure performance using the ranking metrics.
with Timer() as t:
    all_predictions = predict_ranking(
        bpr, train, usercol="userID", itemcol="itemID", remove_seen=True
    )
print(f"Took {t} secondes for the prediction")

all_predictions.head()  # ? - Visualise the prediction for user ratings
bpr.rank(3)[1][
    1394
]  # ? - Get the ranking of items for user with ID 3 -> Access the second element with itemID 1394
# TODO ^^ In the above case, shouldn't we apply a mask for the prediction for it to be zero?

# * Analysis of the predictions and extract their performance matrix
# Mean Average Precision for top k prediction items
eval_map = map(test, all_predictions, col_prediction="prediction", k=TOP_K)
# Normalized Discounted Cumulative Gain (nDCG)
eval_ndcg = ndcg_at_k(test, all_predictions, col_prediction="prediction", k=TOP_K)
# precision at k (min=0, max=1)
eval_precision = precision_at_k(
    test, all_predictions, col_prediction="prediction", k=TOP_K
)
eval_recall = recall_at_k(test, all_predictions, col_prediction="prediction", k=TOP_K)

print(
    "MAP:\t%f" % eval_map,
    "NDCG:\t%f" % eval_ndcg,
    "Precision@K:\t%f" % eval_precision,
    "Recall@K:\t%f" % eval_recall,
    sep="\n",
)
warnings.filterwarnings("ignore")

#### Weighted Matrix Factorisation: <a id='rs-section6'></a>

In [None]:
K = 50
wmf = WMF(
    k=K,
    max_iter=100,
    a=1.0,
    b=0.01,
    learning_rate=0.001,
    lambda_u=0.01,
    lambda_v=0.01,
    verbose=VERBOSE,
    seed=SEED,
    name=f"WMF(K={K})",
)

eval_metrics = [
    cornac.metrics.RMSE(),
    cornac.metrics.AUC(),
    cornac.metrics.Precision(k=10),
    cornac.metrics.Recall(k=10),
    cornac.metrics.FMeasure(k=10),
    cornac.metrics.NDCG(k=[10, 20, 30]),
    cornac.metrics.MRR(),
    cornac.metrics.MAP(),
]

pandas_df = pd.read_csv("csv_path")
data = cornac.data.Dataset.from_uir(pandas_df.itertuples(index=False))
rs = RatioSplit(data, test_size=0.2, seed=SEED, verbose=VERBOSE)
cornac.Experiment(
    eval_method=rs, models=[wmf, mf], metrics=eval_metrics
).run()  # ? - This will output all of the metrics mentioned
# * Consider that MF models are strong at predicting the ratings well.
# * However, WMF models are designed to rank items, by fitting binary adoptions. (A click, a purchase, a view)
# * This is more about showing interest, rather than judging how much they will like it

#### Factorisation Machines <a id='rs-section7'></a>

In [None]:
if torch.backends.mps.is_available():
    device = torch.device("mps")
    x = torch.ones(1, device=device)
    print(x)
else:
    print("MPS device not found.")

# * - Get the item df
item_df = pd.read_csv("path_to_csv")
# ? - Make sure to create a column with the Id index in case that the id's don't start as 0
item_df["itemId_index"] = item_df["itemId"].astype("category").cat.codes
item_df.head()

# * - Get the user df
user_df = pd.read_csv("path_to_csv")
# ? - Remember to factorise all categorical variables !!! - Select those which are relevant
user_df["gender_index"] = user_df["gender"].astype("category").cat.codes
user_df["age_index"] = user_df["age"].astype("category").cat.codes
user_df["occupation_index"] = user_df["occupation"].astype("category").cat.codes
user_df["userId_index"] = user_df["userId"].astype("category").cat.codes
user_df.head()

# * - Get the ratings df and join it with the userId and itemId
ratings_df = pd.read_csv("path_to_csv")
ratings = ratings_df.join(item_df.set_index("itemId"), on="movieId")
ratings = ratings_df.join(user_df.set_index("userId"), on="userId")

# * - Get the feature columns to prepare for Factor Machines. !!! Don't forget to modify for the real ones.
# TODO - Is multi-fesature recommendation systems only relevant when it comes to implicit feedback?
feature_columns = [
    "userId_index",
    "itemId_index",
    "age_index",
    "gender_index",
    "occupation_index",
]

feature_sizes = {
    "userId_index": len(ratings["userId_index"].unique()),
    "movieId_index": len(ratings["itemId_index"].unique()),
    "age_index": len(ratings["age_index"].unique()),
    "gender_index": len(ratings["gender_index"].unique()),
    "occupation_index": len(ratings["occupation_index"].unique()),
}

# * Set the second order FM model made of three parts:
# ? - 1. The offsets:
next_offset = 0
feature_offsets = {}

# * This is in order to establish when to pass to the next feature
for k, v in feature_sizes.items():
    feature_offsets[k] = next_offset
    next_offset += v

# * Map all column indices to start from correct offset
for column in feature_columns:
    ratings[column] = ratings[column].apply(lambda c: c + feature_offsets[column])

# * - Only visualise the feature columns along with the ratings, because that's what we need for FM.
ratings[[*feature_columns, "rating"]].head(5)

# * - Initialise the data and split it into train and test
data_x = torch.tensor(ratings[feature_columns].values)
data_y = torch.tensor(ratings["rating"].values).float()
dataset = data.TensorDataset(data_x, data_y)

bs = 1024
train_n = int(len(dataset) * 0.9)
valid_n = len(dataset) - train_n
splits = [train_n, valid_n]
assert sum(splits) == len(dataset)  # ? - Verify that the split has been done correctly
trainset, devset = torch.utils.data.random_split(
    dataset, splits
)  # ? - Assign the data to each split
train_dataloader = data.DataLoader(trainset, batch_size=bs, shuffle=True)
dev_dataloader = data.DataLoader(devset, batch_size=bs, shuffle=True)


# * Function to fill in a tensor with a 'truncated distribution' -> mean 0, std 1
# copied from fastai:
def trunc_normal_(x, mean=0.0, std=1.0):
    """
    Modifies a PyTorch tensor in-place, filling it with random values that approximate a truncated normal distribution.

    This function fills the tensor `x` with values drawn from a standard normal distribution, then applies a modulus operation to limit the absolute values, and finally scales and shifts these values to achieve the desired mean and standard deviation. Note that the approach does not strictly adhere to a statistically accurate truncated normal distribution, as it does not cut off values outside a specific range but rather wraps them within a limited range.

    Parameters:
    - x (Tensor): The PyTorch tensor to be modified in-place.
    - mean (float, optional): The mean of the distribution after adjustment. Defaults to 0.0.
    - std (float, optional): The standard deviation of the distribution after adjustment. Defaults to 1.0.

    Returns:
    - Tensor: The modified tensor `x` with values approximating a truncated normal distribution centered around `mean` and with a standard deviation of `std`. The tensor is modified in-place, so the return value is the same tensor object `x`.
    """
    return x.normal_().fmod_(2).mul_(std).add_(mean)


class FMModel(nn.Module):
    def __init__(
        self, n, k
    ):  # ? - n: Number of unique features. k: Number of latent vectors
        super().__init__()

        self.w0 = nn.Parameter(torch.zeros(1))  # ? - Global bias
        self.bias = nn.Embedding(n, 1)  # ? - Embedding layer for bias per feature
        self.embeddings = nn.Embedding(
            n, k
        )  # ? - The actual embedding with dimension k

        # ? - This initialises the embeddings and bias layers with a truncated normal distribution
        with torch.no_grad():
            trunc_normal_(self.embeddings.weight, std=0.01)
        with torch.no_grad():
            trunc_normal_(self.bias.weight, std=0.01)

    def forward(
        self, X
    ):  # ? - How is the input tensor processed to produce a prediction?
        emb = self.embeddings(X)  # ? - Compute embeddings for the input features
        # ? - emb has shape: [batch_size, num_of_features, k]
        # calculate the interactions in complexity of O(nk) see lemma 3.1 from paper
        pow_of_sum = emb.sum(dim=1).pow(2)
        sum_of_pow = emb.pow(2).sum(dim=1)
        pairwise = (pow_of_sum - sum_of_pow).sum(1) * 0.5
        bias = self.bias(X).squeeze().sum(1)
        # I wrap the result with a sigmoid function to limit to be between 0 and 5.5.
        return torch.sigmoid(self.w0 + bias + pairwise) * 5.5

    # ? ^^Returns a sigmoid as the output will be limited between 0 and 1 -> The 5.5 I'm not sure why
    # ? Probably because of the rating prediction.


# fit/test functions
def fit(iterator, model, optimizer, criterion):
    train_loss = 0
    model.train()
    for x, y in iterator:
        optimizer.zero_grad()
        y_hat = model(x.to(device))
        loss = criterion(y_hat, y.to(device))
        train_loss += loss.item() * x.shape[0]
        loss.backward()
        optimizer.step()
    return train_loss / len(iterator.dataset)


def test(iterator, model, criterion):
    train_loss = 0
    model.eval()
    for x, y in iterator:
        with torch.no_grad():
            y_hat = model(x.to(device))
        loss = criterion(y_hat, y.to(device))
        train_loss += loss.item() * x.shape[0]
    return train_loss / len(iterator.dataset)


def train_n_epochs(model, n, optimizer, scheduler):
    criterion = nn.MSELoss().to(device)
    for epoch in range(n):
        start_time = time.time()
        train_loss = fit(train_dataloader, model, optimizer, criterion)
        valid_loss = test(dev_dataloader, model, criterion)
        scheduler.step()
        secs = int(time.time() - start_time)
        print(f"epoch {epoch}. time: {secs}[s]")
        print(f"\ttrain rmse: {(math.sqrt(train_loss)):.4f}")
        print(f"\tvalidation rmse: {(math.sqrt(valid_loss)):.4f}")


model = FMModel(data_x.max() + 1, 20).to(device)
wd = 1e-5
lr = 0.001
epochs = 10
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[7], gamma=0.1)
criterion = nn.MSELoss().to(device)
for epoch in range(epochs):
    start_time = time.time()
    train_loss = fit(train_dataloader, model, optimizer, criterion)
    valid_loss = test(dev_dataloader, model, criterion)
    scheduler.step()
    secs = int(time.time() - start_time)
    print(f"epoch {epoch}. time: {secs}[s]")
    print(f"\ttrain rmse: {(math.sqrt(train_loss)):.4f}")
    print(f"\tvalidation rmse: {(math.sqrt(valid_loss)):.4f}")

# TODO: Aprender como ejecutar para coger la recomendación. Too abstract

-----

Non-negative Matrix Factorisation (Puedes pasar)

In [None]:
k = 10
nmf = NMF(
    k=k,
    max_iter=100,  # ? - How do we decide on the number of iterations
    learning_rate=0.01,
    lambda_reg=0.0,
    verbose=VERBOSE,
    seed=SEED,
    name=f"NMF (K = {k})",
)

pandas_df = pd.read_csv("csv_path")
data = cornac.data.Dataset.from_uir(pandas_df.itertuples(index=False))

rs = RatioSplit(data, test_size=0.2, seed=SEED, verbose=VERBOSE)
rmse = cornac.metrics.RMSE()
cornac.Experiment(eval_method=rs, models=[nmf], metrics=[rmse]).run()

# ? - Visualise the variance for each latent factor in the NFM
var_df = pd.DataFrame(
    {"Factor": np.arange(K), "Variance": np.var(nmf.i_factors, axis=0)}
)
fig, ax = plt.subplots(figsize=(12, 5))
plt.title("NFM")
sns.barplot(x="Factor", y="Variance", data=var_df, palette="ch:.25", ax=ax)

# ? - Create a the reconstruction matrix based on the original dimensions
recons_matrix = pd.DataFrame(
    index=range(ratings_matrix.shape[0]), columns=range(ratings_matrix.shape[1])
)
# ? - Populate with the values
for u, i in itertools.product(
    range(recons_matrix.shape[0]), range(recons_matrix.shape[1])
):
    recons_matrix[u, i] = mf.score(u, i)
# ? - ^^Careful if you had multiple models, this is for a single one.

ratings_mask = (ratings_matrix > 0).astype(float)  # ? - To make them decimals

rmse = np.sqrt((((ratings_matrix - recons_matrix) ** 2) * ratings_mask).mean())
print(f"\nRMSE = {rmse:.3f}")
print("Reconstructed matrix:")
pd.DataFrame(
    recons_matrix.round(2),
    index=[f"User {u + 1}" for u in np.arange(df.num_users)],
    columns=[f"Item {i + 1}" for i in np.arange(df.num_items)],
)

# * - Identify the top items associated with each latent factor in an NMF
item_idx2id = list(nmf.train_set.item_ids)  # ? - Map the original id's of the items
top_items = {}
for k in range(K):  # ? - For each latent vector
    # ? - For each column in the latent matrix, pick the top five items (Slice the last 5 items in ascending order. [::-1] then just reverses it)
    top_inds = np.argsort(nmf.i_factors[:, k])[-5:][::-1]
    # * Make sure you have an item df
    # ? - Append to the dictionary the latent factor with its top 5 elements
    top_items[f"Factor {k}"] = item_df.loc[[int(item_idx2id[i]) for i in top_inds]][
        "Title"
    ].values

pd.DataFrame(top_items)

# * Attempt to extract latent vector information by sorting into genre and see if they're related:
item_idx2id = list(nmf.train_set.item_ids)
top_genres = {}
for k in range(K):
    top_inds = np.argsort(nmf.i_factors[:, k])[-100:]  # ? - Same procedure
    # ? - Make sure you have an item df
    top_items = item_df.loc[
        [int(item_idx2id[i]) for i in top_inds]
    ]  # ? - Get the top films per latent ficture
    # ? - Then drop the columns to just get the genre count.
    top_genres[f"Factor {k}"] = top_items.drop(columns=["Title", "Release Date"]).sum(
        axis=0
    )
pd.DataFrame(top_genres)
# TODO: Still don't have it clear how MF and SVD fill in the remaining elements !!!

----

# Natural Language Processing <a id='section7'></a>

#### **Introduction - Modelo Distilbert-base** <a id='nlp-section0'></a>
* Clasificadores de frases para un modelo - una frase
* Tokenizador de los modelos para frases singulares
* Clasificación de multiples frases con torch

In [None]:
classifier = pipeline('sentiment-analysis') # * Sin aplicar un modelo, coge un default
res = classifier('I was sick last week')
print(res)

In [None]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)
#TODO - Do we need this or do we just pass the classifier name?
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
res = classifier(["I hate you", "I love you", "I'm indifferent about you"])
print(res)

In [None]:
sentence = "I hate you" # * Tokeniser section for individual sentences
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tokenizer(sentence)

# * Visualise Tokens, ID's, Model word ID's with the attention mask
print(f"Tokens:{tokens}")
print(f"Token IDs: {token_ids}")
print(f"Input IDs: {input_ids}")

In [None]:
X_train = ["I hate you", "I love you"] # * Tokeniser section for multiple sentences
batch = tokenizer(
    X_train, padding=True, truncation=True, max_length=512, return_tensors="pt"
)
# ? - Padding -> All sequences must be padded to the same length
# ? - Truncation -> Truncates any sequence longer than the specified length
# ? - return_tensors -> pt for pytorch, tf for tensorflow 

with torch.no_grad():  # * Disables gradient calculations to save memory and compute, no backpropagation
    # ? - Preprocessed data is fed into de model, ** is used ot unpack the dictionary into the models input arguments.
    outputs = model(
        **batch
    )  

    # ? - SequenceClassifier Output which includes logits [# of examples, # of classes] 
    # ? - We have two scores per instance corresponding to negative or positive.
    # ? - Raw prediction of each class.
    print(f"Model output:\n{outputs}")

    # ? - We apply a softmax to convert the logits to probabilities.
    # ? -Dim = 1, specifies that we should apply them across the columns, normalising the logits into probabilities.
    predictions = F.softmax(outputs.logits, dim=1)
    print(f"\nPrediction output:\n{predictions}")

    # ? - Pick the index of the highest value in each row of predictions, which will give us the most likely class for a label
    labels = torch.argmax(predictions, dim=1)
    print(f"\nLabels:\n{labels}")

    # ? - Finds the corresponding model label to the readable value
    labels = [model.config.id2label[label_id] for label_id in labels.tolist()]
    print(labels)

#### **VADER** <a id='nlp-section1'></a>
* Descarga de librerías | Importación y analisis de dataset | Tokens, Tags y entities
* Sentiment Intensity Analysis sobre un dataset - *Bag of Words*
* Matrix Factorisation with VADER (Modificación del dataset procesado) - **Necesario hacer la sección anterior**
* Offsets

In [None]:

# ? - Chunker es un modelo estadistico que permite identificar y clasificar entidades nombradas en el texto (Personas, Fechas, Empresas, Ubicaciones)
# ? -  Se hace mediante chunking que agrupa PoS tags similares. 
nltk.download("maxent_ne_chunker") 
# ? - Palabras en inglés del corpus WordNet - Util para correcciones, lematización, etc...
nltk.download("words") 
# ? - Esencial para el analisis de sentimientos con VADER. Es un diccionario que VADER usa para calificar la intensidad de los sentimientos de palabras y frases
nltk.download("vader_lexicon")

In [None]:
data_path = 'data_path'
df = pd.read_csv(data_path)
df.head()

# * Lo más común es que obtengamos un dataset con un userId, productId, Score y Texto
# * Verifica esto en el head

In [None]:
ax = ( # * Visualización de la clasificación de scores 
    df["Score"] # ! - Esto podría ser facilmente el rating tambien
    .value_counts()
    .sort_index()
    .plot(kind="bar", title="Count of Reviews by Stars", figsize=(10, 5))
)
ax.set_xlabel("Review Stars")
plt.show()

In [None]:
# * Seleccionar un texto y aplicar tokenización +  Part of Speech Tagging
example = df["Text"][42] #! - Coges un ejemplo de una critica
print(example)

tokens = nltk.word_tokenize(example)   # * Tokenización del ejemplo.
tagged = nltk.pos_tag(tokens)          # * Part of Speech Tagging - Significado Gramatico en la frase.
entities = nltk.chunk.ne_chunk(tagged) # * Extracción de entidades de los speeches

# ? - Visualizar los tokens y los PoS
tokens[:10]
tagged[:10]
entities.pprint()

**Sentiment Intensity Analysis**

In [None]:
# * Sentiment Analysis sobre una frase individual
sentence = 'I am so happy!' #! - Esto lo puede re-emplazar con la critica
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(sentence) # * Output: neg | neu | pos | compound

In [None]:
# * Sentiment analysis sobre todo el dataset: 
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = row["Text"]
    myid = row["Id"]
    res[myid] = sia.polarity_scores(text) # * Almacenamos en diccionario - clave Id del producto; valor el resultado del SIA

# * Convertimos el resultado a un dataset y lo unimos a los datos
vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={"index": "Id"})
vaders = vaders.merge(df, how="left")
vaders.head() # * Ahora podemos relacionar el compound score con el rating score

In [None]:

# * Analisis visual de la distribución de compound score
ax = sns.barplot(data=vaders, x="Score", y="compound")
ax.set_title("Compound Score by Amazon Star Review")
plt.show()

# * Analisis visual de la distribución de neg/neu/pos
fig, axs = plt.subplots(1, 3, figsize=(12, 3))
sns.barplot(data=vaders, x="Score", y="pos", ax=axs[0])
sns.barplot(data=vaders, x="Score", y="neu", ax=axs[1])
sns.barplot(data=vaders, x="Score", y="neg", ax=axs[2])
axs[0].set_title("Positive")
axs[1].set_title("Neutral")
axs[2].set_title("Negative")
plt.tight_layout()
plt.show()

**Factorisation Machine Data Prep w/ VADER**

In [None]:
vaders.info() # * Para visualizar las categorías
columns_to_drop = ['define', 'aqui', 'las', 'columnas', 'a', 'soltar']
data = vaders.drop(columns_to_drop, axis=1)
data.columns
data.head() # * Deberías tener neg | neu | pos | compound | userId & productId | rating | alguna otra col. numérica
data.info() # * Visualizar cols y tipos

In [None]:

# * Aquí pasamos el Id a valores numéricos hash 
data["userId"] = [hash(uid) for uid in data["UserId"].values]
data["productId"] = [hash(uid) for uid in data["ProductId"].values]
data = data.drop(["ProductId", "UserId"], axis=1) # * Eliminamos ID columns para que todo sea numérico
data.head()

**Esto prepara nuestro dataset para aplicarlo a un Factorisation Machine**

In [None]:
feature_columns = [
    "neg",
    "neu",
    "pos",
    "compound",
    "HelpfulnessNumerator",
    "HelpfulnessDenominator",
    "Time",
    "userId",
    "productId",
]

features_sizes = {
    "userId": data["userId"].nunique(),
    "productId": data["productId"].nunique(),
    "HelpfulnessNumerator": data["HelpfulnessNumerator"].nunique(),
    "HelpfulnessDenominator": data["HelpfulnessDenominator"].nunique(),
    "Time": data["Time"].nunique(),
    "neg": data["neg"].nunique(),
    "neu": data["neu"].nunique(),
    "pos": data["pos"].nunique(),
    "compound": data["compound"].nunique(),
}

# * Calculate offsets.
# * Each feature starts from the end of the last one.
next_offset = 0
features_offsets = {}
for k, v in features_sizes.items():
    features_offsets[k] = next_offset
    next_offset += v

features_offsets

# * map all column indices to start from correct offset
# * We take every value in a column, and we add the corresponding offset to that column.
for column in feature_columns:
    data[column] = data[column].apply(lambda c: c + features_offsets[column])

data[[*feature_columns, "Score"]].head(
    5
)  # * This is a way to just display the columns we're interested in, and we unpack the feature columns

#### **ROBERTA** <a id='nlp-section2'></a>
* Setup del modelo 
* Tokenización 
* Proceso para clasificar una frase
* Proceso para hacerlo para todo un dataframe

In [None]:
# * model - based on tweets
model_name = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [None]:
sentence = "I hate you"
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tokenizer(sentence)

print(f"Tokens:{tokens}")
print(f"Token IDs: {token_ids}")
print(f"Input IDs: {input_ids}")

In [None]:
# Run for Roberta Model
encoded_text = tokenizer(sentence, return_tensors="pt")
print(f"Example text: {sentence}\nEncoded Text Output:{encoded_text}\n")

# * We can observe from the output the negative, neutral and positive logits
output = model(**encoded_text)
print(f"\nOutput:\n{output}")

# ? - Get the scores as logits and place it in a numpy vector
scores = output[0][0].detach().numpy()
# * Apply softmax to pass it to a probability and assign
scores = softmax(scores)
scores_dict = {
    "roberta_neg": scores[0],
    "roberta_neu": scores[1],
    "roberta_pos": scores[2],
}
print(f"\nScores Dictionary final: {scores_dict}")

In [None]:
# * File setup
path = 'path_to_csv'
df = pd.read_csv(path)
df.head()

reviews_per_rating = df[["ProductId"]].value_counts()

# !! - Group by Product, and select those which have more than 10 reviews like in the exam
select_product = (reviews_per_rating >= 10).groupby("ProductId").all()
select_product = select_product.index[select_product].to_list()
df = df.loc[df["ProductId"].isin(select_product)]
df.shape

df.value_counts("Score", normalize=True)

# Numer of unique users and products
n_users = df["UserId"].nunique()
print("UNIQUE USERS: ", n_users)
n_products = df["ProductId"].nunique()
print("UNIQUE PRODUCTS: ", n_products)

def print_sparsity(df):
    n_users = df.UserId.nunique()
    n_items = df.ProductId.nunique()
    n_ratings = len(df)
    rating_matrix_size = n_users * n_items
    sparsity = 1 - n_ratings / rating_matrix_size

    print(f"Number of users: {n_users}")
    print(f"Number of items: {n_items}")
    print(f"Number of available ratings: {n_ratings}")
    print(f"Number of all possible ratings: {rating_matrix_size}")
    print("-" * 40)
    print(f"SPARSITY: {sparsity * 100.0:.2f}%")


print_sparsity(df)

In [None]:
def polarity_scores_roberta(example):
    encoded_text = tokenizer(example, return_tensors="pt")
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        "roberta_neg": scores[0],
        "roberta_neu": scores[1],
        "roberta_pos": scores[2],
    }
    return scores_dict

In [None]:
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    try:
        text = row["Text"]
        myid = row["Id"]
        roberta_result = polarity_scores_roberta(str(text))
        res[myid] = {**roberta_result}
    except RuntimeError:
        print(f"Broke for id {myid}")

ratings = pd.DataFrame(res).T
ratings = ratings.reset_index().rename(columns={"index": "Id"})
ratings = ratings.merge(vaders, how="left")
ratings.info()

and so the wheel turns ...

----
# Graph networks <a id='section8'></a>

Ejemplos en la pagina web: https://github.com/PreferredAI/cornac/tree/master/examples

#### **Graph Modality**  <a id='na-section1'></a>
* SoRec vs. PMF (Probabilistic Matrix Factorisation)
* Interpreting Recommendations

In [None]:
K = 20
# * Definition of the SoRec Model
sorec = SoRec(k=K, max_iter=50, learning_rate=0.001, verbose=VERBOSE, seed=SEED)
# * Definition of the PMF model
pmf = PMF(
    k=K, max_iter=50, learning_rate=0.001, lambda_reg=0.01, verbose=VERBOSE, seed=SEED
)

In [None]:
ratings_df = pd.read_csv('path_to_ratings.csv')
trust_df = pd.read_csv('path_to_trust.csv')

# ! Si nos piden que sea de la librería filmtrust aquí, si no, la siguiente sección: 
ratings = filmtrust.load_feedback()
trust = filmtrust.load_trust() # * Trust define las relaciones entre usuarios. Como si son amigos o no
user_graph_modality = GraphModality(data=trust)

# ! En caso que no nos hagan extraer los datos del filmtrust pack de Cornac
ratings = list(zip(ratings_df.user_id, ratings_df.item_id, ratings_df.rating))
trust = list(zip(trust_df.user_id, trust_df.friend_id))
user_graph_modality = GraphModality(data = trust)

# * Split the dataset
ratio_split = RatioSplit(
    data=ratings,
    test_size=0.2,
    rating_threshold=2.5,
    exclude_unknowns=True,
    user_graph=user_graph_modality,
    verbose=VERBOSE,
    seed=SEED,
)
# * Error metrics
mae = cornac.metrics.MAE()
# * Execute the model to compare both the Social Recommendation vs. the normal Probabilistic Matrix Factorisation
# * Si solo quieres el sorec, quita el pmf 
cornac.Experiment(eval_method=ratio_split, models=[sorec, pmf], metrics=[mae]).run() 

**Interpretabilidad de los Factores y Visualización de las conexiones**

In [None]:
var_df = pd.DataFrame({"Factor": np.arange(K), "Variance": np.var(sorec.U, axis=0)})
fig, ax = plt.subplots(figsize=(13, 5))
# * Hacemos plot de los Factores vs. la varianza 
sns.barplot(
    x="Factor",
    y="Variance",
    hue="Factor",
    data=var_df,
    palette="ch:.25",
    ax=ax,
    legend=False,
);

In [None]:
TOP2F = (9, 19) # ! - Definimos aquí los factores que más varianza contienen en la gráfica
SAMPLE_SIZE = 200

rng = np.random.RandomState(SEED)
sample_inds = rng.choice(np.arange(sorec.U.shape[0]), size=SAMPLE_SIZE, replace=False)
sample_df = pd.DataFrame(data=sorec.U[sample_inds][:, TOP2F], columns=["x", "y"])
g = sns.lmplot(x="x", y="y", data=sample_df, height=11.0, fit_reg=False)
g.ax.set_title("Users in latent space with their social connections", fontsize=16)

adj_mat = sorec.train_set.user_graph.matrix
for i in range(len(sample_inds)):
    for j in range(len(sample_inds)):
        if j != i and adj_mat[sample_inds[i], sample_inds[j]]:
            sns.lineplot(x="x", y="y", data=sample_df.loc[[i, j]])

#### **Text Modality** <a id='na-section2'></a>
* Collaborative Topic Regression (CTR) vs. WMF (Weighted Matrix Factorisation)
* Interpreting Recommendations

In [None]:
K = 20
ctr = CTR(
    k=K,
    max_iter=50,
    a=1.0,
    b=0.01,
    lambda_u=0.01,
    lambda_v=0.01,
    verbose=VERBOSE,
    seed=SEED,
)
wmf = WMF(
    k=K,
    max_iter=50,
    a=1.0,
    b=0.01,
    learning_rate=0.005,
    lambda_u=0.01,
    lambda_v=0.01,
    verbose=VERBOSE,
    seed=SEED,
)

In [None]:

# ! - En caso que no nos lo pidan de la librería Cornac:
# Load the ratings data
ratings_df = pd.read_csv('ratings.csv')  # Expected columns: ['user_id', 'item_id', 'rating']
ratings = list(zip(ratings_df.user_id, ratings_df.item_id, ratings_df.rating))

# Load the text data
text_df = pd.read_csv('item_text.csv')  # Expected columns: ['item_id', 'text']
docs = list(text_df.text)
item_ids = list(text_df.item_id)

# ! - En caso que si que lo pidan de la librería Cornac: 
ratings = amazon_clothing.load_feedback()
docs, item_ids = amazon_clothing.load_text()

# Prepare the Text Modality
item_text_modality = TextModality(
    corpus=docs,
    ids=item_ids,
    tokenizer=BaseTokenizer(sep=" ", stop_words="english"),
    max_vocab=8000,
    max_doc_freq=0.5
)

# Define the data split
ratio_split = RatioSplit(
    data=ratings,
    test_size=0.2,
    rating_threshold=4.0,
    exclude_unknowns=True,
    item_text=item_text_modality,  # Incorporating text modality here
    verbose=VERBOSE,
    seed=SEED
)

rec_50 = cornac.metrics.Recall(50)

cornac.Experiment(eval_method=ratio_split, models=[ctr, wmf], metrics=[rec_50]).run()


Interpretability of the recommendation: 
* Get the top words of each topic
* We can select the user and see what top topics their interested in & then see their recommendations

In [None]:
vocab = ctr.train_set.item_text.vocab
topic_word_dist = ctr.model.beta.T[:, -ctr.train_set.item_text.max_vocab :]
top_word_inds = np.argsort(topic_word_dist, axis=1) + 4  # ingore 4 special tokens

topic_words = {}
topic_df = defaultdict(list)
print("WORD TOPICS:")
for t in range(len(topic_word_dist)):
    top_words = vocab.to_text(top_word_inds[t][-10:][::-1], sep=", ")
    topic_words[t + 1] = top_words
    topic_df["Topic"].append(t + 1)
    topic_df["Top words"].append(top_words)
topic_df = pd.DataFrame(topic_df)
topic_df

In [None]:
UIDX = 123
TOPK = 5

item_id2idx = ctr.train_set.iid_map
item_idx2id = list(ctr.train_set.item_ids)

print(f"USER {UIDX} TOP-3 TOPICS:")
topic_df.loc[np.argsort(ctr.U[UIDX])[-3:][::-1]]

In [None]:
recommendations, scores = ctr.rank(UIDX)
print(f"\nTOP {TOPK} RECOMMENDATIONS FOR USER {UIDX}:")
rec_df = defaultdict(list)
for i in recommendations[:TOPK]:
    rec_df["URL"].append(f"https://www.amazon.com/dp/{item_idx2id[i]}")
    rec_df["Description"].append(ctr.train_set.item_text.corpus[i])
pd.DataFrame(rec_df)

----
#### Image Modality y VBPR no entran por muy chulas que sean 