# Modelo de Recomendación

Para el Modelo de Recomendación, vamos a usar el enfoque **Collaborative Filtering**:

* *User-Item*:
    * Se identifican usuarios similares.
    * Se recomeiendan nuevos ítems a otros usuarios basados en el rating dado por otros usuarios similares.
* *Item-based*:
    * Calcular la similitud entre items.
    * Encontrar los *'mejores items similares'* a los que un usuarios no tenga evaluados y recomendarselos.

En este caso, vamos a usar dos tipos de filtros para las funciones:
- User-Items: Se toma a un usuario, se encuentran usuarios similares y se recomiendan ítems que a esos usuarios le gustaron. En este caso el input es el ID del usuarios y el output sera una lista de 5 ítems.
- Item-Item: Se toma un ítem, se encuentran usuarios que les haya gustado ese item y se buscan otros items que a esos usuarios (o usuarios similres) les haya gustado. En este caso el input es el ID del ítem y el output es una lista de 5 ítems.

## Importación de librerias

In [7]:
import pandas as pd
import numpy as np
import Utilities as u
import scipy as sp
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
import operator
import pyarrow as pa
import pyarrow.parquet as pq

import warnings
warnings.filterwarnings("ignore")

## Lectura de los datasets a utilizar

In [8]:
model = pd.read_csv('./dataset/Endpoints/model_recommend.csv')
steam_games = pd.read_csv('./dataset/Modificados/steam_games.csv')

## Creación del Modelo de Recomendación

### User-Based

Recordemos los datos que tenemos en el dataset *model*

In [9]:
model.head()

Unnamed: 0,user_id,item_id,app_name,rating
0,76561197970982479,1250,Killing Floor,5
1,76561197970982479,22200,Zeno Clash,5
2,76561197970982479,43110,,5
3,js41637,251610,,5
4,js41637,227300,Euro Truck Simulator 2,5


#### Hacemos un pivotado de la tabla *'model'*

In [10]:
model_ = model.pivot_table(index='user_id', columns='app_name', values='rating', aggfunc=np.mean)

In [11]:
model_

app_name,! That Bastard Is Trying To Steal Our Gold !,//N.P.P.D. RUSH//- The milk of Ultraviolet,0RBITALIS,"10,000,000",100% Orange Juice,100% Orange Juice - Krila & Kae Character Pack,1001 Spikes,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,12 is Better Than 6,...,nail'd,oO,planetarian ~the reverie of a little planet~,resident evil 4 / biohazard 4,sZone-Online,the static speaks my name,theBlu,theHunter Classic,theHunter: Primal,Астролорды: Облако Оорта
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--000--,,,,,,,,,,,...,,,,,,,,,,
--ace--,,,,,,,,,,,...,,,,,,,,,,
--ionex--,,,,,,,,,,,...,,,,,,,,,,
-2SV-vuLB-Kg,,,,,,,,,,,...,,,,,,,,,,
-Azsael-,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zvanik,,,,,,,,,,,...,,,,,,,,,,
zwanzigdrei,,,,,,,,,,,...,,,,,,,,,,
zy0705,,,,,,,,,,,...,,,,,,,,,,
zynxgameth,,,,,,,,,,,...,,,,,,,,,,


Vemos que hay varios registros vacios, y es porque hay muchos usuarios que no hay puntuado algunos juegos, entonces no se puede sacar la media en esos registros.

Ahora con esa matriz, vamos a calcular la similitud entre juegos y entre usuarios.

Primero cambiamos los registros vacios por un cero.

In [12]:
model_.fillna(0, inplace=True)

#### Aplicamos la *Normalización Min-Max*.
La normalización Min-Max es una técnica comúnmente utilizada para escalar características en un rango específico, generalmente [0, 1]. Esta técnica es útil para garantizar que las características tengan la misma escala y para evitar que algunas características dominen sobre otras.

Creamos una instancia del escalador.

In [13]:
scaler = MinMaxScaler()

Normalizamos el DataFrame *model_*

In [14]:
model_normalized = scaler.fit_transform(model_)

In [15]:
model_normalized

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Convertimos la matriz en un DataFrame

In [16]:
model_normalized_df = pd.DataFrame(model_normalized, columns=model_.columns, index=model_.index)

In [17]:
model_normalized_df

app_name,! That Bastard Is Trying To Steal Our Gold !,//N.P.P.D. RUSH//- The milk of Ultraviolet,0RBITALIS,"10,000,000",100% Orange Juice,100% Orange Juice - Krila & Kae Character Pack,1001 Spikes,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,12 is Better Than 6,...,nail'd,oO,planetarian ~the reverie of a little planet~,resident evil 4 / biohazard 4,sZone-Online,the static speaks my name,theBlu,theHunter Classic,theHunter: Primal,Астролорды: Облако Оорта
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--000--,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--ace--,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--ionex--,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-2SV-vuLB-Kg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-Azsael-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zvanik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zwanzigdrei,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zy0705,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zynxgameth,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Creamos una matriz dispersa con este DataFrame

In [18]:
model_sparse = sp.sparse.csr_matrix(model_normalized_df.values)
model_sparse

<24380x3195 sparse matrix of type '<class 'numpy.float64'>'
	with 53197 stored elements in Compressed Sparse Row format>

#### Calculamos la similitud en la matriz

In [19]:
users_similarity = cosine_similarity(model_sparse)

Vemos la matriz

In [20]:
users_similarity

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.70710678, 0.        ,
        0.18190172],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.30316953],
       ...,
       [0.        , 0.70710678, 0.        , ..., 1.        , 0.        ,
        0.25724788],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.18190172, 0.30316953, ..., 0.25724788, 0.        ,
        1.        ]])

In [21]:
users_similarity.shape

(24380, 24380)

Pasamos esta matriz a un DataFrame para una mejor visualización

In [22]:
df_users_similarity = pd.DataFrame(users_similarity, index=model_.index, columns=model_.index) 

In [23]:
df_users_similarity.head()

user_id,--000--,--ace--,--ionex--,-2SV-vuLB-Kg,-Azsael-,-Beave-,-I_AM_EPIC-,-Kenny,-Mad-,-PRoSlayeR-,...,zuilde,zukuta,zunbae,zuzuga2003,zv_odd,zvanik,zwanzigdrei,zy0705,zynxgameth,zyr0n1c
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--000--,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--ace--,0.0,1.0,0.0,0.405554,0.0,0.0,0.707107,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.707107,0.0,0.181902
--ionex--,0.0,0.0,1.0,0.405554,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.707107,0.456435,0.0,0.353553,0.0,0.0,0.0,0.30317
-2SV-vuLB-Kg,0.0,0.405554,0.405554,1.0,0.0,0.0,0.573539,0.0,0.0,0.0,...,0.0,0.0,0.573539,0.370218,0.0,0.28677,0.573539,0.573539,0.0,0.393445
-Azsael-,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Reiniciamos los índices

In [29]:
df_users_similarity.reset_index(inplace=True)

In [30]:
df_users_similarity.rename_axis(None, axis=1, inplace=True)

Guardamos el DataFrame en un CSV

In [32]:
df_users_similarity.to_csv('./dataset/Endpoints/Recommendation System/users_similarity.csv', encoding='utf-8', index=False)

Guardamos el DataFrame en un archivo Parquet

In [None]:
pq.write_table(pa.Table.from_pandas(df_users_similarity), './dataset/Endpoints/Recommendation System//users_similarity.parquet')

Probamos buscando usuarios similares a un usuarios dado:

In [19]:
def top_users(user):
    count = 1
    print('TOP 5 usuarios similares a {}:'.format(user))
    print('-----' * 8)

    for users in df_users_similarity.sort_values(by=user, ascending=False).index[1:6]:
        print('Nro. {}: {}'.format(count,users))
        count += 1

In [20]:
top_users('--000--')

TOP 5 usuarios similares a --000--:
----------------------------------------
Nro. 1: 76561198066714498
Nro. 2: trollviper
Nro. 3: LesDexter
Nro. 4: 76561198081529182
Nro. 5: Llamadyl


Corroboramos

In [25]:
model[model.user_id == '--000--']

Unnamed: 0,user_id,item_id,app_name,rating
20667,--000--,1250,Killing Floor,3


In [26]:
model[model.user_id == '76561198066714498']

Unnamed: 0,user_id,item_id,app_name,rating
6746,76561198066714498,1250,Killing Floor,5


In [27]:
model[model.user_id == 'trollviper']

Unnamed: 0,user_id,item_id,app_name,rating
47189,trollviper,1250,Killing Floor,5


In [28]:
model[model.user_id == 'LesDexter']

Unnamed: 0,user_id,item_id,app_name,rating
33361,LesDexter,1250,Killing Floor,3


In [29]:
top_users('--ionex--')

TOP 5 usuarios similares a --ionex--:
----------------------------------------
Nro. 1: zomethingelses
Nro. 2: Hypahh
Nro. 3: jamesaway
Nro. 4: 76561198074447562
Nro. 5: LukasCoady


In [30]:
model[model.user_id == '--ionex--']

Unnamed: 0,user_id,item_id,app_name,rating
31482,--ionex--,730,Counter-Strike: Global Offensive,5
31483,--ionex--,105600,Terraria,5


In [31]:
model[model.user_id == 'zomethingelses']

Unnamed: 0,user_id,item_id,app_name,rating
53129,zomethingelses,730,Counter-Strike: Global Offensive,5
53130,zomethingelses,105600,Terraria,5


In [32]:
model[model.user_id == 'Hypahh']

Unnamed: 0,user_id,item_id,app_name,rating
46524,Hypahh,105600,Terraria,5
46525,Hypahh,730,Counter-Strike: Global Offensive,5


Vemos que el *Modelo de Recomendación*  **User-Based** a primera impresion funciona bien. 
<p> Vayamos ahora con los Items...

### Item-based

Calculamos las similitudes de la matriz dispersa pero en base a los ítems

In [14]:
item_similarity = cosine_similarity(model_sparse.T) #Trasponemos la matriz

In [17]:
item_similarity

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Pasamos esta matriz a un DataFrame 

In [18]:
df_item_similarity = pd.DataFrame(item_similarity, index=model_.columns, columns=model_.columns)

In [19]:
df_item_similarity.head()

app_name,! That Bastard Is Trying To Steal Our Gold !,//N.P.P.D. RUSH//- The milk of Ultraviolet,0RBITALIS,"10,000,000",100% Orange Juice,100% Orange Juice - Krila & Kae Character Pack,1001 Spikes,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,12 is Better Than 6,...,nail'd,oO,planetarian ~the reverie of a little planet~,resident evil 4 / biohazard 4,sZone-Online,the static speaks my name,theBlu,theHunter Classic,theHunter: Primal,Астролорды: Облако Оорта
app_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
! That Bastard Is Trying To Steal Our Gold !,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.356235,0.0,0.0,0.0,0.0
//N.P.P.D. RUSH//- The milk of Ultraviolet,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0RBITALIS,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100% Orange Juice,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Reiniciamos los índices

In [40]:
df_item_similarity.reset_index(inplace=True)

In [41]:
df_item_similarity.rename_axis(None, axis=1, inplace=True)

In [42]:
df_item_similarity.head(3)

Unnamed: 0,app_name,! That Bastard Is Trying To Steal Our Gold !,//N.P.P.D. RUSH//- The milk of Ultraviolet,0RBITALIS,"10,000,000",100% Orange Juice,100% Orange Juice - Krila & Kae Character Pack,1001 Spikes,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,...,nail'd,oO,planetarian ~the reverie of a little planet~,resident evil 4 / biohazard 4,sZone-Online,the static speaks my name,theBlu,theHunter Classic,theHunter: Primal,Астролорды: Облако Оорта
0,! That Bastard Is Trying To Steal Our Gold !,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.356235,0.0,0.0,0.0,0.0
1,//N.P.P.D. RUSH//- The milk of Ultraviolet,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0RBITALIS,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Guardamos la matriz de item en un objeto binario de Numpy.<br>
Pero antes ordenamos esa matriz y solo los primeros 5 registros.

In [44]:
matriz_test = np.argsort(-item_similarity, axis=1)[:,1:6]

In [45]:
matriz_test

array([[1200, 3190, 2288,  847, 1726],
       [1135,  434, 1400,  349, 2614],
       [1222, 2123, 2124, 2125, 2126],
       ...,
       [  86, 2014, 2287,  353, 3112],
       [1014, 3123, 1046,  693, 3050],
       [2429, 1121, 2123, 2124, 2125]], dtype=int64)

In [46]:
np.save('items_similatiry.npy', matriz_test)

Creamos una función que, dado un juego, nos devuelva 5 juegos similares.

In [43]:
def item_test(game: str):
    
    cosine_sim = np.load('./items_similatiry.npy')

   
    idx = df_item_similarity[df_item_similarity['app_name'] == game].index[0]
    
    rec_indices = cosine_sim[idx] 
    rec_games = df_item_similarity.iloc[rec_indices]['app_name']

    
    print('TOP 5 juegos similares a {}:'.format(game))
    print('-----' * 8)

    for count, game in enumerate(rec_games, start=1):
        print('Nro. {}: {}'.format(count, game))

In [44]:
item_test('Counter-Strike')

TOP 5 juegos similares a Counter-Strike:
----------------------------------------
Nro. 1: Streets of Rage 2
Nro. 2: Days Under Custody
Nro. 3: Serious Sam Classic: The First Encounter
Nro. 4: Obscure II (Obscure: The Aftermath)
Nro. 5: Half-Life Deathmatch: Source


Bien, ahora con esto, procedemos a crear una funcion donde toma como parámetro un Usuario y recomiendo 5 juegos que a usuarios similares le gustaron.

Trasponemos el DataFrame para poder tener como columnas los *usuarios* y como filas los *juegos*.

In [45]:
user_item = model_normalized_df.T

In [80]:
user_item.head()

user_id,--000--,--ace--,--ionex--,-2SV-vuLB-Kg,-Azsael-,-Beave-,-I_AM_EPIC-,-Kenny,-Mad-,-PRoSlayeR-,...,zuilde,zukuta,zunbae,zuzuga2003,zv_odd,zvanik,zwanzigdrei,zy0705,zynxgameth,zyr0n1c
app_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
! That Bastard Is Trying To Steal Our Gold !,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
//N.P.P.D. RUSH//- The milk of Ultraviolet,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0RBITALIS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100% Orange Juice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Reiniciamos los indices

In [81]:
user_item.reset_index(inplace=True)

In [82]:
user_item.rename_axis(None, axis=1, inplace=True)

In [83]:
user_item.head()

Unnamed: 0,app_name,--000--,--ace--,--ionex--,-2SV-vuLB-Kg,-Azsael-,-Beave-,-I_AM_EPIC-,-Kenny,-Mad-,...,zuilde,zukuta,zunbae,zuzuga2003,zv_odd,zvanik,zwanzigdrei,zy0705,zynxgameth,zyr0n1c
0,! That Bastard Is Trying To Steal Our Gold !,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,//N.P.P.D. RUSH//- The milk of Ultraviolet,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0RBITALIS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,10000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,100% Orange Juice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Guardamos este DataFrame en un archivo CSV

In [88]:
user_item.to_csv('./dataset/Endpoints/Recommendation System/user_items.csv', encoding='utf-8', index=False)

Guardamos este DataFrame en un archivo Parquet

In [66]:
pq.write_table(pa.Table.from_pandas(user_item), './dataset/Endpoints/Recommendation System//user_items.parquet')


Y ahora procedemos a crear la funcion usando este DataFrame.

In [24]:
def game_for_user(user):
     
    if user not in user_item.columns:
        return 'No hay registros con ese usuario'

    sim_users = df_users_similarity.sort_values(by=user, ascending=False).index[1:11]

    most_common = {}
    
    for i in sim_users:
        max_score = user_item.loc[:, i].max()
        best_games = user_item[user_item.loc[:, i] == max_score].index.tolist()
        
        for game in best_games:
            most_common[game] = most_common.get(game, 0) + 1
    
    sorted_list = sorted(most_common.items(), key=operator.itemgetter(1), reverse=True)
    
    top_games = sorted_list[:5]
    
    game_names = [] 
    
    for game_id, _ in top_games:
        game_name = user_item.loc[game_id, 'app_name']
        game_names.append(game_name)
    
    for count, game_name in enumerate(game_names, start=1):
        print('Nro. {}: {}'.format(count, game_name))

In [25]:
game_for_user('--000--')

Nro. 1: Killing Floor


### Items - Items

En este caso, vamos a generar un Modelo de Recomendación en base a los *TAGS* que tienen cada Ítem.

#### Extraccion de los datos necesarios

Tomamos las columnas necesarias para este modelo, el cual sera:
* id: Identificador único de contenido.
* app_name: Nombre del contenido.
* tags: Etiquetas del contenido.

In [68]:
games = steam_games[['id','app_name','tags']]

#### Información general

Echemos un vistazo al DataFrame

In [69]:
games.head()

Unnamed: 0,id,app_name,tags
0,761140.0,Lost Summoner Kitty,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim..."
1,643980.0,Ironbound,"['Free to Play', 'Strategy', 'Indie', 'RPG', '..."
2,670290.0,Real Pool 3D - Poolians,"['Free to Play', 'Simulation', 'Sports', 'Casu..."
3,767400.0,弹炸人2222,"['Action', 'Adventure', 'Casual']"
4,773570.0,Log Challenge,"['Action', 'Indie', 'Casual', 'Sports']"


In [70]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32132 entries, 0 to 32131
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        32132 non-null  float64
 1   app_name  32131 non-null  object 
 2   tags      31970 non-null  object 
dtypes: float64(1), object(2)
memory usage: 753.2+ KB


#### Limpieza del dataset

Verificamos el porcentaje de los nulos.

In [71]:
u.porcentaje_nulos(games)

La columna id tiene un 0.0% de valores nulos.
La columna app_name tiene un 0.0% de valores nulos.
La columna tags tiene un 0.5% de valores nulos.


Eliminamos los nulos.

In [72]:
games.dropna(inplace=True)

Vemos que tipo de datos hay en la columna *'tags'*

In [73]:
type(games['tags'][0])

str

Convertimos a listas, asi podemos trabajarlo mas adelante.

In [74]:
games['tags'] = games['tags'].apply(u.convert_to_list)

Verificamos el cambio

In [75]:
type(games['tags'][0])

list

#### Tranformación del dataset para el modelo

Separamos las etiquetas (*tags*)

In [76]:
games_ = games['tags'].apply(lambda x: ','.join([item for item in x])).str.split(',', expand = True)

In [77]:
games_.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,Strategy,Action,Indie,Casual,Simulation,,,,,,,,,,,,,,,
1,Free to Play,Strategy,Indie,RPG,Card Game,Trading Card Game,Turn-Based,Fantasy,Tactical,Dark Fantasy,Board Game,PvP,2D,Competitive,Replay Value,Character Customization,Female Protagonist,Difficult,Design & Illustration,
2,Free to Play,Simulation,Sports,Casual,Indie,Multiplayer,,,,,,,,,,,,,,
3,Action,Adventure,Casual,,,,,,,,,,,,,,,,,
4,Action,Indie,Casual,Sports,,,,,,,,,,,,,,,,


Generamos Dummies

In [78]:
games_ = pd.get_dummies(games_)

In [79]:
games_.head()

Unnamed: 0,0_2D,0_2D Fighter,0_360 Video,0_3D Platformer,0_4 Player Local,0_4X,0_Action,0_Action RPG,0_Adventure,0_America,...,19_Underwater,19_Utilities,19_VR,19_Video Production,19_Violent,19_Visual Novel,19_Voxel,19_Walking Simulator,19_Wargame,19_Zombies
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Concatenamos ambos DataFrames

In [80]:
games_final = pd.concat([games, games_], axis= 1)

In [81]:
games_final.head()

Unnamed: 0,id,app_name,tags,0_2D,0_2D Fighter,0_360 Video,0_3D Platformer,0_4 Player Local,0_4X,0_Action,...,19_Underwater,19_Utilities,19_VR,19_Video Production,19_Violent,19_Visual Novel,19_Voxel,19_Walking Simulator,19_Wargame,19_Zombies
0,761140.0,Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,643980.0,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,670290.0,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...",False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,767400.0,弹炸人2222,"[Action, Adventure, Casual]",False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
4,773570.0,Log Challenge,"[Action, Indie, Casual, Sports]",False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


Reiniciamos los índices

In [82]:
games_final.reset_index(drop=True,inplace= True)

Guardamos el DataFrame en un archivo CSV

In [77]:
games_final.to_csv('./dataset/Endpoints/Recommendation System/item_items.csv', encoding='utf-8', index=False)

Guardamos el DataFrame en un archivo Parquet

In [85]:
pq.write_table(pa.Table.from_pandas(games_final), './dataset/Endpoints/Recommendation System/item_items.parquet')

Vamos a generar el coeficiente de coseno

In [24]:
similarity = cosine_similarity(games_final.iloc[:,3:])

In [25]:
similarity

array([[1.        , 0.10259784, 0.18257419, ..., 0.        , 0.        ,
        0.36514837],
       [0.10259784, 1.        , 0.09365858, ..., 0.        , 0.        ,
        0.09365858],
       [0.18257419, 0.09365858, 1.        , ..., 0.23570226, 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.23570226, ..., 1.        , 0.23570226,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.23570226, 1.        ,
        0.        ],
       [0.36514837, 0.09365858, 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [26]:
similarity.shape

(31969, 31969)

Pasamos a DataFrame la matriz generada

In [71]:
item_similarity_df = pd.DataFrame(similarity, index=games_final['app_name'], columns=games_final['app_name'])

In [72]:
item_similarity_df.head()

app_name,Lost Summoner Kitty,Ironbound,Real Pool 3D - Poolians,弹炸人2222,Log Challenge,Battle Royale Trainer,SNOW - All Access Basic Pass,SNOW - All Access Pro Pass,SNOW - All Access Legend Pass,Race,...,The spy who shot me™,Raining blocks,Bravium,BAE 2,Kebab it Up!,Colony On Mars,LOGistICAL: South Africa,Russian Roads,EXIT 2 - Directions,Maze Run VR
app_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Lost Summoner Kitty,1.0,0.102598,0.182574,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.258199,0.0,0.0,0.0,0.0,0.223607,0.258199,0.0,0.0,0.365148
Ironbound,0.102598,1.0,0.093659,0.0,0.0,0.0,0.114708,0.114708,0.114708,0.0,...,0.132453,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.093659
Real Pool 3D - Poolians,0.182574,0.093659,1.0,0.0,0.0,0.0,0.204124,0.204124,0.204124,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.235702,0.0,0.0
弹炸人2222,0.0,0.0,0.0,1.0,0.57735,0.408248,0.0,0.0,0.0,0.0,...,0.666667,0.0,0.0,0.0,0.516398,0.288675,0.333333,0.0,0.0,0.235702
Log Challenge,0.0,0.0,0.0,0.57735,1.0,0.176777,0.5,0.5,0.5,0.0,...,0.288675,0.0,0.0,0.0,0.67082,0.5,0.57735,0.0,0.0,0.0


Probamos buscando los juegos similares a *Counter-Strike*

In [73]:
item_similarity_df['Counter-Strike'].sort_values(ascending=False)

app_name
Counter-Strike                                   1.000000
Call of Duty®: Advanced Warfare - Season Pass    0.447214
Blacklight: Tango Down                           0.387298
Call of Duty®: Ghosts - Onslaught                0.365148
Shadow Ops: Red Mercury                          0.335410
                                                   ...   
Bobbi_Cities                                     0.000000
Glo                                              0.000000
The Girl on the Train                            0.000000
Tank Force                                       0.000000
Maze Run VR                                      0.000000
Name: Counter-Strike, Length: 31969, dtype: float64

Creamos una función que, dado un juego, nos devuelva 5 juegos similares.

In [74]:
def recommend_item(item):
    count = 1
    print('TOP 5 juegos similares a {}:'.format(item))
    print('-----' * 8)

    for game in item_similarity_df.sort_values(by=item, ascending=False).index[1:6]:
        print('Nro. {}: {}'.format(count,game))
        count += 1

In [75]:
recommend_item('Counter-Strike')

TOP 5 juegos similares a Counter-Strike:
----------------------------------------
Nro. 1: Call of Duty®: Advanced Warfare - Season Pass
Nro. 2: Blacklight: Tango Down
Nro. 3: Call of Duty®: Ghosts - Onslaught
Nro. 4: Shadow Ops: Red Mercury
Nro. 5: Call of Duty®: Ghosts - Devastation


Notamos que la función tarda bastante en dar una respuesta, ya que el DataFrame que generamos atraves de la matriz tiene una dimension de 31969 x 31969. Asi que procederemos a guardar la matriz en un objeto Numpy binario. <br>
Esto nos dara una ventaja en optimización, ya que los archivos .npy ofrecen una forma eficiente de serializar y deserializar datos, lo que los hace adecuados para guardar matrices o arreglos complejos y luego cargarlos nuevamente sin perder precisión.

Primero ordenamos los índices de los elementos mas similares de manera descendente para cada fila de la matriz. Y luego seleccionamos solo los primeros 5 registros contando desde el primer elemento.

In [None]:
similarity_ = np.argsort(-similarity, axis=1)[:, 1:6]

Ahora guardamos la matriz en un archivo *.npy*

In [56]:
np.save('similarity.npy', similarity_)

Volvemos a crear una función que, dado un juego, nos devuelva 5 juegos similares.

In [64]:
def game_recommendation(game: str):
    
    cosine_sim = np.load('./similarity.npy')

   
    idx = games_final[games_final['app_name'] == game].index[0]
    
    rec_indices = cosine_sim[idx] 
    rec_games = games_final.iloc[rec_indices]['app_name']

    
    print('TOP 5 juegos similares a {}:'.format(game))
    print('-----' * 8)

    for count, game in enumerate(rec_games, start=1):
        print('Nro. {}: {}'.format(count, game))


In [65]:
game_recommendation('Counter-Strike')

TOP 5 juegos similares a Counter-Strike:
----------------------------------------
Nro. 1: Call of Duty®: Advanced Warfare - Season Pass
Nro. 2: Blacklight: Tango Down
Nro. 3: Call of Duty®: Ghosts - Onslaught
Nro. 4: Project: Snowblind
Nro. 5: Call of Duty®: Ghosts - Devastation
