# Feature Engineering

In [36]:
import sys
import os
import pandas as pd
import numpy as np
import scipy as sp
import textblob
import sklearn
from textblob import TextBlob
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

print(f"System version: {sys.version}")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
print(f"scipy version: {sp.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")

System version: 3.11.8 (main, Mar 12 2024, 11:52:02) [GCC 12.2.0]
pandas version: 2.2.1
numpy version: 1.26.4
scipy version: 1.12.0
scikit-learn version: 1.4.1.post1


## Extracción

En esta sección, extraemos los datos de los archivos steam_games, user_items y user_reviews que estan en formato parquet.

In [37]:
# Cargamos los archivos parquet
def read_parquet_files(parquet_files):
    dataframes = {}
    for name in parquet_files:
        dataframes[name] = pd.read_parquet(f'../dataset/{name}.parquet', engine='pyarrow')
    return dataframes


parquet_files = ['steam_games', 'user_items', 'user_reviews']
dataframes = read_parquet_files(parquet_files)

# Convertimos a df.
df_steam_games = dataframes['steam_games']
df_user_items = dataframes['user_items']
df_user_reviews = dataframes['user_reviews']

In [38]:


# Crear un dataframe combinando la información necesaria
combined_df = pd.merge(df_user_items, df_steam_games[['item_id', 'release_year', 'genres']], on='item_id', how='left')

# Agrupar por género, año y sumar las horas jugadas
grouped = combined_df.groupby(['genres', 'release_year'])['playtime_forever'].sum().reset_index()

# Encontrar el año con más horas jugadas por género
max_playtime_by_genre_year = grouped.loc[grouped.groupby('genres')['playtime_forever'].idxmax()]

# Convertir la columna 'release_year' de flotante a entero
max_playtime_by_genre_year['release_year'] = max_playtime_by_genre_year['release_year'].astype(int)

# Mostrar el resultado
max_playtime_by_genre_year

Unnamed: 0,genres,release_year,playtime_forever
26,Action,2012,18931510.0
55,Adventure,2006,7555119.0
73,Animation &amp; Modeling,2016,1227.233
76,Audio Production,2014,7779.483
107,Casual,2015,3686977.0
114,Design &amp; Illustration,2016,1175.2
116,Early Access,2013,1981715.0
128,Education,2010,1383026.0
149,Free to Play,2013,2490635.0
167,Indie,2006,7435650.0


Vamos a añadir una columna ‘sentiment_analysis’ al dataset ‘user_reviews’ usando NLP para analizar el sentimiento de las reseñas de los juegos. Esto nos permitirá entender las opiniones de los usuarios. Las reseñas se calificarán de la siguiente manera:

0: Negativa (insatisfacción, disgusto, decepción)
1: Neutral (indiferencia, objetividad, sin emoción)
2: Positiva (satisfacción, gusto, admiración)

Crearemos una función **`analisis_sentimiento`** usando TextBlob para analizar el sentimiento de las reseñas de los juegos. Esta función se basará en la polaridad, que varía entre -1 y 1, para determinar si una reseña es negativa, neutra o positiva.

In [39]:
def analisis_sentimiento(review):
    # Si la reseña está ausente, retorna 1 (neutral)
    if pd.isnull(review):
        return 1

    # Calcula la polaridad de la reseña usando TextBlob
    polarity = TextBlob(review).sentiment.polarity

    # Retorna 0 (malo) si la polaridad es menor que 0, 2 (positivo) si la polaridad es mayor que 0, y 1 (neutral) en caso contrario
    if polarity < 0:
        return 0
    elif polarity > 0:
        return 2
    else:
        return 1

- Aplicamos la función a la columna review.

In [40]:
df_user_reviews.sample(5)

Unnamed: 0,item_id,recommend,review,user_id,posted_year
28499,297020,True,Buen juego a simple vista se ve malo pero es m...,jkajkablamblam,2015
37992,265630,True,10/10 would pass the whiskey to myself again.....,TheAmityPrincess,2014
55961,221640,True,It's a great game. Once you have a go of it yo...,76561198089076206,2014
46820,22370,True,"Does not run well on windows 7 at all, great n...",76561198036297711,2014
58378,440,True,มันคับ,76561198120570508,2014


In [41]:
df_user_reviews['sentiment_analysis'] = df_user_reviews['review'].apply(analisis_sentimiento)
df_user_reviews[['review','sentiment_analysis']].sample(5)

Unnamed: 0,review,sentiment_analysis
36424,NEVER DOWNLOAD THIS GAME BECAUSE YOU CAN NEVER...,0
46800,"An absolute masterpiece from the Gameplay, to ...",2
35391,This game is a great game to play with other p...,0
29210,For anyone coming from Borderlands 2 wanting t...,2
15316,"Man, what a rollercoaster.",1


## Creación de Conjuntos de Datos para los Endpoints de la API

Nuestro propósito en esta sección es establecer varios conjuntos de datos, actuando como bases de datos pseudo, para las funciones de los endpoints de la API. Esto nos permitirá recuperar los datos requeridos de manera rápida y eficaz, sin la necesidad de cargar toda la información, optimizando así el rendimiento de la API.

## Creación de la Pseudo Base de Datos 1
####    (Endpoints de la API def PlayTimeGenre( genero : str ): y def UserForGenre( genero : str ):)
Para formar un único conjunto de datos que sirva como pseudo base de datos para los endpoints, es necesario fusionar df_steam_games y df_user_items. De esta forma, consolidamos toda la información requerida en un solo lugar. Las columnas necesarias son: item_id, genres, release_year de df_steam_games y item_id, user_id, playtime_forever de df_user_items.

1. Seleccionamos únicamente las columnas requeridas:
```python
steam_games_columns = ['item_id','genres','release_year']
user_items_columns = ['item_id','user_id', 'playtime_forever']
```
2. Creamos subconjuntos de los DataFrames con solo las columnas necesarias:
```python
df_games_subset = df_steam_games[steam_games_columns]
df_items_subset = df_user_items[user_items_columns]
```

In [42]:
steam_games_columns = ['item_id','genres','release_year']
user_items_columns = ['item_id','user_id', 'playtime_forever']

df_games_subset = df_steam_games[steam_games_columns]
df_items_subset = df_user_items[user_items_columns]

# Asegúrate de que 'item_id' es del mismo tipo en ambos DataFrames
df_games_subset['item_id'] = df_games_subset['item_id'].astype('int')
df_items_subset['item_id'] = df_items_subset['item_id'].astype('int')


df_endpoints1_2 = pd.merge(df_games_subset, df_items_subset, on='item_id')
df_endpoints1_2.head(5)
df_endpoints1_2.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_games_subset['item_id'] = df_games_subset['item_id'].astype('int')


(15255102, 5)

In [43]:
# Seleccionamos los 10 géneros mas frecuentes
top_10_popular_genres = ['Action', 'Adventure', 'Indie', 'Strategy', 'RPG', 'Simulation', 'Casual', 'Massively Multiplayer', 'Racing', 'Sports']

# Filtramos por las condiciones establecidas
df_endpoints1_2 = df_endpoints1_2[(df_endpoints1_2['release_year'] != 'unknown') & (df_endpoints1_2['playtime_forever'] > 0)].reset_index(drop=True)
df_endpoints1_2.head()
print(df_endpoints1_2.shape)

(10511145, 5)


In [44]:
df_endpoints1_2['release_year'] = df_endpoints1_2['release_year'].astype('int16')
df_endpoints1_2['playtime_forever'] = df_endpoints1_2['playtime_forever'].astype('float32')
df_endpoints1_2.memory_usage(deep=True)

Index                     132
item_id              84089160
genres              681639958
release_year         21022290
user_id             739967023
playtime_forever     42044580
dtype: int64

- Por último, creamos una tabla pivote que tenga como índice user_id y release_year, como columnas genres y como valores únicos la suma de playtime_forever.

In [45]:
df_endpoints1_2 = df_endpoints1_2.pivot_table(index=['user_id', 'release_year'], columns='genres', values='playtime_forever', aggfunc='sum', fill_value=0)
df_endpoints1_2

Unnamed: 0_level_0,genres,Action,Adventure,Animation &amp; Modeling,Audio Production,Casual,Design &amp; Illustration,Early Access,Education,Free to Play,Indie,...,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing,unknown
user_id,release_year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
--000--,2006,15.416667,15.416667,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,15.416667,...,0.000000,0.000000,15.416667,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
--000--,2009,88.816666,88.816666,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
--000--,2010,0.366667,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.366667,0.366667,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0
--000--,2011,108.699997,108.699997,0.0,0.0,0.000000,0.0,0.0,0.000000,46.049999,30.616665,...,62.649998,46.049999,11.083333,0.0,0.000000,11.083333,0.0,0.0,0.0,0.0
--000--,2012,1822.516724,37.150002,0.0,0.0,30.016666,0.0,0.0,0.000000,10.500000,37.700001,...,29.516666,0.000000,0.000000,0.0,7.683333,1796.400024,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zzzmidmiss,2010,7.783334,0.166667,0.0,0.0,3.916667,0.0,0.0,0.683333,4.550000,7.950000,...,0.000000,0.000000,3.233333,0.0,3.233333,3.400000,0.0,0.0,0.0,0.0
zzzmidmiss,2011,38.366665,38.366665,0.0,0.0,1.250000,0.0,0.0,0.000000,0.266667,1.750000,...,37.599998,0.266667,0.000000,0.0,0.000000,1.150000,0.0,0.0,0.0,0.0
zzzmidmiss,2012,98.366669,61.650005,0.0,0.0,6.083333,0.0,0.1,0.000000,22.549999,51.316666,...,45.500000,0.000000,6.450000,0.0,0.000000,15.383334,0.0,0.0,0.0,0.0
zzzmidmiss,2013,1.633333,1.750000,0.0,0.0,0.283333,0.0,0.0,0.000000,0.166667,1.750000,...,0.166667,0.000000,0.000000,0.0,0.000000,1.466667,0.0,0.0,0.0,0.0


In [46]:
max_playtime_by_genre_year

Unnamed: 0,genres,release_year,playtime_forever
26,Action,2012,18931510.0
55,Adventure,2006,7555119.0
73,Animation &amp; Modeling,2016,1227.233
76,Audio Production,2014,7779.483
107,Casual,2015,3686977.0
114,Design &amp; Illustration,2016,1175.2
116,Early Access,2013,1981715.0
128,Education,2010,1383026.0
149,Free to Play,2013,2490635.0
167,Indie,2006,7435650.0
