In [38]:
import warnings
warnings.filterwarnings("ignore")

# Inicialización

Importamos las librerías que nos servirán para poder procesar, visualizar y explorar nuestros datos de manera efectiva, la librería `pandas`, `numpy`, `json`, `gzip`, `datetime` y `ast`.

In [39]:
import pandas as pd
import numpy as np
import ast
import json
import gzip
from datetime import datetime

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Extracción de los datos User reviews	

Extraer datos del archivo JSON "user_reviews.json.gz", convertirlos en un DataFrame utilizando Pandas y explorar su contenido inicial.


In [40]:
# Ruta al archivo comprimido
file_path = 'user_reviews.json.gz'

# Lista para almacenar los datos
data_list = []

# Descomprime el archivo y lee línea por línea
with gzip.open(file_path, 'rt', encoding='utf-8') as file:
    for line in file.readlines():
        data = ast.literal_eval(line)
        data_list.append(data)

# Crea un DataFrame a partir de la lista de datos
data_reviews = pd.DataFrame(data_list)

In [41]:
data_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


# Dataset

Para explorar nuestra base de datos vamos a llamar al atributo shape para saber el número total de filas y columnas, posteriormente utilizaremos los métodos head() y tail() para realizar un primer análisis de datos y familiarizarnos con nuestro dataframe. El dataset contiene los siguientes campos:

- **user_id:** Identificador único asignado a cada usuario.
- **user_url:** URL del perfil del usuario en SteamCommunity.
- **reviews:** Esta variable es una lista de diccionarios. Cada diccionario corresponde a un review realizado por un usuario y contiene los siguientes subcampos:
    - **funny:** Comentario gracioso o entretenido.
    - **posted:** Fecha de posteo del review (formato "November 5, 2011").
    - **last_edited:** Fecha de la última edición.
    - **item_id:** Identificador único del juego.
    - **helpful:** Estadísticas que indican cuántos usuarios encontraron útil una reseña o comentario sobre el juego.
    - **recommend:** Indica si el usuario recomienda el juego (Booleano).
    - **review:** Comentarios de los usuarios sobre el juego.


In [42]:
data_reviews.shape

(25799, 3)

In [43]:
data_reviews.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


In [44]:
data_reviews.tail()

Unnamed: 0,user_id,user_url,reviews
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."
25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,"[{'funny': '1 person found this review funny',..."


En nuestro DataFrame, inicialmente registramos un total de 25,799 filas y 3 columnas. A primera vista, no se observan valores ausentes en nuestro dataframe. Sin embargo, es necesario realizar un análisis más detallado para identificar posibles problemas en los datos y corregirlos de ser necesario.

Lo que sí podemos observar es que tenemos un diccionario en la columna "review". Este diccionario parece contener información adicional sobre las revisiones realizadas. Para avanzar en nuestro análisis, necesitaremos profundizar en esta columna con mayor detenimiento. 

# Preparación y Limpieza de los Datos

Antes de analizar y modelar los datos, es esencial asegurarnos de que estén limpios y bien preparados. Esto implica tratar valores faltantes, eliminar duplicados, corregir errores y convertir datos en formatos adecuados. Además, es fundamental revisar cada variable de manera inicial para detectar problemas y realizar las correcciones necesarias. Este proceso garantiza que los datos sean confiables y coherentes para análisis posteriores.

In [45]:
data_reviews.isnull().sum()

user_id     0
user_url    0
reviews     0
dtype: int64

El conjunto de datos no presenta valores faltantes en ninguna de sus columnas.

# Conocer estructura de algunas de nuestras variables

In [46]:
data_reviews['user_id'][0]

'76561197970982479'

In [47]:
data_reviews['user_url'][0]

'http://steamcommunity.com/profiles/76561197970982479'

In [48]:
data_reviews['reviews'][137]

[{'funny': '',
  'posted': 'Posted September 13, 2015.',
  'last_edited': '',
  'item_id': '570',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Played for a bit...enjoyed it. ( ͡° ͜ʖ ͡°)'},
 {'funny': '1 person found this review funny',
  'posted': 'Posted July 26, 2013.',
  'last_edited': '',
  'item_id': '212680',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Great game. Funner than what I expected.'},
 {'funny': '',
  'posted': 'Posted May 18, 2013.',
  'last_edited': '',
  'item_id': '200210',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'good to play with friends'},
 {'funny': '',
  'posted': 'Posted October 30, 2011.',
  'last_edited': '',
  'item_id': '440',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'AWESOME'}]

# Verificación de Datos Duplicados

Llevaremos a cabo una verificación de datos duplicados en nuestra columna 'user_id' del DataFrame. El objetivo es determinar si existen registros duplicados en base a este identificador único. 

In [86]:
#duplicated_reviews = data_reviews.loc[data_reviews['user_id'].duplicated(keep=False)]
#if not duplicated_reviews.empty:
#    print("Registros duplicados")
#else:
#    print("No se encontraron registros duplicados en id. Total: 0")
#duplicated_reviews

Se han identificado 623 registros duplicados en el conjunto de datos. Ahora, procedemos a una revisión más detallada para determinar si la duplicación está relacionada con las revisiones contenidas en la estructura anidada de 'review' o si es el resultado de múltiples comentarios realizados por un mismo usuario ('user_id').


Procedemos a revisar y comparar los datos duplicados mediante el análisis de reseñas de usuarios específicos. En este proceso, nos enfocamos en usuarios particulares cuyos registros muestran duplicados en sus reseñas.

In [49]:
user_id = '76561198092022514'  # El 'user_id' del usuario a revisar
user_reviews = data_reviews[data_reviews['user_id'] == user_id]['reviews']

for review_list in user_reviews:
    for review in review_list:
        print("Review:", review['review'])
    print('*******************************')


Review: Muy entretenido y una coleccion de armas prometedora, un buen soundtrack, 9/10
Review: Tiene una jugabilidad y tematica muy buena :D, El poder subir stats para caracterizar a tu manera de jugar es buena, lo que si le falta la cantidad excesiva de dlc que posee, es algo descarado...
Review: Buen juego, no importa el desarrrollo que tiene con tan solo que de desarrolladores nuevos es un muy buen trabajo,un buen mod y a la espera del act 2 :DD tiene todo lo escencial de un juego de esta categoria no importa los graficos si no una jugabilidad buena
Review: exelente aporte :D¡¡¡ es una buen mod basado en half life 2 muy bueno aparte de ser co-op situaciones reales ej puedes infectarte, cortes profundos etc xDDDDDDDD
*******************************
Review: Muy entretenido y una coleccion de armas prometedora, un buen soundtrack, 9/10
Review: Tiene una jugabilidad y tematica muy buena :D, El poder subir stats para caracterizar a tu manera de jugar es buena, lo que si le falta la canti

Se ha observado que las reseñas son idénticas en todos los registros, lo que indica la presencia de duplicados en los datos. En consecuencia, hemos tomado la decisión de eliminar estos duplicados, conservando únicamente la primera instancia de cada registro. Esta acción garantizará la integridad de nuestros datos y evitará redundancias en el análisis posterior.

In [50]:
data_reviews.drop_duplicates(subset='user_id', keep='first', inplace=True)
data_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


In [49]:
data_reviews.isnull().sum()

user_id     0
user_url    0
reviews     0
dtype: int64

In [50]:
data_reviews['reviews'][9]

[{'funny': '',
  'posted': 'Posted June 16.',
  'last_edited': '',
  'item_id': '252950',
  'helpful': '0 of 1 people (0%) found this review helpful',
  'recommend': True,
  'review': 'love it'}]

In [51]:
# Nos aseguramos que no hay datos duplicados en nuestra columna 'id' del DataFrame
duplicated_reviews = data_reviews.loc[data_reviews['user_id'].duplicated(keep=False)]
if not duplicated_reviews.empty:
    print("Registros duplicados")
else:
    print("No se encontraron registros duplicados en user_id. Total: 0")

Registros duplicados


# Transformación de la Columna 'reviews' y Gestión de Valores Nulos

La columna 'reviews' en nuestros datos se presenta en un formato anidado, representado como una lista que contiene uno o más diccionarios como elementos. Llevaremos a cabo una serie de transformaciones que nos permitirán reorganizar la información, separarla en columnas independientes y garantizar la integridad de los datos.

 Cada elemento de la lista en la columna 'reviews' se convierte en una columna independiente

In [52]:
# Transformación de los elementos de las listas en columnas
data_reviews_2 = pd.json_normalize(data_reviews['reviews'])

# Agregación de las columnas 'user_id' y 'user_url' a las columnas separadas
data_reviews_2 = pd.concat([data_reviews[['user_id', 'user_url']], data_reviews_2], axis=1)


data_reviews_2.head()


Unnamed: 0,user_id,user_url,0,1,2,3,4,5,6,7,8,9
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,evcentric,http://steamcommunity.com/id/evcentric,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,doctr,http://steamcommunity.com/id/doctr,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,maplemage,http://steamcommunity.com/id/maplemage,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,


Transformar las columnas en filas individuales mientras conservamos información clave de identificación, como 'user_id' y 'user_url'. Posteriormente, eliminamos las filas que contienen valores nulos ('None') para garantizar la integridad de los datos.

In [53]:
# Utilizamos pd.melt para transformar las columnas en filas conservando 'user_id' y 'user_url'
data_reviews_2 = pd.melt(data_reviews_2, id_vars=['user_id', 'user_url'], 
                       value_vars=list(range(9)),
                       value_name='reviews')

# Eliminamos las filas con valor None
data_reviews_2 = data_reviews_2.dropna()

In [54]:
data_reviews_2

Unnamed: 0,user_id,user_url,variable,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,js41637,http://steamcommunity.com/id/js41637,0,"{'funny': '', 'posted': 'Posted June 24, 2014...."
2,evcentric,http://steamcommunity.com/id/evcentric,0,"{'funny': '', 'posted': 'Posted February 3.', ..."
3,doctr,http://steamcommunity.com/id/doctr,0,"{'funny': '', 'posted': 'Posted October 14, 20..."
4,maplemage,http://steamcommunity.com/id/maplemage,0,"{'funny': '3 people found this review funny', ..."
...,...,...,...,...
231919,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,8,"{'funny': '', 'posted': 'Posted August 15, 201..."
231921,76561198141079508,http://steamcommunity.com/profiles/76561198141...,8,"{'funny': '', 'posted': 'Posted August 2, 2014..."
232047,ShadowYT100,http://steamcommunity.com/id/ShadowYT100,8,"{'funny': '', 'posted': 'Posted July 31, 2015...."
232127,bestcustomurlevermade,http://steamcommunity.com/id/bestcustomurlever...,8,"{'funny': '', 'posted': 'Posted December 20, 2..."


In [55]:
data_reviews_2.isnull().sum()

user_id     0
user_url    0
variable    0
reviews     0
dtype: int64

In [56]:
import pandas as pd

# Utilizar pd.json_normalize para convertir el diccionario en 'reviews' en columnas
data_reviews_expanded = pd.json_normalize(data_reviews_2['reviews'])




In [57]:
data_reviews_expanded.isnull().sum()

funny          0
posted         0
last_edited    0
item_id        0
helpful        0
recommend      0
review         0
dtype: int64

In [58]:
# Renombrar las columnas con 'reviews' como prefijo
data_reviews_expanded.columns = ['reviews_' + col for col in data_reviews_expanded.columns]


In [59]:
data_reviews_expanded.columns

Index(['reviews_funny', 'reviews_posted', 'reviews_last_edited',
       'reviews_item_id', 'reviews_helpful', 'reviews_recommend',
       'reviews_review'],
      dtype='object')

In [60]:
# Agregar las columnas 'user_id' y 'user_url' al principio
data_reviews_expanded = data_reviews_2[['user_id', 'user_url']].join(data_reviews_expanded)

data_reviews = data_reviews_expanded

In [61]:
data_reviews

Unnamed: 0,user_id,user_url,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...,...,...
231919,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,,,,,,,
231921,76561198141079508,http://steamcommunity.com/profiles/76561198141...,,,,,,,
232047,ShadowYT100,http://steamcommunity.com/id/ShadowYT100,,,,,,,
232127,bestcustomurlevermade,http://steamcommunity.com/id/bestcustomurlever...,,,,,,,


In [62]:
# Utilizar .apply(pd.Series, dtype='object') para expandir el diccionario en 'reviews' y establecer el tipo de datos
data_reviews_expanded = data_reviews_2['reviews'].apply(pd.Series, dtype='object')

# Renombrar las columnas con 'reviews' como prefijo
data_reviews_expanded.columns = ['reviews_' + col for col in data_reviews_expanded.columns]

# Agregar las columnas 'user_id' y 'user_url' al principio
data_reviews_expanded = data_reviews_2[['user_id', 'user_url']].join(data_reviews_expanded)

# Visualizar el DataFrame resultante
data_reviews = data_reviews_expanded


In [63]:
data_reviews

Unnamed: 0,user_id,user_url,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...,...,...
231919,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,,"Posted August 15, 2014.","Last edited November 3, 2014.",440,No ratings yet,True,TF2 is alot of fun and its really good but the...
231921,76561198141079508,http://steamcommunity.com/profiles/76561198141...,,"Posted August 2, 2014.",,304930,No ratings yet,True,Fun game with friends
232047,ShadowYT100,http://steamcommunity.com/id/ShadowYT100,,"Posted July 31, 2015.",,265630,No ratings yet,True,So Fun!! :D
232127,bestcustomurlevermade,http://steamcommunity.com/id/bestcustomurlever...,,"Posted December 20, 2015.",,304050,No ratings yet,True,"This game is great. The only thing is,Why cant..."


# Eliminación de Columnas no Relevantes

Nos enfocaremos en refinar nuestro conjunto de datos, eliminando las columnas "reviews_funny" y "reviews_last_edited". Estas columnas han sido identificadas como no relevantes para el análisis y, además, contienen una cantidad significativa de valores nulos que no aportan información valiosa. Eliminar estas columnas nos permitirá simplificar y limpiar nuestros datos, preparándolos para un análisis más efectivo.

In [64]:
# Eliminar las columnas 'reviews_funny' y 'reviews_last_edited'
data_reviews = data_reviews_expanded.drop(columns=['reviews_funny', 'reviews_last_edited'])

In [65]:
data_reviews.head(3)

Unnamed: 0,user_id,user_url,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,Posted February 3.,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...


In [66]:
# Define una función para convertir el formato original en 'YYYY-MM-DD' o 'Dato no disponible'
def transform_date(date_str):
    try:
        # Convierte la cadena en una fecha
        date = datetime.strptime(date_str, 'Posted %B %d, %Y.')
        # Formatea la fecha en 'YYYY-MM-DD'
        return date.strftime('%Y-%m-%d')
    except ValueError:
        # En caso de que no se pueda convertir, devuelve 'Dato no disponible'
        return 'Dato no disponible'



In [67]:
# Aplica la función a la columna 'reviews_posted' en tu DataFrame data_reviews
data_reviews['reviews_posted'] = data_reviews['reviews_posted'].apply(transform_date)

# Visualiza el DataFrame resultante
data_reviews.head()

Unnamed: 0,user_id,user_url,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2011-11-05,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,2014-06-24,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,Dato no disponible,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,2013-10-14,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,2014-04-15,211420,35 of 43 people (81%) found this review helpful,True,Git gud


In [68]:
data_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59036 entries, 0 to 232129
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   user_id            59036 non-null  object
 1   user_url           59036 non-null  object
 2   reviews_posted     59036 non-null  object
 3   reviews_item_id    59036 non-null  object
 4   reviews_helpful    59036 non-null  object
 5   reviews_recommend  59036 non-null  bool  
 6   reviews_review     59036 non-null  object
dtypes: bool(1), object(6)
memory usage: 5.2+ MB


Observamos la cantidad de 'Dato no disponible' de nuestra variable 'reviews_posted'.

In [69]:
data_reviews['reviews_posted'].value_counts().get('Dato no disponible', 0)

10100

Tenemos un total de 9,771 valores en la columna 'reviews_posted' iguales a "Dato no disponible". A pesar de ser una cantidad significativa de datos faltantes, es importante considerar que estos valores pueden contener información valiosa que merece ser analizada

In [70]:
data_reviews.isnull().sum()

user_id              0
user_url             0
reviews_posted       0
reviews_item_id      0
reviews_helpful      0
reviews_recommend    0
reviews_review       0
dtype: int64

# Transformación de la Columna 'reviews_posted'
A partir de la columna 'reviews_posted', extraeremos el año de lanzamiento y crearemos una nueva columna para almacenar este dato. En caso de que la fecha esté ausente, se marcará como "Dato no disponible".

In [71]:
fecha_minima = data_reviews['reviews_posted'].min()
fecha_maxima = data_reviews['reviews_posted'].max()

print("Fecha más temprana:", fecha_minima)
print("Fecha más reciente:", fecha_maxima)


Fecha más temprana: 2010-10-16
Fecha más reciente: Dato no disponible


In [72]:
# Convertir la columna 'reviews_posted' al tipo datetime
data_reviews['reviews_posted'] = pd.to_datetime(data_reviews['reviews_posted'], errors='coerce')

# Extraer el año y crear la columna 'reviews_year', asignando -1 a los valores no válidos
data_reviews['reviews_year'] = data_reviews['reviews_posted'].dt.year.fillna(-1).astype(int)

# Reemplazar los valores -1 por "Dato no disponible" en la columna 'release_year'
data_reviews['reviews_year'] = data_reviews['reviews_year'].replace(-1, "Dato no disponible")


In [73]:
data_reviews.head()

Unnamed: 0,user_id,user_url,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_year
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2011-11-05,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011
1,js41637,http://steamcommunity.com/id/js41637,2014-06-24,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014
2,evcentric,http://steamcommunity.com/id/evcentric,NaT,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...,Dato no disponible
3,doctr,http://steamcommunity.com/id/doctr,2013-10-14,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...,2013
4,maplemage,http://steamcommunity.com/id/maplemage,2014-04-15,211420,35 of 43 people (81%) found this review helpful,True,Git gud,2014


In [74]:
data_reviews['reviews_recommend'].value_counts()

True     52233
False     6803
Name: reviews_recommend, dtype: int64

In [78]:
data_reviews.drop_duplicates(inplace=True, ignore_index=True)

#if not duplicated_reviews.empty:
#    print("Registros duplicados")
#else:
#    print("No se encontraron registros duplicados en id. Total: 0")


In [80]:
data_reviews

Unnamed: 0,user_id,user_url,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_year
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2011-11-05,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011
1,js41637,http://steamcommunity.com/id/js41637,2014-06-24,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014
2,evcentric,http://steamcommunity.com/id/evcentric,NaT,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...,Dato no disponible
3,doctr,http://steamcommunity.com/id/doctr,2013-10-14,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...,2013
4,maplemage,http://steamcommunity.com/id/maplemage,2014-04-15,211420,35 of 43 people (81%) found this review helpful,True,Git gud,2014
...,...,...,...,...,...,...,...,...
58164,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,2014-08-15,440,No ratings yet,True,TF2 is alot of fun and its really good but the...,2014
58165,76561198141079508,http://steamcommunity.com/profiles/76561198141...,2014-08-02,304930,No ratings yet,True,Fun game with friends,2014
58166,ShadowYT100,http://steamcommunity.com/id/ShadowYT100,2015-07-31,265630,No ratings yet,True,So Fun!! :D,2015
58167,bestcustomurlevermade,http://steamcommunity.com/id/bestcustomurlever...,2015-12-20,304050,No ratings yet,True,"This game is great. The only thing is,Why cant...",2015


# Carga del Conjunto de Datos Transformado

En esta etapa, cargaremos el conjunto de datos transformado y depurado, el cual hemos nombrado `"data_reviews_cleaned.csv"`. Este archivo refleja nuestro conjunto de datos preparado y optimizado para análisis y modelado de datos

In [81]:
# Especifica el nombre del archivo
nombre_del_archivo = 'data_reviews_cleaned.csv'

# Guarda el DataFrame en el archivo CSV
data_reviews.to_csv(nombre_del_archivo, index=False, encoding='utf-8')
print(f'Se ha guardado el archivo {nombre_del_archivo} en la misma carpeta.')

Se ha guardado el archivo data_reviews_cleaned.csv en la misma carpeta.
