## ETL

In [2]:
# Se cargan los modulos y librerias a usar.

import pandas as pd
import numpy as np
import json
import gzip
import ast
import re

Se cargan los archivos **'australian_user_reviews.json'** y **'australian_users_items.json'** a Dataframes, para posterior limpieza y normalizacion

In [3]:
# Cargamos el dataset 'australian_user_reviews.json' a un dataframe

reviews = []

with open('Data/Datasets/australian_user_reviews.json', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        reviews.append(ast.literal_eval(line))
df_user_reviews = pd.DataFrame(reviews)

In [4]:
# Cargamos el dataset 'australian_users_items.json' a un dataframe

reviews_ui = []

with open('Data/Datasets/australian_users_items.json', 'r', encoding='utf-8') as u:
    for line in u.readlines():
        reviews_ui.append(ast.literal_eval(line))
df_users_items = pd.DataFrame(reviews_ui)

Cargamos el dataset **'output_steam_games.json'** a un dataframe, este lo hacemos directamente del archivo comprimido 'steam_games.json.gz', debido a un error <br>
recurrente al cargarlo desde el archivo descomprimido, se usa un metodo parecido al cargar los datos a una lista, en este caso de objetos JSON
pero con la diferencia del <br> decodificado por linea debido a la inconsistencia del formato de texto en el archivo. 
 

In [5]:
# Cargamos el archivo comprimido 'steam_games.json.gz'

json_objects = []

with gzip.open('Data/Datasets/steam_games.json.gz', 'rb') as f:
    for line in f:
        json_obj = json.loads(line.decode('utf-8'))
        json_objects.append(json_obj)

df_steam_games = pd.DataFrame(json_objects)

El abordaje es fusionar los DataFrames **'df_user_reviews'** y **'df_users_items'** atraves de la columna 'user_id' para crear un dataframe unificado con las columnas relevantes <br>
para crear archivos csv, estos despues van a ser procesados en querys, usando **Workbench**, a traves de esas querys se van a crear los DF finales para los endpoints.

### EDA

Se verifican la cantidad de nulos del DF, se ve que las columnas pertenecientes originalmente al DF **'df_user_reviews'** <br>
son muchas menos al ser su DF de origen menor, mas adelante se va a mostrar la relevancia de esto.

In [8]:
print(len(df_user_reviews))
print(df_user_reviews.isnull().sum())
df_user_reviews.head()

25799
user_id     0
user_url    0
reviews     0
dtype: int64


Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


In [9]:
print(len(df_users_items))
print(df_users_items.isnull().sum())
df_users_items.head()

88310
user_id        0
items_count    0
steam_id       0
user_url       0
items          0
dtype: int64


Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."


In [10]:
print(len(df_steam_games))
print(df_steam_games.isnull().sum())
df_steam_games.head()

120445
publisher       96362
genres          91593
app_name        88312
title           90360
url             88310
release_date    90377
tags            88473
reviews_url     88312
specs           88980
price           89687
early_access    88310
id              88312
developer       91609
dtype: int64


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


-----------------------------------------------

## Limpieza y normalizacion

En esta seccion se van a crear los DFs basados en la columna **'reviews'**, por lo tanto se empieza por eliminar la columnas **'items'** y **'items_count'**. 

In [11]:
# Se hace una copia por seguridad y comodidad
df_ur = df_user_reviews.copy()

In [12]:
# Se eliminan la columna 'user_url'

df_ur = df_ur.drop(['user_url'], axis=1)

In [13]:
# Lista para guardar las filas
unpacked_data_ur = []

# Bucle para leer las filas de las columnas 
for index, row in df_user_reviews.iterrows():
    user_id = row['user_id']
    for review_dict in row['reviews']: # Bucle para leer las filas de la columna 'items' 
        review_dict['user_id'] = user_id
        unpacked_data_ur.append(review_dict) # Se adicionan a la lista

# DF final con las filas
df_flattened_ur = pd.DataFrame(unpacked_data_ur)

In [14]:
df_flattened_ur.head()

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,76561197970982479
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,76561197970982479
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,js41637
4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,js41637


In [15]:
print(len(df_flattened_ur))
print(df_flattened_ur.isnull().sum())

59305
funny          0
posted         0
last_edited    0
item_id        0
helpful        0
recommend      0
review         0
user_id        0
dtype: int64


In [16]:
df_flattened_ur = df_flattened_ur.drop(['funny', 'last_edited', 'helpful'], axis=1)

In [22]:
df_flattened_ur_2 = df_flattened_ur.copy()

df_flattened_ur_2.astype(str)

Unnamed: 0,posted,item_id,recommend,review,user_id
0,"Posted November 5, 2011.",1250,True,Simple yet with great replayability. In my opi...,76561197970982479
1,"Posted July 15, 2011.",22200,True,It's unique and worth a playthrough.,76561197970982479
2,"Posted April 21, 2011.",43110,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479
3,"Posted June 24, 2014.",251610,True,I know what you think when you see this title ...,js41637
4,"Posted September 8, 2013.",227300,True,For a simple (it's actually not all that simpl...,js41637
...,...,...,...,...,...
59300,Posted July 10.,70,True,a must have classic from steam definitely wort...,76561198312638244
59301,Posted July 8.,362890,True,this game is a perfect remake of the original ...,76561198312638244
59302,Posted July 3.,273110,True,had so much fun plaing this and collecting res...,LydiaMorley
59303,Posted July 20.,730,True,:D,LydiaMorley


In [23]:
print(len(df_flattened_ur_2))
print(df_flattened_ur_2.isnull().sum())

59305
posted       0
item_id      0
recommend    0
review       0
user_id      0
dtype: int64


In [24]:
# Función para extraer el año de la columna de release_date
def extraer_anio(fecha):
    pattern = r'\b\d{4}\b'  # Expresión regular para encontrar el año en formato XXXX
    match = re.search(pattern, str(fecha))
    return match.group(0) if match else None

# Aplicar la función a la columna de fechas para extraer el año
df_flattened_ur_2['posted'] = df_flattened_ur_2['posted'].apply(extraer_anio)

In [25]:
df_flattened_ur_2['posted'].head(10)

0    2011
1    2011
2    2011
3    2014
4    2013
5    2013
6    None
7    2015
8    2014
9    2014
Name: posted, dtype: object

In [26]:
print(len(df_flattened_ur_2))
print(df_flattened_ur_2.isnull().sum())

59305
posted       10119
item_id          0
recommend        0
review           0
user_id          0
dtype: int64


In [32]:
df_ur_final1 = df_flattened_ur_2.copy()

In [33]:
df_ur_final1.reset_index(drop=True)
df_ur_final1.astype(str)

Unnamed: 0,posted,item_id,recommend,review,user_id
0,2011,1250,True,Simple yet with great replayability. In my opi...,76561197970982479
1,2011,22200,True,It's unique and worth a playthrough.,76561197970982479
2,2011,43110,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479
3,2014,251610,True,I know what you think when you see this title ...,js41637
4,2013,227300,True,For a simple (it's actually not all that simpl...,js41637
...,...,...,...,...,...
59300,,70,True,a must have classic from steam definitely wort...,76561198312638244
59301,,362890,True,this game is a perfect remake of the original ...,76561198312638244
59302,,273110,True,had so much fun plaing this and collecting res...,LydiaMorley
59303,,730,True,:D,LydiaMorley


In [34]:
df_ur_final1 = df_ur_final1.dropna(subset=['posted'])

In [35]:
print(len(df_ur_final1))
print(df_ur_final1.isnull().sum())

49186
posted       0
item_id      0
recommend    0
review       0
user_id      0
dtype: int64


In [310]:
df_ur_final1.to_csv('Data\data_NLPM_1.csv', index=False)

-----------------------------------------

In [36]:
df_ur_final2 = df_flattened_ur_2.copy()

df_ur_final2 = df_ur_final2.drop(['review'], axis=1)

df_ur_final2.reset_index(drop=True)
df_ur_final2.astype(str)

Unnamed: 0,posted,item_id,recommend,user_id
0,2011,1250,True,76561197970982479
1,2011,22200,True,76561197970982479
2,2011,43110,True,76561197970982479
3,2014,251610,True,js41637
4,2013,227300,True,js41637
...,...,...,...,...
59300,,70,True,76561198312638244
59301,,362890,True,76561198312638244
59302,,273110,True,LydiaMorley
59303,,730,True,LydiaMorley


In [37]:
print(len(df_ur_final2))
print(df_ur_final2.isnull().sum())

59305
posted       10119
item_id          0
recommend        0
user_id          0
dtype: int64


In [38]:
df_ur_final2 = df_ur_final2.dropna(subset=['posted'])

In [39]:
print(len(df_ur_final2))
print(df_ur_final2.isnull().sum())

49186
posted       0
item_id      0
recommend    0
user_id      0
dtype: int64


In [40]:
df_ur_final2.reset_index(drop=True)
df_ur_final2.astype(str)

Unnamed: 0,posted,item_id,recommend,user_id
0,2011,1250,True,76561197970982479
1,2011,22200,True,76561197970982479
2,2011,43110,True,76561197970982479
3,2014,251610,True,js41637
4,2013,227300,True,js41637
...,...,...,...,...
59252,2015,730,True,wayfeng
59255,2015,253980,True,76561198251004808
59265,2015,730,True,72947282842
59267,2015,730,True,ApxLGhost


In [None]:
df_ur_final2.to_csv('Data\data_Review.csv', index=False)

En esta seccion se van a crear un DF basado en la columna **'df_users_items'**, por lo tanto se empieza por normalizar la columna <br> **'items'** y juntarla a las otras columnas, para posteriormente limpiar columnas y cargarla en un csv. 

In [42]:
# Lista para guardar las filas
unpacked_data = []

# Bucle para leer las filas de las columnas 
for index, row in df_users_items.iterrows():
    user_id = row['user_id']
    items_count = row['items_count']
    steam_id = row['steam_id']
    user_url = row['user_url']
    for review_dict in row['items']: # Bucle para leer las filas de la columna 'items' 
        review_dict['user_id'] = user_id
        review_dict['items_count'] = items_count
        review_dict['steam_id'] = steam_id
        review_dict['user_url'] = user_url
        unpacked_data.append(review_dict) # Se adicionan a la lista

# DF final con las filas
df_flattened_ui = pd.DataFrame(unpacked_data)

In [43]:
# Se hace una copia, y se eliminan las columnas 

df_flattened_ui_v2 = df_flattened_ui.copy()

df_flattened_ui_v2 = df_flattened_ui_v2.drop(['user_url', 'playtime_2weeks', 'steam_id'], axis=1)

In [44]:
# Se convierten los datos en string para facilitar la exportacion a csv

df_flattened_ui_v2.astype(str)

Unnamed: 0,item_id,item_name,playtime_forever,user_id,items_count
0,10,Counter-Strike,6,76561197970982479,277
1,20,Team Fortress Classic,0,76561197970982479,277
2,30,Day of Defeat,7,76561197970982479,277
3,40,Deathmatch Classic,0,76561197970982479,277
4,50,Half-Life: Opposing Force,0,76561197970982479,277
...,...,...,...,...,...
5153204,346330,BrainBread 2,0,76561198329548331,7
5153205,373330,All Is Dust,0,76561198329548331,7
5153206,388490,One Way To Die: Steam Edition,3,76561198329548331,7
5153207,521570,You Have 10 Seconds 2,4,76561198329548331,7


In [45]:
# Se sustituyen con "0" cualquier valor que tenga caracteres alfabeticos en las columnas 'item_id', 'playtime_forever', 'items_count'

df_flattened_ui_v2['item_id'] = df_flattened_ui_v2['item_id'].replace(regex=r'[^0-9]', value='0')
df_flattened_ui_v2['playtime_forever'] = df_flattened_ui_v2['playtime_forever'].replace(regex=r'[^0-9]', value='0')
df_flattened_ui_v2['items_count'] = df_flattened_ui_v2['items_count'].replace(regex=r'[^0-9]', value='0')

In [46]:
print(len(df_flattened_ui_v2))
print(df_flattened_ui_v2.isnull().sum())

5153209
item_id             0
item_name           0
playtime_forever    0
user_id             0
items_count         0
dtype: int64


In [None]:
# Se carga un en csv para posterior uso

df_flattened_ui_v2.to_csv('Data\data_Items.csv', index=False)

En esta seccion se van a crear un DF basado en la columna **'df_steam_games'**, por lo tanto se empieza por limpiar nulos y cargarla en un csv. 

In [48]:
df_sg_v1 = df_steam_games.copy()

df_sg_v1.dropna(subset=df_sg_v1.columns[0:15], how='all', inplace=True)

In [49]:
print(len(df_sg_v1))
print(df_sg_v1.isnull().sum())

32135
publisher       8052
genres          3283
app_name           2
title           2050
url                0
release_date    2067
tags             163
reviews_url        2
specs            670
price           1377
early_access       0
id                 2
developer       3299
dtype: int64


In [50]:
df_mlm_v1 = df_sg_v1.drop(['publisher', 'title', 'url', 'reviews_url', 'release_date'], axis=1)

In [57]:
print(len(df_mlm_v1))
print(df_mlm_v1.isnull().sum())
print(df_mlm_v1.columns)
df_mlm_v1.head()

32135
genres          3283
app_name           2
tags             163
specs            670
price           1377
early_access       0
id                 2
developer       3299
dtype: int64
Index(['genres', 'app_name', 'tags', 'specs', 'price', 'early_access', 'id',
       'developer'],
      dtype='object')


Unnamed: 0,genres,app_name,tags,specs,price,early_access,id,developer
88310,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,False,761140,Kotoshiro
88311,"[Free to Play, Indie, RPG, Strategy]",Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
88312,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com
88313,"[Action, Adventure, Casual]",弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,False,767400,彼岸领域
88314,,Log Challenge,"[Action, Indie, Casual, Sports]","[Single-player, Full controller support, HTC V...",2.99,False,773570,


In [63]:
df_mlm_v2 = df_mlm_v1.copy()

In [64]:
df_mlm_v2  = df_mlm_v2.drop(['genres', 'tags', 'specs', 'price', 'early_access', 'developer'], axis=1)

In [65]:
df_mlm_v2 = df_mlm_v2.dropna()

In [66]:
print(len(df_mlm_v2))
print(df_mlm_v2.isnull().sum())
print(df_mlm_v2.columns)
df_mlm_v2.head()


32132
app_name    0
id          0
dtype: int64
Index(['app_name', 'id'], dtype='object')


Unnamed: 0,app_name,id
88310,Lost Summoner Kitty,761140
88311,Ironbound,643980
88312,Real Pool 3D - Poolians,670290
88313,弹炸人2222,767400
88314,Log Challenge,773570


In [108]:
df_mlm_v2.to_csv('Dataframes/data_MLM_1.csv', index=False)

-------------------------------------

In [67]:
df_mlm_v3 = df_mlm_v1.copy()

df_mlm_v3  = df_mlm_v3.drop(['id'], axis=1)

In [68]:
df_mlm_v3['genres'].astype(str) 
df_mlm_v3['tags'].astype(str)  
df_mlm_v3['specs'].astype(str) 

88310                                     ['Single-player']
88311     ['Single-player', 'Multi-player', 'Online Mult...
88312     ['Single-player', 'Multi-player', 'Online Mult...
88313                                     ['Single-player']
88314     ['Single-player', 'Full controller support', '...
                                ...                        
120440              ['Single-player', 'Steam Achievements']
120441    ['Single-player', 'Steam Achievements', 'Steam...
120442    ['Single-player', 'Steam Achievements', 'Steam...
120443    ['Single-player', 'Steam Achievements', 'Steam...
120444    ['Single-player', 'Stats', 'Steam Leaderboards...
Name: specs, Length: 32135, dtype: object

In [69]:
df_mlm_v3['genres'] = df_mlm_v3['genres'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
df_mlm_v3['tags'] = df_mlm_v3['tags'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
df_mlm_v3['specs'] = df_mlm_v3['specs'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)

In [70]:
df_mlm_v3.head()

Unnamed: 0,genres,app_name,tags,specs,price,early_access,developer
88310,Action Casual Indie Simulation Strategy,Lost Summoner Kitty,Strategy Action Indie Casual Simulation,Single-player,4.99,False,Kotoshiro
88311,Free to Play Indie RPG Strategy,Ironbound,Free to Play Strategy Indie RPG Card Game Trad...,Single-player Multi-player Online Multi-Player...,Free To Play,False,Secret Level SRL
88312,Casual Free to Play Indie Simulation Sports,Real Pool 3D - Poolians,Free to Play Simulation Sports Casual Indie Mu...,Single-player Multi-player Online Multi-Player...,Free to Play,False,Poolians.com
88313,Action Adventure Casual,弹炸人2222,Action Adventure Casual,Single-player,0.99,False,彼岸领域
88314,,Log Challenge,Action Indie Casual Sports,Single-player Full controller support HTC Vive...,2.99,False,


In [71]:
df_mlm_v4 = df_mlm_v3.copy()

df_mlm_v4 = df_mlm_v4.replace(to_replace=np.nan, value='')

In [72]:
print(len(df_mlm_v4))
print(df_mlm_v4.isnull().sum())

32135
genres          0
app_name        0
tags            0
specs           0
price           0
early_access    0
developer       0
dtype: int64


In [73]:
df_mlm_v4.head()

Unnamed: 0,genres,app_name,tags,specs,price,early_access,developer
88310,Action Casual Indie Simulation Strategy,Lost Summoner Kitty,Strategy Action Indie Casual Simulation,Single-player,4.99,False,Kotoshiro
88311,Free to Play Indie RPG Strategy,Ironbound,Free to Play Strategy Indie RPG Card Game Trad...,Single-player Multi-player Online Multi-Player...,Free To Play,False,Secret Level SRL
88312,Casual Free to Play Indie Simulation Sports,Real Pool 3D - Poolians,Free to Play Simulation Sports Casual Indie Mu...,Single-player Multi-player Online Multi-Player...,Free to Play,False,Poolians.com
88313,Action Adventure Casual,弹炸人2222,Action Adventure Casual,Single-player,0.99,False,彼岸领域
88314,,Log Challenge,Action Indie Casual Sports,Single-player Full controller support HTC Vive...,2.99,False,


In [109]:
df_mlm_v4.to_csv('Dataframes/data_MLM_2.csv', index=False)

-------------------------------------

In [86]:
df_sg_v1.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,False,767400,彼岸领域
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,False,773570,


In [87]:
df_sg_v1 = df_sg_v1.reset_index(drop=True)

df_sg_v2 = df_sg_v1.copy()

df_sg_v2 = df_sg_v2.drop(['publisher', 'title', 'url', 'tags', 'reviews_url', 'specs', 'early_access'], axis=1)

In [88]:
print(len(df_sg_v2))
print(df_sg_v2.isnull().sum())
df_sg_v2.head()

32135
genres          3283
app_name           2
release_date    2067
price           1377
id                 2
developer       3299
dtype: int64


Unnamed: 0,genres,app_name,release_date,price,id,developer
0,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018-01-04,4.99,761140,Kotoshiro
1,"[Free to Play, Indie, RPG, Strategy]",Ironbound,2018-01-04,Free To Play,643980,Secret Level SRL
2,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017-07-24,Free to Play,670290,Poolians.com
3,"[Action, Adventure, Casual]",弹炸人2222,2017-12-07,0.99,767400,彼岸领域
4,,Log Challenge,,2.99,773570,


In [89]:
df_sg_v2['genres'].fillna(df_sg_v1['tags'], inplace=True)
df_sg_v2['developer'].fillna(df_sg_v1['publisher'], inplace=True)

In [90]:
df_sg_v2 = df_sg_v2.dropna(subset=['id'])
df_sg_v2 = df_sg_v2.dropna(subset=['app_name'])

In [93]:
df_sg_v3 = df_sg_v2.copy()

In [94]:
# Función para extraer el año de la columna de release_date
def extraer_anio(fecha):
    pattern = r'\b\d{4}\b'  # Expresión regular para encontrar el año en formato XXXX
    match = re.search(pattern, str(fecha))
    return match.group(0) if match else None

# Aplicar la función a la columna de fechas para extraer el año
df_sg_v3['release_date'] = df_sg_v3['release_date'].apply(extraer_anio)


In [95]:
print(len(df_sg_v3))
print(df_sg_v3.isnull().sum())
df_sg_v3.head(20)

32132
genres           138
app_name           0
release_date    2170
price           1376
id                 0
developer       3232
dtype: int64


Unnamed: 0,genres,app_name,release_date,price,id,developer
0,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018.0,4.99,761140,Kotoshiro
1,"[Free to Play, Indie, RPG, Strategy]",Ironbound,2018.0,Free To Play,643980,Secret Level SRL
2,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017.0,Free to Play,670290,Poolians.com
3,"[Action, Adventure, Casual]",弹炸人2222,2017.0,0.99,767400,彼岸领域
4,"[Action, Indie, Casual, Sports]",Log Challenge,,2.99,773570,
5,"[Action, Adventure, Simulation]",Battle Royale Trainer,2018.0,3.99,772540,Trickjump Games Ltd
6,"[Free to Play, Indie, Simulation, Sports]",SNOW - All Access Basic Pass,2018.0,9.99,774276,Poppermost Productions
7,"[Free to Play, Indie, Simulation, Sports]",SNOW - All Access Pro Pass,2018.0,18.99,774277,Poppermost Productions
8,"[Free to Play, Indie, Simulation, Sports]",SNOW - All Access Legend Pass,2018.0,29.99,774278,Poppermost Productions
9,"[Casual, Indie, Racing, Simulation]",Race,2018.0,,768800,RewindApp


A partir de aqui se van a crear 2 DF diferentes con la informacion necesaria para creacion de los DF finales que serviran de base a los endpoints <br>  

In [96]:
# DF genero/release date, se hace un copia, se eliminan las columnas inecesarias, y se eliminan los nulos. 

df_sg_grd = df_sg_v3.copy()

In [97]:
df_sg_grd = df_sg_grd.drop(['app_name', 'price', 'developer'], axis=1)

df_sg_grd = df_sg_grd.dropna(subset=['genres'])
df_sg_grd = df_sg_grd.dropna(subset=['release_date'])

In [98]:
print(len(df_sg_grd))
print(df_sg_grd.isnull().sum())
df_sg_grd.head(20)

29825
genres          0
release_date    0
id              0
dtype: int64


Unnamed: 0,genres,release_date,id
0,"[Action, Casual, Indie, Simulation, Strategy]",2018,761140
1,"[Free to Play, Indie, RPG, Strategy]",2018,643980
2,"[Casual, Free to Play, Indie, Simulation, Sports]",2017,670290
3,"[Action, Adventure, Casual]",2017,767400
5,"[Action, Adventure, Simulation]",2018,772540
6,"[Free to Play, Indie, Simulation, Sports]",2018,774276
7,"[Free to Play, Indie, Simulation, Sports]",2018,774277
8,"[Free to Play, Indie, Simulation, Sports]",2018,774278
9,"[Casual, Indie, Racing, Simulation]",2018,768800
12,"[Action, Adventure, Casual, Indie, RPG]",2018,770380


In [102]:
df_sg_grd.astype(str)

Unnamed: 0,genres,release_date,id
0,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",2018,761140
1,"['Free to Play', 'Indie', 'RPG', 'Strategy']",2018,643980
2,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",2017,670290
3,"['Action', 'Adventure', 'Casual']",2017,767400
5,"['Action', 'Adventure', 'Simulation']",2018,772540
...,...,...,...
32129,"['Action', 'Adventure', 'Casual', 'Indie']",2018,745400
32130,"['Casual', 'Indie', 'Simulation', 'Strategy']",2018,773640
32131,"['Casual', 'Indie', 'Strategy']",2018,733530
32132,"['Indie', 'Racing', 'Simulation']",2018,610660


In [104]:
data = []

for index, row in df_sg_grd.iterrows():
    generos = row['genres']
    for genero in generos:
        data.append({'genres': genero.strip(), 'release_date': row['release_date'], 'id': row['id']})

df_sg_grd_v2 = pd.DataFrame(data)


In [105]:
print(len(df_sg_grd_v2))
print(df_sg_grd_v2.isnull().sum())
df_sg_grd_v2.head(20)

74500
genres          0
release_date    0
id              0
dtype: int64


Unnamed: 0,genres,release_date,id
0,Action,2018,761140
1,Casual,2018,761140
2,Indie,2018,761140
3,Simulation,2018,761140
4,Strategy,2018,761140
5,Free to Play,2018,643980
6,Indie,2018,643980
7,RPG,2018,643980
8,Strategy,2018,643980
9,Casual,2017,670290


In [178]:
# Se resetea el indice y se sube a un archivo csv

df_sg_grd_v2 = df_sg_grd_v2.reset_index(drop=True)

df_sg_grd_v2.to_csv('Data\data_Steam_1.csv', index=False)

-------------------------------

In [106]:
# DF developer_price_release, se hace un copia, se eliminan las columnas inecesarias, y se eliminan los nulos.

df_sg_dpr = df_sg_v3.copy()

df_sg_dpr = df_sg_dpr.drop(['app_name', 'genres'], axis=1)

df_sg_dpr = df_sg_dpr.dropna(subset=['price'])
df_sg_dpr = df_sg_dpr.dropna(subset=['release_date'])
df_sg_dpr = df_sg_dpr.dropna(subset=['developer'])

In [107]:
print(len(df_sg_dpr))
print(df_sg_dpr.isnull().sum())
df_sg_dpr.head()

27656
release_date    0
price           0
id              0
developer       0
dtype: int64


Unnamed: 0,release_date,price,id,developer
0,2018,4.99,761140,Kotoshiro
1,2018,Free To Play,643980,Secret Level SRL
2,2017,Free to Play,670290,Poolians.com
3,2017,0.99,767400,彼岸领域
5,2018,3.99,772540,Trickjump Games Ltd


In [108]:
# Se reemplazan todos los valores que tengan letras por "0" ya que todos los valores con letras son o "Free to play" o "Free"

df_sg_dpr['price'] = df_sg_dpr['price'].replace(regex=r'[^0-9]', value='0')

In [109]:
print(len(df_sg_dpr))
print(df_sg_dpr.isnull().sum())
df_sg_dpr.head()

27656
release_date    0
price           0
id              0
developer       0
dtype: int64


Unnamed: 0,release_date,price,id,developer
0,2018,4.99,761140,Kotoshiro
1,2018,0.0,643980,Secret Level SRL
2,2017,0.0,670290,Poolians.com
3,2017,0.99,767400,彼岸领域
5,2018,3.99,772540,Trickjump Games Ltd


In [198]:
# Se resetea el indice y se sube a un archivo csv

df_sg_dpr = df_sg_dpr.reset_index(drop=True)

df_sg_dpr.to_csv('Data\data_Steam_2.csv', index=False)

---------------------------------------------------