# 2. Tratamiento inicial de los datos
En este notebook echaremos un vistazo a nulos y transformaremos el tipo de dato de algunas columnas

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data/raw/MainDF.csv').drop(labels='Unnamed: 0', axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62771 entries, 0 to 62770
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Game          62764 non-null  object 
 1   Release date  62771 non-null  object 
 2   Price         59468 non-null  object 
 3   Owners        62771 non-null  object 
 4   Developer(s)  62550 non-null  object 
 5   Publisher(s)  62570 non-null  object 
 6   ID            55909 non-null  float64
 7   tags          55234 non-null  object 
dtypes: float64(1), object(7)
memory usage: 3.8+ MB


In [3]:
# Dado que la columna 'tags' es la principal columna sobre la que haremos la exploración, eliminamos todas aquellas entradas que tengan un NaN en dicha columna
df.dropna(how='all', subset=['tags'],inplace=True,ignore_index=True)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55234 entries, 0 to 55233
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Game          55234 non-null  object 
 1   Release date  55234 non-null  object 
 2   Price         52251 non-null  object 
 3   Owners        55234 non-null  object 
 4   Developer(s)  55091 non-null  object 
 5   Publisher(s)  55107 non-null  object 
 6   ID            55234 non-null  float64
 7   tags          55234 non-null  object 
dtypes: float64(1), object(7)
memory usage: 3.4+ MB


In [6]:
# Comprobamos si tenemos algún duplicado
df['ID'].is_unique

True

In [218]:
# Convertimos la columna 'Price' en tipo float, para ello reemplazamos los strings convenientes
df['Price'] = df['Price'].str.replace('$','').replace('Free',0).astype('float64')

In [219]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55234 entries, 0 to 55233
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Game          55234 non-null  object 
 1   Release date  55234 non-null  object 
 2   Price         52251 non-null  float64
 3   Owners        55234 non-null  object 
 4   Developer(s)  55091 non-null  object 
 5   Publisher(s)  55107 non-null  object 
 6   ID            55234 non-null  float64
 7   tags          55234 non-null  object 
dtypes: float64(2), object(6)
memory usage: 3.4+ MB


In [220]:
# Comprobamos qué aspecto tienen los datos NaN de 'Price'.
df[df['Price'].isna()].head(20)

Unnamed: 0,Game,Release date,Price,Owners,Developer(s),Publisher(s),ID,tags
48,Universe Sandbox Legacy,"Apr 29, 2011",,"500,000 .. 1,000,000",Giant Army,Giant Army,72200.0,"{'Simulation': 401, 'Sandbox': 385, 'Space': 3..."
59,Section 8: Prejudice,"May 4, 2011",,"500,000 .. 1,000,000",TimeGate Studios,Atari,97100.0,"{'Action': 70, 'FPS': 54, 'Sci-fi': 46, 'Shoot..."
65,Lord of the Rings: War in the North,"Nov 1, 2011",,"500,000 .. 1,000,000",Snowblind Studios,Warner Bros. Interactive Entertainment,32800.0,"{'RPG': 332, 'Action': 247, 'Co-op': 205, 'Fan..."
81,Magic: The Gathering - Duels of the Planeswalk...,"Jun 15, 2011",,"200,000 .. 500,000",Stainless Games Ltd,Wizards of the Coast LLC,49470.0,"{'Card Battler': 102, 'Deckbuilding': 81, 'Car..."
83,Might & Magic: Clash of Heroes,"Sep 22, 2011",,"200,000 .. 500,000",Capybara Games,Ubisoft,61700.0,"{'Strategy': 202, 'Puzzle': 163, 'RPG': 158, '..."
105,Majesty 2 Collection,"Apr 19, 2011",,"200,000 .. 500,000",1C:InoCo,Paradox Interactive,73020.0,"{'Strategy': 68, 'RTS': 42, 'Fantasy': 33, 'Ba..."
124,Shift 2 Unleashed,"Mar 29, 2011",,"100,000 .. 200,000",Slightly Mad Studios,Electronic Arts,47920.0,"{'Racing': 170, 'Automobile Sim': 104, 'Simula..."
127,Rochard,"Nov 15, 2011",,"100,000 .. 200,000",Recoil Games,Warner Bros. Interactive Entertainment,107800.0,"{'Platformer': 101, 'Indie': 89, 'Action': 82,..."
140,IL-2 Sturmovik: Cliffs of Dover,"Jul 19, 2011",,"100,000 .. 200,000",1C: Maddox Games,Fulqrum Publishing,63950.0,"{'Simulation': 51, 'Flight': 33, 'World War II..."
179,GundeadliGne,"Sep 27, 2011",,"50,000 .. 100,000",Platine Dispositif,Rockin' Android,92220.0,"{'Bullet Hell': 70, ""Shoot 'Em Up"": 63, 'Anime..."


Al buscar estos títulos en la web oficial de Steam, comprobamos que en su mayoría no existen. Al menos hay un **motivo** conocido:<br>
El juego ha sido reeditado, y por tanto la versión anterior ha sido retirada de la tienda. Por ejemplo, "Might & Magic: Clash of Heroes" fue reemplazado por "Might & Magic: Clash of Heroes - Definitive Edition", y "DARK SOULS: Prepare To Die Edition" por "DARK SOULS: REMASTERED".<br>
Por desgracia, en base a esta pequeña investigación, no podemos asignar a los valores NaN precios correspondientes a la mediana o moda (4.99$), pues estaríamos cometiendo un gran error. Por ejemplo, la reedición de "Might & Magic" tiene un precio de 17.99$ y la reedición de "DARK SOULS" de 39,99$. En general, si un juego se reedita es porque tiene mucho éxito, y por ello no suele tener un precio bajo.<br>
De todas maneras, estos juegos, aunque en versión reeditada, sí están representados en el df, así que realmente conservar ambas versiones sería como tener duplicados.<br><br>
Por otro lado, los juegos sin precio que sí existen en la tienda, como es el caso de "Shift 2 Unleashed" o "IL-2 Sturmovk: Cliffs of Dover", parecen tener en común esta restricción: *"Requires agreement to a 3rd-party EULA
eula_63950"*. Desconozco más detalles.<br><br>
Ante esta incertidumbre, cuando evalúe los juegos en función de su precio, descartaré las entradas cuyo precio sea NaN.

In [221]:
# Ahora cambiamos el tipo de dato de la columna Release date a datetime
# Dado que hay fechas que no incluyen el día del mes, el formato no es consistente y tampoco podemos ir día por día averiguando cuáles son
# Podemos ofrecer una alternativa mediante el argumento errors. En este caso, siendo pocos datos, le diremos que incluya un NaT con 'coerce'
df['Release date'] = pd.to_datetime(df['Release date'], format='%b %d, %Y',errors='coerce')
df

Unnamed: 0,Game,Release date,Price,Owners,Developer(s),Publisher(s),ID,tags
0,Terraria,2011-05-16,9.99,"20,000,000 .. 50,000,000",Re-Logic,Re-Logic,105600.0,"{'Open World Survival Craft': 15689, 'Sandbox'..."
1,Portal 2,2011-04-18,9.99,"10,000,000 .. 20,000,000",Valve,Valve,620.0,"{'Platformer': 7324, 'Puzzle': 7109, 'Dark Hum..."
2,The Elder Scrolls V: Skyrim,2011-11-10,19.99,"5,000,000 .. 10,000,000",Bethesda Game Studios,Bethesda Softworks,72850.0,"{'Open World': 9937, 'RPG': 8637, 'Fantasy': 6..."
3,STAR WARS: The Old Republic,2011-12-20,0.00,"5,000,000 .. 10,000,000",Broadsword,Electronic Arts,1286830.0,"{'Free to Play': 506, 'Multiplayer': 390, 'MMO..."
4,APB Reloaded,2011-12-06,0.00,"5,000,000 .. 10,000,000",Reloaded Productions,Little Orbit,113400.0,"{'Free to Play': 1723, 'Open World': 989, 'Mul..."
...,...,...,...,...,...,...,...,...
55229,My Tribe,2008-11-28,9.99,"0 .. 20,000",Big Fish Games,Big Fish Games,51010.0,"{'Casual': 24, 'Simulation': 24}"
55230,Virtual Villagers: A New Home,2008-05-12,9.99,"0 .. 20,000",Last Day of Work,Last Day of Work,16100.0,"{'Simulation': 26, 'Casual': 25}"
55231,Virtual Villagers - The Secret City,2008-05-28,9.99,"0 .. 20,000",Last Day of Work,Last Day of Work,16180.0,"{'Casual': 24, 'Simulation': 24}"
55232,Petz Horsez 2,2008-06-13,9.99,"0 .. 20,000",Ubisoft,Ubisoft,15160.0,"{'Simulation': 34, 'Horses': 32, 'Family Frien..."


In [223]:
# Usando el método info observamos que la cantidad de nulos no es grande ()
print(df.info())
print('Cantidad de fechas nulas:', len(df[df['Release date'].isna() == True]))
# Cuando hagamos análisis por fecha habrá que tener en cuenta estos nulos y dropearlos

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55234 entries, 0 to 55233
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Game          55234 non-null  object        
 1   Release date  55155 non-null  datetime64[ns]
 2   Price         52251 non-null  float64       
 3   Owners        55234 non-null  object        
 4   Developer(s)  55091 non-null  object        
 5   Publisher(s)  55107 non-null  object        
 6   ID            55234 non-null  float64       
 7   tags          55234 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(5)
memory usage: 3.4+ MB
None
Cantidad de fechas nulas: 79


In [224]:
# Por dejarlo más limpio, cambiaremos el tipo de dato que es ID
df['ID'] = df['ID'].astype('int').astype('object')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55234 entries, 0 to 55233
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Game          55234 non-null  object        
 1   Release date  55155 non-null  datetime64[ns]
 2   Price         52251 non-null  float64       
 3   Owners        55234 non-null  object        
 4   Developer(s)  55091 non-null  object        
 5   Publisher(s)  55107 non-null  object        
 6   ID            55234 non-null  object        
 7   tags          55234 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(6)
memory usage: 3.4+ MB


In [225]:
# Ahora transformamos la columna Owners en categórica
# Primero veamos cuántas categorías tenemos
df.Owners.unique()

array(['20,000,000\xa0..\xa050,000,000', '10,000,000\xa0..\xa020,000,000',
       '5,000,000\xa0..\xa010,000,000', '2,000,000\xa0..\xa05,000,000',
       '1,000,000\xa0..\xa02,000,000', '500,000\xa0..\xa01,000,000',
       '200,000\xa0..\xa0500,000', '100,000\xa0..\xa0200,000',
       '50,000\xa0..\xa0100,000', '20,000\xa0..\xa050,000',
       '0\xa0..\xa020,000', '100,000,000\xa0..\xa0200,000,000',
       '200,000,000\xa0..\xa0500,000,000',
       '50,000,000\xa0..\xa0100,000,000'], dtype=object)

In [226]:
# Importamos este módulo para realizar la categorización más rápidamente
from pandas.api.types import CategoricalDtype

In [227]:
# Establecemos la categoría como tal
owner_cat = CategoricalDtype(categories=['0\xa0..\xa020,000',
                                         '20,000\xa0..\xa050,000',
                                         '50,000\xa0..\xa0100,000',
                                         '100,000\xa0..\xa0200,000',
                                         '200,000\xa0..\xa0500,000',
                                         '500,000\xa0..\xa01,000,000',
                                         '1,000,000\xa0..\xa02,000,000',
                                         '2,000,000\xa0..\xa05,000,000',
                                         '5,000,000\xa0..\xa010,000,000',
                                         '10,000,000\xa0..\xa020,000,000',
                                         '20,000,000\xa0..\xa050,000,000',
                                         '50,000,000\xa0..\xa0100,000,000',
                                         '100,000,000\xa0..\xa0200,000,000',
                                         '200,000,000\xa0..\xa0500,000,000',],
                                         ordered=True)

In [228]:
# Asignamos la categoría a la columna 'Owners' del df
df['Owners'] = df['Owners'].astype(owner_cat)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55234 entries, 0 to 55233
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Game          55234 non-null  object        
 1   Release date  55155 non-null  datetime64[ns]
 2   Price         52251 non-null  float64       
 3   Owners        55234 non-null  category      
 4   Developer(s)  55091 non-null  object        
 5   Publisher(s)  55107 non-null  object        
 6   ID            55234 non-null  object        
 7   tags          55234 non-null  object        
dtypes: category(1), datetime64[ns](1), float64(1), object(5)
memory usage: 3.0+ MB


In [197]:
# Ahora ya podríamos ordenarlo de acuerdo a Owners
df.sort_values('Owners',ascending=False)

Unnamed: 0,Game,Release date,Price,Owners,Developer(s),Publisher(s),ID,tags
706,Dota 2,2013-07-09,0.00,"200,000,000 .. 500,000,000",Valve,Valve,570,"{'Free to Play': 59256, 'MOBA': 19844, 'Multip..."
305,Counter-Strike: Global Offensive,2012-08-21,0.00,"100,000,000 .. 200,000,000",Valve,Valve,730,"{'FPS': 89573, 'Shooter': 64398, 'Multiplayer'..."
45773,Lost Ark,2022-02-11,0.00,"50,000,000 .. 100,000,000",Smilegate RPG,Amazon Games,1599340,"{'MMORPG': 347, 'Free to Play': 309, 'Action R..."
37394,New World,2021-09-28,39.99,"50,000,000 .. 100,000,000",Amazon Games,Amazon Games,1063730,"{'Massively Multiplayer': 718, 'Open World': 7..."
29379,Apex Legends,2020-11-04,0.00,"50,000,000 .. 100,000,000",Respawn Entertainment,Electronic Arts,1172470,"{'Free to Play': 1827, 'Battle Royale': 1263, ..."
...,...,...,...,...,...,...,...,...
25246,Hour of the Snake,2019-04-19,2.99,"0 .. 20,000",Grzegorz Fila,Grzegorz Fila,1052660,"{'Indie': 33, 'Casual': 24, 'Fast-Paced': 12, ..."
25247,UNI TURRET,2019-04-10,1.99,"0 .. 20,000",gentome,gentome,1050430,"{'Casual': 22, 'Indie': 21}"
25248,Noodle Jump,2019-05-01,7.99,"0 .. 20,000",Toastar Virtual Entertainment,Toastar Virtual Entertainment,1050660,"{'RPG': 21, 'Indie': 21, 'Casual': 21, 'Advent..."
25249,Breaking Bunny,2019-04-23,0.99,"0 .. 20,000",Heinz Poetter,ADE,1050700,"{'Indie': 21, 'Casual': 21}"


Como los CSV no conservan este tipo de dato, será necesario volver a ejecutar estas celdas cuando comencemos la exploración.

In [230]:
# Guardamos el dataframe en un nuevo archivo, 'processed'
df.to_csv('MainDF_processed.csv')

<hr>

<div style="display: flex; justify-content: space-between; margin-bottom: 10px;">
    <div style="text-align: left;">
        <a href="./1_obtaining_data.ipynb">
            <button>&#8592; 1. Obtaining data </button>
        </a>
    </div>
    <div style="text-align: right;">
        <a href="./3_expanding_data.ipynb">
            <button>3. Expanding data &#8594;</button>
        </a>
    </div>
</div>

<hr>