Indice: 

* Librerías y configuraciones
* Carga de datos
* Funciones auxiliares
* Tablas
    * Tabla df_users
    * Tabla user_items_df
    * Tabla df_items
    * Tabla item_genre
    * Tabla genres
    * Tabla reviews

---

---

# Librerías y configuraciones


In [109]:
import pandas as pd
import ast
import json
from textblob import TextBlob

In [110]:
pd.set_option('display.max_colwidth', 100)

---

---

# Carga de datos

In [111]:
raw_steam_games = pd.read_csv('data/raw_steam_games.csv')
raw_user_items = pd.read_csv('data/raw_user_items.csv')
raw_user_reviews = pd.read_csv('data/raw_user_reviews.csv')

----

----

# Funciones auxiliares

In [112]:
def valores_unicos(df):
    for x in list(df.columns):
        print('cant de valores unicos enla columna', x, ': ', len(df[x].unique()))

In [113]:
def duplicados_de(df):
    df_columns = df.columns.tolist()
    for i in df.columns:
        print( 'Duplicados de ',i,': ', df.duplicated(subset=i,keep=False).sum())

In [114]:
def lista_de_dic_a_df(df,col1,col_list_dic):
    df = df[[col1,col_list_dic]] #filtro el df con las columnas que necesito
    df[col_list_dic] = df[col_list_dic].apply(json.loads) # transformo la lista a str
    df = df.explode(col_list_dic).reset_index(drop=True) #
    llaves = list(df[[col_list_dic]].iloc[0,0].keys())
    for llave in llaves:
        df[llave] = df[col_list_dic].apply(lambda x: x[llave])
    return df

In [115]:
def columna_con_listas_a_df(df,columna):
    data = df[[columna]]
    data.dropna(inplace=True)
    data['name'] = data[columna].apply(ast.literal_eval)
    data = data.explode('name')
    data = data.drop_duplicates(subset='name').reset_index(drop=True)
    data.insert(0,'id_' + columna,range(1,len(data)+1))
    data = data.drop(columns=[columna])
    data.reset_index(drop=True,inplace=True)

    return data

In [116]:
def tabla_intermedia(df,col1_id,col2_lista):
    data = df[[col1_id,col2_lista]]
    # if data[data[['col2_lista']] == '[]'].shape[0] != 0:
    #     data = data.replace('[]', None)
    data = data.dropna()
    data[col2_lista] = data[col2_lista].apply(ast.literal_eval)
    data = data.explode(col2_lista)
    if data.duplicated().sum() != 0:
        data = data.drop_duplicated()
    data.reset_index(drop=True,inplace=True)

    return data

In [117]:
def play_4ever_x_genre(genre):
    id_items_filtrados = df_item_genre[df_item_genre.genres == genre]
    df = pd.merge(
        id_items_filtrados,
        item_id_playtime_forever,
        on = 'item_id',
        how = 'inner'
    )
    # print(genre , df.shape[0])
    cant = df.playtime_forever.sum()

    return cant

In [118]:
def analisis_de_sentimientos(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity < 0:
        return 0  # Negativo
    elif analysis.sentiment.polarity == 0:
        return 1  # Neutral
    else:
        return 2  # Positivo

----

----

# Tablas

---

# Tabla df_users

---

Esta tabla contendrá información sobre los usuarios en las siguientes columnas: user_id, items_count y user_url.

In [119]:
raw_user_items.head(1)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970982479,"[{""item_id"": ""10"", ""item_name"": ""Counter-Strike"", ""playtime_forever"": 6, ""playtime_2weeks"": 0}, ..."


In [120]:
raw_user_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70939 entries, 0 to 70938
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      70939 non-null  object
 1   items_count  70939 non-null  int64 
 2   steam_id     70939 non-null  int64 
 3   user_url     70939 non-null  object
 4   items        70939 non-null  object
dtypes: int64(2), object(3)
memory usage: 2.7+ MB


In [121]:
valores_unicos(raw_user_items)

cant de valores unicos enla columna user_id :  70912
cant de valores unicos enla columna items_count :  924
cant de valores unicos enla columna steam_id :  70912
cant de valores unicos enla columna user_url :  70912


cant de valores unicos enla columna items :  68901


En un análisis anterior observamos valores de id duplicados. En el caso de la creación de esta tabla seran eliminados ya que no se tomará en cuenta la columna items que es donde difieren las filas.

In [122]:
df_users = raw_user_items.drop_duplicates(subset='user_id')

In [123]:
df_users.head(2)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970982479,"[{""item_id"": ""10"", ""item_name"": ""Counter-Strike"", ""playtime_forever"": 6, ""playtime_2weeks"": 0}, ..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{""item_id"": ""10"", ""item_name"": ""Counter-Strike"", ""playtime_forever"": 0, ""playtime_2weeks"": 0}, ..."


In [124]:
df_users.shape

(70912, 5)

In [125]:
valores_unicos(df_users)

cant de valores unicos enla columna user_id :  70912
cant de valores unicos enla columna items_count :  924
cant de valores unicos enla columna steam_id :  70912
cant de valores unicos enla columna user_url :  70912


cant de valores unicos enla columna items :  68874


Podemos ver que cada valor de user_id y de user_url son únicos.

In [126]:
df_users = raw_user_items[['user_id','items_count','user_url']]

## ver si lo dejo o saco item_count!!!

In [127]:
# cargar csv
df_users.to_csv('data/df_users.csv')

-----

# Tabla user_items_df

Esta tabla tendrá como columnas: 
    * user_id(identificación única del usuario),	
    * game_id(identificación única de los juegos) y 
    * playtime_forever( cantidad de horas que el usuario ha jugado a un juego específico).

In [128]:
raw_user_items.head(1)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970982479,"[{""item_id"": ""10"", ""item_name"": ""Counter-Strike"", ""playtime_forever"": 6, ""playtime_2weeks"": 0}, ..."


In [129]:
#extraemos la info de la columna items
df_users_items = lista_de_dic_a_df(raw_user_items,'user_id','items')
df_users_items = df_users_items[['user_id','item_id','playtime_forever']]
df_users_items

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col_list_dic] = df[col_list_dic].apply(json.loads) # transformo la lista a str


Unnamed: 0,user_id,item_id,playtime_forever
0,76561197970982479,10,6
1,76561197970982479,20,0
2,76561197970982479,30,7
3,76561197970982479,40,0
4,76561197970982479,50,0
...,...,...,...
5097252,76561198329548331,346330,0
5097253,76561198329548331,373330,0
5097254,76561198329548331,388490,3
5097255,76561198329548331,521570,4


* Análisis de nulos y duplicados

In [130]:
print(f'nulos en playtime_forever: ', df_users_items.playtime_forever.isnull().sum())
print(f'nulos en item_id: ', df_users_items.item_id.isnull().sum())

nulos en playtime_forever:  0
nulos en item_id:  0


In [131]:
( 
    df_users_items
    .duplicated()
    .sum()
    )

3165

Elimiaremos las 3165 filas duplicadas

In [132]:
df_users_items = df_users_items.drop_duplicates()

In [133]:
df_users_items.shape

(5094092, 3)

Ahora analizaremos las filas que poseen identicos user_id e item_id pero difieren en playtime_forever.

In [134]:
df_users_items.head(2)

Unnamed: 0,user_id,item_id,playtime_forever
0,76561197970982479,10,6
1,76561197970982479,20,0


In [135]:
combinaciones_userId_itemId_duplicadas = df_users_items[df_users_items.duplicated(subset=['user_id', 'item_id'], keep=False)].sort_values(by=['user_id', 'item_id'])

In [136]:
combinaciones_userId_itemId_duplicadas

Unnamed: 0,user_id,item_id,playtime_forever
398432,76561198050680344,377160,1997
1487810,76561198050680344,377160,2058
426051,76561198064956087,40100,2504
1557729,76561198064956087,40100,2543
54357,76561198072861800,433850,5083
2074960,76561198072861800,433850,5084
1535868,76561198079079942,282070,1486
3355746,76561198079079942,282070,1516
355776,76561198081666970,361600,2
3155027,76561198081666970,361600,9


Podemos ver que no en lugar de actualizarse los datos de la columna playtime_forever, se ha creado un nuevo registro. Por lo que solo nos quedaremos con las filas que poseen el playtime_forever mayor de cada combinación, user_id-item_id.

In [137]:
combinaciones_userId_itemId_duplicadas_a_eliminar = df_users_items[df_users_items.duplicated(subset=['user_id', 'item_id'], keep='last')].sort_values(by=['user_id', 'item_id'])
combinaciones_userId_itemId_duplicadas_a_eliminar

Unnamed: 0,user_id,item_id,playtime_forever
398432,76561198050680344,377160,1997
426051,76561198064956087,40100,2504
54357,76561198072861800,433850,5083
1535868,76561198079079942,282070,1486
355776,76561198081666970,361600,2
355743,76561198081666970,730,31410
679943,bwolf7803,221640,1000
680080,bwolf7803,345180,976
680082,bwolf7803,385770,1464
1506379,sergioxks1,291410,1357


In [138]:
filas_a_eliminar = list(combinaciones_userId_itemId_duplicadas_a_eliminar.index)
df_users_items = df_users_items.drop(filas_a_eliminar)

* 'playtime_forever' == 0

## Agregar grafiquito bonito del porcentaje de pt4ever == 0 vs el resto

Vamos a eliminar los registros cuyo valor en 'playtime_forever' sean 0 significa que ese usuario no ha jugado a ese juego.

In [139]:
filtro_playtime_forever_not0 = df_users_items.playtime_forever != 0
df_users_items = df_users_items[filtro_playtime_forever_not0].reset_index(drop=True)


In [140]:
df_users_items.shape

(3246352, 3)

In [141]:
df_users_items

Unnamed: 0,user_id,item_id,playtime_forever
0,76561197970982479,10,6
1,76561197970982479,30,7
2,76561197970982479,300,4733
3,76561197970982479,240,1853
4,76561197970982479,3830,333
...,...,...,...
3246347,76561198329548331,304930,677
3246348,76561198329548331,227940,43
3246349,76561198329548331,388490,3
3246350,76561198329548331,521570,4


In [142]:
# cargar csv
df_users_items.to_csv("data/df_users_items.csv",index=False)

-----


## Tabla items

In [143]:
raw_steam_games.head(1)


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,discount_price,specs,price,early_access,item_id,metascore,developer
0,Kotoshiro,"['Action', 'Casual', 'Indie', 'Simulation', 'Strategy']",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/,2018-01-04,"['Strategy', 'Action', 'Indie', 'Casual', 'Simulation']",http://steamcommunity.com/app/761140/reviews/?browsefilter=mostrecent&p=1,4.49,['Single-player'],4.99,False,761140,,Kotoshiro


In [144]:
raw_steam_games.shape

(32133, 15)

In [145]:
items = raw_steam_games[['item_id','title','url','release_date','developer','price']] # Debido a que no poseemos información sobre las condiciones en las que se aplicaron los descuentos, solo consideraremos la columna 'price' para las consultas.
items

Unnamed: 0,item_id,title,url,release_date,developer,price
0,761140,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/,2018-01-04,Kotoshiro,4.99
1,643980,Ironbound,http://store.steampowered.com/app/643980/Ironbound/,2018-01-04,Secret Level SRL,Free To Play
2,670290,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_Pool_3D__Poolians/,2017-07-24,Poolians.com,Free to Play
3,767400,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,彼岸领域,0.99
4,773570,,http://store.steampowered.com/app/773570/Log_Challenge/,,,2.99
...,...,...,...,...,...,...
32128,773640,Colony On Mars,http://store.steampowered.com/app/773640/Colony_On_Mars/,2018-01-04,"Nikita ""Ghost_RUS""",1.99
32129,733530,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGistICAL_South_Africa/,2018-01-04,Sacada,4.99
32130,610660,Russian Roads,http://store.steampowered.com/app/610660/Russian_Roads/,2018-01-04,Laush Dmitriy Sergeevich,1.99
32131,658870,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_2__Directions/,2017-09-02,"xropi,stev3ns",4.99


Análisis de la tabla items

In [146]:
(
    items
    .isna()
    .sum()
)

item_id            0
title           2049
url                0
release_date    2066
developer       3298
price           1377
dtype: int64

In [147]:
(
    items
    .duplicated()
    .sum()
)

0

No tenemos valores nulos en las columnas item_id y url, que es lo que esperamos ya que esos valores deberían ser únicos. Y tampoco tenemos filas duplicadas.

Pudimos notar en la tabla visualizada anteriormente que la columna price posee valores no numéricos.

In [148]:
valores_unicos_de_price = items.price.unique()
print('Cantidad de valores únicos en "precios": ', len(valores_unicos_de_price))
# print('Valores únicos en "precios": ', valores_unicos_de_price)
price_str = [x for x in valores_unicos_de_price if not (str(x).replace('.', '', 1).isdigit() or str(x).isdigit())]
print(f'valores str en price: ', price_str)
print(f'cantidad de valores str en price: ', len(price_str))

Cantidad de valores únicos en "precios":  163
valores str en price:  ['Free To Play', 'Free to Play', nan, 'Free', 'Free Demo', 'Play for Free!', 'Install Now', 'Play WARMACHINE: Tactics Demo', 'Free Mod', 'Install Theme', 'Third-party', 'Play Now', 'Free HITMAN™ Holiday Pack', 'Play the Demo', 'Starting at $499.00', 'Starting at $449.00', 'Free to Try', 'Free Movie', 'Free to Use']
cantidad de valores str en price:  19


Analizaremos la lista lista_sin_numericos


* valores nulos

In [149]:
items[items.price.isna()]

Unnamed: 0,item_id,title,url,release_date,developer,price
9,768800,Race,http://store.steampowered.com/app/768800/Race/,2018-01-04,RewindApp,
10,768570,Uncanny Islands,http://store.steampowered.com/app/768570/Uncanny_Islands/,Soon..,Qucheza,
31,520680,Lost Cities,http://store.steampowered.com/app/520680/Lost_Cities/,2018-01-01,BlueLine Games,
32,690410,Twisted Enhanced Edition,http://store.steampowered.com/app/690410/Twisted_Enhanced_Edition/,2018-01-01,Games by Brundle,
34,413120,Tactics Forever,http://store.steampowered.com/app/413120/Tactics_Forever/,2018-01-01,ProjectorGames,
...,...,...,...,...,...,...
32097,771070,Infinos Gaiden,http://store.steampowered.com/app/771070/Infinos_Gaiden/,2018-01-19,Picorinne Soft,
32109,90007,International Online Soccer,http://store.steampowered.com/app/90007/International_Online_Soccer/,2002-01-01,I.O.S. Team,
32121,772180,Cricket Club,http://store.steampowered.com/app/772180/Cricket_Club/,January 2018,VersoVR,
32123,771810,The spy who shot me™,http://store.steampowered.com/app/771810/The_spy_who_shot_me/,2018-10-01,Retro Army Limited,


Tenemos 1377 filas con valores nulos en la columna price. Verificaremos si los usuarios han consumido estos items.

In [150]:
# df de juegos sin precio
priceless_items_df = items[items.price.isnull()]
priceless_items_df.to_csv('data/priceless_items_df.csv',index=False)
priceless_items_id_list = list(priceless_items_df.item_id)
df_users_items[df_users_items.item_id.isin(priceless_items_id_list)]


Unnamed: 0,user_id,item_id,playtime_forever


Podemos verificar que ningun usuario ha consumido los items que carecen de precio.

 * 'Starting at $449.00' y 'Starting at $499.00'


In [151]:
items[items.price ==  'Starting at $449.00']

Unnamed: 0,item_id,title,url,release_date,developer,price
24999,353390,Alienware Steam Machine,http://store.steampowered.com/app/353390/Alienware_Steam_Machine/,2015-11-10,,Starting at $449.00


In [152]:
items[items.price ==  'Starting at $499.00']

Unnamed: 0,item_id,title,url,release_date,developer,price
24998,353420,Syber Steam Machine,http://store.steampowered.com/app/353420/Syber_Steam_Machine/,2015-11-10,,Starting at $499.00


In [153]:
items[items.item_id.isin([353390,353420])]

Unnamed: 0,item_id,title,url,release_date,developer,price
24998,353420,Syber Steam Machine,http://store.steampowered.com/app/353420/Syber_Steam_Machine/,2015-11-10,,Starting at $499.00
24999,353390,Alienware Steam Machine,http://store.steampowered.com/app/353390/Alienware_Steam_Machine/,2015-11-10,,Starting at $449.00


In [154]:
df_users_items[df_users_items.item_id.isin([353390,353420])]

Unnamed: 0,user_id,item_id,playtime_forever


Los items 353390 y 353420 serán eliminadas, ya que al ingresar a las urls de estos items hemos sido redirigidos a https://store.steampowered.com/ y al verificar si existen usuarios que hayan consumido estos items nos encontramos con que ningun usuario adquirió estos items. 

In [155]:
items = items[~(items.price.isin(['Starting at $449.00','Starting at $499.00']))]

* Third-party


In [156]:
items[items.price == 'Third-party']

Unnamed: 0,item_id,title,url,release_date,developer,price
3917,362970,Parcel - Soundtrack,http://store.steampowered.com/app/362970/Parcel__Soundtrack/,2015-07-02,Polar Bunny Ltd,Third-party
31836,3483,Peggle Extreme,http://store.steampowered.com/app/3483/Peggle_Extreme/,2007-09-11,"PopCap Games, Inc.",Third-party


Investigando las urls de los items 362970 y 3483 verificamos que ambos son gratuitos.

* 'install now'

In [157]:
items[items.price == 'Install Now']

Unnamed: 0,item_id,title,url,release_date,developer,price
2404,268850,EVGA Precision XOC,http://store.steampowered.com/app/268850/EVGA_Precision_XOC/,2014-09-19,EVGA,Install Now


Observando la url del item 268850 verificamos que es gratuito.

* 'Play Now'

In [158]:
items[items.price == 'Play Now']

Unnamed: 0,item_id,title,url,release_date,developer,price
4025,345040,Oblivious Garden ~White Day,http://store.steampowered.com/app/345040/Oblivious_Garden_White_Day/,2015-07-20,"CorypheeSoft,DigitalEZ",Play Now
26215,383860,Area-X - Extra Gallery,http://store.steampowered.com/app/383860/AreaX__Extra_Gallery/,2015-06-24,Zeiva Inc,Play Now


Observando las urls de los items 345040 y 383860 verificamos que son gratuitos.

* Las descripciones: 'Free', 'Free to Play', 'Free To Play', 'Play WARMACHINE: Tactics Demo', 'Free HITMAN™ Holiday Pack', 'Free Movie', 'Play for Free!', 'Free to Use', 'Free Mod', 'Play the Demo' y 'Free to Try' indican que el item es gratuito. 
*
* Los items con las descripciones 'Starting at $449.00' y 'Starting at $499.00' fueron eliminadas ya que ningun usuario las consumió, carecen de datos del desarrollador y sus url nos redirigen a la página principal de la tienda.
*
* Hemos verificado que las descripciones 'Third-party', 'Install Now' y 'Play Now' indican que los items son gratuitos.
*
* Hemos verificado que los items que poseen valores nulos en precio no han sido consumidos por los usuarios por los que seran eliminados. Sin embargo hemos guardado la información de estos items en el DF priceless_items_df. Todos los demás items de la lista 'lista_sin_numericos' serán reemplazados por 0, ya que no representan un costo para el usuario.

In [159]:
items.dropna(subset='price',inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items.dropna(subset='price',inplace=True)


In [160]:
items.price = items.price.replace(price_str, 0)
items.price = items.price.astype("float64")
items

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items.price = items.price.replace(price_str, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items.price = items.price.astype("float64")


Unnamed: 0,item_id,title,url,release_date,developer,price
0,761140,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/,2018-01-04,Kotoshiro,4.99
1,643980,Ironbound,http://store.steampowered.com/app/643980/Ironbound/,2018-01-04,Secret Level SRL,0.00
2,670290,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_Pool_3D__Poolians/,2017-07-24,Poolians.com,0.00
3,767400,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,彼岸领域,0.99
4,773570,,http://store.steampowered.com/app/773570/Log_Challenge/,,,2.99
...,...,...,...,...,...,...
32128,773640,Colony On Mars,http://store.steampowered.com/app/773640/Colony_On_Mars/,2018-01-04,"Nikita ""Ghost_RUS""",1.99
32129,733530,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGistICAL_South_Africa/,2018-01-04,Sacada,4.99
32130,610660,Russian Roads,http://store.steampowered.com/app/610660/Russian_Roads/,2018-01-04,Laush Dmitriy Sergeevich,1.99
32131,658870,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_2__Directions/,2017-09-02,"xropi,stev3ns",4.99


In [161]:
duplicados_de(items)

Duplicados de  item_id :  0
Duplicados de  title :  1983
Duplicados de  url :  0
Duplicados de  release_date :  29447
Duplicados de  developer :  23795
Duplicados de  price :  30682


In [162]:
(
    items
    .isna()
    .sum()
)

item_id            0
title           1932
url                0
release_date    1936
developer       3154
price              0
dtype: int64

In [163]:
items.to_csv("data/df_items.csv",index=False)

## revisar los items con titulos duplicados !!!!!

In [164]:
# items[items.title.duplicated(keep=False)].sort_values(by=['title','developer']).tail(50)

In [165]:
# items[items.title.duplicated(keep=False)].sort_values(by=['title','developer'])

----

# Tabla games_genres

In [166]:
df_item_genre = tabla_intermedia(raw_steam_games,'item_id','genres')

In [167]:
df_item_genre.head(2)

Unnamed: 0,item_id,genres
0,761140,Action
1,761140,Casual


In [168]:
df_item_genre.genres = df_item_genre.genres.str.lower()

In [169]:
df_item_genre

Unnamed: 0,item_id,genres
0,761140,action
1,761140,casual
2,761140,indie
3,761140,simulation
4,761140,strategy
...,...,...
71548,610660,indie
71549,610660,racing
71550,610660,simulation
71551,658870,casual


In [170]:
df_item_genre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71553 entries, 0 to 71552
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   item_id  71553 non-null  int64 
 1   genres   71553 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.1+ MB


In [171]:
(
    df_item_genre
    .duplicated()
    .sum()
)

0

Esta tabla no tiene valores nulos ni filas duplicadas

In [172]:
df_item_genre.to_csv("data/df_item_genre.csv",index=False)

----

# Tabla genres

In [173]:
raw_steam_games.head(1)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,discount_price,specs,price,early_access,item_id,metascore,developer
0,Kotoshiro,"['Action', 'Casual', 'Indie', 'Simulation', 'Strategy']",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/,2018-01-04,"['Strategy', 'Action', 'Indie', 'Casual', 'Simulation']",http://steamcommunity.com/app/761140/reviews/?browsefilter=mostrecent&p=1,4.49,['Single-player'],4.99,False,761140,,Kotoshiro


In [174]:
genres = columna_con_listas_a_df(raw_steam_games,'genres')
genres.name = genres.name.str.lower()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['name'] = data[columna].apply(ast.literal_eval)


In [175]:
genres

Unnamed: 0,id_genres,name
0,1,action
1,2,casual
2,3,indie
3,4,simulation
4,5,strategy
5,6,free to play
6,7,rpg
7,8,sports
8,9,adventure
9,10,racing


anexo columna forplay

In [176]:
raw_steam_games[raw_steam_games.item_id == 10]

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,discount_price,specs,price,early_access,item_id,metascore,developer
32104,Valve,['Action'],Counter-Strike,Counter-Strike,http://store.steampowered.com/app/10/CounterStrike/,2000-11-01,"['Action', 'FPS', 'Multiplayer', 'Shooter', 'Classic', 'Team-Based', 'Competitive', 'First-Perso...",http://steamcommunity.com/app/10/reviews/?browsefilter=mostrecent&p=1,,"['Multi-player', 'Valve Anti-Cheat enabled']",9.99,False,10,88.0,Valve


In [177]:
item_id_playtime_forever = df_users_items[['item_id','playtime_forever']]
item_id_playtime_forever = item_id_playtime_forever.groupby('item_id')['playtime_forever'].sum().reset_index()

In [178]:
item_id_playtime_forever

Unnamed: 0,item_id,playtime_forever
0,10,17107858
1,100,301732
2,10000,62685
3,1002,894
4,100400,6544
...,...,...
10045,99890,127442
10046,9990,2083
10047,99900,17083206
10048,99910,426210


In [179]:
item_id_playtime_forever.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10050 entries, 0 to 10049
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_id           10050 non-null  object
 1   playtime_forever  10050 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 157.2+ KB


In [180]:
item_id_playtime_forever['item_id'] = item_id_playtime_forever['item_id'].astype('int64')


In [181]:
item_id_playtime_forever.duplicated().sum()

0

No tenemos valores nulos ni duplicados.

In [182]:
#  Esta al comienzo del notebook. 
def play_4ever_x_genre(genre):
    id_games_filtrados = df_item_genre[df_item_genre.genres == genre]
    df = pd.merge(
        id_games_filtrados,
        item_id_playtime_forever,
        on = 'item_id',
        how = 'inner'
    )
    # print(genre , df.shape[0])
    cant = df.playtime_forever.sum()

    return cant

In [183]:
play_4ever_x_genre_list = []
for genero in genres.name:
    play_4ever_x_genre_list.append((play_4ever_x_genre(genero)))
genres['playtime_forever'] = play_4ever_x_genre_list

In [184]:
genres = genres.sort_values(by='playtime_forever', ascending=False)
genres['ranking'] = range(1, len(genres) + 1)
genres

Unnamed: 0,id_genres,name,playtime_forever,ranking
0,1,action,3074865964,1
2,3,indie,1475383715,2
6,7,rpg,1027849083,3
8,9,adventure,898675144,4
3,4,simulation,855261582,5
4,5,strategy,650991704,6
5,6,free to play,603563359,7
11,12,massively multiplayer,441038278,8
1,2,casual,249315170,9
10,11,early access,156682370,10


In [185]:
genres.to_csv("data/df_genres.csv",index=False)

-----

## Tabla df_reviews 

In [186]:
raw_user_reviews.head(2)

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970982479,"[{""funny"": """", ""posted"": ""Posted November 5, 2011."", ""last_edited"": """", ""item_id"": ""1250"", ""help..."
1,js41637,http://steamcommunity.com/id/js41637,"[{""funny"": """", ""posted"": ""Posted June 24, 2014."", ""last_edited"": """", ""item_id"": ""251610"", ""helpf..."


In [187]:
df_reviews = raw_user_reviews[['user_id','reviews']]
df_reviews.reviews = df_reviews.reviews.apply(json.loads) # transforma el str en lista
df_reviews = df_reviews.explode('reviews').reset_index(drop=True) # analiza lo que  hay dentro de la lista y en este cado lo transforma en un diccionario

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reviews.reviews = df_reviews.reviews.apply(json.loads) # transforma el str en lista


In [188]:
llaves = list(df_reviews[['reviews']].iloc[0,0].keys())
llaves

['funny', 'posted', 'last_edited', 'item_id', 'helpful', 'recommend', 'review']

In [189]:
for llave in llaves:
    df_reviews[llave] = df_reviews['reviews'].apply(lambda x: x[llave]) # creamos columnas para el contenido de cada llave

In [190]:
df_reviews = df_reviews.drop(columns=['reviews']) # eliminamos la columna con datos anidados


In [191]:
df_reviews

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,"Simple yet with great replayability. In my opinion does ""zombie"" hordes and team work better tha..."
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game i...
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,"I know what you think when you see this title ""Barbie Dreamhouse Party"" but do not be intimidate..."
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,"For a simple (it's actually not all that simple but it can be!) truck driving Simulator, it is q..."
...,...,...,...,...,...,...,...,...
52229,76561198310819422,1 person found this review funny,Posted June 23.,,570,1 of 1 people (100%) found this review helpful,True,Well Done
52230,76561198312638244,,Posted July 21.,,233270,No ratings yet,True,this is a very fun and nice 80s themed shooter. im not good at spoiler free reviews so im just g...
52231,76561198312638244,,Posted July 10.,,130,No ratings yet,True,if you liked Half life i would really recommend getting this expansion.im not gonna spoil anythi...
52232,76561198312638244,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely worth buying.


In [192]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52234 entries, 0 to 52233
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      52234 non-null  object
 1   funny        52234 non-null  object
 2   posted       52234 non-null  object
 3   last_edited  52234 non-null  object
 4   item_id      52234 non-null  object
 5   helpful      52234 non-null  object
 6   recommend    52234 non-null  bool  
 7   review       52234 non-null  object
dtypes: bool(1), object(7)
memory usage: 2.8+ MB


In [193]:
df_reviews.duplicated().sum()

0

No tenemos valores nulos ni filas duplicadas.

In [194]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52234 entries, 0 to 52233
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      52234 non-null  object
 1   funny        52234 non-null  object
 2   posted       52234 non-null  object
 3   last_edited  52234 non-null  object
 4   item_id      52234 non-null  object
 5   helpful      52234 non-null  object
 6   recommend    52234 non-null  bool  
 7   review       52234 non-null  object
dtypes: bool(1), object(7)
memory usage: 2.8+ MB


Crearemos la columna posted_data con la info de la columna posted. Aquellas que se encuentran incompletas se condideraran como valores nulos.

In [195]:
df_reviews.sort_values('posted').tail(50)

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
40012,viihdavanzo,,"Posted September 9, 2015.",,730,No ratings yet,True,
37918,W33dl3y,1 person found this review funny,"Posted September 9, 2015.","Last edited September 9, 2015.",239030,No ratings yet,True,"Guy walks into the checkpoint, hands in papers, invalid expiration date, invalid issuing city, n..."
35027,76561198071340853,,"Posted September 9, 2015.",,440,No ratings yet,True,"TF2.Me be on 19 Kill StreakMe see gibus solder, ""free easy kill"" Me thinkRandom Crit.I Die.,Rand..."
51164,76561198107730535,,"Posted September 9, 2015.",,230290,2 of 2 people (100%) found this review helpful,True,"the absolute most fun i've ever had out of a simulation (its not exactly a game, but god damn is..."
40096,76561198005708745,,"Posted September 9, 2015.",,234140,1 of 1 people (100%) found this review helpful,True,"Didn't waste my $60, well done movie and game!! c:"
17522,Skirin200,,"Posted September 9, 2015.","Last edited September 9, 2015.",440,No ratings yet,False,Cara jogo bom mas... se tu não gastar dinheiro nele vc vai ser visto como uma bosta... todos vão...
19253,76561198086388497,,"Posted September 9, 2015.",,570,No ratings yet,True,"Set my region to South America, got matched with a team with russians and mexicans 10/10 wou..."
42655,76561198049429171,1 person found this review funny,"Posted September 9, 2015.",,730,No ratings yet,False,♥♥♥♥ing ♥♥♥♥♥ everywhere
25095,derplaymc,,Posted September 9.,,304050,No ratings yet,True,ITS SO ADICTIVE
15848,BokuNoSekai,,Posted September 9.,,57690,No ratings yet,True,What can i say? This game's a pretty fun city management with a sizable amount of political humo...


In [196]:
formato_de_la_fecha = r'Posted (\w+ \d{1,2}, \d{4})'
df_reviews['posted_date'] = df_reviews['posted'].str.extract(formato_de_la_fecha)
df_reviews['posted_date'] = pd.to_datetime(df_reviews['posted_date'], format='%B %d, %Y')


#### borraremos el 20% de los datos

In [197]:
df_reviews.drop(columns='posted')

Unnamed: 0,user_id,funny,last_edited,item_id,helpful,recommend,review,posted_date
0,76561197970982479,,,1250,No ratings yet,True,"Simple yet with great replayability. In my opinion does ""zombie"" hordes and team work better tha...",2011-11-05
1,76561197970982479,,,22200,No ratings yet,True,It's unique and worth a playthrough.,2011-07-15
2,76561197970982479,,,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game i...,2011-04-21
3,js41637,,,251610,15 of 20 people (75%) found this review helpful,True,"I know what you think when you see this title ""Barbie Dreamhouse Party"" but do not be intimidate...",2014-06-24
4,js41637,,,227300,0 of 1 people (0%) found this review helpful,True,"For a simple (it's actually not all that simple but it can be!) truck driving Simulator, it is q...",2013-09-08
...,...,...,...,...,...,...,...,...
52229,76561198310819422,1 person found this review funny,,570,1 of 1 people (100%) found this review helpful,True,Well Done,NaT
52230,76561198312638244,,,233270,No ratings yet,True,this is a very fun and nice 80s themed shooter. im not good at spoiler free reviews so im just g...,NaT
52231,76561198312638244,,,130,No ratings yet,True,if you liked Half life i would really recommend getting this expansion.im not gonna spoil anythi...,NaT
52232,76561198312638244,,,70,No ratings yet,True,a must have classic from steam definitely worth buying.,NaT


In [201]:
df_reviews['year'] = df_reviews['posted_date'].dt.year

In [205]:
df_reviews['sentiment_analysis'] = df_reviews['review'].apply(analisis_de_sentimientos)
df_reviews

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review,posted_date,year,sentiment_analysis
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,"Simple yet with great replayability. In my opinion does ""zombie"" hordes and team work better tha...",2011-11-05,2011.0,2
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,2011-07-15,2011.0,2
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game i...,2011-04-21,2011.0,2
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,"I know what you think when you see this title ""Barbie Dreamhouse Party"" but do not be intimidate...",2014-06-24,2014.0,2
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,"For a simple (it's actually not all that simple but it can be!) truck driving Simulator, it is q...",2013-09-08,2013.0,0
...,...,...,...,...,...,...,...,...,...,...,...
52229,76561198310819422,1 person found this review funny,Posted June 23.,,570,1 of 1 people (100%) found this review helpful,True,Well Done,NaT,,1
52230,76561198312638244,,Posted July 21.,,233270,No ratings yet,True,this is a very fun and nice 80s themed shooter. im not good at spoiler free reviews so im just g...,NaT,,2
52231,76561198312638244,,Posted July 10.,,130,No ratings yet,True,if you liked Half life i would really recommend getting this expansion.im not gonna spoil anythi...,NaT,,2
52232,76561198312638244,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely worth buying.,NaT,,2


In [206]:
df_reviews = df_reviews[['user_id', 'item_id', 'helpful','recommend', 'posted_date','year', 'sentiment_analysis']]

In [207]:
df_reviews

Unnamed: 0,user_id,item_id,helpful,recommend,posted_date,year,sentiment_analysis
0,76561197970982479,1250,No ratings yet,True,2011-11-05,2011.0,2
1,76561197970982479,22200,No ratings yet,True,2011-07-15,2011.0,2
2,76561197970982479,43110,No ratings yet,True,2011-04-21,2011.0,2
3,js41637,251610,15 of 20 people (75%) found this review helpful,True,2014-06-24,2014.0,2
4,js41637,227300,0 of 1 people (0%) found this review helpful,True,2013-09-08,2013.0,0
...,...,...,...,...,...,...,...
52229,76561198310819422,570,1 of 1 people (100%) found this review helpful,True,NaT,,1
52230,76561198312638244,233270,No ratings yet,True,NaT,,2
52231,76561198312638244,130,No ratings yet,True,NaT,,2
52232,76561198312638244,70,No ratings yet,True,NaT,,2


In [208]:
df_reviews.to_csv('data/df_reviews.csv',index=False)

----