***EDA***
Vamos a realizar el analisis exploratorio de datos utilizando la libreria pandas revisando que contiene el archivo JSON  

In [1]:
### Realizamos la importacion de lalibreria pandas
import pandas as pd

### Leemos el archivo JSON 
file = './australian_user_reviews.json'
### Definimos el df
df = pd.read_json(file)


In [2]:
### Vizualizamos las primeras colomnas del data set
df.head()


Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


In [3]:
### Damos un visataso general de como esta conformado el df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25799 non-null  object
 1   user_url  25799 non-null  object
 2   reviews   25799 non-null  object
dtypes: object(3)
memory usage: 604.8+ KB


Aqui termina el analisis general de el archivo **australian_users_reviews** en donde vemos que el archivo posee 3 colunnas:

***user_id*** _esta columna hace referencia al ID unico de cada Jugador.

***user_url*** _hace referencia a la direccion url de cada jugador la cual nogenera ningun valor importante, en el ETL sera eliminada.

***reviews*** esta es una columna con objetos tipo JSON adentro (diccionarios) esta anidada y hay que realizar una extraccion para tener una visualizacion mas limpiadeestos. 

### ***ETL***

In [4]:

### Función para extraer información de un diccionario en la lista 'reviews'
def extract_review_data(review_list):
    
    if not review_list:
        return pd.Series([None] * 7)
    
    # Extraer datos del primer diccionario de la lista
    review_dict = review_list[0]
    funny = review_dict.get('funny', '')
    posted = review_dict.get('posted', '').replace('Posted ', '')  # Eliminar "Posted "
    last_edited = review_dict.get('last_edited', '')
    item_id = review_dict.get('item_id', '')
    helpful = review_dict.get('helpful', '')
    recommend = review_dict.get('recommend', False)
    review = review_dict.get('review', '')

    return pd.Series([funny, posted, last_edited, item_id, helpful, recommend, review])

# Aplicar la función y crear nuevas columnas
columns = ['funny', 'posted', 'last_edited', 'item_id', 'helpful', 'recommend', 'review_content']
df[columns] = df['reviews'].apply(extract_review_data)


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   user_id         25799 non-null  object
 1   user_url        25799 non-null  object
 2   reviews         25799 non-null  object
 3   funny           25771 non-null  object
 4   posted          25771 non-null  object
 5   last_edited     25771 non-null  object
 6   item_id         25771 non-null  object
 7   helpful         25771 non-null  object
 8   recommend       25771 non-null  object
 9   review_content  25771 non-null  object
dtypes: object(10)
memory usage: 2.0+ MB


In [7]:
df.head(100)

Unnamed: 0,user_id,user_url,reviews,funny,posted,last_edited,item_id,helpful,recommend,review_content
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2...",,"November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014...",,"June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',...",,February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2...",,"October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',...",3 people found this review funny,"April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...,...,...,...
95,76561198090715178,http://steamcommunity.com/profiles/76561198090...,"[{'funny': '', 'posted': 'Posted March 28, 201...",,"March 28, 2014.",,4000,1 of 1 people (100%) found this review helpful,True,"Nice serious nice game ,"
96,CatsWhale,http://steamcommunity.com/id/CatsWhale,"[{'funny': '', 'posted': 'Posted February 23, ...",,"February 23, 2015.",,333950,1 of 2 people (50%) found this review helpful,True,the structural integrity says it all if you li...
97,76561198073784601,http://steamcommunity.com/profiles/76561198073...,"[{'funny': '', 'posted': 'Posted March 26, 201...",,"March 26, 2015.",,221100,3 of 6 people (50%) found this review helpful,True,"love the game, just needs to work more on bug ..."
98,mimimomoma,http://steamcommunity.com/id/mimimomoma,"[{'funny': '', 'posted': 'Posted November 23, ...",,"November 23, 2015.",,730,No ratings yet,True,"Tried this game once, its alright."


In [8]:
### Vamos a realizar la limpieza y organizar los datos para facilitar el analisis.

# Eliminar columnas innecesarias
df.drop(['user_url', 'reviews'], axis=1, inplace=True)

# Convertir 'item_id' a entero (si es aplicable)
df['item_id'] = pd.to_numeric(df['item_id'], errors='coerce')

# Manejar valores nulos
df.fillna('Unknown', inplace=True)  # o df.dropna(inplace=True)

# Normalizar el texto en 'review_content'
df['review_content'] = df['review_content'].str.lower()

# Eliminar el punto al final de cada fecha en la columna 'posted'
df['posted'] = df['posted'].str.rstrip('.')

# Convertir 'posted' a formato de fecha
df['posted'] = pd.to_datetime(df['posted'], errors='coerce', format='%B %d, %Y')  # Ajusta el formato según sea necesario





  df.fillna('Unknown', inplace=True)  # o df.dropna(inplace=True)


In [9]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   user_id         25799 non-null  object        
 1   funny           25799 non-null  object        
 2   posted          21069 non-null  datetime64[ns]
 3   last_edited     25799 non-null  object        
 4   item_id         25799 non-null  object        
 5   helpful         25799 non-null  object        
 6   recommend       25799 non-null  object        
 7   review_content  25799 non-null  object        
dtypes: datetime64[ns](1), object(7)
memory usage: 1.6+ MB


Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review_content
0,76561197970982479,,2011-11-05,,1250.0,No ratings yet,True,simple yet with great replayability. in my opi...
1,js41637,,2014-06-24,,251610.0,15 of 20 people (75%) found this review helpful,True,i know what you think when you see this title ...
2,evcentric,,NaT,,248820.0,No ratings yet,True,a suitably punishing roguelike platformer. wi...
3,doctr,,2013-10-14,,250320.0,2 of 2 people (100%) found this review helpful,True,this game... is so fun. the fight sequences ha...
4,maplemage,3 people found this review funny,2014-04-15,,211420.0,35 of 43 people (81%) found this review helpful,True,git gud


In [36]:
print(df['posted'].head())


0   2011-11-05
1   2014-06-24
2          NaT
3   2013-10-14
4   2014-04-15
Name: posted, dtype: datetime64[ns]


In [40]:
# Contar cuántos valores NaT hay en la columna 'posted'
conteo_nan = df['posted'].isna().sum()

print(conteo_nan)


4730


In [13]:
# Crear una nueva columna 'year' extrayendo el año de 'posted'
df['year'] = df['posted'].dt.year.astype('Int64')

# Verificar los resultados
print(df[['posted', 'year']].head())


      posted  year
0 2011-11-05  2011
1 2014-06-24  2014
2        NaT  <NA>
3 2013-10-14  2013
4 2014-04-15  2014


In [14]:
df.head()

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review_content,year
0,76561197970982479,,2011-11-05,,1250.0,No ratings yet,True,simple yet with great replayability. in my opi...,2011.0
1,js41637,,2014-06-24,,251610.0,15 of 20 people (75%) found this review helpful,True,i know what you think when you see this title ...,2014.0
2,evcentric,,NaT,,248820.0,No ratings yet,True,a suitably punishing roguelike platformer. wi...,
3,doctr,,2013-10-14,,250320.0,2 of 2 people (100%) found this review helpful,True,this game... is so fun. the fight sequences ha...,2013.0
4,maplemage,3 people found this review funny,2014-04-15,,211420.0,35 of 43 people (81%) found this review helpful,True,git gud,2014.0


In [17]:
# Convertir 'year' a float para la interpolación
df['year'] = df['year'].astype(float)

# Interpolar por grupo, restableciendo el índice para evitar problemas de incompatibilidad
def interpolate_group(group):
    group['year'] = group['year'].interpolate()
    return group

# Aplicar la función a cada grupo
df = df.groupby('item_id').apply(interpolate_group).reset_index(drop=True)

# Verificar los resultados
print(df[['item_id', 'year']].head())



  item_id    year
0    10.0  2015.0
1    10.0  2015.0
2    10.0  2014.0
3    10.0  2015.0
4    10.0  2015.0


In [25]:
df.head(20)

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review_content,year
0,Bennysaputra,,2015-08-01,,10.0,No ratings yet,True,cool game,2015.0
1,Monsta45,,2015-04-26,,10.0,No ratings yet,True,wallbang simulator,2015.0
2,76561198072207162,,2014-01-23,,10.0,No ratings yet,True,people still play this! siq game,2014.0
3,Monsta45,,2015-04-26,,10.0,No ratings yet,True,wallbang simulator,2015.0
4,shaman3soul3,,NaT,,10.0,No ratings yet,True,please put australian servers in cs 1.6 (espea...,2015.0
5,76561198015886143,,2015-06-14,,10.0,1 of 1 people (100%) found this review helpful,True,simplismente perfeito!,2015.0
6,lanatbeonakeehsasamokoshtan,,2014-12-26,,10.0,No ratings yet,True,best game ever in the world,2014.0
7,aman98,,2015-07-03,,10.0,6 of 8 people (75%) found this review helpful,True,best game ever,2015.0
8,76561198001699914,,2013-08-16,,10.0,No ratings yet,True,jueguenlooooooo,2013.0
9,brahhhhcsgomatee,,2015-08-24,,10.0,0 of 1 people (0%) found this review helpful,True,clickin heads since i was 12,2015.0


In [None]:
### eliminar filas que no son necesarias como posted 
df.drop(['posted', 'last_edited', 'helpful'], axis=1, inplace=True)



In [31]:

df.drop(['funny'], axis=1, inplace=True)

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_id         25799 non-null  object 
 1   item_id         25799 non-null  object 
 2   recommend       25799 non-null  object 
 3   review_content  25799 non-null  object 
 4   year            25799 non-null  float64
dtypes: float64(1), object(4)
memory usage: 1007.9+ KB


In [34]:
from textblob import TextBlob

# Función para analizar el sentimiento de las reseñas
def analyze_sentiment(text):
    # Obtener el objeto TextBlob
    analysis = TextBlob(text)
    # Obtener la polaridad
    polarity = analysis.sentiment.polarity
    # Clasificar el sentimiento
    if polarity > 0:
        return 2  # Positivo
    elif polarity == 0:
        return 1  # Neutral
    else:
        return 0  # Negativo

# Aplicar la función de análisis de sentimientos a la columna de reseñas
df['sentiment_analysis'] = df['review_content'].apply(analyze_sentiment)

# Verificar los resultados
print(df[['review_content', 'sentiment_analysis']].head())


                                      review_content  sentiment_analysis
0                                          cool game                   0
1                                 wallbang simulator                   1
2                   people still play this! siq game                   0
3                                 wallbang simulator                   1
4  please put australian servers in cs 1.6 (espea...                   2


In [35]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   user_id             25799 non-null  object 
 1   item_id             25799 non-null  object 
 2   recommend           25799 non-null  object 
 3   review_content      25799 non-null  object 
 4   year                25799 non-null  float64
 5   sentiment_analysis  25799 non-null  int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 1.2+ MB


Unnamed: 0,user_id,item_id,recommend,review_content,year,sentiment_analysis
0,Bennysaputra,10.0,True,cool game,2015.0,0
1,Monsta45,10.0,True,wallbang simulator,2015.0,1
2,76561198072207162,10.0,True,people still play this! siq game,2014.0,0
3,Monsta45,10.0,True,wallbang simulator,2015.0,1
4,shaman3soul3,10.0,True,please put australian servers in cs 1.6 (espea...,2015.0,2


In [36]:
# Convertir el DataFrame a un archivo CSV
df.to_csv('', index=False)


