# __ETL__ _(Extract, Transform, Load)_

## Introducción

Este notebook se enfoca en el proceso de **ETL** utilizando datos extraídos de las plataformas Yelp y Google Maps. Este proceso implica una _extracccion,transformación y carga_ de los datos con el objetivo de prepararlos para análisis posteriores. Este paso es crucial en cualquier proyecto de ciencia de datos para garantizar la calidad y utilidad de los datos.

## Configuraciones Globales e Importaciones

En esta sección, se instalan e importan todas las librerías y/o módulos necesarios para el proceso ETL (Extract, Transform, Load) y se establecen configuraciones globales de ser requerido. Se utilizan las siguientes librerías y herramientas:

In [1]:
import warnings
warnings.filterwarnings("ignore") # Se utiliza para gestionar las advertencias y mantener el código limpio.

In [2]:
import os # Proporciona funciones para interactuar con el sistema operativo.
import pandas as pd # Una librería de análisis de datos.
import json # Se utiliza para trabajar con datos en formato JSON.

# YELP

**Dataset:** USER.PARQUET

DECLARACIÓN DE LA RUTA DE LOS DATA SET

In [3]:
data_path = 'C:/Users/Usuario/Desktop/Proyecto Final/Yelp/user.parquet'
yelp_users = pd.read_parquet(data_path)
print(yelp_users.info())
yelp_users.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2105597 entries, 0 to 2105596
Data columns (total 22 columns):
 #   Column              Dtype  
---  ------              -----  
 0   user_id             object 
 1   name                object 
 2   review_count        int64  
 3   yelping_since       object 
 4   useful              int64  
 5   funny               int64  
 6   cool                int64  
 7   elite               object 
 8   friends             object 
 9   fans                int64  
 10  average_stars       float64
 11  compliment_hot      int64  
 12  compliment_more     int64  
 13  compliment_profile  int64  
 14  compliment_cute     int64  
 15  compliment_list     int64  
 16  compliment_note     int64  
 17  compliment_plain    int64  
 18  compliment_cool     int64  
 19  compliment_funny    int64  
 20  compliment_writer   int64  
 21  compliment_photos   int64  
dtypes: float64(1), int64(16), object(5)
memory usage: 353.4+ MB
None


Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,...,65,55,56,18,232,844,467,467,239,180
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,...,13,10,17,3,66,96,119,119,35,18
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,...,4,1,6,2,12,16,26,26,10,9
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,...,1,0,0,0,1,1,0,0,0,0


In [4]:
#Buscamos duplicados
yelp_users.duplicated().sum()

117700

In [5]:
# Elimina duplicados
yelp_users.drop_duplicates(inplace=True)

In [6]:
# Convierte 'yelping_since' a datetime
yelp_users['yelping_since'] = pd.to_datetime(yelp_users['yelping_since'])

In [7]:
# Ordena por 'yelping_since'
yelp_users = yelp_users.sort_values('yelping_since')
yelp_users.reset_index(drop=True, inplace=True)

In [8]:
#Muestras las primeras filas del dataframe
yelp_users.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,fFGPBtsutYpn3A155Sf75Q,Brandon,194,2004-10-12 08:46:11,250,103,121,20062007,"u-BjshHIamkWw4lBOsyscw, vGtIy5jDIqNS4oT0w-mkfw...",26,...,1,3,11,0,6,11,21,21,1,3
1,nkN_do3fJ9xekchVC-v68A,Jeremy,1366,2004-10-12 08:46:43,18524,10049,15141,"2006,2007,2008,2009,2010,2011,2012,2013,2014,2...","5HJvYcAM6FLat695V_JF1A, m_y6jQ5AeVpXfTSc9c_LEQ...",2107,...,178,152,249,45,890,1254,2150,2150,473,581
2,wqoXYLWmpkEH0YvTmHBsJQ,Michael,398,2004-10-12 08:51:07,1393,734,662,2006200720082009201020112012201320142015,"g-mL-8J1_9iuDLUhyTzvmg, mjdYqMhqlNi3qz9d2NWgLQ...",156,...,22,13,24,4,90,98,168,168,27,33
3,co_jK_x-CvK2Z3ZrJLz1SQ,j,12,2004-10-12 09:36:53,21,3,4,,"KSoGqlg42blFfukIMfZd_w, OxAPD1OFV9qgS6EmbbHhtw...",0,...,0,0,0,0,1,0,0,0,0,0
4,23J4vG9_xxxdnmi8CBX7Ng,Joan,1674,2004-10-12 12:29:35,21509,15514,19972,"2006,2007,2008,2009,2010,2011,2012,2013,2014,2...","PNa2-EjHe_ApIgZXD6kxBg, _BHTC7nyCBoZcfiiD5cOXg...",1425,...,292,290,342,98,1480,4961,5126,5126,1904,3291


In [9]:
#muestra las primeras filas de "elite"
yelp_users.elite.head()

0                                            2006,2007
1    2006,2007,2008,2009,2010,2011,2012,2013,2014,2...
2    2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
3                                                     
4    2006,2007,2008,2009,2010,2011,2012,2013,2014,2...
Name: elite, dtype: object

In [10]:
# Normaliza 'elite'
elite = yelp_users['elite'].str.split(',', expand=True).stack().reset_index(level=-1, drop=True)
elite = elite.to_frame('elite_year')
elite['user_id'] = yelp_users['user_id']

In [11]:
# Reemplaza los strings en blanco por NaN
elite['elite_year'] = elite['elite_year'].replace('', pd.NA)

In [12]:
# Elimina los NaN
elite = elite.dropna(subset=['elite_year'])

In [13]:
# Ordena el DataFrame por 'elite_year'
elite = elite.sort_values(['elite_year'])

In [14]:
#Muestra los uncios de elite
elite.elite_year.unique()

array(['20', '2006', '2007', '2008', '2009', '2010', '2011', '2012',
       '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2021'],
      dtype=object)

In [15]:
#Hace conteos por años en "elite"
elite.elite_year.value_counts()

elite_year
20      79858
2021    44542
2019    44044
2018    41009
2017    36015
2016    29636
2015    24175
2014    18571
2013    16193
2012    15222
2011    10997
2010     8772
2009     5479
2008     3185
2007     2023
2006      775
Name: count, dtype: int64

In [16]:
# Filtrar el DataFrame elite para excluir las filas donde el valor de 'elite_year' es '20'
elite = elite[elite['elite_year'] != '20']

# Restablecer el índice del DataFrame resultante y eliminar el índice anterior
elite.reset_index(drop=True, inplace=True)

# Mostrar el DataFrame elite resultante
elite

Unnamed: 0,elite_year,user_id
0,2006,fFGPBtsutYpn3A155Sf75Q
1,2006,t9mXmz8cPrCXqTz6tOlveQ
2,2006,S3gNrUh7N9oWoackKlHGMQ
3,2006,H4JNrBAoyCk_ZMZWbAf8OA
4,2006,3XxsH5vS3yJDnYLxSnRu3A
...,...,...
300633,2021,MdGnY6eLxy82nEZhCc6rnQ
300634,2021,grY9VXMDbjzrKiFmKdXPVg
300635,2021,qMhIZuS7OC3v-IN7kU4m2w
300636,2021,6o3CiQzYpLse_kJx2qYfSw


In [17]:
#Muestra la info de "elite"
elite.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300638 entries, 0 to 300637
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   elite_year  300638 non-null  object
 1   user_id     300638 non-null  object
dtypes: object(2)
memory usage: 4.6+ MB


In [18]:
#elimina la columna elite
yelp_users = yelp_users.drop(columns='elite')

yelp_users.columns

Index(['user_id', 'name', 'review_count', 'yelping_since', 'useful', 'funny',
       'cool', 'friends', 'fans', 'average_stars', 'compliment_hot',
       'compliment_more', 'compliment_profile', 'compliment_cute',
       'compliment_list', 'compliment_note', 'compliment_plain',
       'compliment_cool', 'compliment_funny', 'compliment_writer',
       'compliment_photos'],
      dtype='object')

In [19]:
# Dividir la cadena en la posición 1987896 de la columna 'friends' por comas
sample_friends = yelp_users.friends[1987896].split(',')

# Calcular la longitud de la lista resultante
len(sample_friends)

1

In [20]:
# Verificar valores nulos en el DataFrame 'business_yelp' y contarlos por columna
yelp_users.isnull().sum()

user_id               0
name                  0
review_count          0
yelping_since         0
useful                0
funny                 0
cool                  0
friends               0
fans                  0
average_stars         0
compliment_hot        0
compliment_more       0
compliment_profile    0
compliment_cute       0
compliment_list       0
compliment_note       0
compliment_plain      0
compliment_cool       0
compliment_funny      0
compliment_writer     0
compliment_photos     0
dtype: int64

In [21]:
# Aplicar una función lambda a cada valor en la columna 'friends' del DataFrame 'yelp_users'
# La función divide cada cadena por comas y calcula la longitud de la lista resultante
# Reemplazar los valores originales en la columna 'friends' con las longitudes de las listas resultantes
yelp_users['friends'] = yelp_users['friends'].apply(lambda x: len(x.split(',')))

# Mostrar la columna 'friends' actualizada en el DataFrame 'yelp_users'
yelp_users['friends']

0           245
1          6849
2          2454
3             7
4          3926
           ... 
1987892       1
1987893       1
1987894       1
1987895       1
1987896       1
Name: friends, Length: 1987897, dtype: int64

In [22]:
yelp_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1987897 entries, 0 to 1987896
Data columns (total 21 columns):
 #   Column              Dtype         
---  ------              -----         
 0   user_id             object        
 1   name                object        
 2   review_count        int64         
 3   yelping_since       datetime64[ns]
 4   useful              int64         
 5   funny               int64         
 6   cool                int64         
 7   friends             int64         
 8   fans                int64         
 9   average_stars       float64       
 10  compliment_hot      int64         
 11  compliment_more     int64         
 12  compliment_profile  int64         
 13  compliment_cute     int64         
 14  compliment_list     int64         
 15  compliment_note     int64         
 16  compliment_plain    int64         
 17  compliment_cool     int64         
 18  compliment_funny    int64         
 19  compliment_writer   int64         
 20  co

## Carga de nuestro archivo

In [23]:
ruta_nuevo_json = "C:/Users/Usuario/Desktop/Proyecto Final/PF_Google_yelp_Map/Notebook/User_limpio.json"
yelp_users.to_json(ruta_nuevo_json, orient='records', lines=True)