# __ETL__ _(Extract, Transform, Load)_

## Introducción

Este notebook se enfoca en el proceso de **ETL** utilizando datos extraídos de las plataformas Yelp y Google Maps. Este proceso implica una _extracccion,transformación y carga_ de los datos con el objetivo de prepararlos para análisis posteriores. Este paso es crucial en cualquier proyecto de ciencia de datos para garantizar la calidad y utilidad de los datos.

## Configuraciones Globales e Importaciones

En esta sección, se instalan e importan todas las librerías y/o módulos necesarios para el proceso ETL (Extract, Transform, Load) y se establecen configuraciones globales de ser requerido. Se utilizan las siguientes librerías y herramientas:

In [1]:
import warnings
warnings.filterwarnings("ignore") # Se utiliza para gestionar las advertencias y mantener el código limpio.

In [2]:
import os # Proporciona funciones para interactuar con el sistema operativo.
import pandas as pd # Una librería de análisis de datos.
import json # Se utiliza para trabajar con datos en formato JSON.

# YELP

**Dataset:** BUSSINES

DECLARACIÓN DE LA RUTA DE LOS DATA SET

In [4]:
# Ruta local completa del archivo business.pkl
ruta_local_business = r"C:/Users/Usuario/Desktop/Proyecto Final/Yelp/business.pkl"

In [5]:
# Cargar el archivo pickle usando pd.read_pickle
business_yelp = pd.read_pickle(ruta_local_business)

In [6]:
# Mostramos las primeras 5 filas del dataframe
business_yelp.head(5)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,...,state.1,postal_code.1,latitude.1,longitude.1,stars.1,review_count.1,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,93101,34.426679,-119.711197,5.0,7,...,,,,,,,,,,
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,63123,38.551126,-90.335695,3.0,15,...,,,,,,,,,,
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,85711,32.223236,-110.880452,3.5,22,...,,,,,,,,,,
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,...,,,,,,,,,,
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,18054,40.338183,-75.471659,4.5,13,...,,,,,,,,,,


In [12]:
business_yelp.count()

business_id     150346
name            150346
address         150346
city            150346
state           150346
postal_code     150346
latitude        150346
longitude       150346
stars           150346
review_count    150346
is_open         150346
attributes      136602
categories      150243
hours           127123
dtype: int64

In [8]:
# Verificación de columnas duplicadas en el DataFrame business_yelp
columns = business_yelp.columns.duplicated()
columns

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

In [9]:
# Filtrar el DataFrame business_yelp para eliminar columnas duplicadas
business_yelp = business_yelp.loc[:, ~columns]
business_yelp.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,MO,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


In [11]:
# Reemplazamos los NaN en 'state' por sus valores correspondientes de acuerdo a 'address', 'city' y 'postal_code'
business_yelp.at[0, 'state'] = 'CA'
business_yelp.at[1, 'state'] = 'MO'
business_yelp.at[2, 'state'] = 'AZ'

In [None]:
# Convierte la columna 'yelping_since' al formato datetime
business_yelp['yelping_since'] = pd.to_datetime(df['yelping_since'])

In [13]:
def contar_nulos(dataframe):

    # Obtener la cantidad de valores nulos por columna
  nulos_por_columna = dataframe.isnull().sum()

  print("Cantidad de valores nulos por columna:\n", nulos_por_columna)

In [14]:
#Cuantos nulos hay en cada columna
contar_nulos(business_yelp)

Cantidad de valores nulos por columna:
 business_id         0
name                0
address             0
city                0
state               0
postal_code         0
latitude            0
longitude           0
stars               0
review_count        0
is_open             0
attributes      13744
categories        103
hours           23223
dtype: int64


In [15]:
# Acceder al valor en la posición cero de la columna 'categories' del DataFrame 'business_yelp'
business_yelp.categories[0]

'Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists'

In [16]:
# Dividir las cadenas en la columna 'categories' por comas, expandirlas en filas y eliminar espacios iniciales
business_yelp_categories = business_yelp.categories.str.split(',').explode().str.lstrip()

# Contar el número de valores únicos en la columna 'business_yelp_categories'
business_yelp_categories.nunique()

1311

In [17]:
# Normaliza 'attributes'
attributes = pd.json_normalize(business_yelp['attributes'])
attributes['business_id'] = business_yelp['business_id']
attributes = attributes.melt(id_vars='business_id', var_name='attribute_key', value_name='attribute_value')
attributes.dropna(inplace=True)
attributes.reset_index(drop=True, inplace=True)

In [18]:
attributes[attributes['attribute_value'] == 'None']

Unnamed: 0,business_id,attribute_key,attribute_value
774,m0JTpAD6Hf7AO71hmmqxIg,ByAppointmentOnly,
3702,Xypw6Dn6Mt1gCywme5OoUw,ByAppointmentOnly,
5941,Bo-AALoRsKeLqfJbcyzm8Q,ByAppointmentOnly,
6048,gMC-74chzFpSoGbKWWuElg,ByAppointmentOnly,
7545,fq1gweldy1FqSeazemzReA,ByAppointmentOnly,
...,...,...,...
1206477,7gEZO8zTIlJdGcWZMBGMsw,HairSpecializesIn,
1206517,Tqkhl0H83bXDyZJE6S66kQ,HairSpecializesIn,
1206537,LQMa64U__ryF2J_ArnAsUQ,HairSpecializesIn,
1206577,eAv4CxnFb19Fpyi9HaiytA,HairSpecializesIn,


In [19]:
attributes = attributes[attributes['attribute_value'] != 'None']
attributes.reset_index(drop=True, inplace=True)

In [21]:
attributes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1193993 entries, 0 to 1193992
Data columns (total 3 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   business_id      1193993 non-null  object
 1   attribute_key    1193993 non-null  object
 2   attribute_value  1193993 non-null  object
dtypes: object(3)
memory usage: 27.3+ MB


In [22]:
# Normaliza 'categories'
categories = business_yelp[['business_id', 'categories']]
categories = categories.explode('categories')
categories = categories.rename(columns={'categories': 'category'})
categories['category'] = categories['category'].str.split(',')
categories = categories.explode('category')

In [23]:
categories[categories.category.isnull()]

Unnamed: 0,business_id,category
1917,SMYXOLPyM95JvZ-oqnsWUA,
2243,9ryVeDaaR-le3kiSayTGow,
3304,xT3J-SP5g49g2FjQfLEQfg,
3324,_obl2-rphXvtzP3y_ekV1Q,
4640,mKxCNYEoKt6d_1rXmvRwww,
...,...,...
144328,szluot9mpdIAnUDGi27__w,
145039,s54FBcv78I6QNjqznP9oKw,
146058,KYI2rHE3vTG_z9ddqhp58A,
148225,DCvA43gLeetay_qttR9ABQ,


In [24]:
categories.dropna(inplace=True)
categories.reset_index(drop=True, inplace=True)

In [25]:
categories.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 668592 entries, 0 to 668591
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   business_id  668592 non-null  object
 1   category     668592 non-null  object
dtypes: object(2)
memory usage: 10.2+ MB


In [26]:
# Normaliza 'hours'
hours = pd.json_normalize(business_yelp['hours'])
hours['business_id'] = business_yelp['business_id']
hours = hours.melt(id_vars='business_id', var_name='day_of_week', value_name='opening_hours')
hours.dropna(inplace=True)
hours.reset_index(drop=True, inplace=True)

In [27]:
hours

Unnamed: 0,business_id,day_of_week,opening_hours
0,mpf3x-BjTdTEA3yCZrAYPw,Monday,0:0-0:0
1,tUFrWirKiKi_TAnsVWINQQ,Monday,8:0-22:0
2,MTSW4McQd7CbVtyjqoe9mw,Monday,7:0-20:0
3,CF33F8-E6oudUQ46HnavjQ,Monday,0:0-0:0
4,n_0UpQx1hsNbnPUSlodU8w,Monday,0:0-0:0
...,...,...,...
801010,2O2K6SXPWv56amqxCECd4w,Sunday,0:0-0:0
801011,hn9Toz3s-Ei3uZPt7esExA,Sunday,11:0-22:0
801012,IUQopTMmYQG-qRtBk-8QnA,Sunday,11:0-17:0
801013,c8GjPIOTGVmIemT7j5_SyQ,Sunday,0:0-16:0


In [28]:
hours.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801015 entries, 0 to 801014
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   business_id    801015 non-null  object
 1   day_of_week    801015 non-null  object
 2   opening_hours  801015 non-null  object
dtypes: object(3)
memory usage: 18.3+ MB


In [29]:
# Eliminamos'attributes', 'categories' y 'hours' que ya hemos normalizado.
business = business_yelp.drop(columns=['attributes', 'categories', 'hours'])
business.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open'],
      dtype='object')

In [30]:
business.info()

<class 'pandas.core.frame.DataFrame'>
Index: 150346 entries, 0 to 150345
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   business_id   150346 non-null  object
 1   name          150346 non-null  object
 2   address       150346 non-null  object
 3   city          150346 non-null  object
 4   state         150346 non-null  object
 5   postal_code   150346 non-null  object
 6   latitude      150346 non-null  object
 7   longitude     150346 non-null  object
 8   stars         150346 non-null  object
 9   review_count  150346 non-null  object
 10  is_open       150346 non-null  object
dtypes: object(11)
memory usage: 17.8+ MB


## Carga de nuestro archivo

In [32]:
ruta_json = "C:/Users/Usuario/Desktop/Proyecto Final/PF_Google_yelp_Map/Notebook/business-limpio.json"
business_yelp.to_json(ruta_json, orient='records', lines=True)