# __ETL__ _(Extract, Transform, Load)_

## Introducción

Este notebook se enfoca en el proceso de **ETL** utilizando datos extraídos de las plataformas Yelp y Google Maps. Este proceso implica una _extracccion,transformación y carga_ de los datos con el objetivo de prepararlos para análisis posteriores. Este paso es crucial en cualquier proyecto de ciencia de datos para garantizar la calidad y utilidad de los datos.

## Configuraciones Globales e Importaciones

En esta sección, se instalan e importan todas las librerías y/o módulos necesarios para el proceso ETL (Extract, Transform, Load) y se establecen configuraciones globales de ser requerido. Se utilizan las siguientes librerías y herramientas:

In [1]:
#Se conecta Google Colaboratory con Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Instalamos SPARK para manejar grande volumnes de datos

In [2]:
#Instala pyspark en Google Colaboratory
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488493 sha256=67690606e9fd8c4732203f7da33a0de825a528e7119c9080cb5cabb2e6497b24
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


## Importamos librerias necesarias

In [3]:
import re
import os # Proporciona funciones para interactuar con el sistema operativo.
import requests # Se utiliza para realizar solicitudes HTTP.
import pandas as pd # Una librería de análisis de datos.
import seaborn as sns #S e utiliza para la visualización de datos.
import pyspark.pandas as ps # Proporciona una interfaz para trabajar con datos en Spark utilizando el formato de DataFrame de pandas.
import json # Se utiliza para trabajar con datos en formato JSON.
from pyspark.sql import SparkSession # Se utiliza para crear una instancia de SparkSession, que es la entrada principal para trabajar con Spark SQL.
from pyspark.sql import functions as F #  Proporciona funciones para trabajar con datos en Spark DataFrame.
from pyspark.sql.functions import array_contains # Esta función se utiliza para filtrar los datos basados en la presencia de un valor en un array.
from pyspark.sql.functions import sum, col # Se utiliza para acceder a una columna en un DataFrame de Spark.
from pyspark.sql.functions import split, substring, concat_ws
from pyspark.sql.functions import expr, regexp_replace, when
from pyspark.sql.types import StringType



In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Preliminary EDA")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()
spark

In [6]:
#Se cargan los archivos de la carpeta metadata-sitios y se compilan en un solo DataFrame

metadata_df=[]

for i in range(1,12):


    archivo = spark.read.json(f"/content/drive/MyDrive/Colab-Notebooks/metadata-sitios/{i}.json")

    archivo = archivo.withColumn("MISC", col("MISC").cast("string"))

    metadata_df.append(archivo)

df_final = metadata_df[0]

for dataframe in metadata_df[1:]:

  df_final = df_final.unionByName(dataframe)


metadata_df=df_final

In [7]:
# Mostrar el DataFrame
metadata_df.show()

+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|                MISC|             address|avg_rating|            category|         description|             gmap_id|               hours|          latitude|          longitude|                name|num_of_reviews|price|    relative_results|               state|                 url|
+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|{[Wheelchair acce...|Porter Pharmacy, ...|       4.9|          [Pharmacy]|                NULL|0x88f16e41928ff68...|[[Friday, 8AM–6PM...|           32

### **BUSCAMOS AL CLIENTE EN NUESTRA BASE DE DATOS**

In [8]:
cliente_sgambatis = metadata_df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis.show()


+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+------------------+------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|                MISC|             address|avg_rating|            category|         description|             gmap_id|               hours|          latitude|         longitude|                name|num_of_reviews|price|    relative_results|               state|                 url|
+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+------------------+------------------+--------------------+--------------+-----+--------------------+--------------------+--------------------+
|{[Wheelchair acce...|Sgambati's New Yo...|       3.4|[Pizza restaurant...|                NULL|0x8802e40e3b00b01...|[[Thursday, 9:30A...|         44.4650

Notamos que nuestro cliente tiene columnas con datos **NULL** eso puede generar perdidas importantes en el momento de estar realizando la limpieza del archivo.

Es por esto que procederemos a buscar los datos vacios, manipularlos y transformarlos para que no perdamos a nuestro cliente durante todo el proceso.

In [9]:
# Filtrar los datos por nombre y descripción
sgambatis_description = metadata_df.filter(col('name').like('%Sgambati%')).select('description')

# Mostrar los datos de la columna 'description'
sgambatis_description.show(truncate=False)

# Obtener el primer dato de la descripción
primer_dato_desc = sgambatis_description.first()[0]

print("Primer dato en la columna description:", primer_dato_desc)


+---------------------------------------------------------------------------------------+
|description                                                                            |
+---------------------------------------------------------------------------------------+
|NULL                                                                                   |
|NULL                                                                                   |
|NULL                                                                                   |
|Brick-front pizza & pasta parlor with both bar & booth seating, plus a takeout counter.|
|NULL                                                                                   |
+---------------------------------------------------------------------------------------+

Primer dato en la columna description: None


Modificamos los datos NULL de la columna descripcion para evitar perdidas cuando eliminemos los NULL.

In [10]:
# Modificar los valores NULL en la columna 'description' que tienen referencia con "Sgambati"
metadata_df = metadata_df.withColumn('description',
                                     when((col('name').like('%Sgambati%')) & (col('description').isNull()),
                                          'Brick-front pizza & pasta parlor with both bar & booth seating, plus a takeout counter')
                                     .otherwise(col('description')))

# Mostrar los resultados modificados
metadata_df.filter(col('name').like('%Sgambati%')).select('name', 'description').show(truncate=False)


+-------------------------------------+---------------------------------------------------------------------------------------+
|name                                 |description                                                                            |
+-------------------------------------+---------------------------------------------------------------------------------------+
|Sgambati's New York Pizza            |Brick-front pizza & pasta parlor with both bar & booth seating, plus a takeout counter |
|Sgambati's New York Pizza            |Brick-front pizza & pasta parlor with both bar & booth seating, plus a takeout counter |
|Engel Devlin Sgambati, LLC           |Brick-front pizza & pasta parlor with both bar & booth seating, plus a takeout counter |
|Sgambati's New York Pizza            |Brick-front pizza & pasta parlor with both bar & booth seating, plus a takeout counter.|
|The Law Firm of Green Haines Sgambati|Brick-front pizza & pasta parlor with both bar & booth seating, p

In [11]:
# Filtrar los datos por nombre y category
sgambatis_category = metadata_df.filter(col('name').like('%Sgambati%')).select('category')

# Mostrar los datos de la columna 'category'
sgambatis_category.show(truncate=False)

# Obtener el primer dato de la category
primer_dato_cte = sgambatis_category.first()[0]

print("Primer dato en la columna category:", primer_dato_cte)

+------------------------------------------------------------------+
|category                                                          |
+------------------------------------------------------------------+
|[Pizza restaurant, Italian restaurant, Pizza delivery, Restaurant]|
|[Pizza restaurant, Italian restaurant, Pizza delivery, Restaurant]|
|[Attorney]                                                        |
|[Pizza restaurant, Italian restaurant]                            |
|[Law firm]                                                        |
+------------------------------------------------------------------+

Primer dato en la columna category: ['Pizza restaurant', 'Italian restaurant', 'Pizza delivery', 'Restaurant']


### Procederemos a eliminar columnas con las que no trabajaeremos

In [12]:
# Lista de columnas a eliminar
columnas_a_eliminar = ['MISC', 'state', 'price', 'hours', 'description', 'relative_results']

In [13]:
# Eliminar las columnas
metadata_df = metadata_df.drop(*columnas_a_eliminar)

# Mostrar el DataFrame resultante
metadata_df.show(truncate=False)

+------------------------------------------------------------------------------------------------------+----------+-----------------------------------------------------------------------+-------------------------------------+------------------+-------------------+---------------------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------+
|address                                                                                               |avg_rating|category                                                               |gmap_id                              |latitude          |longitude          |name                                               |num_of_reviews|url                                                                                                             |
+------------------------------------------------------------------------------------------------------+------

### Volvemos a llamar al cliente para asegurarnos que se encuentre en la base de datos

In [14]:
cliente_sgambatis = metadata_df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis.show()

+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|             address|avg_rating|            category|             gmap_id|          latitude|         longitude|                name|num_of_reviews|                 url|
+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Engel Devlin Sgam...|       5.0|          [Attorney]|0x89c3b95088808ab...|        40.6203381|        -74.488506|Engel Devlin Sgam...|           

In [15]:
#Muestra el número de filas del DataFrame
metadata_df.count()

3025011

### **Buscamos NULOS**

In [16]:
# Cuenta el número de nulos en cada columna

def conteo_nulos(dataframe):
    # Construye expresiones de agregación para contar nulos en cada columna
    expresiones_agregacion = [sum(col(c).isNull().cast("int")).alias(c) for c in dataframe.columns]

    # Aplica las expresiones de agregación al dataframe
    conteo_nulos_por_columna = dataframe.agg(*expresiones_agregacion)

    # Muestra el resultado
    conteo_nulos_por_columna.show()

In [17]:
# Llama a la función con tu dataframe
conteo_nulos(metadata_df)

+-------+----------+--------+-------+--------+---------+----+--------------+---+
|address|avg_rating|category|gmap_id|latitude|longitude|name|num_of_reviews|url|
+-------+----------+--------+-------+--------+---------+----+--------------+---+
|  80511|         0|   17419|      0|       0|        0|  37|             0|  0|
+-------+----------+--------+-------+--------+---------+----+--------------+---+



### **Eliminamos los datos NULOS**

In [18]:
#Llamamos al cliente para confirmar que este
cliente_sgambatis2 = metadata_df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis2.show()

+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|             address|avg_rating|            category|             gmap_id|          latitude|         longitude|                name|num_of_reviews|                 url|
+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Engel Devlin Sgam...|       5.0|          [Attorney]|0x89c3b95088808ab...|        40.6203381|        -74.488506|Engel Devlin Sgam...|           

In [19]:
# Eliminar filas donde 'address' está vacío
metadata_df = metadata_df.dropna(subset=['address'])

# Verificar si se eliminaron las filas nulas
conteo_nulos(metadata_df)

+-------+----------+--------+-------+--------+---------+----+--------------+---+
|address|avg_rating|category|gmap_id|latitude|longitude|name|num_of_reviews|url|
+-------+----------+--------+-------+--------+---------+----+--------------+---+
|      0|         0|   17414|      0|       0|        0|   0|             0|  0|
+-------+----------+--------+-------+--------+---------+----+--------------+---+



In [20]:
#Llamamos al cliente para confirmar que este
cliente_sgambatis3 = metadata_df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis3.show()

+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|             address|avg_rating|            category|             gmap_id|          latitude|         longitude|                name|num_of_reviews|                 url|
+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Engel Devlin Sgam...|       5.0|          [Attorney]|0x89c3b95088808ab...|        40.6203381|        -74.488506|Engel Devlin Sgam...|           

In [21]:
# Eliminar filas donde 'metadata_df' está vacío
metadata_df = metadata_df.dropna(subset=['category'])

# Verificar si se eliminaron las filas nulas
conteo_nulos(metadata_df)

+-------+----------+--------+-------+--------+---------+----+--------------+---+
|address|avg_rating|category|gmap_id|latitude|longitude|name|num_of_reviews|url|
+-------+----------+--------+-------+--------+---------+----+--------------+---+
|      0|         0|       0|      0|       0|        0|   0|             0|  0|
+-------+----------+--------+-------+--------+---------+----+--------------+---+



### **Verificamos que nuestro cliente siga en la BASE DE DATOS**

In [22]:
#Llamamos al cliente para confirmar que este
cliente_sgambatis4 = metadata_df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis4.show()

+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|             address|avg_rating|            category|             gmap_id|          latitude|         longitude|                name|num_of_reviews|                 url|
+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Engel Devlin Sgam...|       5.0|          [Attorney]|0x89c3b95088808ab...|        40.6203381|        -74.488506|Engel Devlin Sgam...|           

### **Funcion para llamar la primera fila de cada columna**

In [23]:
#Función para la obtención de los primeros datos en columnas especificas

def obtener_primer_dato(dataframe, columna):
    # Selecciona la columna especificada y extrae el primer dato
    primer_dato = dataframe.select(columna).first()[0]
    return primer_dato

### **Buscamos DUPLICADOS**

In [24]:
# Encuentra las filas duplicadas en el DataFrame
duplicados = metadata_df.subtract(metadata_df.dropDuplicates())
# Muestra los registros duplicados
duplicados.show()

+-------+----------+--------+-------+--------+---------+----+--------------+---+
|address|avg_rating|category|gmap_id|latitude|longitude|name|num_of_reviews|url|
+-------+----------+--------+-------+--------+---------+----+--------------+---+
+-------+----------+--------+-------+--------+---------+----+--------------+---+



In [25]:
#Mostrar la estructura de mi dataframe
metadata_df.printSchema()

root
 |-- address: string (nullable = true)
 |-- avg_rating: double (nullable = true)
 |-- category: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- gmap_id: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- name: string (nullable = true)
 |-- num_of_reviews: long (nullable = true)
 |-- url: string (nullable = true)



### **CONVERTIMOS TODA LA COLUMNA CATEGORY A STRING**

In [26]:
#Pasa la columna 'category' a tipo string
metadata_df = metadata_df.withColumn("category", col("category").cast("string"))

In [27]:
# Llama a la función con tu dataframe y el nombre de la columna
primer_dato_category = obtener_primer_dato(metadata_df, "category")
print("Primer dato en la columna category:", primer_dato_category)

Primer dato en la columna category: [Pharmacy]


In [28]:
#Llamamos al cliente para confirmar que este
cliente_sgambatis5 = metadata_df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis5.show()

+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|             address|avg_rating|            category|             gmap_id|          latitude|         longitude|                name|num_of_reviews|                 url|
+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Sgambati's New Yo...|       3.4|[Pizza restaurant...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Engel Devlin Sgam...|       5.0|          [Attorney]|0x89c3b95088808ab...|        40.6203381|        -74.488506|Engel Devlin Sgam...|           

### ** Filtramos solos las filas que son RESTAURANT**

In [29]:
# Filtra las filas que contienen 'restaurant'
df = metadata_df.filter(col('category').like('%restaurant%'))

In [30]:
# Utiliza la función substring para eliminar el primer y último caracter de la columna 'category'
df = metadata_df.withColumn("category", expr("substring(category, 2, length(category)-2)"))

In [31]:
df_aux = df.withColumn("category", col("category").cast("string"))

# Filtra y cuenta las filas que contienen 'restaurant' en la columna 'category'
contador_restaurant = df_aux.filter(col("category").contains('restaurant')).count()


# Imprime el resultado
print("El número de restaurantes es:", contador_restaurant)

El número de restaurantes es: 151771


In [32]:
df_aux.show()

+--------------------+----------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+--------------------+
|             address|avg_rating|            category|             gmap_id|          latitude|          longitude|                name|num_of_reviews|                 url|
+--------------------+----------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+--------------------+
|Porter Pharmacy, ...|       4.9|            Pharmacy|0x88f16e41928ff68...|           32.3883|           -83.3571|     Porter Pharmacy|            16|https://www.googl...|
|City Textile, 300...|       4.5|    Textile exporter|0x80c2c98c0e3c16f...|        34.0188913|       -118.2152898|        City Textile|             6|https://www.googl...|
|San Soo Dang, 761...|       4.4|   Korean restaurant|0x80c2c778e3b73d3...|        34.0580917|       -118.2921295|        San Soo Dang|     

In [33]:
#Verificamos que el cliente siga en linea
cliente_sgambatis6 = df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis6.show()

+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|             address|avg_rating|            category|             gmap_id|          latitude|         longitude|                name|num_of_reviews|                 url|
+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+
|Sgambati's New Yo...|       3.4|Pizza restaurant,...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Sgambati's New Yo...|       3.4|Pizza restaurant,...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|
|Engel Devlin Sgam...|       5.0|            Attorney|0x89c3b95088808ab...|        40.6203381|        -74.488506|Engel Devlin Sgam...|           

In [34]:
# Llama a la función con tu dataframe y el nombre de la columna
primer_dato_address = obtener_primer_dato(df, "address")
print("Primer dato en la columna address:", primer_dato_address)

Primer dato en la columna address: Porter Pharmacy, 129 N Second St, Cochran, GA 31014


In [35]:
# Filtrar los datos por nombre y  address
sgambatis_address = metadata_df.filter(col('name').like('%Sgambati%')).select('address')

# Mostrar los datos de la columna ' address'
sgambatis_address.show(truncate=False)

# Obtener el primer dato de la  address
primer_dato_address = sgambatis_address.first()[0]

print("Primer dato en la columna address:", primer_dato_address)

+----------------------------------------------------------------------------------+
|address                                                                           |
+----------------------------------------------------------------------------------+
|Sgambati's New York Pizza, 2725 Manitowoc Rd, Green Bay, WI 54311                 |
|Sgambati's New York Pizza, 2725 Manitowoc Rd, Green Bay, WI 54311                 |
|Engel Devlin Sgambati, LLC, 31 Mountain Blvd # E, Warren, NJ 07059                |
|Sgambati's New York Pizza, 1700 Sand Acres Dr, De Pere, WI 54115, United States   |
|The Law Firm of Green Haines Sgambati, 100 E Federal St #800, Youngstown, OH 44503|
+----------------------------------------------------------------------------------+

Primer dato en la columna address: Sgambati's New York Pizza, 2725 Manitowoc Rd, Green Bay, WI 54311


In [36]:
# Filtrar los datos por nombre y gmap_id
sgambatis_gmap_id = df.filter(col('name').like('%Sgambati%')).select('gmap_id')

# Mostrar los datos de la columna gmap_id
sgambatis_gmap_id.show(truncate=False)

# Obtener el primer dato de la gmap_id
primer_dato_gmap_id = sgambatis_gmap_id.first()[0]

print("Primer dato en la columna gmap_id:", primer_dato_gmap_id)

+-------------------------------------+
|gmap_id                              |
+-------------------------------------+
|0x8802e40e3b00b01b:0xbc746336817e4381|
|0x8802e40e3b00b01b:0xbc746336817e4381|
|0x89c3b95088808ab7:0xbacd90b30fbc7da6|
|0x8802fee355be516f:0x95a8f520c250b36b|
|0x8833e5764c013bab:0xd7e626dd2c3093bf|
+-------------------------------------+

Primer dato en la columna gmap_id: 0x8802e40e3b00b01b:0xbc746336817e4381


Separamos ADDRESS en 3 para tener datos Limpios

In [37]:
# Separa la columna 'address' en 3 columnas: 'address', 'city' y 'state'
split_col = split(df['address'], ',')
df = df.withColumn('state', substring(split_col.getItem(3), 2, 2))
df = df.withColumn('city', split_col.getItem(2))
df = df.withColumn('address', concat_ws(',', split_col.getItem(0), split_col.getItem(1)))

In [38]:
#Verificamos que el cliente siga en linea
cliente_sgambatis7 = df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis7.show()

+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+-----+--------------------+
|             address|avg_rating|            category|             gmap_id|          latitude|         longitude|                name|num_of_reviews|                 url|state|                city|
+--------------------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------+--------------------+-----+--------------------+
|Sgambati's New Yo...|       3.4|Pizza restaurant,...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|   WI|           Green Bay|
|Sgambati's New Yo...|       3.4|Pizza restaurant,...|0x8802e40e3b00b01...|         44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|   WI|           Green Bay|
|Engel Dev

### **Filtramos NY Y CA**

In [39]:
# Filtrar las filas con estados 'NY', 'CA'
estados_seleccionados = ['NY', 'CA', 'WI']

df = df.filter(df['state'].isin(estados_seleccionados))

In [40]:
#Verificamos que el cliente siga en linea
cliente_sgambatis8 = df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis8.show()

+--------------------+----------+--------------------+--------------------+----------+------------------+--------------------+--------------+--------------------+-----+----------+
|             address|avg_rating|            category|             gmap_id|  latitude|         longitude|                name|num_of_reviews|                 url|state|      city|
+--------------------+----------+--------------------+--------------------+----------+------------------+--------------------+--------------+--------------------+-----+----------+
|Sgambati's New Yo...|       3.4|Pizza restaurant,...|0x8802e40e3b00b01...| 44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|   WI| Green Bay|
|Sgambati's New Yo...|       3.4|Pizza restaurant,...|0x8802e40e3b00b01...| 44.465098|-87.94651689999999|Sgambati's New Yo...|            18|https://www.googl...|   WI| Green Bay|
|Sgambati's New Yo...|       4.2|Pizza restaurant,...|0x8802fee355be516...|44.4318372|       -88.124

### **Arreglamos las columnas LATITUD Y LONGITUD AL MISMO TIPO DE DATOS Y LA MISMA CANTIDAD DE DIGITOS**

In [41]:
# Llama a la función con tu dataframe
conteo_nulos(df)

+-------+----------+--------+-------+--------+---------+----+--------------+---+-----+----+
|address|avg_rating|category|gmap_id|latitude|longitude|name|num_of_reviews|url|state|city|
+-------+----------+--------+-------+--------+---------+----+--------------+---+-----+----+
|      0|         0|       0|      0|       0|        0|   0|             0|  0|    0|   0|
+-------+----------+--------+-------+--------+---------+----+--------------+---+-----+----+



In [42]:
# Filtrar filas donde 'latitude' y 'longitude' no sean numéricos
df = df.filter(col("latitude").rlike("^-?\\d+\\.\\d+$"))
df = df.filter(col("longitude").rlike("^-?\\d+\\.\\d+$"))

df.show()


+--------------------+----------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+--------------------+-----+---------------+
|             address|avg_rating|            category|             gmap_id|          latitude|          longitude|                name|num_of_reviews|                 url|state|           city|
+--------------------+----------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+--------------------+-----+---------------+
|City Textile, 300...|       4.5|    Textile exporter|0x80c2c98c0e3c16f...|        34.0188913|       -118.2152898|        City Textile|             6|https://www.googl...|   CA|    Los Angeles|
|San Soo Dang, 761...|       4.4|   Korean restaurant|0x80c2c778e3b73d3...|        34.0580917|       -118.2921295|        San Soo Dang|            18|https://www.googl...|   CA|    Los Angeles|
|Nova Fabrics, 220...|       3

In [43]:
# Ajustar la cantidad de números antes y después del punto decimal
df = df.withColumn("latitude", regexp_replace(col("latitude"), "^(-?\d{1,3})\.(\d{1,7})$", "$1.$2"))
df = df.withColumn("longitude", regexp_replace(col("longitude"), "^(-?\d{1,3})\.(\d{1,7})$", "$1.$2"))

df.show()


+--------------------+----------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+--------------------+-----+---------------+
|             address|avg_rating|            category|             gmap_id|          latitude|          longitude|                name|num_of_reviews|                 url|state|           city|
+--------------------+----------+--------------------+--------------------+------------------+-------------------+--------------------+--------------+--------------------+-----+---------------+
|City Textile, 300...|       4.5|    Textile exporter|0x80c2c98c0e3c16f...|        34.0188913|       -118.2152898|        City Textile|             6|https://www.googl...|   CA|    Los Angeles|
|San Soo Dang, 761...|       4.4|   Korean restaurant|0x80c2c778e3b73d3...|        34.0580917|       -118.2921295|        San Soo Dang|            18|https://www.googl...|   CA|    Los Angeles|
|Nova Fabrics, 220...|       3

In [44]:
def limpiar_lat_lon(lat_lon):
    # Utilizar una expresión regular para extraer solo los números y el punto decimal
    cleaned_lat_lon = re.findall(r'-?\d+\.\d{1,7}', lat_lon)
    if cleaned_lat_lon:
        return cleaned_lat_lon[0]
    return None

In [45]:
# Aplicar la función de limpieza a las columnas de latitud y longitud
limpiar_lat_lon_udf = F.udf(limpiar_lat_lon)
df = df.withColumn("latitude", limpiar_lat_lon_udf("latitude"))
df = df.withColumn("longitude", limpiar_lat_lon_udf("longitude"))

In [46]:
#Verificamos que el cliente siga en linea
cliente_sgambatis9 = df.filter(col('name').like('%Sgambati%'))
cliente_sgambatis9.show()

+--------------------+----------+--------------------+--------------------+----------+-----------+--------------------+--------------+--------------------+-----+----------+
|             address|avg_rating|            category|             gmap_id|  latitude|  longitude|                name|num_of_reviews|                 url|state|      city|
+--------------------+----------+--------------------+--------------------+----------+-----------+--------------------+--------------+--------------------+-----+----------+
|Sgambati's New Yo...|       3.4|Pizza restaurant,...|0x8802e40e3b00b01...| 44.465098|-87.9465168|Sgambati's New Yo...|            18|https://www.googl...|   WI| Green Bay|
|Sgambati's New Yo...|       3.4|Pizza restaurant,...|0x8802e40e3b00b01...| 44.465098|-87.9465168|Sgambati's New Yo...|            18|https://www.googl...|   WI| Green Bay|
|Sgambati's New Yo...|       4.2|Pizza restaurant,...|0x8802fee355be516...|44.4318372|-88.1245254|Sgambati's New Yo...|           608|h

In [47]:
df.show()

+--------------------+----------+--------------------+--------------------+----------+------------+--------------------+--------------+--------------------+-----+---------------+
|             address|avg_rating|            category|             gmap_id|  latitude|   longitude|                name|num_of_reviews|                 url|state|           city|
+--------------------+----------+--------------------+--------------------+----------+------------+--------------------+--------------+--------------------+-----+---------------+
|City Textile, 300...|       4.5|    Textile exporter|0x80c2c98c0e3c16f...|34.0188913|-118.2152898|        City Textile|             6|https://www.googl...|   CA|    Los Angeles|
|San Soo Dang, 761...|       4.4|   Korean restaurant|0x80c2c778e3b73d3...|34.0580917|-118.2921295|        San Soo Dang|            18|https://www.googl...|   CA|    Los Angeles|
|Nova Fabrics, 220...|       3.3|        Fabric store|0x80c2c89923b27a4...|34.0236689|-118.2329297|      

In [48]:
df.count()

464324

In [49]:
# Reemplazar las comillas y las comas en todas las columnas
df_clean = df.select([
    regexp_replace(regexp_replace(col(column), '"', ''), ',', '').alias(column)
    for column in df.columns
])

In [50]:
# Reemplazar las comillas y las comas en todas las columnas
df_clean = df_clean.select([
    regexp_replace(regexp_replace(col(column), '"', ''), ';', '').alias(column)
    for column in df.columns
])

CARGAR ARCHIVO

In [53]:
# Ruta al archivo CSV local
file_path = '/content/drive/MyDrive/Colab-Notebooks/transformaciones/meta-finalgaby.csv'

# Escribe el DataFrame a un solo archivo CSV localmente con separador ","
df.coalesce(1).write.csv(file_path, header=True, sep=',')