# Plantilla para la Tarea online BDA03

# Nombre del alumno: Victoria Jiménez Martín

Realiza las tareas que se plantean en cada ejercicio. En algunas tareas deberás completar las celdas que están incompletas en otras añadir nuevas celdas. Se trata de que implementes una serie de consultas con HQL (Hive) y Pig Latin.

Vamos a seguir utilizando el `dataset` de retrasos en vuelos en EEUU de la guía práctica. A modo de recordatorio, en el siguiente apartado, repetimos la explicación del significado de los campos.

# Dataset de retrasos en vuelos

Vamos a usar [este](https://www.kaggle.com/datasets/tylerx/flights-and-airports-data) de Kaggle
para aprender a usar tanto Hive como Pig. Kaggle es un sitio muy popular en ciencia de datos. En este sitio los científicos de datos pueden publicar y compartir sus trabajos. Además también se pueden proponer concursos en los que los participantes compiten en la construcción del mejor modelo para el problema propuesto.

El `dataset` contiene información sobre retrasos en vuelos en EEUU. Hay dos ficheros de interés: `airports.csv` y `flights.csv`.

El primero tiene información sobre los aeropuertos y consta de los siguientes campos:
   * airport_id: identificador del aeropuerto. Numérico, aunque se utilizará un campo `string` en Hive.
   * city: ciudad del aeropuerto.
   * state: estado del aeropuerto.
   * name: nombre del aeropuerto.
   
El fichero `flights` tiene la siguiente estructura:
   * DayofMonth: día del mes del vuelo.
   * DayOfWeek: día de la semana del vuelo.
   * Carrier: Identificador de la compañía aérea.
   * OriginAirportID: Identificador del aeropuerto de origen.
   * DestAirportID: Identificador del aeropuerto de destino.
   * DepDelay: Minutos de retraso en la salida de un vuelo (puede ser negativo si el vuelo sale antes de lo previsto).
   * ArrDelay: Minutos de retraso en la llegada de un vuelo (puede ser negativo si el vuelo sale antes de lo previsto).

El directorio `notebooks` contiene el `archiv.zip` con los dos ficheros. Para descargarlo de Kaggle hay que estar registrado y se ha incluido para que no tengas que registrarte.

## 1.- Realiza el proceso de preparación que se hizo en la guia práctica:

* Crea las celdas y muestra el resultado de su ejecución de la extracción de los ficheros del `dataset` de vuelos.
* Crea la base de datos de Hive y las tablas `airports` y `flights`. Presta atención a cambiar los comentarios y no simplemente copiar los de la guía.
* Carga las tablas y crea consultas de HQL que muestren 10 aeropuertos y 10 vuelos como se hizo en la guía práctica.
* Crea un `script` en Pig Latin que muestre 10 aeropuertos y 10 vuelos como se hizo en la guía práctica.

In [20]:
# Lo primero que haremos será actualizar el entorno

! apt update

Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:2 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Reading package lists... Done3m[33m
Building dependency tree       
Reading state information... Done
207 packages can be upgraded. Run 'apt list --upgradable' to see them.


In [21]:
# Instalamos el unzip

! apt install unzip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
unzip is already the newest version (6.0-25ubuntu1.2).
0 upgraded, 0 newly installed, 0 to remove and 207 not upgraded.


In [22]:
# Descomprimimos los archivos

! unzip -j -o archive.zip airports.csv flights.csv

Archive:  archive.zip
  inflating: airports.csv            
  inflating: flights.csv             


In [23]:
# Leemos el numero de lineas y primeras lineas del fichero de airports.csv
! wc -l airports.csv && head airports.csv

366 airports.csv
airport_id,city,state,name
10165,Adak Island,AK,Adak
10299,Anchorage,AK,Ted Stevens Anchorage International
10304,Aniak,AK,Aniak Airport
10754,Barrow,AK,Wiley Post/Will Rogers Memorial
10551,Bethel,AK,Bethel Airport
10926,Cordova,AK,Merle K Mudhole Smith
14709,Deadhorse,AK,Deadhorse Airport
11336,Dillingham,AK,Dillingham Airport
11630,Fairbanks,AK,Fairbanks International


In [24]:
# Leemos el numero de lineas y primeras lineas del fichero de flights.csv
! wc -l flights.csv && head flights.csv

2702219 flights.csv
DayofMonth,DayOfWeek,Carrier,OriginAirportID,DestAirportID,DepDelay,ArrDelay
19,5,DL,11433,13303,-3,1
19,5,DL,14869,12478,0,-8
19,5,DL,14057,14869,-4,-15
19,5,DL,15016,11433,28,24
19,5,DL,11193,12892,-6,-11
19,5,DL,10397,15016,-1,-19
19,5,DL,15016,10397,0,-1
19,5,DL,10397,14869,15,24
19,5,DL,10397,10423,33,34


In [25]:
! hdfs dfs -mkdir -p /user/root/flights
! hdfs dfs -put -f ./airports.csv /user/root/flights
! hdfs dfs -put -f ./flights.csv /user/root/flights
! hdfs dfs -ls /user/root/flights

Found 2 items
-rw-r--r--   3 root supergroup      16308 2024-02-17 01:01 /user/root/flights/airports.csv
-rw-r--r--   3 root supergroup   72088113 2024-02-17 01:01 /user/root/flights/flights.csv


In [26]:
# Para comprobar que se conecta a Hive, ejecutamos el siguiente comando para que nos muestre las
# bases de datos disponibles
! beeline -u "jdbc:hive2://localhost:10000" -e "SHOW DATABASES"

Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217010123_dd6ffd58-0dce-4729-9e34-50bb3e3e1998): SHOW DATABASES
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=root_20240217010123_dd6ffd58-0dce-4729-9e34-50bb3e3e1998); Time taken: 0.013 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20240217010123_dd6ffd58-0dce-4729-9e34-50bb3e3e1998): SHOW DATABASES
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=root_20240217010123_dd6ffd58-0dce-4729-9e34-50bb3e3e1998); Ti

In [27]:
# Creamos una base de datos nueva
! beeline -u "jdbc:hive2://localhost:10000/" -e "CREATE DATABASE IF NOT EXISTS bda03 \
COMMENT 'Base de datos de la unidad BDA03' \
WITH DBPROPERTIES ('Creada por' = 'Victoria Jiménez', 'Fecha' = '13/02/24');"

Connecting to jdbc:hive2://localhost:10000/
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217010135_99a6171d-1de3-41d6-8d4d-d407edf81835): CREATE DATABASE IF NOT EXISTS bda03  COMMENT 'Base de datos de la unidad BDA03'  WITH DBPROPERTIES ('Creada por' = 'Victoria Jim?nez', 'Fecha' = '13/02/24')
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20240217010135_99a6171d-1de3-41d6-8d4d-d407edf81835); Time taken: 0.01 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20240217010135_99a6171d-1de3-41d6-8d4d-d407edf81835): CREATE DATABASE IF NOT EXISTS bda03  COMMENT 'Base de datos de la unidad BDA03'  WITH DBPROPERTIES

In [28]:
# Creamos una tabla nueva
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "DROP TABLE IF EXISTS airports; \
CREATE EXTERNAL TABLE IF NOT EXISTS airports (airportid STRING, city STRING, state STRING, airportname STRING) \
COMMENT 'USA Airports' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' \
TBLPROPERTIES ('Autora' = 'Victoria Jiménez', 'Fecha' = '13/02/24', 'skip.header.line.count' = '1');"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217010147_a86e5c0e-35e0-4b00-8f5b-5d091539f5cd): DROP TABLE IF EXISTS airports
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20240217010147_a86e5c0e-35e0-4b00-8f5b-5d091539f5cd); Time taken: 0.065 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20240217010147_a86e5c0e-35e0-4b00-8f5b-5d091539f5cd): DROP TABLE IF EXISTS airports
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=root_20240217010147_a86e5c0e-35e0-4b00-8f5b-5d091539f5cd); Time taken: 0.435 seconds
INFO  : O

In [29]:
# Creamos una tabla nueva
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "DROP TABLE IF EXISTS flights; \
CREATE EXTERNAL TABLE IF NOT EXISTS flights (dayofmonth TINYINT, dayofweek TINYINT, carrier STRING, \
depairportid STRING, arrairportid STRING, depdelay SMALLINT, arrdelay SMALLINT) \
COMMENT 'Flights' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' \
TBLPROPERTIES ('Autora' = 'Victoria Jiménez', 'Fecha' = '13/02/24', 'skip.header.line.count' = '1');"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217010200_ea619836-4ba9-4f56-8503-326b73955f5d): DROP TABLE IF EXISTS flights
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20240217010200_ea619836-4ba9-4f56-8503-326b73955f5d); Time taken: 0.086 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20240217010200_ea619836-4ba9-4f56-8503-326b73955f5d): DROP TABLE IF EXISTS flights
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=root_20240217010200_ea619836-4ba9-4f56-8503-326b73955f5d); Time taken: 0.422 seconds
INFO  : OK


In [30]:
# Asignamos permisos al directorio
! hdfs dfs -chmod 777 /user/root/flights

In [31]:
# Cargamos el csv en la tabla
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "LOAD DATA INPATH '/user/root/flights/airports.csv' \
OVERWRITE INTO TABLE airports;"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217010217_34a876b0-9576-4efd-84ea-966b9e0d6349): LOAD DATA INPATH '/user/root/flights/airports.csv'  OVERWRITE INTO TABLE airports
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20240217010217_34a876b0-9576-4efd-84ea-966b9e0d6349); Time taken: 0.072 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20240217010217_34a876b0-9576-4efd-84ea-966b9e0d6349): LOAD DATA INPATH '/user/root/flights/airports.csv'  OVERWRITE INTO TABLE airports
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table bda

In [32]:
# Cargamos el csv en la tabla
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "LOAD DATA INPATH '/user/root/flights/flights.csv' \
OVERWRITE INTO TABLE flights;"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217010232_70b0aa88-4f5b-4d7f-b52a-54dd6530b18c): LOAD DATA INPATH '/user/root/flights/flights.csv'  OVERWRITE INTO TABLE flights
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20240217010232_70b0aa88-4f5b-4d7f-b52a-54dd6530b18c); Time taken: 0.049 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20240217010232_70b0aa88-4f5b-4d7f-b52a-54dd6530b18c): LOAD DATA INPATH '/user/root/flights/flights.csv'  OVERWRITE INTO TABLE flights
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table bda03.f

In [33]:
# Cargamos 10 aeropuestos
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "SELECT * FROM airports LIMIT 10"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217010239_a0cbb21a-5077-4d0f-a16c-41833878e8f3): SELECT * FROM airports LIMIT 10
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:airports.airportid, type:string, comment:null), FieldSchema(name:airports.city, type:string, comment:null), FieldSchema(name:airports.state, type:string, comment:null), FieldSchema(name:airports.airportname, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20240217010239_a0cbb21a-5077-4d0f-a16c-41833878e8f3); Time taken: 0.191 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20240217010239_a0cbb21

In [34]:
# Muestra 10 vuelos
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "SELECT * FROM flights LIMIT 10"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217010254_9fa8a1d5-1418-40b1-9655-bc5631db4e8d): SELECT * FROM flights LIMIT 10
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:flights.dayofmonth, type:tinyint, comment:null), FieldSchema(name:flights.dayofweek, type:tinyint, comment:null), FieldSchema(name:flights.carrier, type:string, comment:null), FieldSchema(name:flights.depairportid, type:string, comment:null), FieldSchema(name:flights.arrairportid, type:string, comment:null), FieldSchema(name:flights.depdelay, type:smallint, comment:null), FieldSchema(name:flights.arrdelay, type:smallint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_2024

In [21]:
%%writefile flights.pig

-- resgistramos la librería PiggyBank para poder usar la función de carga CSVExcelStorage.
REGISTER piggybank.jar

/*
Leemos el fichero de airports.csv.

Usamos el loader CSVExcelStorage indicando el delimitador (,) y que se debe excluir la cabecera.
*/

AIRPORTS = LOAD '$airports_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (airportid:chararray, city:chararray, state:chararray, airportname:chararray);

-- Leemos el fichero fligths.csv

FLIGHTS = LOAD '$flights_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (dayofmonth:int, dayofweek:int, carrier:chararray, 
               depairportid:chararray, arrairportid:chararray, depdelay:int, arrdelay:int);


-- Probamos que podemos recuperar datos.
      
-- Nos quedamos con 10 aeropuertos
AIRPORTS_10 = LIMIT AIRPORTS 10;

-- Mostramos 10 aeropuertos
DUMP AIRPORTS_10;

-- Hacemos lo mismo con los vuelos
FLIGHTS_10 = LIMIT FLIGHTS 10;
DUMP FLIGHTS_10;




Writing flights.pig


In [23]:
# Ejecutamos el script
! pig -x local -f flights.pig -param airports_file='airports.csv' -param flights_file='flights.csv' -param output_dir='pig/output/flights'

2024-02-13 23:12:47,084 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
2024-02-13 23:12:47,084 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2024-02-13 23:12:47,129 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-02-13 23:12:47,129 [main] INFO  org.apache.pig.Main - Logging error messages to: /media/notebooks/Tarea 3/notebooks/pig_1707862367128.log
2024-02-13 23:12:47,141 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - user.name is deprecated. Instead, use mapreduce.job.user.name
2024-02-13 23:12:47,237 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-02-13 23:12:47,277 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-02-13 23:12:47,278 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:

## 2.- Con una consulta de HQL muestra: La cinco compañías que más vuelos retrasados tienen.

* El campo `carrier` contiene la compañía aérea.
* Vamos a considerar que un vuelo llega con retraso cuando el vuelo llega más de 15 minutos tarde (campo `arrdelay` > 15).

Se espera el siguiente resultado:

![solución 2](./img/2.png)

In [3]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "SELECT carrier, COUNT(*) as total_retrasos \
FROM flights WHERE arrdelay > 15 \
GROUP BY carrier \
ORDER BY total_retrasos \
DESC LIMIT 5;"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217003157_98fc1f4f-6e33-4680-8806-fb8ee8933661): SELECT carrier, COUNT(*) as total_retrasos FROM flights WHERE arrdelay > 15 GROUP BY carrier ORDER BY total_retrasos DESC LIMIT 5
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:carrier, type:string, comment:null), FieldSchema(name:total_retrasos, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20240217003157_98fc1f4f-6e33-4680-8806-fb8ee8933661); Time taken: 3.392 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20240217003157_98fc1f4f-6e33-4680-8806-fb8ee8933661): SELECT c

## 3.- Con una consulta de HQL muestra: Las 5 compañías que mejor recuperación de tiempo en vuelo tienen.

* Se considera que se ha recuperado el tiempo de un vuelo cuando habiendo salido con retraso (`depdelay` > 15), llega sin retraso (`arraydelay` <= 15).
* Se trata de que muestres las 5 compañías que han recuperado el tiempo en un mayor porcentaje de vuelos que salieron retrasados.

El resultado esperado es el siguiente:

![solución 3](./img/3.png)

In [47]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "SELECT carrier, \
       CAST(COUNT(CASE WHEN depdelay > 15 AND arrdelay <= 15 THEN 1 ELSE NULL END) AS FLOAT) \
       / CAST(COUNT(CASE WHEN depdelay > 15 THEN 1 ELSE NULL END) AS FLOAT) AS porcentaje_recuperacion \
FROM flights \
WHERE depdelay > 15 \
GROUP BY carrier \
HAVING COUNT(CASE WHEN depdelay > 15 THEN 1 ELSE NULL END) > 0 \
ORDER BY porcentaje_recuperacion DESC \
LIMIT 5";


Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20240217013255_e77c7792-c4be-4646-9638-89f196160ec8): SELECT carrier,         CAST(COUNT(CASE WHEN depdelay > 15 AND arrdelay <= 15 THEN 1 ELSE NULL END) AS FLOAT)         / CAST(COUNT(CASE WHEN depdelay > 15 THEN 1 ELSE NULL END) AS FLOAT) AS porcentaje_recuperacion  FROM flights  WHERE depdelay > 15  GROUP BY carrier  HAVING COUNT(CASE WHEN depdelay > 15 THEN 1 ELSE NULL END) > 0  ORDER BY porcentaje_recuperacion DESC  LIMIT 5
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:carrier, type:string, comment:null), FieldSchema(name:porcentaje_recuperacion, type:double, comment:null)], properties:null)
INFO  : Completed compiling command(qu

## 4.- Resuelve el ejercicio 2 con Pig Latin

El resultado esperado es:

![solución 4](./img/4.png)

In [13]:
%%writefile ejercicio2.pig

REGISTER piggybank.jar

FLIGHTS = LOAD '$flights_file' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
    AS (dayofmonth:int, dayofweek:int, carrier:chararray, flightnum:int, depdelay:int, arrdelay:int);

-- Filtrar los vuelos que tienen un retraso mayor a 15 minutos
delayed_flights = FILTER FLIGHTS BY arrdelay > 15;

-- Agrupar los vuelos retrasados por aerolínea
grouped_flights = GROUP delayed_flights BY carrier;

-- Contar los vuelos retrasados por aerolínea
counted_flights = FOREACH grouped_flights GENERATE group AS carrier, COUNT(delayed_flights) AS delayed_flights;

-- Ordenar las aerolíneas por la cantidad de vuelos retrasados de mayor a menor
sorted_flights = ORDER counted_flights BY delayed_flights DESC;

-- Limitar los resultados a las cinco principales aerolíneas
top_five_carriers = LIMIT sorted_flights 5;

-- Almacenar o mostrar los resultados
DUMP top_five_carriers;

Overwriting ejercicio2.pig


In [14]:
! pig -x local -f ejercicio2.pig -param flights_file='flights.csv' -param output_dir='pig/output/flights'

2024-02-18 16:23:03,912 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
2024-02-18 16:23:03,912 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2024-02-18 16:23:03,958 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-02-18 16:23:03,958 [main] INFO  org.apache.pig.Main - Logging error messages to: /media/notebooks/Tarea 3/notebooks/pig_1708269783955.log
2024-02-18 16:23:03,976 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - user.name is deprecated. Instead, use mapreduce.job.user.name
2024-02-18 16:23:04,114 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-02-18 16:23:04,156 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-02-18 16:23:04,157 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:

2024-02-18 16:23:05,732 [Thread-6] INFO  org.apache.hadoop.mapred.LocalJobRunner - Waiting for map tasks
2024-02-18 16:23:05,735 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local1221214718_0001_m_000000_0
2024-02-18 16:23:05,776 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2024-02-18 16:23:05,776 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-02-18 16:23:05,800 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorProcessTree : [ ]
2024-02-18 16:23:05,804 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Processing split: Number of splits :1
Total Length = 33554432
Input 

2024-02-18 16:23:09,564 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 12% complete
2024-02-18 16:23:09,584 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_local1221214718_0001]
2024-02-18 16:23:11,234 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.LocalJobRunner - 
2024-02-18 16:23:11,237 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Starting flush of map output
2024-02-18 16:23:11,237 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Spilling map output
2024-02-18 16:23:11,237 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - bufstart = 0; bufend = 2360410; bufvoid = 104857600
2024-02-18 16:23:11,237 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - kvstart = 26214396(104857584); kvend = 25270236(101080944); length = 944161/6553600

2024-02-18 16:23:11,714 [pool-4-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2024-02-18 16:23:11,716 [EventFetcher for fetching Map Completion Events] INFO  org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local1221214718_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2024-02-18 16:23:11,742 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#1 about to shuffle output of map attempt_local1221214718_0001_m_000001_0 decomp: 228 len: 232 to MEMORY
2024-02-18 16:23:11,760 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 228 bytes from map-output for attempt_local1221214718_0001_m_000001_0
2024-02-18 16:23:11,768 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - 

2024-02-18 16:23:12,218 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2024-02-18 16:23:12,220 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2024-02-18 16:23:12,260 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2024-02-18 16:23:12,300 [JobControl] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:23:12,306 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2024-02-18 16:23:12,313 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2024-02-18 16:23:12,313 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.Ma

2024-02-18 16:23:12,604 [EventFetcher for fetching Map Completion Events] INFO  org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local1153469595_0002_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2024-02-18 16:23:12,635 [localfetcher#2] INFO  org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#2 about to shuffle output of map attempt_local1153469595_0002_m_000000_0 decomp: 342 len: 346 to MEMORY
2024-02-18 16:23:12,668 [localfetcher#2] INFO  org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 342 bytes from map-output for attempt_local1153469595_0002_m_000000_0
2024-02-18 16:23:12,681 [localfetcher#2] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - closeInMemoryFile -> map-output of size: 342, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->342
2024-02-18 16:23:12,685 [EventFetcher for fetching Map Completion Events] INFO  org.apache.hadoop.mapreduce.task.reduce.EventFetcher - Event

2024-02-18 16:23:13,079 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/
2024-02-18 16:23:13,096 [Thread-24] INFO  org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null
2024-02-18 16:23:13,109 [Thread-24] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2024-02-18 16:23:13,109 [Thread-24] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-02-18 16:23:13,109 [Thread-24] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2024-02-18 16:23:13,109 [Thread-24] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2024-02-18 16:23:13,109 [Thread-24] INFO  org.apache.hadoop.mapreduce

2024-02-18 16:23:13,344 [EventFetcher for fetching Map Completion Events] INFO  org.apache.hadoop.mapreduce.task.reduce.EventFetcher - EventFetcher is interrupted.. Returning
2024-02-18 16:23:13,368 [pool-12-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2024-02-18 16:23:13,369 [pool-12-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2024-02-18 16:23:13,378 [pool-12-thread-1] INFO  org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
2024-02-18 16:23:13,380 [pool-12-thread-1] INFO  org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 90 bytes
2024-02-18 16:23:13,381 [pool-12-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merged 1 segments, 102 bytes to disk to satisfy reduce memory limit
2024-02-18 16:23:13,382 [pool-12-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeMan

2024-02-18 16:23:13,816 [pool-15-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2024-02-18 16:23:13,816 [pool-15-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-02-18 16:23:13,817 [pool-15-thread-1] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorProcessTree : [ ]
2024-02-18 16:23:13,817 [pool-15-thread-1] INFO  org.apache.hadoop.mapred.ReduceTask - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@433ffce5
2024-02-18 16:23:13,817 [pool-15-thread-1] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:23:13,820 [pool-15-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208

(WN,142850)
(AA,59007)
(UA,58758)
(DL,53687)
(EV,36161)
2024-02-18 16:23:14,078 [main] INFO  org.apache.pig.Main - Pig script completed in 10 seconds and 287 milliseconds (10287 ms)


## 5.- Resuelve el ejercicio 3 con Pig Latin

Se espera el siguiente resultado:

![solución 5](./img/5.png)

In [5]:
%%writefile ejercicio3.pig

REGISTER piggybank.jar

-- Leemos el fichero fligths.csv

FLIGHTS = LOAD '$flights_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (dayofmonth:int, dayofweek:int, carrier:chararray, 
               depairportid:chararray, arrairportid:chararray, depdelay:int, arrdelay:int);

-- Filtramos los vuelos que salieron con retraso
DELAYED_FLIGHTS = FILTER FLIGHTS BY depdelay > 15;

-- Filtramos los vuelos que salieron con retraso pero que llegaron con 15 minutos o menos de retraso
RECOVERED_FLIGHTS = FILTER FLIGHTS BY depdelay > 15 AND arrdelay <= 15;

-- Agrupamos los vuelos retrasados por aerolínea
GROUPED_DELAYED = GROUP DELAYED_FLIGHTS BY carrier;
GROUPED_RECOVERED = GROUP RECOVERED_FLIGHTS BY carrier;
-- Contamos los vuelos retrasados y los vuelos recuperados por aerolínea
COUNT_DELAYED = FOREACH GROUPED_DELAYED GENERATE group AS carrier, COUNT(DELAYED_FLIGHTS) AS total_delayed;
COUNT_RECOVERED = FOREACH GROUPED_RECOVERED GENERATE group AS carrier, COUNT(RECOVERED_FLIGHTS) AS total_recovered;

-- Realizamos un JOIN por aerolínea para tener ambos conteos en la misma tupla
JOINED = JOIN COUNT_DELAYED BY carrier, COUNT_RECOVERED BY carrier;

-- Calculamos el porcentaje de recuperación
CALCULATED_PERCENTAGE = FOREACH JOINED GENERATE
    COUNT_DELAYED::carrier AS carrier,
    ((float)COUNT_RECOVERED::total_recovered / (float)COUNT_DELAYED::total_delayed) AS percent_recovered;

-- Ordenamos las aerolíneas por el porcentaje de recuperación de mayor a menor
ORDERED_RECOVERY = ORDER CALCULATED_PERCENTAGE BY percent_recovered DESC;

-- Limitamos a las 5 aerolíneas principales
TOP_5_RECOVERY = LIMIT ORDERED_RECOVERY 5;

-- Mostramos el resultado
DUMP TOP_5_RECOVERY;

Overwriting ejercicio3.pig


In [6]:
! pig -x local -f ejercicio3.pig -param airports_file='airports.csv' -param flights_file='flights.csv' -param output_dir='pig/output/flights'

2024-02-18 16:12:38,477 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
2024-02-18 16:12:38,477 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2024-02-18 16:12:38,510 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-02-18 16:12:38,511 [main] INFO  org.apache.pig.Main - Logging error messages to: /media/notebooks/Tarea 3/notebooks/pig_1708269158509.log
2024-02-18 16:12:38,521 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - user.name is deprecated. Instead, use mapreduce.job.user.name
2024-02-18 16:12:38,629 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-02-18 16:12:38,676 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-02-18 16:12:38,679 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:

2024-02-18 16:12:40,203 [Thread-6] INFO  org.apache.hadoop.mapred.LocalJobRunner - Waiting for map tasks
2024-02-18 16:12:40,203 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1824884447_0001
2024-02-18 16:12:40,203 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases COUNT_DELAYED,COUNT_RECOVERED,DELAYED_FLIGHTS,FLIGHTS,GROUPED_DELAYED,GROUPED_RECOVERED,RECOVERED_FLIGHTS
2024-02-18 16:12:40,203 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: FLIGHTS[6,10],FLIGHTS[-1,-1],DELAYED_FLIGHTS[12,18],COUNT_DELAYED[21,16],GROUPED_DELAYED[18,18],RECOVERED_FLIGHTS[15,20],COUNT_RECOVERED[22,18],GROUPED_RECOVERED[19,20] C: COUNT_DELAYED[21,16],GROUPED_DELAYED[18,18],COUNT_RECOVERED[22,18],GROUPED_RECOVERED[19,20] R: COUNT_DELAYED[21,16],COUNT_RECOVERED[22,18]
2024-02-18 16:12:40,203 [LocalJobRunner Map 

2024-02-18 16:12:43,836 [LocalJobRunner Map Task Executor #0] INFO  org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2024-02-18 16:12:43,836 [LocalJobRunner Map Task Executor #0] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2024-02-18 16:12:43,843 [LocalJobRunner Map Task Executor #0] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map - Aliases being processed per job phase (AliasName[line,offset]): M: FLIGHTS[6,10],FLIGHTS[-1,-1],DELAYED_FLIGHTS[12,18],COUNT_DELAYED[21,16],GROUPED_DELAYED[18,18],RECOVERED_FLIGHTS[15,20],COUNT_RECOVERED[22,18],GROUPED_RECOVERED[19,20] C: COUNT_DELAYED[21,16],GROUPED_DELAYED[18,18],COUNT_RECOVERED[22,18],GROUPED_RECOVERED[19,20] R: COUNT_DELAYED[21,16],COUNT_RECOVERED[22,18]
2024-02-18 16:12:46,308 [LocalJobRunner Map Task Executor #0] INFO  or

2024-02-18 16:12:47,009 [Thread-6] INFO  org.apache.hadoop.mapred.LocalJobRunner - Waiting for reduce tasks
2024-02-18 16:12:47,009 [pool-4-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local1824884447_0001_r_000000_0
2024-02-18 16:12:47,020 [pool-4-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2024-02-18 16:12:47,020 [pool-4-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-02-18 16:12:47,021 [pool-4-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2024-02-18 16:12:47,021 [pool-4-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: f

2024-02-18 16:12:47,428 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete
2024-02-18 16:12:47,431 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:47,441 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:47,442 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2024-02-18 16:12:47,443 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:47,469 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2024-02-18 16:12:47,470 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, s

2024-02-18 16:12:47,735 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.LocalJobRunner - map
2024-02-18 16:12:47,735 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local383870519_0002_m_000000_0' done.
2024-02-18 16:12:47,736 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task - Final Counters for attempt_local383870519_0002_m_000000_0: Counters: 18
	File System Counters
		FILE: Number of bytes read=72102982
		FILE: Number of bytes written=1257842
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=16
		Map output records=16
		Map output bytes=186
		Map output materialized bytes=224
		Input split bytes=377
		Combine input records=0
		Spilled Records=16
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=0
		Total committed heap usage (bytes)=720371712
	MultiInputCounters
		Input recor

2024-02-18 16:12:47,929 [pool-9-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - 2 / 2 copied.
2024-02-18 16:12:47,929 [pool-9-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
2024-02-18 16:12:47,930 [pool-9-thread-1] INFO  org.apache.hadoop.mapred.Merger - Merging 2 sorted segments
2024-02-18 16:12:47,942 [pool-9-thread-1] INFO  org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 2 segments left of total size: 418 bytes
2024-02-18 16:12:47,944 [pool-9-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merged 2 segments, 432 bytes to disk to satisfy reduce memory limit
2024-02-18 16:12:47,944 [pool-9-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 1 files, 434 bytes from disk
2024-02-18 16:12:47,945 [pool-9-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 0 segments, 0 byt

2024-02-18 16:12:48,353 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/
2024-02-18 16:12:48,364 [Thread-25] INFO  org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null
2024-02-18 16:12:48,367 [Thread-25] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2024-02-18 16:12:48,367 [Thread-25] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-02-18 16:12:48,367 [Thread-25] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2024-02-18 16:12:48,367 [Thread-25] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2024-02-18 16:12:48,368 [Thread-25] INFO  org.apache.hadoop.mapreduce

2024-02-18 16:12:48,616 [pool-12-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merged 1 segments, 386 bytes to disk to satisfy reduce memory limit
2024-02-18 16:12:48,617 [pool-12-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 1 files, 390 bytes from disk
2024-02-18 16:12:48,617 [pool-12-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - Merging 0 segments, 0 bytes from memory into reduce
2024-02-18 16:12:48,617 [pool-12-thread-1] INFO  org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
2024-02-18 16:12:48,618 [pool-12-thread-1] INFO  org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 370 bytes
2024-02-18 16:12:48,618 [pool-12-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - 1 / 1 copied.
2024-02-18 16:12:48,619 [pool-12-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm versi

2024-02-18 16:12:49,104 [Thread-32] INFO  org.apache.hadoop.mapred.LocalJobRunner - Waiting for map tasks
2024-02-18 16:12:49,104 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local1946139539_0004_m_000000_0
2024-02-18 16:12:49,124 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2024-02-18 16:12:49,124 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-02-18 16:12:49,125 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorProcessTree : [ ]
2024-02-18 16:12:49,125 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Processing split: Number of splits :1
Total Length = 224
Input spli

2024-02-18 16:12:49,270 [pool-15-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local1946139539_0004_r_000000_0' to file:/tmp/temp1004047324/tmp-1394609891
2024-02-18 16:12:49,280 [pool-15-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2024-02-18 16:12:49,280 [pool-15-thread-1] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local1946139539_0004_r_000000_0' done.
2024-02-18 16:12:49,282 [pool-15-thread-1] INFO  org.apache.hadoop.mapred.Task - Final Counters for attempt_local1946139539_0004_r_000000_0: Counters: 24
	File System Counters
		FILE: Number of bytes read=72107270
		FILE: Number of bytes written=2496230
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Combine input records=0
		Combine output records=0
		Reduce input groups=5
		Reduce shuffle bytes=86
		Reduce input records=5
		Reduce output records=

2024-02-18 16:12:49,696 [pool-18-thread-1] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorProcessTree : [ ]
2024-02-18 16:12:49,698 [pool-18-thread-1] INFO  org.apache.hadoop.mapred.ReduceTask - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@50efb789
2024-02-18 16:12:49,698 [pool-18-thread-1] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:49,700 [pool-18-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2024-02-18 16:12:49,704 [EventFetcher for fetching Map Completion Events] INFO  org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local743114634_0005_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2024-02-18 16:12:49,709 [localfetcher#5] INFO  org.apache.hado

2024-02-18 16:12:50,078 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:50,081 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:50,083 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:50,083 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:50,084 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:50,089 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:50,091 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2024-02-18 16:12:50,092 [main] WARN  org.apache.