## Repaso Pig
---

## Analítica de Grandes Datos
* ##### Facultad de Minas
* ##### Universidad Nacional de Colombia
* ##### Autor: Valentina Vásquez Hernandez

#### 0. ¿Cómo ejecuto el código de este taller?

* **Paso 1.** Ejecutar la imagen de docker `jdvelasq/pig:0.17.0` [click aquí para ver el comando](https://jdvelasq.github.io/courses/analitica_de_grandes_datos/index.html)
* **Paso 2.** Una vez se encuentren dentro de la imágen, pueden seguir cualquiera de las siguientes opciones: 
    > **Paso 2.1.** Clonar el repositorio de los talleres dentro de su máquina e inicializar jupyter [lab o notebook] en el directorio donde se encuentre este libro .ipynb
    
    > **Paso 2.2.** Guardar comandos en un archivo `.pig` y luego ejecutarlo con el comando `pig -execute`

---

### 1. ¿Qué es Apache Pig?

Apache Pig es una plataforma flujo de trabajo *OpenSource* para grandes datos que permite el analisis de la información a través de la ejecución de programas de MapReduce en Hadoop. El lenguaje mediante el que se puede interactuar con el sistema se llama `Pig Latin` [1](https://pig.apache.org/about.html) 

Con respecto a lo que se ha explorado dentro de las funciones MapReduce y otras implementaciones, `Pig` presenta las siguientes ventajas: 

* Tiene implementados comandos especificos para el conteo (reduce).
* No es necesario escribir explicitamente el mapper y el reducer en Java o Python.
* Convierte queries, que es un lenguaje fácil de aprender e impleemntar, en funciones de MapReduce.
* Acepta todo tipo de dato, incluso los anidados. Entre estos se encuentran int, float, datetime, chararray, entre otros.



> **Arquitectura Pig** [2](https://forum.huawei.com/enterprise/es/apache-pig/thread/861847-100759)


<img src="pigarq.jpg" alt="drawing" width="400"/>


**Jerarquía**
> Map []
>> Bag {(a,b),(c,d)}
>>> Tuple (a,b)


**Principales comandos:**

* `LOAD`, `CROSS`, `DISTINCT`, `FILTER`, `FOREACH`, `GROUP`, `LIMIT`, `ORDER BY`, `SPLIT`, `UNION`
* `AVG`, `CONCAT`, `COUNT`, `IN`, `MAX`, `MIN`, `SIZE`, `SUM`, `TOKENIZE`
* `ABS`, `CBRT`, `CEIL`, `FLOOR`, `LOG`, `LOG10`, `ROUND`, `SQRT`


*Documentación*: https://pig.apache.org/

### 1. ¿Cómo uso Pig?

Es posible ejecutar Pig de manera local, pseudodistribuido o usando directamente HDFS.

In [1]:
!pwd

/workspace/Desktop/Maestría Ingeniería - Analítica/2022-01_Monitoria_AnaliticaGrandesDatos/Analitica-Grandes-Datos


A continuación se crea la base de datos a usar y se establece el esquema de la tabla:

In [2]:
%%writefile make_fueltype_price.pig

-- carga de datos desde la carpeta local
cars_table = LOAD 'data/cars_subset.csv' USING PigStorage(',')
    AS (
            car_id:int,
            make:chararray,
            fuel_type:chararray,
            length:float,
            width:float,
            height:float,
            price:int

    );

specific_columns = FOREACH cars_table GENERATE make, fuel_type, price;
STORE specific_columns INTO 'output';

Writing make_fueltype_price.pig


In [3]:
!pig -x local -execute 'run make_fueltype_price.pig'

2022-06-13 21:44:21,297 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
2022-06-13 21:44:21,441 [JobControl] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2022-06-13 21:44:21,578 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-06-13 21:44:21,599 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-06-13 21:44:21,652 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-06-13 21:44:21,900 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1976651350_0001
2022-06-13 21:44:22,002 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/

In [4]:
!head output/part-m-*

gas	four	
gas	four	
gas	four	
gas	four	
gas	two	
gas	four	
gas	two	
gas	four	
gas	two	
gas	two	


A continuación se genera el conteo de frecuencia para una de las columnas:

In [5]:
%%writefile fueltype_count.pig
-- carga de datos desde la carpeta local
cars_table = LOAD 'data/cars_subset.csv' USING PigStorage(',')
    AS (
            car_id:int,
            make:chararray,
            fuel_type:chararray,
            length:float,
            width:float,
            height:float,
            price:int

    );
words = FOREACH cars_table GENERATE FLATTEN(TOKENIZE(fuel_type)) AS word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
STORE wordcount INTO 'output_wordcount';

Writing fueltype_count.pig


In [6]:
!pig -x local -execute 'run fueltype_count.pig'

2022-06-13 21:44:34,079 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
2022-06-13 21:44:34,273 [JobControl] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2022-06-13 21:44:34,371 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-06-13 21:44:34,403 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-06-13 21:44:34,461 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-06-13 21:44:34,697 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1422260926_0001
2022-06-13 21:44:34,801 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/

In [7]:
!head output_wordcount/part-r-* 

two	64
four	95


A continuación se general metricas por cada una de las marcas y se almacenan en salidas diferentes:

In [8]:
%%writefile make_report_template.pig
-- carga de datos desde la carpeta local
cars_table = LOAD 'data/cars_subset.csv' USING PigStorage(',')
    AS (
            car_id:int,
            make:chararray,
            fuel_type:chararray,
            length:float,
            width:float,
            height:float,
            price:int

    );
    
filtered = FILTER cars_table BY (make MATCHES {}{}{});
STORE filtered INTO {}{}{};

Writing make_report_template.pig


In [2]:
def make_pig_file(make):
    file = open("make_report_template.pig", 'r').read()
    pig_query = file.format("'",str(make),"'","'",'output_'+str(make),"'")
    pig_file_path = "make_report_{}.pig".format(make)
    with open(pig_file_path, 'w') as f:
        f.write(pig_query)
    f.close()
    return pig_file_path

In [3]:
make_pig_file("mazda")

'make_report_mazda.pig'

In [12]:
make_pig_file("toyota")

'make_report_toyota.pig'

In [13]:
!pig -x local -execute 'run make_report_toyota.pig'

2022-06-13 21:45:17,752 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
2022-06-13 21:45:17,877 [JobControl] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2022-06-13 21:45:18,033 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-06-13 21:45:18,056 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-06-13 21:45:18,103 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-06-13 21:45:18,347 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1636098767_0001
2022-06-13 21:45:18,463 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/

In [14]:
!pig -x local -execute 'run make_report_mazda.pig'

2022-06-13 21:45:25,286 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
2022-06-13 21:45:25,419 [JobControl] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2022-06-13 21:45:25,560 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-06-13 21:45:25,580 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-06-13 21:45:25,724 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-06-13 21:45:26,029 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1717844432_0001
2022-06-13 21:45:26,138 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/

In [4]:
!cat output_mazda/part-m-* 

	gas	four	176.6	66.2	13950.0	
	gas	four	176.6	66.4	17450.0	
	gas	four	192.7	71.4	17710.0	
	gas	four	192.7	71.4	23875.0	
	gas	two	176.8	64.8	16430.0	
	gas	four	176.8	64.8	16925.0	
	gas	two	176.8	64.8	20970.0	
	gas	four	176.8	64.8	21105.0	
	gas	two	141.1	60.3	5151.0	
	gas	two	155.9	63.6	6295.0	
	gas	four	158.8	63.6	6575.0	
	gas	two	157.3	63.8	5572.0	
	gas	two	157.3	63.8	6377.0	
	gas	two	157.3	63.8	7957.0	
	gas	four	157.3	63.8	6229.0	
	gas	four	157.3	63.8	6692.0	
	gas	four	157.3	63.8	7609.0	
	gas	four	174.6	64.6	8921.0	
	gas	two	173.2	66.3	12964.0	
	gas	two	144.6	63.9	6479.0	
	gas	two	144.6	63.9	6855.0	
	gas	two	150.0	64.0	5399.0	
	gas	two	150.0	64.0	6529.0	
	gas	two	150.0	64.0	7129.0	
	gas	four	163.4	64.0	7295.0	
	gas	four	157.1	63.9	7295.0	
	gas	two	167.5	65.2	7895.0	
	gas	two	167.5	65.2	9095.0	
	gas	four	175.4	65.2	8845.0	
	gas	four	175.4	62.5	10295.0	
	gas	four	175.4	65.2	12945.0	
	gas	two	169.1	66.0	10345.0	
	gas	four	199.6	69.6	32250.0	
	gas	two	159.1	64.2	5195.0	
	gas	two	159.1	64.

In [16]:
!head output_toyota/part-m-* 

	gas	four	176.6	66.2	13950.0	
	gas	four	176.6	66.4	17450.0	
	gas	four	192.7	71.4	17710.0	
	gas	four	192.7	71.4	23875.0	
	gas	two	176.8	64.8	16430.0	
	gas	four	176.8	64.8	16925.0	
	gas	two	176.8	64.8	20970.0	
	gas	four	176.8	64.8	21105.0	
	gas	two	141.1	60.3	5151.0	
	gas	two	155.9	63.6	6295.0	


---