# Mini-Tarea 1: Análisis de datos de preferencias musicales

La presente actividad debe ser realizada de forma individual. El formato de entregar es el archivo .ipynb con todas las celdas ejecutadas. Las secciones donde se planteen preguntas de forma explícita, deben ser respondida en celdas de texto, y no se aceptará solo el output de una celda de código como respuesta.

**Fecha de entrega:** martes 09 de abril de 2024, 09:00 hrs.

**Nombre alumno:** Sebastian Navea Aguirre

# Apache Pig

## Instalación de ambiente

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
exec(open('/content/drive/MyDrive/BigDataSw/hadoop_colab_installer.py').read())

Active services:
2545 Jps
2322 NodeManager
2165 NameNode
2426 JobHistoryServer
2095 ResourceManager
2239 DataNode



## Descarga de dataset

In [None]:
!cp /content/drive/MyDrive/BigDataSw/lastfm-dataset-1K.tar.gz .
!tar xzf lastfm-dataset-1K.tar.gz

## Actividad 0

In [None]:
# Inserte su código para copia de datos en HDFS y definición de UDFs

#primero, creacion de directorio destinatario para archivo 1 userid-profile
!hdfs dfs -mkdir user_profile
#creacion directorio destinatario para archivo 2  userid-timestamp-artid
!hdfs dfs -mkdir user_timestamp

#copiar datos user_profile a carpeta recien creada
!hdfs dfs -put lastfm-dataset-1K/userid-profile.tsv user_profile
#copiar datos userid_timestamp a carpeta recien creada
!hdfs dfs -put lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv user_timestamp

In [None]:
#definicion de UDF para detectar campos vacios
%%writefile my_pig_udfs.py

from pig_util import outputSchema
import re
import array

@outputSchema('detected:boolean')
def empty_detector( query ):
  if query is None:
    return True
  query = query.tostring() # 'array' to 'str'
  if len( query ) == 0:
    return True
  return False

Writing my_pig_udfs.py


## Actividad 1

In [None]:
# Inserte su código
#Register permite cargar los udf creados recientemente, indicando el lenguaje a usar
#data permite cargar los datos desde la ubicacion, senalando separador y tipo de dato
#grouped_data agrupa el resultado de data segun el nombre de los artistas
#artist_count recorre el grouped_data y genera un nuevo grupo
#ordered_data ordena el artist_count de manera descendiente
#top_artist limita la respuesta a solo los 10 artistas
#la consulta final se almacena en el directorio top_artist
%%writefile top_artists.pig

REGISTER 'my_pig_udfs.py' USING jython AS my_pig_udfs;

data = LOAD 'user_timestamp' USING PigStorage('\t') AS (userid:chararray, ts:chararray, musicbrainz_artist_id:chararray, artist_name:chararray, musicbrainz_track_id:chararray, track_name:chararray);

grouped_data = GROUP data BY artist_name;
artist_counts = FOREACH grouped_data GENERATE group AS artist_name, COUNT(data) AS count;

ordered_data = ORDER artist_counts BY count DESC;
top_artists = LIMIT ordered_data 10;

STORE top_artists INTO 'top_artists';


Writing top_artists.pig


In [None]:
!pig -f top_artists.pig


24/04/09 04:29:54 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
24/04/09 04:29:54 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
24/04/09 04:29:54 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2024-04-09 04:29:55,038 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-04-09 04:29:55,038 [main] INFO  org.apache.pig.Main - Logging error messages to: /content/pig_1712636995034.log
2024-04-09 04:29:56,267 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-04-09 04:29:56,436 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-04-09 04:29:56,437 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2024-04-09 04:29:57,573 [main] INFO  org.apache.pig.PigServer - Pig Script ID for the session:

In [None]:
!hdfs dfs -ls

Found 3 items
drwxr-xr-x   - root supergroup          0 2024-04-09 04:42 top_artists
drwxr-xr-x   - root supergroup          0 2024-04-09 04:28 user_profile
drwxr-xr-x   - root supergroup          0 2024-04-09 04:28 user_timestamp


In [None]:
!hdfs dfs -ls top_artists

Found 2 items
-rw-r--r--   1 root supergroup          0 2024-04-09 04:42 top_artists/_SUCCESS
-rw-r--r--   1 root supergroup        180 2024-04-09 04:42 top_artists/part-r-00000


In [None]:
!hdfs dfs -cat top_artists/part-r-00000

Radiohead	115209
The Beatles	100338
Nine Inch Nails	84421
Muse	63351
Coldplay	62251
Depeche Mode	59910
Pink Floyd	58561
Death Cab For Cutie	58083
Placebo	53543
Elliott Smith	50278


Responda la siguiente pregunta:

¿ Cuál es el artista más popular y el menos popular del ranking top-10 ?

El artista mas popular es Radiohead, con 115209 escuchas, y el menos popular es Elliott Smith con 50278 escuchas.


## Actividad 2

In [None]:
# Inserte su código
#Register permite cargar los udf creados recientemente, indicando el lenguaje a usar
#user_data permite cargar los datos desde la ubicacion, senalando separador y tipo de dato
#data permite cargar los datos de las reproducciones
#joined_data une los datos de los usuario con los datoa de las reproducciones a traves de la variable data
#filtered_data filtra por el artistas mas popular a traves de la variable joined_data
#gender_count recorre gruped_data y genera nuevo grupo por genero y realiza conteo
#grouped_data agrupa la variable filtered_data por genero
#la consulta final se almacena en el directorio gender_counts
%%writefile gender_counts.pig

REGISTER 'my_pig_udfs.py' USING jython AS my_pig_udfs;

user_data = LOAD 'user_profile' USING PigStorage('\t') AS (userid:chararray, gender:chararray, age:int, country:chararray, signup:chararray);

data = LOAD 'user_timestamp' USING PigStorage('\t') AS (userid:chararray, ts:chararray, musicbrainz_artist_id:chararray, artist_name:chararray, musicbrainz_track_id:chararray, track_name:chararray);

joined_data = JOIN data BY userid, user_data BY userid;

filtered_data = FILTER joined_data BY artist_name == 'Radiohead';

grouped_data = GROUP filtered_data BY gender;
gender_counts = FOREACH grouped_data GENERATE group AS gender, COUNT(filtered_data) AS count;

STORE gender_counts INTO 'gender_counts';




Writing gender_counts.pig


In [None]:
!pig -f gender_counts.pig


24/04/09 04:43:12 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
24/04/09 04:43:12 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
24/04/09 04:43:12 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2024-04-09 04:43:12,448 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-04-09 04:43:12,448 [main] INFO  org.apache.pig.Main - Logging error messages to: /content/pig_1712637792441.log
2024-04-09 04:43:13,813 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-04-09 04:43:14,036 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-04-09 04:43:14,042 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2024-04-09 04:43:15,169 [main] INFO  org.apache.pig.PigServer - Pig Script ID for the session:

In [None]:
!hdfs dfs -ls

Found 4 items
drwxr-xr-x   - root supergroup          0 2024-04-09 04:50 gender_counts
drwxr-xr-x   - root supergroup          0 2024-04-09 04:42 top_artists
drwxr-xr-x   - root supergroup          0 2024-04-09 04:28 user_profile
drwxr-xr-x   - root supergroup          0 2024-04-09 04:28 user_timestamp


In [None]:
!hdfs dfs -ls gender_counts

Found 2 items
-rw-r--r--   1 root supergroup          0 2024-04-09 04:50 gender_counts/_SUCCESS
-rw-r--r--   1 root supergroup         22 2024-04-09 04:50 gender_counts/part-r-00000


In [None]:
!hdfs dfs -cat gender_counts/part-r-00000

f	43748
m	63784
	7677


Responda la siguiente pregunta:

¿ Cuál es la cantidad de hombres y mujeres que han escuchado alguna canción del artista más popular ?

La cantidad de hombres que han escuchado el artista mas popular son de 63784, y la cantidad de mujeres son de 43748

## Actividad 3

In [None]:
# Inserte su código

#Register permite cargar los udf creados recientemente, indicando el lenguaje a usar
#user_data permite cargar los datos desde la ubicacion, senalando separador y tipo de dato
#data permite cargar los datos de las reproducciones
#joined_data une los datos de los usuario con los datoa de las reproducciones a traves de la variable data
#filtered_data filtra por el artistas mas popular a traves de la variable joined_data
#grouped_data agrupa la variable filtered_data por edad
#age_counts recorre grouped_data generando nuevo grupo por edad, y contando por filtered_data
#ordered_data ordena la variable age_counts por edad
#la consulta final se almacena en el directorio gender_counts

%%writefile age_counts.pig

REGISTER 'my_pig_udfs.py' USING jython AS my_pig_udfs;

user_data = LOAD 'user_profile' USING PigStorage('\t') AS (userid:chararray, gender:chararray, age:int, country:chararray, signup:chararray);

data = LOAD 'user_timestamp' USING PigStorage('\t') AS (userid:chararray, ts:chararray, musicbrainz_artist_id:chararray, artist_name:chararray, musicbrainz_track_id:chararray, track_name:chararray);

joined_data = JOIN data BY userid, user_data BY userid;

filtered_data = FILTER joined_data BY artist_name == 'Radiohead' AND age IS NOT NULL;

grouped_data = GROUP filtered_data BY age;

age_counts = FOREACH grouped_data GENERATE group AS age, COUNT(filtered_data) AS count;

ordered_data = ORDER age_counts BY age;

STORE ordered_data INTO 'age_counts';



Overwriting age_counts.pig


In [None]:
!pig -f age_counts.pig


24/04/09 05:01:39 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
24/04/09 05:01:39 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
24/04/09 05:01:39 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2024-04-09 05:01:39,334 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-04-09 05:01:39,336 [main] INFO  org.apache.pig.Main - Logging error messages to: /content/pig_1712638899332.log
2024-04-09 05:01:40,559 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-04-09 05:01:40,757 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-04-09 05:01:40,757 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2024-04-09 05:01:42,038 [main] INFO  org.apache.pig.PigServer - Pig Script ID for the session:

In [None]:
!hdfs dfs -ls

Found 5 items
drwxr-xr-x   - root supergroup          0 2024-04-09 05:09 age_counts
drwxr-xr-x   - root supergroup          0 2024-04-09 04:50 gender_counts
drwxr-xr-x   - root supergroup          0 2024-04-09 04:42 top_artists
drwxr-xr-x   - root supergroup          0 2024-04-09 04:28 user_profile
drwxr-xr-x   - root supergroup          0 2024-04-09 04:28 user_timestamp


In [None]:
!hdfs dfs -ls age_counts

Found 2 items
-rw-r--r--   1 root supergroup          0 2024-04-09 05:09 age_counts/_SUCCESS
-rw-r--r--   1 root supergroup        216 2024-04-09 05:09 age_counts/part-r-00000


In [None]:
!hdfs dfs -cat age_counts/part-r-00000

3	2
4	14
7	2
15	2
16	122
17	444
18	65
19	4514
20	5760
21	1589
22	7038
23	2552
24	1313
25	1289
26	2527
27	2360
28	1330
29	359
30	617
31	146
32	433
33	1441
34	404
35	1543
36	1962
38	422
39	213
42	13
48	35
75	109
103	1


Responda la siguiente pregunta:

¿ Cuántos usuarios de 35 años han escuchado al artista más popular ?

Dentro de los usuarios de 35 años, 1543 han escuchado al artista mas popular