# Project - MapReduce



Analysis of large datasets is being performed at
an unprecedented frequency. Several technologies have been
developed to do so, offering a variety of solutions and drawbacks
related to the processing of different data types and
data processing requirements. 

This notebook implements MapReduce in order to solve a series of questions by using a data set regarding air polution in the USA.
In the report, we compared the performance
of five different technologies – MapReduce, Spark RDD,
SparkDF, Spark SQL and Hive.

# Q.1) Which states have more/less monitors?

## First map-reduce for exercise 1

### Mapper


In [1]:
%%file mapper_q1.py
#!/usr/bin/env python
import sys 
import string

x = input("")
for line in sys.stdin:
    line = line.strip()
    words = line.split(",")
    machine = words[5]+words[6] 
    
    state = words[24] #25 state name

    print(machine + "\t" + state)

Overwriting mapper_q1.py


### Reducer

In [2]:
%%file reducer_q1.py
#!/usr/bin/python
import sys 

last_machine = None

for line in sys.stdin:
    line = line.strip()
    words = line.split("\t")
    machine = words[0] 
    state = words[1]
    if machine != last_machine:
        if last_machine:
            print(state + "\t" + "1")
        last_machine = machine
            
if last_machine:
    print(state + "\t" + "1")

Overwriting reducer_q1.py


### Hadoop standalone mode execution - Clear the output directory

In [3]:
rm -rf results_q1

### Submitting the job

In [4]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q1.py,reducer_q1.py -mapper mapper_q1.py -reducer reducer_q1.py -input epa_hap_daily_summary-small.csv -output results_q1

2021-12-21 10:52:53,154 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-21 10:52:53,212 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-21 10:52:53,212 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-21 10:52:53,225 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-21 10:52:53,372 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-21 10:52:53,383 INFO mapreduce.JobSubmitter: number of splits:4
2021-12-21 10:52:53,502 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local497766770_0001
2021-12-21 10:52:53,502 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-21 10:52:53,736 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q1.py as file:/tmp/hadoop-jovyan/mapred/local/job_local497766770_0001_342c9ad7-5247-4383-86c2-0c7a5b1cf3f9/mapper_q1.py
2021-12-21 10:52:53,772 INFO 

2021-12-21 10:52:56,185 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:52:56,186 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:52:56,190 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:52:56,237 INFO streaming.PipeMapRed: Records R/W=457/1
2021-12-21 10:52:56,248 INFO streaming.PipeMapRed: R/W/S=1000/757/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:52:56,402 INFO streaming.PipeMapRed: R/W/S=10000/9724/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:52:56,876 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-21 10:52:57,826 INFO streaming.PipeMapRed: R/W/S=100000/99618/0 in:100000=100000/1 [rec/s] out:99618=99618/1 [rec/s]
2021-12-21 10:52:57,996 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-21 10:52:57,997 INFO streaming.PipeMapRed: mapRedFinished
2021-12-21 10:52:57,997 INFO mapred.LocalJobRunner: 
2021-12-21 10:52:57,997 INFO mapred.MapTask: Starting flush of map output
2021

2021-12-21 10:53:01,125 INFO mapred.MapTask: Finished spill 0
2021-12-21 10:53:01,127 INFO mapred.Task: Task:attempt_local497766770_0001_m_000003_0 is done. And is in the process of committing
2021-12-21 10:53:01,134 INFO mapred.LocalJobRunner: Records R/W=485/1
2021-12-21 10:53:01,134 INFO mapred.Task: Task 'attempt_local497766770_0001_m_000003_0' done.
2021-12-21 10:53:01,135 INFO mapred.Task: Final Counters for attempt_local497766770_0001_m_000003_0: Counters: 17
	File System Counters
		FILE: Number of bytes read=123211762
		FILE: Number of bytes written=13146287
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=73551
		Map output records=73550
		Map output bytes=2115204
		Map output materialized bytes=2262310
		Input split bytes=121
		Combine input records=0
		Spilled Records=73550
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=0
		Total committed heap usage

2021-12-21 10:53:02,880 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-21 10:53:02,880 INFO mapreduce.Job: Job job_local497766770_0001 completed successfully
2021-12-21 10:53:02,893 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=473209145
		FILE: Number of bytes written=61225215
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=404854
		Map output records=404850
		Map output bytes=11639993
		Map output materialized bytes=12449717
		Input split bytes=484
		Combine input records=0
		Combine output records=0
		Reduce input groups=1846
		Reduce shuffle bytes=12449717
		Reduce input records=404850
		Reduce output records=1846
		Spilled Records=809700
		Shuffled Maps =4
		Failed Shuffles=0
		Merged Map outputs=4
		GC time elapsed (ms)=20
		Total committed heap usage (bytes)=2206203904
	Shuffle Errors
		BAD_ID=0
		CONNE

### Checking the results

In [5]:
!cat results_q1/part-*

Virgin Islands	1
Virgin Islands	1
Virgin Islands	1
Virgin Islands	1
Puerto Rico	1
Virgin Islands	1
Virgin Islands	1
Puerto Rico	1
Puerto Rico	1
Puerto Rico	1
Puerto Rico	1
Puerto Rico	1
Hawaii	1
Hawaii	1
Hawaii	1
Hawaii	1
Hawaii	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Texas	1
Florida	1
Florida	1
Florida	1
Florida	1
Texas	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Texas	1
Florida	1
Florida	1
Florida	1
Texas	1
Florida	1
Texas	1
Florida	1
Florida	1
Texas	1
Texas	1
Texas	1
Florida	1
Florida	1
Florida	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Florida	1
Texas	1
Louisiana	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Country Of Mexico	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
Texas	1
T

## Second map-reduce for exercise 1

### Mapper

In [6]:
%%file mapper_q1_2.py
#!/usr/bin/env python
import sys  

for line in sys.stdin:
    line = line.strip()
    state, count = line.split("\t")
    print(state + "\t" + count)

Overwriting mapper_q1_2.py


### Reducer

In [7]:
%%file reducer_q1_2.py
#!/usr/bin/python
import sys 

last_state = None
count_machine = 0

for line in sys.stdin:
    line = line.strip()
    state, count = line.split("\t")
    if state != last_state:
        if last_state:
            print(last_state + "\t" + str(count_machine))
        last_state = state
        count_machine = 1
    else:
        count_machine += 1
            
if last_state:
    print(last_state + "\t" + str(count_machine))

Overwriting reducer_q1_2.py


### Hadoop standalone mode execution - Clear the output directory

In [8]:
rm -rf results_q1_2

### Submitting the job

In [9]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q1_2.py,reducer_q1_2.py -mapper mapper_q1_2.py -reducer reducer_q1_2.py -input results_q1/part-*  -output results_q1_2

2021-12-21 10:53:05,145 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-21 10:53:05,200 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-21 10:53:05,200 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-21 10:53:05,212 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-21 10:53:05,367 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-21 10:53:05,386 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-21 10:53:05,503 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1615258921_0001
2021-12-21 10:53:05,504 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-21 10:53:05,716 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q1_2.py as file:/tmp/hadoop-jovyan/mapred/local/job_local1615258921_0001_5ffcf939-4c33-4e7e-9796-2a45613f5800/mapper_q1_2.py
2021-12-21 10:53:05,758

2021-12-21 10:53:06,430 INFO streaming.PipeMapRed: PipeMapRed exec [/home/jovyan/work/shared/Projeto/./reducer_q1_2.py]
2021-12-21 10:53:06,433 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2021-12-21 10:53:06,434 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2021-12-21 10:53:06,581 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:06,582 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:06,586 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:06,600 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:06,612 INFO streaming.PipeMapRed: Records R/W=1846/1
2021-12-21 10:53:06,612 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-21 10:53:06,614 INFO streaming.PipeMapRed: mapRedFinished
2021-12-21 10:53:06,618 INFO mapred.Task: Task:a

### Checking the results

In [10]:
!cat results_q1_2/part-*

Alabama	31
Alaska	13
Arizona	38
Arkansas	11
California	170
Colorado	51
Connecticut	15
Country Of Mexico	18
Delaware	6
District Of Columbia	5
Florida	55
Georgia	35
Hawaii	5
Idaho	17
Illinois	48
Indiana	52
Iowa	18
Kansas	37
Kentucky	34
Louisiana	41
Maine	21
Maryland	17
Massachusetts	19
Michigan	92
Minnesota	94
Mississippi	21
Missouri	18
Montana	62
Nebraska	6
Nevada	9
New Hampshire	17
New Jersey	24
New Mexico	18
New York	67
North Carolina	50
North Dakota	7
Ohio	91
Oklahoma	22
Oregon	32
Pennsylvania	61
Puerto Rico	6
Rhode Island	13
South Carolina	64
South Dakota	7
Tennessee	29
Texas	133
Utah	12
Vermont	22
Virgin Islands	6
Virginia	18
Washington	43
West Virginia	10
Wisconsin	26
Wyoming	9


## Third map-reduce for exercise 1

### Mapper

In [11]:
%%file mapper_q1_3.py
#!/usr/bin/env python
import sys 

init= 1000
for line in sys.stdin:
    line = line.strip()
    state, count = line.split("\t")
    init_count = init - int(count)
    print(str(init_count) + "\t" + state)

Overwriting mapper_q1_3.py


### Reducer

In [12]:
%%file reducer_q1_3.py
#!/usr/bin/env python
import sys

init= 1000
for line in sys.stdin:
    line = line.strip()
    init_count, state = line.split("\t")
    count = init - int(init_count)
    print(state + "\t" + str(count))

Overwriting reducer_q1_3.py


### Hadoop standalone mode execution - Clear the output directory

In [13]:
rm -rf results_q1_3

### Submitting the job

In [14]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q1_3.py,reducer_q1_3.py -mapper mapper_q1_3.py -reducer reducer_q1_3.py -input results_q1_2/part-*  -output results_q1_3

2021-12-21 10:53:10,826 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-21 10:53:10,905 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-21 10:53:10,905 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-21 10:53:10,925 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-21 10:53:11,143 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-21 10:53:11,160 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-21 10:53:11,313 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1349947382_0001
2021-12-21 10:53:11,313 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-21 10:53:11,599 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q1_3.py as file:/tmp/hadoop-jovyan/mapred/local/job_local1349947382_0001_4a1b76e6-9534-443e-a60b-f70c7db20ec6/mapper_q1_3.py
2021-12-21 10:53:11,662

2021-12-21 10:53:12,489 INFO streaming.PipeMapRed: PipeMapRed exec [/home/jovyan/work/shared/Projeto/./reducer_q1_3.py]
2021-12-21 10:53:12,493 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2021-12-21 10:53:12,494 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2021-12-21 10:53:12,649 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:12,649 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:12,653 INFO streaming.PipeMapRed: Records R/W=54/1
2021-12-21 10:53:12,658 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-21 10:53:12,658 INFO streaming.PipeMapRed: mapRedFinished
2021-12-21 10:53:12,661 INFO mapred.Task: Task:attempt_local1349947382_0001_r_000000_0 is done. And is in the process of committing
2021-12-21 10:53:12,689 INFO mapred.LocalJobRunner: 1 / 1 copied.
2021-12-21 10:53:12,689 INFO mapred.Task:

### Checking the results

In [15]:
!cat results_q1_3/part-*

California	170
Texas	133
Minnesota	94
Michigan	92
Ohio	91
New York	67
South Carolina	64
Montana	62
Pennsylvania	61
Florida	55
Indiana	52
Colorado	51
North Carolina	50
Illinois	48
Washington	43
Louisiana	41
Arizona	38
Kansas	37
Georgia	35
Kentucky	34
Oregon	32
Alabama	31
Tennessee	29
Wisconsin	26
New Jersey	24
Oklahoma	22
Vermont	22
Maine	21
Mississippi	21
Massachusetts	19
Missouri	18
Virginia	18
New Mexico	18
Iowa	18
Country Of Mexico	18
New Hampshire	17
Maryland	17
Idaho	17
Connecticut	15
Rhode Island	13
Alaska	13
Utah	12
Arkansas	11
West Virginia	10
Nevada	9
Wyoming	9
North Dakota	7
South Dakota	7
Nebraska	6
Puerto Rico	6
Delaware	6
Virgin Islands	6
Hawaii	5
District Of Columbia	5


# Q.2) Which counties have the best/worst air quality?

## Frist map-reduce for exercise 2 

### Mapper

In [26]:
%%file mapper_q2.py
#!/usr/bin/env python
import sys  

#2 county code = words[1]
#17 Arithmetic mean pollution per day = words[16]

x = input("")
for line in sys.stdin:
    line = line.strip()
    words = line.split(",")
    county = words[25]
    pollution = words[16]
    print(county + "\t" + pollution)

Overwriting mapper_q2.py


### Reducer

In [27]:
%%file reducer_q2.py
#!/usr/bin/python
import sys 

pollution_sum = 0
pollution_count = 0
last_county = None 

for line in sys.stdin:
    line = line.strip()
    county, pollution = line.split("\t")

    pollution = float(pollution)
    if county != last_county:
        if last_county:
            print(last_county + "\t" + str(pollution_sum/pollution_count)) #verificar aqui este print
        last_county = county
        pollution_sum = pollution
        pollution_count = 1
    else:
        pollution_sum += pollution
        pollution_count += 1

if last_county:
    print(last_county + "\t" + str(pollution_sum/pollution_count))

Overwriting reducer_q2.py


### Hadoop standalone mode execution - Clear the output directory

In [28]:
rm -rf results_q2

### Submitting the job

In [29]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q2.py,reducer_q2.py -mapper mapper_q2.py -reducer reducer_q2.py -input epa_hap_daily_summary-small.csv -output results_q2

2021-12-22 12:21:12,139 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-22 12:21:12,233 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-22 12:21:12,234 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-22 12:21:12,251 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-22 12:21:12,485 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-22 12:21:12,500 INFO mapreduce.JobSubmitter: number of splits:4
2021-12-22 12:21:12,690 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local943483185_0001
2021-12-22 12:21:12,690 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-22 12:21:13,012 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q2.py as file:/tmp/hadoop-jovyan/mapred/local/job_local943483185_0001_363fcf50-6bdb-4a45-a03c-0df893ffebd8/mapper_q2.py
2021-12-22 12:21:13,074 INFO 

2021-12-22 12:21:17,498 INFO streaming.PipeMapRed: PipeMapRed exec [/home/jovyan/work/shared/Projeto/./mapper_q2.py]
2021-12-22 12:21:17,513 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:17,514 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:17,521 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:17,604 INFO streaming.PipeMapRed: Records R/W=916/1
2021-12-22 12:21:17,607 INFO streaming.PipeMapRed: R/W/S=1000/668/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:17,889 INFO streaming.PipeMapRed: R/W/S=10000/9741/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:18,254 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-22 12:21:20,752 INFO streaming.PipeMapRed: R/W/S=100000/99780/0 in:33333=100000/3 [rec/s] out:33260=99780/3 [rec/s]
2021-12-22 12:21:21,109 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-22 12:21:21,111 INFO streaming.PipeMapRed: mapRedFinished
2021-12-22 12:2

2021-12-22 12:21:25,770 INFO streaming.PipeMapRed: R/W/S=10000/9745/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:26,261 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-22 12:21:28,031 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-22 12:21:28,032 INFO streaming.PipeMapRed: mapRedFinished
2021-12-22 12:21:28,033 INFO mapred.LocalJobRunner: 
2021-12-22 12:21:28,033 INFO mapred.MapTask: Starting flush of map output
2021-12-22 12:21:28,033 INFO mapred.MapTask: Spilling map output
2021-12-22 12:21:28,033 INFO mapred.MapTask: bufstart = 0; bufend = 924504; bufvoid = 104857600
2021-12-22 12:21:28,033 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25920200(103680800); length = 294197/6553600
2021-12-22 12:21:28,114 INFO mapred.MapTask: Finished spill 0
2021-12-22 12:21:28,117 INFO mapred.Task: Task:attempt_local943483185_0001_m_000003_0 is done. And is in the process of committing
2021-12-22 12:21:28,122 INFO mapred.LocalJobRunner: Records R/W=890/1
2021-12-22 12:21:28,1

2021-12-22 12:21:30,264 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-22 12:21:31,266 INFO mapreduce.Job: Job job_local943483185_0001 completed successfully
2021-12-22 12:21:31,294 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=460116266
		FILE: Number of bytes written=30866315
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=404854
		Map output records=404850
		Map output bytes=5092836
		Map output materialized bytes=5902560
		Input split bytes=484
		Combine input records=0
		Combine output records=0
		Reduce input groups=582
		Reduce shuffle bytes=5902560
		Reduce input records=404850
		Reduce output records=582
		Spilled Records=809700
		Shuffled Maps =4
		Failed Shuffles=0
		Merged Map outputs=4
		GC time elapsed (ms)=15
		Total committed heap usage (bytes)=2206728192
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGT

### Checking the results

In [30]:
!cat results_q2/part-*

Abbeville	0.370916666667
Ada	0.255412936803
Adair	0.000630681818182
Adams	0.0534428615085
Aiken	0.00378947368421
Alameda	0.369576591784
Alamosa	0.000407651006711
Albany	0.0705495055402
Aleutians East	0.000109571428571
Allegan	0.762988189189
Allegheny	0.359486976322
Allen	0.129140108401
Alpena	0.103741007194
Amador	0.089375
Anchorage	0.254714147727
Anderson	0.000970588235294
Androscoggin	0.237539877301
Angelina	0.3089
Anne Arundel	0.352224405956
Anoka	0.0937284237726
Apache	0.000328120805369
Arlington	0.590909090909
Aroostook	0.0666321311475
Ascension	0.611785714286
Ashley	0.00345675675676
Asotin	2.025
Atlantic	0.00111488653555
Avery	0.000587362637363
BAJA CALIFORNIA NORTE	0.801891038697
Baldwin	0.0837226102941
Baltimore	0.619025034045
Baltimore (City)	0.416897926873
Barbour	0.00116487179487
Barceloneta	0.687767955801
Barnstable	0.0165090829493
Bartholomew	0.169285714286
Bay	0.18691025641
Bayamon	1.12003636364
Beaufort	0.00127272727273
Beaver	0.016

## Second map-reduce for exercise 2 -ordering the counties by pollution

### Mapper

In [31]:
%%file mapper_q2_2.py
#!/usr/bin/env python
import sys 

init = 1000
for line in sys.stdin:
    line = line.strip()
    county, pollution = line.split("\t")
    pollution_init = 10 - float(pollution)
    print(str(pollution_init) + "\t" + county)

Overwriting mapper_q2_2.py


### Reducer

In [32]:
%%file reducer_q2_2.py
#!/usr/bin/python
import sys 

init = 1000
for line in sys.stdin:
    line = line.strip()
    pollution_init, county = line.split("\t")
    pollution = 10 - float(pollution_init)
    print(county + "\t" + str(pollution))

Overwriting reducer_q2_2.py


### Hadoop standalone mode execution - Clear the output directory

In [33]:
rm -rf results_q2_2

### Submitting the job

In [34]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q2_2.py,reducer_q2_2.py -mapper mapper_q2_2.py -reducer reducer_q2_2.py -input results_q2/part-* -output results_q2_2

2021-12-22 12:21:35,013 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-22 12:21:35,104 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-22 12:21:35,104 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-22 12:21:35,120 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-22 12:21:35,342 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-22 12:21:35,359 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-22 12:21:35,532 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1069061182_0001
2021-12-22 12:21:35,532 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-22 12:21:35,815 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q2_2.py as file:/tmp/hadoop-jovyan/mapred/local/job_local1069061182_0001_71094aad-26ba-4f2d-9d49-55e29739c015/mapper_q2_2.py
2021-12-22 12:21:35,861

2021-12-22 12:21:36,737 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:36,738 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:36,744 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 12:21:36,764 INFO streaming.PipeMapRed: Records R/W=582/1
2021-12-22 12:21:36,772 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-22 12:21:36,777 INFO streaming.PipeMapRed: mapRedFinished
2021-12-22 12:21:36,782 INFO mapred.Task: Task:attempt_local1069061182_0001_r_000000_0 is done. And is in the process of committing
2021-12-22 12:21:36,821 INFO mapred.LocalJobRunner: 1 / 1 copied.
2021-12-22 12:21:36,821 INFO mapred.Task: Task attempt_local1069061182_0001_r_000000_0 is allowed to commit now
2021-12-22 12:21:36,973 INFO mapreduce.Job: Job job_local1069061182_0001 running in uber mode : false
2021-12-22 12:21:36,975 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-22 12:21:37,153 INFO output.File

### Checking the results

In [35]:
!cat results_q2_2/part-*

Tipton	2556.0
Nassau	19.0
Sweet Grass	0.0
Martin	0.0
Columbiana	7.38569073579
CHIHUAHUA STATE	4.5121875
Caldwell	4.11666666667
Madera	3.7393
Oakland	2.8888778481
Duval	2.77946039785
Kearny	2.37533333333
Bucks	2.3675
San Luis Obispo	2.33333333333
Edgecombe	2.325
Pawnee	2.29411764706
Westchester	2.239375
Johnston	2.225
Hartford	2.07870558962
Granville	2.02857142857
Asotin	2.025
Duplin	2.0
Boulder	1.96047090123
Yancey	1.9
Crittenden	1.9
Los Angeles	1.83913508599
Iberville	1.82131322667
Caswell	1.80075
Pitt	1.8
Clinton	1.75830188679
Wayne	1.7137964213
Imperial	1.64786005155
Ozaukee	1.55733332
Stillwater	1.53846153846
Crow Wing	1.53204651163
Boyd	1.4857022549
Gloucester	1.48142857143
Henderson	1.42955090909
Kennebec	1.375
Stanislaus	1.27819412698
Muscatine	1.2375
Deer Lodge	1.21357933884
Mesa	1.21095249242
Spokane	1.14697335106
East Baton Rouge	1.14447284567
Harris	1.14270582867
Davis	1.14184812402
Phillips	1.14140939597
Tuscola	1.13696140625


# Q.3) Which states have the best/worst air quality in each year?

## First map-reduce for exercise 3 

### Mapper - taking from the data set only the state_year and all the corresponding pollution measures.

In [62]:
%%file mapper_q3.py
#!/usr/bin/env python
import sys  

#state name = words[24] and date = words[11] por dia ; year sao os primeiros 4 valores
#17 Arithmetic mean pollution per day = words[16]

x = input("")
for line in sys.stdin:
    line.strip()
    words = line.split(",")
    state_name = words[24]
    year = words[11][:4] #queremos apenas os 4 digitos do ano
    pollution = words[16]
    state_year = state_name + year
    print(state_year + "\t" + pollution)

Overwriting mapper_q3.py


### Reducer - calculate the mean pollution for each state_year

In [63]:
%%file reducer_q3.py
#!/usr/bin/python
import sys 

pollution_sum = 0
pollution_count = 0
last_state_year = None 

for line in sys.stdin:
    state_year, pollution = line.split("\t")
    pollution = float(pollution)
    if state_year != last_state_year:
        if last_state_year:
            print(last_state_year + "\t" + str(pollution_sum/pollution_count))
        last_state_year = state_year
        pollution_sum = pollution
        pollution_count = 1
    else:
        pollution_sum += pollution
        pollution_count += 1

if last_state_year:
    print(last_state_year + "\t" + str(pollution_sum/pollution_count))

Overwriting reducer_q3.py


### Hadoop standalone mode execution - Clear the output directory

In [64]:
rm -rf results_q3

### Submitting the job

In [65]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q3.py,reducer_q3.py -mapper mapper_q3.py -reducer reducer_q3.py -input epa_hap_daily_summary-small.csv -output results_q3

2021-12-20 09:53:38,936 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-20 09:53:39,057 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-20 09:53:39,057 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-20 09:53:39,082 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-20 09:53:39,398 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-20 09:53:39,422 INFO mapreduce.JobSubmitter: number of splits:4
2021-12-20 09:53:39,640 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1723390945_0001
2021-12-20 09:53:39,640 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-20 09:53:40,126 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q3.py as file:/tmp/hadoop-jovyan/mapred/local/job_local1723390945_0001_bdb2e93f-7d4c-4959-83d1-dbbd84f963aa/mapper_q3.py
2021-12-20 09:53:40,208 INF

2021-12-20 09:53:45,273 INFO streaming.PipeMapRed: PipeMapRed exec [/home/jovyan/work/shared/Projeto/./mapper_q3.py]
2021-12-20 09:53:45,292 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-20 09:53:45,293 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-20 09:53:45,301 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-20 09:53:45,397 INFO streaming.PipeMapRed: Records R/W=866/1
2021-12-20 09:53:45,415 INFO streaming.PipeMapRed: R/W/S=1000/477/0 in:NA [rec/s] out:NA [rec/s]
2021-12-20 09:53:45,444 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-20 09:53:45,804 INFO streaming.PipeMapRed: R/W/S=10000/9489/0 in:NA [rec/s] out:NA [rec/s]
2021-12-20 09:53:48,350 INFO streaming.PipeMapRed: R/W/S=100000/99709/0 in:33333=100000/3 [rec/s] out:33236=99709/3 [rec/s]
2021-12-20 09:53:48,673 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-20 09:53:48,676 INFO streaming.PipeMapRed: mapRedFinished
2021-12-20 09:5

2021-12-20 09:53:53,451 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-20 09:53:55,116 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-20 09:53:55,118 INFO streaming.PipeMapRed: mapRedFinished
2021-12-20 09:53:55,119 INFO mapred.LocalJobRunner: 
2021-12-20 09:53:55,119 INFO mapred.MapTask: Starting flush of map output
2021-12-20 09:53:55,119 INFO mapred.MapTask: Spilling map output
2021-12-20 09:53:55,119 INFO mapred.MapTask: bufstart = 0; bufend = 1270631; bufvoid = 104857600
2021-12-20 09:53:55,119 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25920200(103680800); length = 294197/6553600
2021-12-20 09:53:55,250 INFO mapred.MapTask: Finished spill 0
2021-12-20 09:53:55,256 INFO mapred.Task: Task:attempt_local1723390945_0001_m_000003_0 is done. And is in the process of committing
2021-12-20 09:53:55,268 INFO mapred.LocalJobRunner: Records R/W=867/1
2021-12-20 09:53:55,268 INFO mapred.Task: Task 'attempt_local1723390945_0001_m_000003_0' done.
2021-12-20 09:53:55,269

2021-12-20 09:53:59,458 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-20 09:53:59,458 INFO mapreduce.Job: Job job_local1723390945_0001 completed successfully
2021-12-20 09:53:59,479 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=463916257
		FILE: Number of bytes written=39710188
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=404854
		Map output records=404850
		Map output bytes=6992464
		Map output materialized bytes=7802188
		Input split bytes=484
		Combine input records=0
		Combine output records=0
		Reduce input groups=1356
		Reduce shuffle bytes=7802188
		Reduce input records=404850
		Reduce output records=1356
		Spilled Records=809700
		Shuffled Maps =4
		Failed Shuffles=0
		Merged Map outputs=4
		GC time elapsed (ms)=17
		Total committed heap usage (bytes)=2206728192
	Shuffle Errors
		BAD_ID=0
		CONNECT

### Checking the results

In [66]:
!cat results_q3/part-*

Alabama1990	0.024325
Alabama1992	0.220744242424
Alabama1993	1.367275
Alabama1994	1.78628020408
Alabama1995	1.18004323077
Alabama1996	3.22631405797
Alabama1997	0.000985714285714
Alabama1998	0.00130277777778
Alabama1999	0.00133333333333
Alabama2000	0.00195058823529
Alabama2001	0.0153583333333
Alabama2002	0.143727961165
Alabama2003	0.108966856436
Alabama2004	0.0128894475138
Alabama2005	0.0924626443769
Alabama2006	0.0991246280992
Alabama2007	0.0296045238095
Alabama2008	0.00813945945946
Alabama2009	0.0700396803653
Alabama2010	0.0652468461538
Alabama2011	0.333157198276
Alabama2012	0.483460076046
Alabama2013	0.00195679012346
Alabama2014	0.00414705882353
Alabama2015	0.00772857142857
Alabama2016	0.0104923076923
Alaska1990	0.000442083333333
Alaska1991	0.000176
Alaska1992	0.000102692307692
Alaska1993	0.000156470588235
Alaska1994	0.000193448275862
Alaska1995	0.0002468
Alaska1996	0.000128
Alaska1997	0.0003925
Alaska1998	0.000294838709677
Alaska1999	0.000916
Alask

## Second map-reduce for exercise 3 -ordering the state-year by pollution

### Mapper

In [67]:
%%file mapper_q3_2.py
#!/usr/bin/env python
import sys  

init=1000

for line in sys.stdin:
    line= line.strip()
    state_year, pollution = line.split("\t")
    init_pollution = init - float(pollution)
    print(str(init_pollution) + "\t" + state_year)

Overwriting mapper_q3_2.py


### Reducer

In [68]:
%%file reducer_q3_2.py
#!/usr/bin/python
import sys 

init=1000
for line in sys.stdin:
    line= line.strip()
    init_pollution, state_year = line.split("\t")
    pollution = init - float(init_pollution)
    print(state_year + "\t" + str(pollution))

Overwriting reducer_q3_2.py


### Hadoop standalone mode execution - Clear the output directory

In [69]:
rm -rf results_q3_2

### Submitting the job

In [70]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q3_2.py,reducer_q3_2.py -mapper mapper_q3_2.py -reducer reducer_q3_2.py -input results_q3/part-* -output results_q3_2

2021-12-20 09:54:04,164 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-20 09:54:04,282 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-20 09:54:04,282 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-20 09:54:04,314 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-20 09:54:04,653 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-20 09:54:04,678 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-20 09:54:04,902 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local478091603_0001
2021-12-20 09:54:04,902 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-20 09:54:05,377 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q3_2.py as file:/tmp/hadoop-jovyan/mapred/local/job_local478091603_0001_f115e20d-be57-466d-8a9b-67c29f5bf95c/mapper_q3_2.py
2021-12-20 09:54:05,477 I

2021-12-20 09:54:06,738 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local478091603_0001_m_000000_0 decomp: 43837 len: 43841 to MEMORY
2021-12-20 09:54:06,746 INFO reduce.InMemoryMapOutput: Read 43837 bytes from map-output for attempt_local478091603_0001_m_000000_0
2021-12-20 09:54:06,750 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 43837, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->43837
2021-12-20 09:54:06,755 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2021-12-20 09:54:06,758 INFO mapred.LocalJobRunner: 1 / 1 copied.
2021-12-20 09:54:06,759 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2021-12-20 09:54:06,771 INFO mapred.Merger: Merging 1 sorted segments
2021-12-20 09:54:06,772 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43828 bytes
2021-12-20 09:54:06,793 INFO reduce.MergeManagerImpl:

### Check the results

In [242]:
!cat results_q3_2/part-*

Ohio1991	0.0
Virgin Islands1990	0.0
West Virginia1990	0.0
Arkansas1991	0.0
Oklahoma1990	0.0
Wisconsin1990	0.0
Ohio1992	0.0
Tennessee1990	170.400930667
Country Of Mexico1995	8.46
Michigan2001	4.50613871637
Massachusetts1993	4.30583328571
Colorado2017	4.225
Indiana1990	4.09897837838
Illinois1992	3.9118251634
Massachusetts1994	3.46099061224
Louisiana1995	3.36434886585
Rhode Island1994	3.3635714
Alabama1996	3.22631405797
Connecticut1993	3.09754615385
Massachusetts1990	3.02468235294
Wisconsin1994	2.95048333333
Indiana1993	2.89722580645
Rhode Island1995	2.73130434783
Delaware1993	2.723077
Indiana1992	2.66063636364
Pennsylvania1993	2.5750862069
District Of Columbia1995	2.50474633333
Wisconsin1995	2.50224443333
Wisconsin1998	2.49189189189
Connecticut1998	2.38514745161
Country Of Mexico1993	2.38
Connecticut1995	2.221777775
Massachusetts1995	2.21076472727
Louisiana1993	2.117611625
Massachusetts1992	2.1175
Connecticut1997	2.09315096
North Carolina1995	2.0843603

# Q.4) For each state, what is the average distance of the monitors to the state center?

### Mapper

In [51]:
%%file mapper_q4.py
#!/usr/bin/env python
import sys 

x = input("")
for line in sys.stdin:
    line = line.strip()
    words = line.split(",")
    
    state = words[24] #25 state name
    lat = words[5]
    long = words[6]
    state_lat_long = state + "_" + lat + "_" + long
    machine = lat + long

    print(machine + "\t" + state_lat_long)

Overwriting mapper_q4.py


### Reducer

In [52]:
%%file reducer_q4.py
#!/usr/bin/python
import sys 

last_machine = None

for line in sys.stdin:
    line = line.strip()
    words = line.split("\t")
    machine = words[0] 
    state_lat_long = words[1]
    
    if machine != last_machine:
        if last_machine:
            print(state_lat_long + "\t" + "1")
        last_machine = machine
            
if last_machine:
    print(state_lat_long + "\t" + "1")

Overwriting reducer_q4.py


### Hadoop standalone mode execution - Clear the output directory

In [53]:
rm -rf results_q4

### Submitting the job

In [54]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q4.py,reducer_q4.py -mapper mapper_q4.py -reducer reducer_q4.py -input epa_hap_daily_summary-small.csv -output results_q4

2021-12-22 13:17:13,785 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-22 13:17:13,871 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-22 13:17:13,871 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-22 13:17:13,890 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-22 13:17:14,133 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-22 13:17:14,150 INFO mapreduce.JobSubmitter: number of splits:4
2021-12-22 13:17:14,321 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local746631316_0001
2021-12-22 13:17:14,321 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-22 13:17:14,642 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q4.py as file:/tmp/hadoop-jovyan/mapred/local/job_local746631316_0001_95731d64-23a9-465e-8fb0-5b181e77eddf/mapper_q4.py
2021-12-22 13:17:14,691 INFO 

2021-12-22 13:17:19,159 INFO streaming.PipeMapRed: Records R/W=433/1
2021-12-22 13:17:19,173 INFO streaming.PipeMapRed: R/W/S=1000/838/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 13:17:19,392 INFO streaming.PipeMapRed: R/W/S=10000/9706/0 in:NA [rec/s] out:NA [rec/s]
2021-12-22 13:17:19,855 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-22 13:17:21,963 INFO streaming.PipeMapRed: R/W/S=100000/99724/0 in:50000=100000/2 [rec/s] out:49862=99724/2 [rec/s]
2021-12-22 13:17:22,215 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-22 13:17:22,216 INFO streaming.PipeMapRed: mapRedFinished
2021-12-22 13:17:22,217 INFO mapred.LocalJobRunner: 
2021-12-22 13:17:22,217 INFO mapred.MapTask: Starting flush of map output
2021-12-22 13:17:22,217 INFO mapred.MapTask: Spilling map output
2021-12-22 13:17:22,217 INFO mapred.MapTask: bufstart = 0; bufend = 5430507; bufvoid = 104857600
2021-12-22 13:17:22,217 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25772740(103090960); length = 441657/6

2021-12-22 13:17:27,317 INFO mapred.MapTask: Finished spill 0
2021-12-22 13:17:27,322 INFO mapred.Task: Task:attempt_local746631316_0001_m_000003_0 is done. And is in the process of committing
2021-12-22 13:17:27,329 INFO mapred.LocalJobRunner: Records R/W=435/1
2021-12-22 13:17:27,329 INFO mapred.Task: Task 'attempt_local746631316_0001_m_000003_0' done.
2021-12-22 13:17:27,330 INFO mapred.Task: Final Counters for attempt_local746631316_0001_m_000003_0: Counters: 17
	File System Counters
		FILE: Number of bytes read=123211861
		FILE: Number of bytes written=21421366
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=73551
		Map output records=73550
		Map output bytes=3618373
		Map output materialized bytes=3765479
		Input split bytes=121
		Combine input records=0
		Spilled Records=73550
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=0
		Total committed heap usage

2021-12-22 13:17:29,863 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-22 13:17:29,864 INFO mapreduce.Job: Job job_local746631316_0001 completed successfully
2021-12-22 13:17:29,900 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=489759600
		FILE: Number of bytes written=99632227
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=404854
		Map output records=404850
		Map output bytes=19914973
		Map output materialized bytes=20724697
		Input split bytes=484
		Combine input records=0
		Combine output records=0
		Reduce input groups=1846
		Reduce shuffle bytes=20724697
		Reduce input records=404850
		Reduce output records=1846
		Spilled Records=809700
		Shuffled Maps =4
		Failed Shuffles=0
		Merged Map outputs=4
		GC time elapsed (ms)=14
		Total committed heap usage (bytes)=2101870592
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_

### Checking the results

In [55]:
!cat results_q4/part-*

Virgin Islands_17.708308_-64.793479	1
Virgin Islands_17.712474_-64.784868	1
Virgin Islands_17.714444_-64.785278	1
Virgin Islands_17.725278_-64.780278	1
Puerto Rico_18.014281_-66.611694	1
Virgin Islands_18.334399_-64.795972	1
Virgin Islands_18.336389_-64.796389	1
Puerto Rico_18.417315_-66.150293	1
Puerto Rico_18.420089_-66.150615	1
Puerto Rico_18.425652_-66.115846	1
Puerto Rico_18.436764_-66.58002	1
Puerto Rico_18.449167_-66.181667	1
Hawaii_19.4308_-155.2578	1
Hawaii_20.7585_-156.24789	1
Hawaii_20.808788_-156.283507	1
Hawaii_21.323745_-158.088613	1
Hawaii_21.392833_-157.969126	1
Florida_25.39122_-80.680819	1
Florida_25.586384_-80.326811	1
Florida_25.794222_-80.215556	1
Florida_25.798333_-80.210278	1
Florida_25.865278_-80.278611	1
Florida_25.87583_-80.2583	1
Texas_25.892518_-97.49383	1
Florida_25.982222_-80.247778	1
Florida_26.000833_-80.160556	1
Florida_26.001202_-80.160324	1
Florida_26.053889_-80.256944	1
Texas_26.069615_-97.1622	1
Florida_26.073536_-80.338

### Q4- parte 2

### Mapper

In [66]:
%%file mapper_q4_2.py
#!/usr/bin/env python
import sys  

coordinates = {}
first_line = True

with open('usa_states.csv','r') as coord:
    for line in coord:
        if first_line:
            first_line = False
        else:
            line = line.strip()
            line = line.split(',')
            media_lat = (float(line[2]) + float(line[3])) / 2
            media_long = (float(line[4]) + float(line[5])) / 2
            coordinates[line[1]] = [media_lat, media_long]
            
for line in sys.stdin:
    line = line.strip()
    words = line.split("\t")
    state_lat_long = words[0]
    state, lat, long = state_lat_long.split("_")
    lat = float(lat)
    long = float(long)
    if state in coordinates:
        dist_lat = float(coordinates[state][0]) - float(lat)
        dist_lat_km = abs(dist_lat)*111
        dist_long = float(coordinates[state][1]) - float(long)
        dist_long_km = abs(dist_long)*111
        distance =  ((dist_lat_km)**2 + dist_long_km**2)**(1/2)
        print(state + "\t" + str(distance))

Overwriting mapper_q4_2.py


### Reducer

In [68]:
%%file reducer_q4_2.py
#!/usr/bin/python
import sys 

dist_sum = 0
dist_count = 0
last_state = None

for line in sys.stdin:
    line = line.strip()
    state, dist = line.split("\t")
    dist = float(dist)
    if state != last_state:
        if last_state:
            print(last_state + "\t" + str(dist_sum/dist_count))
        last_state = state
        dist_sum = dist
        dist_count = 1
    else:
        dist_sum += dist
        dist_count += 1

if last_state:
    print(last_state + "\t" + str(dist_sum/dist_count))

Overwriting reducer_q4_2.py


### Hadoop standalone mode execution - Clear the output directory

In [69]:
rm -rf results_q4_2

### Submitting the job

In [70]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q4_2.py,reducer_q4_2.py -mapper mapper_q4_2.py -reducer reducer_q4_2.py -input results_q4/part-* -output results_q4_2

2021-12-22 14:36:41,412 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-22 14:36:41,547 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-22 14:36:41,547 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-22 14:36:41,574 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-22 14:36:41,944 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-22 14:36:41,989 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-22 14:36:42,314 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local968284185_0001
2021-12-22 14:36:42,315 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-22 14:36:42,770 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q4_2.py as file:/tmp/hadoop-jovyan/mapred/local/job_local968284185_0001_f4b2e730-5140-4059-a0fb-e2bef1b5c05f/mapper_q4_2.py
2021-12-22 14:36:42,849 I

2021-12-22 14:36:43,986 INFO reduce.MergeManagerImpl: Merged 1 segments, 54310 bytes to disk to satisfy reduce memory limit
2021-12-22 14:36:43,987 INFO reduce.MergeManagerImpl: Merging 1 files, 54314 bytes from disk
2021-12-22 14:36:43,989 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2021-12-22 14:36:43,989 INFO mapred.Merger: Merging 1 sorted segments
2021-12-22 14:36:43,991 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 54300 bytes
2021-12-22 14:36:43,992 INFO mapred.LocalJobRunner: 1 / 1 copied.
2021-12-22 14:36:44,009 INFO streaming.PipeMapRed: PipeMapRed exec [/home/jovyan/work/shared/Projeto/./reducer_q4_2.py]
2021-12-22 14:36:44,011 INFO mapreduce.Job: Job job_local968284185_0001 running in uber mode : false
2021-12-22 14:36:44,013 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-22 14:36:44,013 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
202

### Checking the results

In [71]:
!cat results_q4_2/part-*

Alabama	167.577796147
Alaska	637.388697729
Arizona	178.875740556
Arkansas	157.850104723
California	328.226381316
Colorado	180.259742033
Connecticut	49.9897454878
Delaware	51.5797704802
Florida	336.544914657
Georgia	184.348235073
Hawaii	155.73279515
Idaho	289.635073276
Illinois	224.067615069
Indiana	177.741957772
Iowa	206.598941037
Kansas	292.07968413
Kentucky	219.951516808
Louisiana	173.276312309
Maine	167.771435414
Maryland	89.2875944556
Massachusetts	92.3540125679
Michigan	326.411606485
Minnesota	195.068270828
Mississippi	174.238609984
Missouri	234.329534128
Montana	286.838335276
Nebraska	307.141182606
Nevada	326.28118072
New Hampshire	115.622050923
New Jersey	80.7436730378
New Mexico	183.189121285
New York	283.727339864
North Carolina	179.094048154
North Dakota	248.421930733
Ohio	176.182964829
Oklahoma	236.882574373
Oregon	268.853807923
Pennsylvania	251.415176341
Puerto Rico	32.7315162758
Rhode Island	22.1925206066
South Carolina	131.491886698

# Q.5) How many sensors per quadrant in each state?

### Mapper

In [16]:
%%file mapper_q5.py
#!/usr/bin/env python
import sys 

x = input("")
for line in sys.stdin:
    line = line.strip()
    words = line.split(",")
    state = words[24] #25 state name
    lat = words[5]
    long = words[6]
    state_lat_long = state + "_" + lat + "_" + long
    machine = lat + long

    print(machine + "\t" + state_lat_long)

Overwriting mapper_q5.py


### Reducer

In [17]:
%%file reducer_q5.py
#!/usr/bin/python
import sys 

last_machine = None

for line in sys.stdin:
    line = line.strip()
    words = line.split("\t")
    machine = words[0] 
    state_lat_long = words[1]
    
    if machine != last_machine:
        if last_machine:
            print(state_lat_long + "\t" + "1")
        last_machine = machine
            
if last_machine:
    print(state_lat_long + "\t" + "1")

Overwriting reducer_q5.py


### Hadoop standalone mode execution - Clear the output directory

In [18]:
rm -rf results_q5

### Submitting the job

In [19]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q5.py,reducer_q5.py -mapper mapper_q5.py -reducer reducer_q5.py -input epa_hap_daily_summary-small.csv -output results_q5

2021-12-21 10:53:34,714 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-21 10:53:34,774 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-21 10:53:34,774 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-21 10:53:34,788 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-21 10:53:34,933 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-21 10:53:34,945 INFO mapreduce.JobSubmitter: number of splits:4
2021-12-21 10:53:35,051 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local227818063_0001
2021-12-21 10:53:35,051 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-21 10:53:35,296 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q5.py as file:/tmp/hadoop-jovyan/mapred/local/job_local227818063_0001_e2ab63fc-9e73-4f6a-acad-851815519c4e/mapper_q5.py
2021-12-21 10:53:35,331 INFO 

2021-12-21 10:53:38,721 INFO streaming.PipeMapRed: PipeMapRed exec [/home/jovyan/work/shared/Projeto/./mapper_q5.py]
2021-12-21 10:53:38,737 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:38,738 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:38,742 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:38,796 INFO streaming.PipeMapRed: Records R/W=433/1
2021-12-21 10:53:38,813 INFO streaming.PipeMapRed: R/W/S=1000/838/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:39,032 INFO streaming.PipeMapRed: R/W/S=10000/9872/0 in:NA [rec/s] out:NA [rec/s]
2021-12-21 10:53:39,434 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-21 10:53:41,243 INFO streaming.PipeMapRed: R/W/S=100000/99724/0 in:50000=100000/2 [rec/s] out:49862=99724/2 [rec/s]
2021-12-21 10:53:41,478 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-21 10:53:41,479 INFO streaming.PipeMapRed: mapRedFinished
2021-12-21 10:5

2021-12-21 10:53:45,438 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-21 10:53:46,395 INFO streaming.PipeMapRed: MRErrorThread done
2021-12-21 10:53:46,396 INFO streaming.PipeMapRed: mapRedFinished
2021-12-21 10:53:46,397 INFO mapred.LocalJobRunner: 
2021-12-21 10:53:46,397 INFO mapred.MapTask: Starting flush of map output
2021-12-21 10:53:46,397 INFO mapred.MapTask: Spilling map output
2021-12-21 10:53:46,397 INFO mapred.MapTask: bufstart = 0; bufend = 3618373; bufvoid = 104857600
2021-12-21 10:53:46,397 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25920200(103680800); length = 294197/6553600
2021-12-21 10:53:46,439 INFO mapreduce.Job:  map 75% reduce 0%
2021-12-21 10:53:46,507 INFO mapred.MapTask: Finished spill 0
2021-12-21 10:53:46,509 INFO mapred.Task: Task:attempt_local227818063_0001_m_000003_0 is done. And is in the process of committing
2021-12-21 10:53:46,514 INFO mapred.LocalJobRunner: Records R/W=435/1
2021-12-21 10:53:46,514 INFO mapred.Task: Task 'attempt_

2021-12-21 10:53:48,440 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-21 10:53:48,441 INFO mapreduce.Job: Job job_local227818063_0001 completed successfully
2021-12-21 10:53:48,460 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=489759575
		FILE: Number of bytes written=99632202
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=404854
		Map output records=404850
		Map output bytes=19914973
		Map output materialized bytes=20724697
		Input split bytes=484
		Combine input records=0
		Combine output records=0
		Reduce input groups=1846
		Reduce shuffle bytes=20724697
		Reduce input records=404850
		Reduce output records=1846
		Spilled Records=809700
		Shuffled Maps =4
		Failed Shuffles=0
		Merged Map outputs=4
		GC time elapsed (ms)=25
		Total committed heap usage (bytes)=2206203904
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_

### Checking the results

In [20]:
!cat results_q5/part-*

Virgin Islands_17.708308_-64.793479	1
Virgin Islands_17.712474_-64.784868	1
Virgin Islands_17.714444_-64.785278	1
Virgin Islands_17.725278_-64.780278	1
Puerto Rico_18.014281_-66.611694	1
Virgin Islands_18.334399_-64.795972	1
Virgin Islands_18.336389_-64.796389	1
Puerto Rico_18.417315_-66.150293	1
Puerto Rico_18.420089_-66.150615	1
Puerto Rico_18.425652_-66.115846	1
Puerto Rico_18.436764_-66.58002	1
Puerto Rico_18.449167_-66.181667	1
Hawaii_19.4308_-155.2578	1
Hawaii_20.7585_-156.24789	1
Hawaii_20.808788_-156.283507	1
Hawaii_21.323745_-158.088613	1
Hawaii_21.392833_-157.969126	1
Florida_25.39122_-80.680819	1
Florida_25.586384_-80.326811	1
Florida_25.794222_-80.215556	1
Florida_25.798333_-80.210278	1
Florida_25.865278_-80.278611	1
Florida_25.87583_-80.2583	1
Texas_25.892518_-97.49383	1
Florida_25.982222_-80.247778	1
Florida_26.000833_-80.160556	1
Florida_26.001202_-80.160324	1
Florida_26.053889_-80.256944	1
Texas_26.069615_-97.1622	1
Florida_26.073536_-80.338

### 2 map-reduce for exercise 5

### Mapper

In [21]:
%%file mapper_q5_2.py
#!/usr/bin/env python

import sys  
coordinates = {}
first_line = True

with open('usa_states.csv','r') as coord:
    for line in coord:
        if first_line:
            first_line = False
        else:
            line = line.strip()
            line = line.split(',')
            media_lat = (float(line[2]) + float(line[3])) / 2
            media_long = (float(line[4]) + float(line[5])) / 2
            coordinates[line[1]] = [media_lat, media_long]
            coordinates[line[1]] = [line[2], line[3], media_lat, line[4], line[5], media_long] #MinLat,MaxLat,media_lat,MinLon,MaxLon,media_long


for line in sys.stdin:
    line2 = line.strip()
    words = line.split("\t")
    state_lat_long = words[0]
    state, lat, long = state_lat_long.split("_")
    lat = float(lat)
    long = float(long)
    if state in coordinates:
        min_lat = float(coordinates[state][0])
        max_lat = float(coordinates[state][1])
        media_lat = float(coordinates[state][2])
        min_long = float(coordinates[state][3])
        max_long = float(coordinates[state][4])
        media_long = float(coordinates[state][5])

        
        if min_lat < lat < media_lat:
            state_quad = state + "S"
        
        elif media_lat < lat < max_lat:
            state_quad = state + "N"
        
        else:
            state_quad = "Out of range"
            
        if min_long < long < media_long:
            state_quad += "W"
        
        elif media_long < long < max_long:
            state_quad += "E"
        
        else:
            state_quad = "Out of range"
        

    print(state_quad + "\t" + "1")

Overwriting mapper_q5_2.py


### Reducer

In [22]:
%%file reducer_q5_2.py
#!/usr/bin/python
import sys 

last_state = None
count_machine = 0

for line in sys.stdin:
    line = line.strip()
    state, count = line.split("\t")
    if state != last_state:
        if last_state:
            print(last_state + "\t" + str(count_machine))
        last_state = state
        count_machine = 1
    else:
        count_machine += 1
            
if last_state:
    print(last_state + "\t" + str(count_machine))

Overwriting reducer_q5_2.py


### Hadoop standalone mode execution - Clear the output directory

In [23]:
rm -rf results_q5_2

### Submitting the job

In [24]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_q5_2.py,reducer_q5_2.py -mapper mapper_q5_2.py -reducer reducer_q5_2.py -input results_q5/part-* -output results_q5_2

2021-12-21 10:53:51,979 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-21 10:53:52,080 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-21 10:53:52,080 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-21 10:53:52,109 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-21 10:53:52,394 INFO mapred.FileInputFormat: Total input files to process : 1
2021-12-21 10:53:52,419 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-21 10:53:52,638 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1354469250_0001
2021-12-21 10:53:52,638 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-21 10:53:53,017 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/shared/Projeto/mapper_q5_2.py as file:/tmp/hadoop-jovyan/mapred/local/job_local1354469250_0001_fbcadc35-43c3-473d-86c8-e0206ce352cd/mapper_q5_2.py
2021-12-21 10:53:53,083

2021-12-21 10:53:53,924 INFO mapred.Merger: Merging 1 sorted segments
2021-12-21 10:53:53,924 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 28295 bytes
2021-12-21 10:53:53,941 INFO reduce.MergeManagerImpl: Merged 1 segments, 28307 bytes to disk to satisfy reduce memory limit
2021-12-21 10:53:53,942 INFO reduce.MergeManagerImpl: Merging 1 files, 28311 bytes from disk
2021-12-21 10:53:53,943 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2021-12-21 10:53:53,943 INFO mapred.Merger: Merging 1 sorted segments
2021-12-21 10:53:53,945 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 28295 bytes
2021-12-21 10:53:53,946 INFO mapred.LocalJobRunner: 1 / 1 copied.
2021-12-21 10:53:53,957 INFO streaming.PipeMapRed: PipeMapRed exec [/home/jovyan/work/shared/Projeto/./reducer_q5_2.py]
2021-12-21 10:53:53,961 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapred

### Checking the results

In [25]:
!cat results_q5_2/part-*

AlabamaNE	5
AlabamaNW	14
AlabamaSE	5
AlabamaSW	7
AlaskaNE	4
AlaskaNW	4
AlaskaSE	2
AlaskaSW	3
ArizonaNE	2
ArizonaNW	10
ArizonaSE	16
ArizonaSW	10
ArkansasNE	2
ArkansasNW	3
ArkansasSE	1
ArkansasSW	5
CaliforniaNE	2
CaliforniaNW	85
CaliforniaSE	70
CaliforniaSW	16
ColoradoNE	25
ColoradoNW	17
ColoradoSE	4
ColoradoSW	5
ConnecticutNE	5
ConnecticutNW	2
ConnecticutSW	8
DelawareNW	4
DelawareSW	2
FloridaNE	27
FloridaNW	5
FloridaSE	23
GeorgiaNE	4
GeorgiaNW	21
GeorgiaSE	5
GeorgiaSW	7
HawaiiNE	2
HawaiiNW	2
HawaiiSE	1
IdahoNW	7
IdahoSE	3
IdahoSW	7
IllinoisNE	32
IllinoisNW	2
IllinoisSW	18
IndianaNE	18
IndianaNW	18
IndianaSE	8
IndianaSW	8
IowaNE	6
IowaSE	8
IowaSW	4
KansasNE	18
KansasNW	2
KansasSE	9
KansasSW	8
KentuckyNE	13
KentuckyNW	3
KentuckySE	2
KentuckySW	16
LouisianaNW	8
LouisianaSE	35
LouisianaSW	1
MaineNE	2
MaineNW	1
MaineSE	6
MaineSW	12
MarylandNE	16
MarylandNW	1
MassachusettsNE	11
MassachusettsNW	4
MassachusettsSE	4
Michigan