# MapReduce Exercises

## Due Thurs, May 5 at 8 AM

In this lab, you will use MapReduce to redo a few exercises that you previously did in Pandas. You should only edit the mapper and the reducer function in each question.

After completing this lab, you should have a pretty good understanding of the similarities and differences between Pandas and MapReduce.

## Question 1: Shark Tank (20 points)

Write a MapReduce job to calculate the number of funded companies and the number of total companies, sliced by industry and gender. 

(Make sure you understand the definition of "slicing" before doing this problem. If you've forgotten, take a look at the slides on "Conditioning and Slicing".)

In [37]:
%%file sharktank.py

#hint:

#yield(industry,gender)(1,1)
#yield(industry,gender)(0,1)

#hint 2: 

#don't keep track of col names because rows will be split amongst different machines, theoretically

#hint 3:

#num_funded = 0
# num_total = 0
#values =[(1,1),(0,1),(0,1),(1,1),(0,1)]

# for x,y in values (over a generator):
#     num_funded += x
#     num_total += y

#yield key, (num_funded,num_total)

#hint 4:

#industry = ...
#gender = ...
#yield industry + gender
    

from mrjob.job import MRJob
import csv

class CompanyCount(MRJob):

    def mapper(self, _, line):

        col = list(csv.reader([line])) #return a list with one list...
        #print(col)
        #col = line.split(',')
        # take a line of entries and split by comma and return a list
        industry = col[0][4]
        gender = col[0][5]
        deal = col[0][3]
        yield (industry, gender),(deal == 'Yes', 1)
                
            

    def reducer(self, key, values):
        
        num_funded = 0
        num_total = 0
        
        for x,y in values:
            
            num_funded += x
            num_total += y
        
        
        yield key, (num_funded,num_total)
        

if __name__ == '__main__':
    CompanyCount.run()

Overwriting sharktank.py


In [38]:
! python3 sharktank.py /data/sharktank.csv

No configs found; falling back on auto-configuration
Creating temp directory /tmp/sharktank.wileong.20160504.215605.034874
Running step 1 of 1...
Streaming final output from /tmp/sharktank.wileong.20160504.215605.034874/output...
["Business Services","Female"]	[0,2]
["Business Services","Male"]	[2,9]
["Business Services","Mixed Team"]	[1,2]
["Children \/ Education","Female"]	[14,20]
["Children \/ Education","Male"]	[9,22]
["Children \/ Education","Mixed Team"]	[6,13]
["Consumer Products","Female"]	[1,1]
["Consumer Products","Male"]	[6,13]
["Consumer Products","Mixed Team"]	[3,5]
["Fashion \/ Beauty","Female"]	[18,37]
["Fashion \/ Beauty","Male"]	[19,41]
["Fashion \/ Beauty","Mixed Team"]	[6,15]
["Fitness \/ Sports","Female"]	[4,7]
["Fitness \/ Sports","Male"]	[18,29]
["Fitness \/ Sports","Mixed Team"]	[1,4]
["Food and Beverage","Female"]	[15,28]
["Food and Beverage","Male"]	[30,59]
["Food and Beverage","Mixed Team"]	[11,17]
["Green\/CleanTech","Male"]	[5,8]
["Green\/CleanTech","Mixed T

## Question 2: Movielens 1 (20 points)

Write a MapReduce job to calculate the number of movies in each genre. (Note: A movie may belong to more than one genre.)

In [49]:
%%file movielens1.py

from mrjob.job import MRJob

class MovieCount(MRJob):

    def mapper(self, _, line):
        
        
        col_info2 = line.split('::') #returns a list of of strings that rep indices?, movie title and genre combos
        #print(col_info2)
        
        for str in col_info2:
            
            col_info2 = str.split('|') #returns a list of genres, seperated
            
            
            
        for genre in col_info2:
            
            yield(genre, 1)

    def reducer(self, key, values):
        
        num_movies = 0
        
        for x in values:
            
            num_movies+=x
            
        yield key, num_movies
           
        

if __name__ == '__main__':
    MovieCount.run()

Overwriting movielens1.py


In [50]:
! python3 movielens1.py /data/movielens/movies.dat

No configs found; falling back on auto-configuration
Creating temp directory /tmp/movielens1.wileong.20160504.220631.832450
Running step 1 of 1...
Streaming final output from /tmp/movielens1.wileong.20160504.220631.832450/output...
"Action"	503
"Adventure"	283
"Animation"	105
"Children's"	251
"Comedy"	1200
"Crime"	211
"Documentary"	127
"Drama"	1603
"Fantasy"	68
"Film-Noir"	44
"Horror"	343
"Musical"	114
"Mystery"	106
"Romance"	471
"Sci-Fi"	276
"Thriller"	492
"War"	143
"Western"	68
Removing temp directory /tmp/movielens1.wileong.20160504.220631.832450...


## Question 3: Movielens 2 (20 points)

Write a MapReduce job to calculate the number of ratings each movie had.

In [31]:
%%file movielens2.py

from mrjob.job import MRJob

class RatingsCount(MRJob):

    def mapper(self, _, line):
        # YOUR CODE HERE
        cols_3 = line.split('::')
        movie_id = cols_3[1]
        #rating = cols_3[2]
        
       
        
        #yield (movie_id), cols_3.count(cols_3[2]) # This line most works except if I get a case like:
        #[userid = 3, .. ,... ... , rating = 3, ....... ]. .count() would get number of instances of that rating
        
        yield (movie_id), 1 #yield 1 because each row yields just ONE RATING.

    def reducer(self, key, values):
        # YOUR CODE HERE
        num_ratings = 0
        
        for x in values:
            
            num_ratings+=x
            
        yield key, num_ratings

if __name__ == '__main__':
    RatingsCount.run()

Overwriting movielens2.py


In [32]:
! python3 movielens2.py /data/movielens/ratings.dat

No configs found; falling back on auto-configuration
Creating temp directory /tmp/movielens2.wileong.20160504.214621.109438
Running step 1 of 1...
Streaming final output from /tmp/movielens2.wileong.20160504.214621.109438/output...
"1"	2077
"10"	888
"100"	128
"1000"	20
"1002"	8
"1003"	121
"1004"	101
"1005"	142
"1006"	78
"1007"	232
"1008"	97
"1009"	291
"101"	253
"1010"	242
"1011"	135
"1012"	301
"1013"	258
"1014"	136
"1015"	234
"1016"	156
"1017"	276
"1018"	123
"1019"	575
"102"	60
"1020"	392
"1021"	247
"1022"	577
"1023"	221
"1024"	126
"1025"	293
"1026"	8
"1027"	344
"1028"	1011
"1029"	568
"103"	33
"1030"	323
"1031"	319
"1032"	525
"1033"	287
"1034"	169
"1035"	882
"1036"	1666
"1037"	615
"1038"	11
"1039"	3
"104"	682
"1040"	5
"1041"	474
"1042"	528
"1043"	108
"1044"	35
"1046"	90
"1047"	434
"1049"	451
"105"	387
"1050"	146
"1051"	98
"1053"	35
"1054"	81
"1055"	19
"1056"	50
"1057"	274
"1058"	11
"1059"	479
"106"	12
"1060"	786
"1061"	374
"1062"	2
"1063"	26
"1064"	209
"1066"	175
"1067"	23
"1068"	35
"1

## Question 4: Baby Names (20 points)

Write a MapReduce job to count the number of babies born between 2000-2009 with names beginning with each letter. (In other words, I want to know how many babies had names beginning with the letter 'A', and so on.

In [51]:
%%file babynames.py

from mrjob.job import MRJob

class BabiesCount(MRJob):

    def mapper(self, _, line):
        
        col_info4 = line.split(',')
        
        names_first_letter = col_info4[0][0]
        
        yield names_first_letter, 1
        
        

    def reducer(self, key, values):
        
        yield key, sum(values)
        
        

if __name__ == '__main__':
    BabiesCount.run()

Overwriting babynames.py


In [52]:
! python3 babynames.py /data/babynames/yob200*.txt

No configs found; falling back on auto-configuration
Creating temp directory /tmp/babynames.wileong.20160505.022234.086346
Running step 1 of 1...
Streaming final output from /tmp/babynames.wileong.20160505.022234.086346/output...
"A"	39530
"B"	11993
"C"	18279
"D"	21703
"E"	11847
"F"	3551
"G"	6701
"H"	6843
"I"	5565
"J"	31372
"K"	29503
"L"	15058
"M"	26545
"N"	12030
"O"	3049
"P"	4259
"Q"	1347
"R"	14110
"S"	23719
"T"	19599
"U"	643
"V"	3076
"W"	2108
"X"	875
"Y"	5362
"Z"	6387
Removing temp directory /tmp/babynames.wileong.20160505.022234.086346...
