# Homework 2 - Python Higher Order Functions

### DUE: 03/01/2017 before class at 9:30am

In this homework, we will practice Python's higher order functions. Please note that you may only use higher order functions **without access to global variables**. Your expression should contain only **map()**, **filter()**, **sorted**, **reduce()** and your custom functions.

You are required to turn in this notebook with all the parts filled in between <>. Your notebook must be named BDM\_HW2.ipynb

We will be using only the citibike data (i.e. *citibike.csv*) for this homework.

In [7]:
import csv



## Task 1 (2 points)

We would like to write an HOF expression to count the total number of trip activities involved each station. For example, if a rider starts a trip at station A and ends at station B, each station A and B will receive +1 count for  the trip. The output must be tuples, each consisting of a station name and a total count. A portion of the expected output are included below.

* **NOTE:** a suggested solution is given below to demonstrate the use of **sorted()**

In [2]:
def mapper1(row):
    return (row['start_station_name'], row['end_station_name'])

def reducer1(counts, pair):
    for p in pair:
        counts[p] = counts.get(p, 0)+1
    return counts

with open('citibike.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    output1 = sorted(reduce(reducer1, map(mapper1, reader), {}).items())

output1[:10]

[('1 Ave & E 15 St', 795),
 ('1 Ave & E 44 St', 219),
 ('10 Ave & W 28 St', 422),
 ('11 Ave & W 27 St', 354),
 ('11 Ave & W 41 St', 461),
 ('11 Ave & W 59 St', 242),
 ('12 Ave & W 40 St', 217),
 ('2 Ave & E 31 St', 588),
 ('2 Ave & E 58 St', 125),
 ('3 Ave & Schermerhorn St', 34)]


## Task 2 (2 points)

Next, we would like to do the same task as Task 1, but only keep the stations with more than 1000 trips involved. Please add your HOF expression below.

In [3]:
with open('citibike.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    output2 = sorted(filter(lambda x: x[1]>1000, reduce(reducer1, map(mapper1, reader), {}).items()))

output2

[('8 Ave & W 31 St', 1065),
 ('E 43 St & Vanderbilt Ave', 1003),
 ('Lafayette St & E 8 St', 1013),
 ('W 21 St & 6 Ave', 1057),
 ('W 41 St & 8 Ave', 1095)]


## Task 3 (2 points)

We would like to count the number of trips taken between pairs of stations. Trips taken from station A to station B or  from station B to station A are both counted towards the station pair A and B. *Please note that the station pair should be identified by station names, as a tuple, and **in lexical order**, i.e. **(A,B)** instead of ~~(B,A)~~ in this case*. The output must be tuples, each consisting of the station pair identification and a count. A portion of the expected output are included below. Please provide your HOF expression.

In [4]:
def mapper3(row):
    return tuple(sorted((row['start_station_name'], row['end_station_name'])))

def reducer3(counts, pair):
    counts[pair] = counts.get(pair, 0)+1
    return counts

with open('citibike.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    output3 = sorted(reduce(reducer3, map(mapper3, reader), {}).items())

output3[:10]

[(('1 Ave & E 15 St', '1 Ave & E 15 St'), 5),
 (('1 Ave & E 15 St', '1 Ave & E 44 St'), 6),
 (('1 Ave & E 15 St', '11 Ave & W 27 St'), 1),
 (('1 Ave & E 15 St', '2 Ave & E 31 St'), 9),
 (('1 Ave & E 15 St', '5 Ave & E 29 St'), 2),
 (('1 Ave & E 15 St', '6 Ave & Broome St'), 3),
 (('1 Ave & E 15 St', '6 Ave & Canal St'), 1),
 (('1 Ave & E 15 St', '8 Ave & W 31 St'), 5),
 (('1 Ave & E 15 St', '9 Ave & W 14 St'), 3),
 (('1 Ave & E 15 St', '9 Ave & W 16 St'), 3)]


## Task 4 (2 points)

Next, we would like to futher process the output from Task 3 to determine the station popularity among all of the station pairs that have 35 or more trips. The popularity of station is calculated by how many times it appears on the list. In other words, we would like to first filter the station pairs to only those that have 35 or more trips. Then, among these pairs, we count how many time each station appears and report back the counts. The output will be tuples, each consisting of the station name and a count. The expected output are included below. As illustrated, *W 41 St & 8 Ave* station is the most "popular" with 4 appearances. Please provide your HOF expression below. You can use the output3 from the previous task.

In [5]:
output4 = sorted(reduce(reducer1, map(lambda x: x[0], filter(lambda x: x[1]>=35, output3)), {}).items())
output4

[('10 Ave & W 28 St', 1),
 ('11 Ave & W 27 St', 2),
 ('11 Ave & W 41 St', 1),
 ('8 Ave & W 31 St', 3),
 ('8 Ave & W 33 St', 1),
 ('9 Ave & W 22 St', 1),
 ('Adelphi St & Myrtle Ave', 1),
 ('DeKalb Ave & Hudson Ave', 1),
 ('E 10 St & Avenue A', 1),
 ('E 24 St & Park Ave S', 2),
 ('E 27 St & 1 Ave', 1),
 ('E 32 St & Park Ave', 1),
 ('E 33 St & 2 Ave', 2),
 ('E 43 St & Vanderbilt Ave', 2),
 ('E 47 St & Park Ave', 1),
 ('E 6 St & Avenue B', 1),
 ('E 7 St & Avenue A', 1),
 ('Lafayette St & E 8 St', 3),
 ('Pershing Square North', 1),
 ('Pershing Square South', 2),
 ('Vesey Pl & River Terrace', 1),
 ('W 17 St & 8 Ave', 1),
 ('W 20 St & 11 Ave', 2),
 ('W 21 St & 6 Ave', 1),
 ('W 26 St & 8 Ave', 1),
 ('W 31 St & 7 Ave', 2),
 ('W 33 St & 7 Ave', 2),
 ('W 41 St & 8 Ave', 4),
 ('West Thames St', 1)]


## Task 5 (2 points)

In this task, you are asked to compute the station with the most riders started from, per each gender of the *'Subscriber'* user. Meaning, what was the station name with the highest number of bike pickups for female riders, for male riders and for unknown riders.

The output will be a list of tuples, each includes a gender label (as indicated below) and another tuple consisting of a station name, and the total number of trips started at that station for that gender. The expected output are included below. Please provide your HOF expression below.

The label mapping for the gender column in citibike.csv is: (Zero=**Unknown**; 1=**Male**; 2=**Female**)

In [9]:
def mapper5(row):
    return ((int(row['gender']), row['start_station_name']), row['usertype']=='Subscriber')

def reducer5(gc, (gender_station, count)):
    gc[gender_station] = gc.get(gender_station, 0) + count
    return gc

def mapper6((gender, station_count)):
    label = ('Unknown', 'Male', 'Female')
    return (label[gender], station_count)

def reducer6(gc, ((gender, station), count)):
    if count>gc.get(gender, (None, 0))[1]:
        gc[gender] = (station, count)
    return gc

with open('citibike.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    output5 = sorted(map(mapper6, 
                         reduce(reducer6, 
                                reduce(reducer5, 
                                       map(mapper5, reader), {}
                                      ).items(), {}
                               ).items()
                        )
                    )

output5

[('Female', ('W 21 St & 6 Ave', 107)),
 ('Male', ('8 Ave & W 31 St', 488)),
 ('Unknown', ('Fulton St & William St', 1))]