## Higher Order Functions

In this assignment, I practiced using Python's higher order functions to quickly process two data sets. The goal was to use map, filter, and reduce (without the assistance of global variables) to perform various calculations on the data that could be applied at scale.

In [1]:
import csv
from functools import reduce

### Student Performance

In this task, I used HOFs and NYC's graduation outcomes data set to find the correlation between the percentage of students who dropped out and the percentage of students who graduated with advanced regents for schools in NYC.

In [2]:
# Get the total for each borough
def filterer(x):
    if x['Demographic'] == 'Borough Total':
        return x

# Get the borough, cohort year, and fraction of students with advanced regents and fraction who dropped out
def mapper(x):
    return x['Borough'], x['Cohort'],\
    float(x['Advanced Regents'])/float(x['Total Cohort']), float(x['Dropped Out'])/float(x['Total Cohort'])

# Apply both functions using map and filter
with open('data/nyc_grads.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    output1 = map(mapper, filter(filterer, reader))
        
output1

[('Bronx', '2001', 0.08713874094123811, 0.21286999039552956),
 ('Bronx', '2002', 0.08244680851063829, 0.1778590425531915),
 ('Bronx', '2003', 0.09206279342723005, 0.1813380281690141),
 ('Bronx', '2004', 0.09711779448621553, 0.16033138401559455),
 ('Bronx', '2005', 0.10174629324546952, 0.1414827018121911),
 ('Bronx', '2006', 0.10000641889723345, 0.1541819115475961),
 ('Brooklyn', '2001', 0.14172636641450828, 0.1776965081909724),
 ('Brooklyn', '2002', 0.1376874279123414, 0.16190888119953864),
 ('Brooklyn', '2003', 0.15182338051935876, 0.14990156557607576),
 ('Brooklyn', '2004', 0.1673600858945108, 0.13300228157294322),
 ('Brooklyn', '2005', 0.16201692714164168, 0.11544489722806861),
 ('Brooklyn', '2006', 0.1676060783694819, 0.12314560129864274),
 ('Manhattan', '2001', 0.14609313338595106, 0.1548539857932123),
 ('Manhattan', '2002', 0.13904776052885687, 0.1294659436975414),
 ('Manhattan', '2003', 0.18207363642913754, 0.1245766986094099),
 ('Manhattan', '2004', 0.18582666754809282, 0.12176


### Overall Rates

Using the output above, I then found the average rates for each borough over time.

In [3]:
# Create a dict with each borough as a key and a dict of the total advanced regents rates, dropout rates, and row count
# as the value
def reducer(x, y):
    value = x.get(y[0], {'count':0,'ara':0,'ard':0})
    value['count'] += 1
    value['ara'] += y[2]
    value['ard'] += y[3]
    x[y[0]] = value    
    return x

# Return only each borough, the average advanced regents rate, and average dropout rate
def mapper2(x):
    return x[0], x[1]['ara']/x[1]['count'], x[1]['ard']/x[1]['count']

output2 = sorted(map(mapper2, reduce(reducer, output1, {}).items()))
output2

[('Bronx', 0.0934198082513375, 0.17134384308218617),
 ('Brooklyn', 0.15470337770864048, 0.14351662251104022),
 ('Manhattan', 0.16519452307558916, 0.1223416853729485),
 ('Queens', 0.1769903541049419, 0.13903528573260707),
 ('Staten Island', 0.23144827521877342, 0.09375060031471814)]

### Sales Data

Here, I used map and reduce to read a csv with sales data and output the number of unique customers who bought each product and the total reveue brought in per product.

In [4]:
# Store all the product ids and add every unique customer ID and ever revenue value.
def sale_reduce(out_dict, in_data):
    value = out_dict.get(in_data['Product ID'], {'customers':set(),'revenue':0})
    value['customers'].add(in_data['Customer ID'])
    value['revenue'] += float(in_data['Item Cost'])
    out_dict[in_data['Product ID']] = value 
    return out_dict

# Return each product ID and get the number of unique customers plus the total revenue
def sale_mapper(in_dict):
    return in_dict[0], len(in_dict[1]['customers']), round(in_dict[1]['revenue'], 2)

# Apply both functions to the data
with open('data/sale.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    output3 = sorted(map(sale_mapper, reduce(sale_reduce, reader, {}).items()))

output3

[('P02291', 16, 1181.97),
 ('P19498', 17, 989.99),
 ('P32565', 17, 1006.09),
 ('P33162', 18, 1210.92),
 ('P39328', 17, 1129.01),
 ('P58225', 17, 1349.82),
 ('P61235', 18, 959.02),
 ('P76615', 18, 1087.96),
 ('P82222', 17, 950.05),
 ('P92449', 14, 966.17)]