# Manipulating Collections using Map Reduce APIs – Python 3

As we understand about collections and how to manipulate them using traditional looping, now let us check out already existing APIs such as map reduce to process collection data.

* Define problem statements
* Develop myFilter, myMap and myReduce APIs
* Understanding existing packages and APIs
* Developing Solutions using Map Reduce APIs

## Define Problem Statements
Let us see few similar problem statements and understand how we can build solutions using conventional loops.

* Filtering
* Get COMPLETE orders from orders data set
* Get orders placed on 2013-07-25
* Get order items for given order id
* In all 3 cases we need to iterate through collection, filter based on criteria and return collection.

In [1]:
#01-loops-filtering-by-order-status.py 
ordersPath = "/data/retail_db/orders/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orders = readData(ordersPath)

ordersFiltered = []
for order in orders:
    if(order.split(",")[3] == "COMPLETE"):
        ordersFiltered.append(order)
        
ordersFiltered[:10]

['3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '15,2013-07-25 00:00:00.0,2568,COMPLETE',
 '17,2013-07-25 00:00:00.0,2667,COMPLETE',
 '22,2013-07-25 00:00:00.0,333,COMPLETE',
 '26,2013-07-25 00:00:00.0,7562,COMPLETE',
 '28,2013-07-25 00:00:00.0,656,COMPLETE',
 '32,2013-07-25 00:00:00.0,3960,COMPLETE']

In [2]:
# 02-loops-filtering-by-order-date.py
ordersPath = "/data/retail_db/orders/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orders = readData(ordersPath)

ordersFiltered = []
for order in orders:
    if(order.split(",")[1] == "2013-07-25 00:00:00.0"):
        ordersFiltered.append(order)
        
ordersFiltered[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [3]:
# 03-loops-filtering-by-order-id.py 
orderItemsPath = "/data/retail_db/order_items/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orderItems = readData(orderItemsPath)

orderItemsFiltered = []
for orderItem in orderItems:
    if(int(orderItem.split(",")[1]) == 2):
        orderItemsFiltered.append(orderItem)
        
orderItemsFiltered[:10]

['2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0', '4,2,403,1,129.99,129.99']

* Mapping
    * Get order_id and order_status from orders (1st and 4th fields of orders data)
    * Get order_item_order_id and order_item_subtotal from order_items (2nd and 5th field of order_items data)
    * Get order_month from orders data (extract year and month from 2nd field)
    * In all 3 cases we need to iterate through collection, transform individual records and add them to new collection

In [5]:
# 01-loops-get-order-id-and-order-status.py
ordersPath = "/data/retail_db/orders/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orders = readData(ordersPath)

ordersMap = []
for order in orders:
    ordersMap.append((int(order.split(",")[0]), order.split(",")[3]))
        
ordersMap[:10]


[(1, 'CLOSED'),
 (2, 'PENDING_PAYMENT'),
 (3, 'COMPLETE'),
 (4, 'CLOSED'),
 (5, 'COMPLETE'),
 (6, 'COMPLETE'),
 (7, 'COMPLETE'),
 (8, 'PROCESSING'),
 (9, 'PENDING_PAYMENT'),
 (10, 'PENDING_PAYMENT')]

In [6]:
# 02-loops-get-order-id-and-subtotal.py
orderItemsPath = "/data/retail_db/order_items/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orderItems = readData(orderItemsPath)

orderItemsMap = []
for orderItem in orderItems:
    orderItemsMap.append((int(orderItem.split(",")[1]), float(orderItem.split(",")[4])))
        
orderItemsMap[:10]

[(1, 299.98),
 (2, 199.99),
 (2, 250.0),
 (2, 129.99),
 (4, 49.98),
 (4, 299.95),
 (4, 150.0),
 (4, 199.92),
 (5, 299.98),
 (5, 299.95)]

In [7]:
# 03-loops-get-order-month.py 
ordersPath = "/data/retail_db/orders/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orders = readData(ordersPath)

ordersMap = []
for order in orders:
    ordersMap.append(order.split(",")[1][:7])
        
ordersMap[:10]

['2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07']

* Reduce (on filtered and mapped order item subtotal based on order_id)
    * Get total revenue by adding all the revenues
    * Get minimum of order item subtotal
    * Get maximum of order item subtotal
    * In all 3 cases we need to initialize aggregator, loop through the values in collection and add it to the aggregator

## Develop myFilter, myMap and myReduce APIs
Now let us see how we can leverage lambda functions to develop generic functions to filter data, to apply transformation or mapping, to perform aggregations using reduce.
* myFilter function
    * Define function with two arguments
    * first argument – lambda function with one argument (at run time we pass a code snippet which return True or False)
    * second argument – collection
    * Develop the logic which will iterate through the elements in collection, apply passed filter criteria and add elements to new collections which satisfied the criteria.
    * Here is the code and also sample invocations covering all 3 scenarios discussed above.