## Exercises - Manipulating Collections using Loops

Let us go throuh some of the exercises to understand how to process collections using conventional loops and conditionals. Create functions for each of the below problem statement.

   * Get number of COMPLETE orders placed by each customer.
   * Get total number of PENDING or PENDING_PAYMENT orders the month of 2014 January.
   * Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING.
   
## Details of Data

Here are the details about the orders data which you can leverage to take care of these exercises.

   * Location `/data/retail_db/orders/part-00000`.
   * Each record is line separated or line delimited.
   * Attribute in each record is comma separated.
   * Here are the columns in the orders data set.
      * order_id
      * order_date
      * order_customer_id
      * order_status

In [None]:
# get the details about file
!ls -ltr /data/retail_db/orders/part-00000

In [None]:
# Get first five lines from the file
!head -5 /data/retail_db/orders/part-00000

In [None]:
# Get number of lines from the file
# We can use linux command wc with -l
wc -l /data/retail_db/orders/part-00000 

Here are the details about the order_items data which you can leverage to take care of these exercises.

   * Location `/data/retail_db/orders/part-00000`.
   * Each record is line separated or line delimited.
   * Attribute in each record is comma separated.
   * Here are the columns in the orders data set.
      * order_item_id
      * order_item_order_id
      * order_item_product_id
      * order_item_quantity
      * order_item_subtotal
      * order_item_product_price

In [None]:
# get the details about file
!ls -ltr /data/retail_db/items/part-00000

In [None]:
# Get first five lines from the file
!head -5 /data/retail_db/items/part-00000

In [None]:
# Get number of lines from the file
# We can use linux command wc with -l
wc -l /data/retail_db/items/part-00000 

## Exercise 1 - read data from file

Before getting into problem statement, develop the code to read the file into list of elements.

   * We should be able to use this function to read any file with text data using line as record delimiter.

In [None]:
# Update the logic here
def get_list_from_file(file_path):
    data_list = open(file_path).read().splitlines()
    return data_list

* Run below cells to validate the function.
* You should see 68883 records as part of the output for the cell with `len(orders)` below.
* You should see 172198 records as part of the output for the cell with `len(order_items)` below.

In [None]:
orders = get_list_from_file('/data/retail_db/orders/part-00000 ')

In [None]:
orders[:5]

In [None]:
len(orders)

In [None]:
order_items = get_list_from_file('/data/retail_db/items/part-00000 ')

In [None]:
order_items[:5]

In [None]:
len(order_items)

## Exercise 2 - Complete Order Count by Customer

Get number of COMPLETE orders placed by each customer. Develop a function which read the orders data and get us complete order count by each customer using **order_customer_id.**

   * The function should take the complete order list as argument and return count of complete orders by customer. The function should return **dict** type object.
   * The order is said to be complete if the **order_status** is **COMPLETE.**
   * You can review structure of the data under **Details of Data** section in this notebook.

In [None]:
# Update the logic here
def get_complete_order_count_by_customer(orders):
    order_count_by_customer = {}
    
    return get_complete_order_count_by_customer

* Run below cell to validate the function. You shoud get **22899** as output.

In [None]:
orders = get_list_from_file('/data/retail_db/items/part-00000 ')

In [None]:
complete_order_count_by_customer = get_complete_order_count_by_customer(orders)

In [None]:
# This should return dict
type(complete_order_count_by_customer)

In [None]:
# This should return 10538
len(complete_order_count_by_customer)

* Run below cell to preview thw data.
```json
(1, 1)
(2, 2)
(3, 5)
(4, 4)
(5, 2)
```

In [None]:
for e in sorted(complete_order_count_by_customer.items())[:5]:
    print(e)

## Exercise 3 - Pending Order Count

Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014. Develop a function which read the orders data and get us pending order count.

   * The function should take the complete order list as argument and return count of pending orders.
   * The order is said to be complete if the status is **PENDING** or **PENDING_PAYMENT.** We should only the orders placed in the month of 2014 January.
   * The second element in each comma separated record gives us the date.
   * The 4th or last element in each comma separated record gives us the order status.

In [None]:
# Update the logic here
def get_pending_order_count(orders):
    return order_count

* Run below cell to validate your function. You should get **1969** as output.

In [None]:
get_pending_order_count(orders)

* You can also validate results using simple linux scripts.

In [None]:
!egrep -w '(PENDING|PENDING_PAYMENT)' /data/retail_db/orders/part-00000|grep 2014-01|wc -l

## Exercise 4 - Get Outstanding Revenue

Get outstanding amount for each month considering orders witn status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING. Modularize by developing multiple functions.

   * Develop a function which takes orders list as argument and retunr a collection of order ids with one of the pending statuses.
   * Develop a function which takes **orders_items list** as well as **orders dict with only status** as arguments and return outstanding amount.
   * You can use **order_item_subtotal** to compute the outstanding amount.
   * Here are the instructions for the solution.
      * Create a list or set or dict for pending orders as part of first function with name that starts with **get_pending_orders.**
      * As part of **get_outstanding_revenue** make sure to iterate through **order_items** and lookup into **pending_orders** to get the subtotal for each order item.
   * Review **Details of Data** section to get more details of columns.
   * Develop a function to create list of orders with pending status and lookup into it.

In [None]:
# Update the logic here


In [None]:
# It should return 31644
len(pending_orders)

In [None]:
def get_outstanding_revenue(order_items, pending_orders):
    return round(get_outstanding_revenue, 2)

In [None]:
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')

In [None]:
%%time
# You should get 15982030.54 as outup. Even if it is different by few dollars it is fine.
get_outstanding_revenue(order_items, pending_orders)

## Exercise 5 - Compare Performance

As part of the previous exercise you were asked to come up with the solution using 3 different approaches. You need to add a markdown cell below each question and provide answer.

   * Question: Which of the 3 approaches is faster? Add a markdown cell below and provide your answer.
      * list
      * set
      * dict

* Question: Provide explanation why the option you have chosen is faster over others. Add a markdown cell below and provide your answer.