## Data Processing using Pandas
Let us understand how to process data using a 3rd party library called Pandas.

* Limitations of Collections
* Overview of Pandas
* Overview of Series
* Reading files into Data Frames
* Standard Transformations
* Projection and Filtering Data
* Aggregations
* Writing to Files
* Joining Data Frames
* Exercises

## Limitations of Collections
Let us understand some of the limitations of the collections.
* No structure defined
* The code is not readable
* Some of the APIs are scattered in multiple Python modules or plugins.
* Pandas provide us a robust set of APIs where we can refer to columns using names and perform all standard transformations


## Overview of Pandas
Let us understand the details with respect to Pandas.
* Pandas is not a core Python module and hence we need to install using pip.
* It has 2 types of data structures - series and DataFrame
* We can perform all standard transformations using Pandas APIs
* We also have SQL based wrappers on top of Pandas where we can write queries.


## Overview of Series
Let us quickly go through one of the Pandas Data Structure - Series.
* Pandas Series is a one-dimensional labeled array capable of holding any data type.
* It is similar to one column in an excel spreadsheet or a database table.
* We can create Series by using dict.

In [None]:
d = {"JAN": 10, "FEB": 15, "MAR": 12, "APR": 16}

In [None]:
import pandas as pd
s = pd.Series(d)

In [None]:
s

In [None]:
type(s)

In [None]:
l = [10, 15, 12, 16]
pd.Series(l)

In [None]:
s.count()

In [None]:
s.sum()

In [None]:
s.min()

In [None]:
s.max()

* When we fetch only one column from a Pandas Dataframe, it will be returned as Series.

In [1]:
orders_path = "/data/retail_db/orders/part-00000"

In [2]:
orders_schema = [
  "order_id",
  "order_date",
  "order_customer_id",
  "order_status"
]

In [None]:
pd.read_csv?

In [None]:
orders = pd.read_csv(orders_path,
  header=None,
  names=orders_schema
)

In [None]:
type(orders)

In [None]:
order_dates = orders.order_date

In [None]:
type(order_dates)

In [None]:
# Preview Series
order_dates

**Don’t worry too much about creating Data Frames yet, we are trying to understand how Data Frame and Series are related.**

## Reading files into Data Frames
Let us see how we can create the Pandas Data Frame.
read_csv is the most popular API to create a Data Frame by reading data from files.
* Here are some of the important options.
  * sep or delimiter
  * header or names
  * index_col
  * dtype
  * and many more
* We have several other APIs which will facilitate us to create Data Frame
  * read_fwf
  * read_table
  * pandas.io.json
  * and more
* Here is how we can create a Data Frame for orders dataset.
  * Delimiter is default
  * There is no Header and hence we have to set keyword argument header to None.
  * We can pass the column names as a list.
  * Data types of each column are typically inferred based on the data, however we can explicitly specify Data Types using dtype.


In [None]:
orders_path = "/data/retail_db/orders/part-00000"

In [None]:
orders_schema = [
  "order_id",
  "order_date",
  "order_customer_id",
  "order_status"
]

In [None]:
orders = pd.read_csv(orders_path,
  header=None,
  names=orders_schema
)

In [None]:
type(orders)

In [None]:
# Preview Data Frame
orders

## Standard Transformations
Let us see some of the standard transformations that can be performed using Data Frame APIs.
* Projection
* Filtering
* Aggregations
* Joins

Let us see some examples related to standard transformation using Pandas API. But before that let us read both orders as well as order_items into Pandas Data Frame.

* Read order_items data and project order_item_order_id and order_item_subtotal. Columns can be named with these names in the same order.
  * order_item_id
  * order_item_order_id
  * order_item_product_id
  * order_item_quantity
  * order_item_subtotal
  * order_item_product_price

In [None]:
import pandas as pd

In [None]:
# Reading orders
orders_path = "/data/retail_db/orders/part-00000"

In [None]:
orders_schema = [
    "order_id",
    "order_date",
    "order_customer_id",
    "order_status"
]

In [None]:
orders = pd.read_csv(
    orders_path,
    header=None,
    names=orders_schema
)

In [None]:
# Reading order_items
order_items_path = "/data/retail_db/order_items/part-00000"

In [None]:
order_items_schema = [
    "order_item_id",
    "order_item_order_id",
    "order_item_product_id",
    "order_item_quantity",
    "order_item_subtotal",
    "order_item_product_price"
]

In [None]:
order_items = pd.read_csv(
    order_items_path,
    header=None,
    names=order_items_schema
)

## Projection and Filtering Data

Let us understand how to project as well filter data in Data Frames.

* Projecting data

In [None]:
orders.order_date

In [None]:
orders['order_date']

In [None]:
# Project order_item_order_id and order_item_subtotal
order_items[["order_item_order_id", "order_item_subtotal"]]

* Filter for order_item_order_id 2

In [None]:
order_items[order_items.order_item_order_id == 2]

In [None]:
order_items[order_items["order_item_order_id"] == 2]

In [None]:
order_items.query('order_item_order_id == 2')

* Filter for order_item_order_id 2 and order_item_subtotal between 125 and 250

In [None]:
order_items[(order_items.order_item_order_id == 2) & 
            ((order_items.order_item_subtotal >= 150) & (order_items.order_item_subtotal <= 250))]

In [None]:
order_items.query('order_item_order_id == 2 and ' +
                  'order_item_subtotal >= 150 and ' +
                  'order_item_subtotal <= 250'
                 )

* Filter for orders which are placed on 2013 August 1st

In [None]:
orders[orders.order_date.str.startswith('2013-08-01')]

In [None]:
orders.query('order_date.str.startswith("2013-08-01")', engine='python')

## Aggregations

Let us understand how to perform aggregations using Pandas. There are 2 types of aggregations.
* Global Aggregations
* By key Aggregations

### Global Aggregations

There are several global aggregations that can be performed.

* Getting number of records in the Data Frame.

In [None]:
orders.shape

* Getting number of non np.NaN values in each attribute in a Data Frame

In [None]:
orders.count()

* Getting basic statistics of numeric fields of a Data Frame

In [None]:
orders.describe()

* Get revenue for a order id 2 from order_items

In [None]:
order_items[order_items.order_item_order_id == 2].order_item_subtotal.sum()

### By Key Aggregations

By Key Aggregations are those which are computed per key. Here are some of the examples.

* Getting number of orders per day
* Getting number of orders per status
* Computing revenue per order

In [None]:
## Getting number of orders per day
orders.groupby(orders.order_date).count()
## This gives count of each and every field by default

In [None]:
orders.groupby(orders.order_date)['order_status'].count()

In [None]:
## Getting number of orders per status
orders.groupby(orders.order_status)['order_status'].count()

In [None]:
orders. \
    groupby(orders.order_status)['order_status']. \
    agg(['count', 'min', 'max']). \
    rename(columns={'count': 'order_count'})

In [None]:
## Computing revenue per order
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum']). \
    rename(columns={'sum': 'revenue'})

## Writing to files

Pandas also provides simple APIs to write the data back to files.

* Let us write the revenue per order along with order_id to a file.

In [None]:
order_items.to_csv?

In [None]:
import getpass

username = getpass.getuser()

username

In [None]:
import os
os.system(f'mkdir -p /home/{username}/retail_db')
os.system(f'ls -ltr /home/{username}/retail_db')

In [None]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum']). \
    rename(columns={'sum': 'revenue'}). \
    round(2). \
    to_csv(f'/home/{username}/retail_db/order_revenue.csv')

In [None]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum']). \
    rename(columns={'sum': 'revenue'}). \
    round(2). \
    to_json(f'/home/{username}/retail_db/order_revenue.json', orient='table')

In [None]:
order_items.to_json?

In [None]:
import platform
print(platform.python_version())

## Joining Data Frames

Let us understand how to join Data Frames using Pandas.

* Join orders and order_items using orders.order_id and order_items.order_item_order_id.

In [None]:
# Join orders and order_items
orders.set_index("order_id"). \
    join(order_items.set_index("order_item_order_id"))

* Compute Daily Revenue using orders.order_date and order_items.order_item_order_subtotal considering only COMPLETE and CLOSED orders.

In [None]:
# Compute Daily Revenue using
# orders.order_date and order_items.order_item_order_subtotal
# considering only COMPLETE and CLOSED orders.

import pandas as pd

In [None]:
# Reading orders
orders_path = "/data/retail_db/orders/part-00000"
orders_schema = [
    "order_id",
    "order_date",
    "order_customer_id",
    "order_status"
]

orders = pd.read_csv(
    orders_path,
    header=None,
    names=orders_schema
)

In [None]:
# Reading order_items
order_items_path = "/data/retail_db/order_items/part-00000.csv"
order_items_schema = [
    "order_item_id",
    "order_item_order_id",
    "order_item_product_id",
    "order_item_quantity",
    "order_item_subtotal",
    "order_item_product_price"
]

order_items = pd.read_csv(
    order_items_path,
    header=None,
    names=order_items_schema
)

In [None]:
orders_filtered = orders[orders.order_status.isin(["COMPLETE", "CLOSED"])]

In [None]:
orders_join = orders_filtered. \
    set_index("order_id"). \
    join(order_items.set_index("order_item_order_id"))

In [None]:
daily_revenue = orders_join. \
    groupby("order_date")["order_item_subtotal"]. \
    agg(revenue="sum").round(2)

In [None]:
daily_revenue

## Exercises
Here are some of the Exercises related to Pandas.

* Get all the orders which belong to the month of 2013 August
* Get all the orders which belong to the months of August, September and October in 2013.
* Get count of orders by status for the month of 2014 January
* Get all the records from orders where there are no corresponding records in order_items
* Get all the customers who have not placed any orders
* Get the revenue by status