# Simple Statistics
Summarising the data across various dimensions

In [2]:
]import LINK ../link
LINK.Setup '#.EC' '../APLSource'
⎕CS EC

By now we have already:
- imported data from file
- cleaned the data by removing missing items

Now we are going to analyse the data. To start with, we will select some columns and join the data together to form a cohesive table containing only the data relevant to our query.

## Total orders
Use **tally** `≢⍵` to count the total orders.

In [3]:
orders ← ImportDataTable '../data/olist_orders_dataset.csv'

In [9]:
≢orders.data

## Orders by product category
Now we need to merge the orders and their product categories. First let us list the columns which we will use to merge the data.

| Data set | Columns |
|   ---    |   ---  |
| items    | order_id, product_id |
| products | product_id, product_category |


In [32]:
items ← (1 2 1 1 1 2 2) ImportDataTable '../data/olist_order_items_dataset.csv'

In [37]:
products ← (1 1 3 3 3 3 3 3 3) ImportDataTable '../data/olist_products_dataset.csv'

Now we will merge data from `items` and `products` to count orders grouped by product category.

There may be multiple items in one order, so we will join `order_id` from `items` and `product_category` from `products` according to `product_id`.

In [16]:
oid←items.Column'order_id'
pid←products.Column'product_id'
iid←items.Column'product_id'
idx←pid⍳iid
∧/idx≤≢pid   ⍝ Are all product IDs found?

The **index-of** function returns `1+≢⍺` for elements of `⍵` not found in `⍺`, 

In [49]:
cat ← products.Column'product_category_name'
pcn ← (cat,⊂'Unknown')[idx]

Now we will count the number of items in each order, and check that this matches the maximum `order_item_id` for each order in the `items` data set.

In [50]:
order_items_count ← oid {≢⍵}⌸ pcn
max_items ← oid {⌈/⍵}⌸ items.Column'order_item_id'
≢¨order_items_count   max_items
  order_items_count ≡ max_items

Finally, let's count the total number of orders which include a particular category. We will then save this as a file with the list in order of descending order count.

In [52]:
5↑orders_by_category ← pcn {⍺,≢∪⍵}⌸ oid

We expect that the sum of orders in this table may be greater than the total number of orders placed, since the same order may be counted in several categories.

In [53]:
+/⊢/orders_by_category
≢∪orders.Column'order_id'

Let's sort our data before writing it to file. We will also attach the English product category translations.

In [65]:
sorted ← orders_by_category[⍒⊢/orders_by_category;]
(port eng)←↓⍉⎕CSV'../data/product_category_name_translation.csv'
trans←(eng, ⊂'Translation NOT FOUND')[port⍳⊣/sorted]
header ← 'English Product Category Name' 'Portuguese Product Category Name' 'Count of Orders'
(header⍪trans,sorted) (⎕CSV⎕OPT'IfExists' 'Replace') 'orders_by_category.csv'

## Payments over time
Now let's add time to our analysis.