# Analyzing Retail Data using Pandas

This exercise involves processing a sampled dataset from a UK-based online retailer.
The original dataset has over a million records &mdash; Excel won't even let you load it!
Pandas will let you load it and process it but, in the interest of not overloading our server, we'll only work with about 0.1% of the rows.
1. Read in the data from the dataset `online_retail_II.csv`. 
A fair number of the records are bad &mdash; having null or empty fields where there should be data. Just throw those records away at the outset using `dropna()`.
1. Sample invoice numbers by computing the remainder after dividing by 1009 (the smallest prime number greater than 1000). 
If the remainder is zero, choose the invoice, otherwise throw it away.
1. But there are a couple of wrinkles to keep in mind:
    * An invoice represents a shopping cart and it can contain multiple items. 
    If we want to keep an invoice in our sampling, we must keep all items in that shopping cart.
    If we want to not keep it, we must not keep any of the items in it.
    * Some invoice numbers start with a "C." Invoice number C123456 is to be interpreted as a return of items in invoice 123456.
    If we want to keep an invoice in our sampled dataset, we must also keep its corresponding return, if one exists.

In [None]:
import pandas as pd

## 1. Read in the data

The code for initial reading of the data is provided in the next cell. You need only run it!

In [None]:
data = pd.read_csv('online_retail_II.csv', encoding='utf-8')
data.dropna(inplace=True)
data = data.reset_index().drop('index',axis=1)
data['Total'] = data['Price'] * data['Quantity']
data.rename(columns={"Customer ID": "CustomerID"}, errors="raise", inplace=True)
print ("Incoming rows:", len(data), "Incoming customers:", len(data['CustomerID'].unique()))
data

## 1a. What does the `data = data.reset_index()`&hellip; line above do?

**Your Answer** 

---

## 2. Add a new column `inum` to `data`&hellip;

The new column `inum` should be the Invoice number if it is a regular invoice and have the "C" tripped out if it is a return. Hint: Use [`pandas.DataFrame.apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html).

Write the result into a new dataframe `data2`.

In [None]:
# Fill in 
data2 = ...
data2

## 3. Choose rows in `data2` with `inum modulo 1009 == 0`

Store the result in a new dataframe `data3`.

In [None]:
# Fill in
data3 = ...
data3

## 3a. Analysis Validation

1. How many rows are in `data3`?
2. All things being equal, because of the modulo 1009 calculation, what would you expect the row count for `data3` to be?
3. Is the number of rows in `data3` consistent with what you might have expected?

---

## 4. What countries are represented in `data3`?

In [None]:
# Fill in
...

## 5. Aggregate values for `data3` by Country

Show the minimum, maximum, mean, count and sums for the invoices by Country

In [None]:
# Fill in
...

![The end](https://live.staticflickr.com/32/89187454_3ae6aded89_b.jpg)