# Sales Exercise

In this exercise, you will perform a simulated auditing task with data on invoices, customer orders, and shipments.

## Overview and setting

Imagine you are asked to work on a new audit client that distributes bath bombs and cowbells (my daughter’s favorite toys at the time I created this exercise). Your manager has provided you with 3 related datasets containing invoice, customer order, and shipment information. Your task is to explore the data and identify observations that require further testing.

## Housekeeping

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Import datasets

In the working folder, you will find three datasets: [invoices](invoices.csv), [customer orders](customer_orders.txt), and [shipping data](shipping_data.csv). A description of these datasets can be found in the [sales data sheet](sales_data_sheet.pdf). The datasets are in csv or txt format and can be read into Python using the `pd.read_csv()` function.

In [None]:
invoices =
customer_orders =
shipping_data =

## Inspect and clean each dataset

After reviewing the [sales data sheet](sales_data_sheet.pdf), inspect each dataset carefully. Think about your “strategy” for wrangling each dataset. Consider how each dataset needs to be cleaned, formatted, or reshaped.

### Clean and process the `invoices` dataset:

In [None]:
invoices.info()

In [None]:
invoices.head()

Notice there are 'duplicate' invoice numbers. This is because each invoice can contain multiple items (i.e., bath bombs and cowbells). We will need to pivot this data from long to wide format so that each invoice number is unique and cowbells and bath bombs are in separate columns.

In [None]:
invoices_clean =

invoices_clean.head()

Everything looks pretty good. However, notice that the `invoice_date` is in string format. Let's convert it to a date format.

In [None]:
invoices_clean['invoice_date'] =

invoices_clean.head()

### Clean and process the `customer_orders` dataset:

In [None]:
customer_orders.info()

In [None]:
customer_orders.head()

Everything looks pretty good here. However, notice that the `customer_order_date` is in string format. Let's convert it to a date format.

In [None]:
customer_orders_clean = customer_orders.copy()

customer_orders_clean['customer_order_date'] =

customer_orders_clean.info()

### Clean and process the `shipping_data` dataset:

In [None]:
shipping_data.info()

In [None]:
shipping_data.head()

This dataset needs a bit more work. Notice that the `customer_id` and `invoice_num` are combined in one column. We will need to split this into two separate columns. Also, the `shipping_date` and `delivery_date` are in string format. Let's convert these to date format.

In [None]:
shipping_data_clean = shipping_data.copy()

shipping_data_clean[['customer_id', 'invoice_num']] =

shipping_data_clean['shipping_date'] =
shipping_data_clean['delivery_date'] =

shipping_data_clean.head()

OK, we are very close, but notice that `invoice_num` is in string format. Let's convert it to numeric format.

In [None]:
shipping_data_clean['invoice_num'] =

shipping_data_clean.info()

## Join and filter the sales datasets

Use join functions to create a list of invoices that **did not** have an associated shipment (save in a new dataset called `non_shipped_invoices`):

In [None]:
non_shipped_invoices =

non_shipped_invoices.info()

Next, use join functions to combine all three datasets to form a new dataset called `complete_sales_data`. Make sure this dataset retains only observations that match across all three datasets:

In [None]:
# Inner join invoices_data_clean with customer_orders_data_clean on 'customer_order'
complete_sales_data =

# Inner join the result with shipping_data_clean on 'invoice_num'
complete_sales_data =

complete_sales_data.head()

Notice that customer_id appears twice with different suffixes (i.e., _x and _y). This is because customer_id appears in both the invoices and shipping datasets.

Using the dataset created above, can you isolate the invoices that may have been recorded in the wrong year? Hint: Look at the shipping dates and shipping terms. Save in a new dataset called `wrong_period_sales`:

In [None]:
wrong_period_sales =

wrong_period_sales.info()

In [None]:
wrong_period_sales.head()

Create a variable in complete_sales_data called `inv_month` that represents the month of the invoice date:

In [None]:
complete_sales_data['inv_month'] =

Create a summary of the total value of all invoices (`invoice_amount`) for each month (`inv_month`) during 2024. Store in `sales_per_month`:

In [None]:
sales_per_month =

sales_per_month.head(12)

## Generate visualizations
Use a line chart to visualize trends in monthly sales using `sales_per_month`. Use a line chart to visualize trends in monthly sales using `sales_per_month`:

In [None]:
# Create the line plot using Seaborn
sns.relplot(
    data=sales_per_month,
    x='inv_month',
    y='total_sales',
    kind='line'
)

plt.ticklabel_format(style='plain', axis='y')

plt.show()

Use a scatterplot to visualize the relation between shipping weight and invoice amount - use techniques to avoid “overplotting”:

In [None]:
sns.relplot(
    data=complete_sales_data.sample(n=4000, random_state=42),
    x='shipping_weight',
    y='invoice_amount',
    color='black',
    alpha=0.25,
    size=10,
    kind='scatter'
)

plt.show()