<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Module 1:</span> Data Wrangling</h1>
<hr>

Welcome to <span style="color:royalblue">Module 1: Data Wrangling</span>!

Often, the most interesting machine learning applications require you to **wrangle** your data first.

For this project, we've been given a **transaction-level** dataset. In other words, each observation in the raw dataset is for a single transaction - one item, one customer, one purchase.

Therefore, to get a **customer-level** data (since we want to build clusters for customers, not transactions) we need to **aggregate transactions by customer** and **engineer customer-level features**.

* This step blends together exploratory analysis, data cleaning, and feature engineering.
* Here, feature engineering comes from aggregating the transaction-level data.
* As you'll see, you still have a lot of room for creativity in this step!

<br><hr id="toc">

### In this module...

In this module, we'll cover:
1. [Transaction-level exploration](#exploration)
2. [Transaction-level cleaning](#cleaning)
3. [Customer-level feature engineering](#customer-level)
4. [Intermediary levels](#intermediary-levels)
5. [Joining together the ABT](#joining)


<br><hr>

### First, let's import libraries and load the dataset.

First, import libraries that you'll need.

In [None]:
# print_function for compatibility with Python 3


# NumPy for numerical computing


# Pandas for DataFrames



# Matplotlib for visualization

# display plots in the notebook


# Seaborn for easier visualization



Next, let's import the dataset.
* The file path is <code style="color:crimson">'project_files/int_online_tx.csv'</code>

In [None]:
# Load international online transactions data from CSV


Now we're ready to jump into wrangling the data!

<span id="exploration"></span>
# 1. Transaction-level exploration

Even though we eventually want customer-level data, it's still helpful to do some basic exploratory analysis at the transaction level. 

<br>
**First, display the shape of the dataset.**

In [None]:
# Dataframe dimensions


**Next, display the first 10 observations from the dataset.**

In [None]:
# First 10 rows of data


**Finally, display the distribution of transactions by country.**

In [None]:
# Make figsize 6x7


# Bar plot by country


<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="cleaning"></span>
# 2. Transaction-level cleaning

Before we aggregate to the customer level, we need to tidy up a few things at the transaction level.

<br>
**Display the number of missing observations for each feature.**

In [None]:
# Missing data by feature


**Drop observations with missing customer ID's.**

In [None]:
# Only keep transactions with CustomerID's


**Next, just for clarity, convert the CustomerID's from floats into integers.**
* This is technically not required, but it's good practice.
* ID's should be saved as strings or integers just so they don't get mixed up with other numeric features.
* Then, display the first 5 CustomerID's in the transaction dataset just to confirm.

In [None]:
# Convert customer ID's into integers


# Display first 5 CustomerID's in the transaction dataset



**Create a <code style="color:steelblue">'Sales'</code> interaction feature.**
* Then, display the first 5 Sales values in the transaction dataset just to confirm.

In [None]:
# Create 'Sales' interaction feature


# Display first 5 Sales values in the transaction dataset



**Before moving on, save your cleaned transaction-level data as <code style="color:crimson">cleaned_transactions.csv</code>.**
* We'll return to this in <span style="color:royalblue">Modules 2 and 3</span> to create the features for individual items.
* Remember to set <code style="color:steelblue">index=None</code> to save only the data and not the auto-generated index.

In [None]:
# Save cleaned transaction-level data


<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="customer-level"></span>
# 3. Customer-level feature engineering

Now that we have a cleaned transaction-level dataset, it's time to **roll it up** to the customer level, which we'll feed into our machine learning algorithms later.

<br>
**First, we'll aggegrate invoice data by customer. We'll engineer 1 feature:**
* <code style="color:steelblue">'total_transactions'</code> - the total number of unique transactions for each customer.

In [None]:
# Aggegrate invoice data


# Display invoice data for first 5 customers


**Next, using the same technique, aggregate product data by customer. Engineer 2 features:**
* <code style="color:steelblue">'total_products'</code> - the total number of products purchased by each customer (not the unit Quantity within each product!).
* <code style="color:steelblue">'total_unique_products'</code> - the number of unique products purchased by each customer.
* **Hint:** you'll want to use the Item ID column (i.e. StockCode) after your groupby!
* Let's name it <code style="color:crimson">product_data</code>.

In [None]:
# Aggregate product data


# Display product data for first 5 customers


**Finally, aggregate sales data by customer. Engineer 2 features:**
* <code style="color:steelblue">'total_sales'</code> - the total sales for each customer.
* <code style="color:steelblue">'avg_product_value'</code> - the average value of the products purchased by the customer (not the UnitPrice!).
* **Hint:** you'll want to use the Sales column after your groupby!
* Let's name it <code style="color:crimson">sales_data</code>.

In [None]:
# Roll up sales data


# Display sales data for first 5 customers


<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="intermediary-levels"></span>
# 4. Intermediary levels

You won't always be able to easily roll up to customer-level directly... Sometimes, it will be easier to create intermediary levels first.

<br>
**Let's first aggregate cart data at the "cart-level."**
* We'll group by <code style="color:steelblue">'CustomerID'</code> AND by <code style="color:steelblue">'InvoiceID'</code>. Remember, we're treating each invoice as a "cart."
* Then, we'll calculate <code style="color:steelblue">'cart_value'</code> by taking the sum of the Sales column. This is the total sales by invoice (i.e. cart).
* Finally, we'll call <code style="color:steelblue">.reset_index()</code> to turn CustomerID and InvoiceID back into regular columns so we can perform another aggregation.

In [None]:
# Aggregate cart-level data (i.e. invoice-level)


# Reset index


# Display cart data for first 5 CARTS


**Aggregate cart data by customer. Engineer 3 features:**
* <code style="color:steelblue">'avg_cart_value'</code> - average cart value by customer.
* <code style="color:steelblue">'min_cart_value'</code> - minimum cart value by customer.
* <code style="color:steelblue">'max_cart_value'</code> - maximum cart value by customer.
* Let's name it <code style="color:crimson">agg_cart_data</code>.

In [None]:
# Aggregate cart data (at customer-level)



# Display cart data for first 5 CUSTOMERS


<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="joining"></span>
# 5. Joining together the ABT

Great, now you have multiple dataframes that each contain customer-level features. 

<br>
**Let's join the various customer-level datasets together with the <code style="color:steelblue">.join()</code> function.**
* Just pick one of the customer-level dataframes and join it to a list of the others.
* By default, it will join the dataframes on their <code style="color:steelblue">index</code>... In this case, it will join by CustomerID, which is exactly what we want.
* You can read more about the <code style="color:steelblue">.join()</code> function in the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html" target="_blank">documentation</a>.

In [None]:
# Join together customer-level ABT


# Display customer-level data for first 5 customers


**Finally, let's save <code style="color:crimson">customer_df</code> as our analytical base table to use later.**
* **Important:** We will not set <code style="color:steelblue">index=None</code> because we want to keep the CustomerID's as the index (this will be important when we get to Module 4).

In [None]:
# Save analytical base table


<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<br>
## Next Steps

Congratulations for making it through Project 4's Data Wrangling!

As a reminder, here are a few things you did in this module:
* You explored and cleaned the transaction-level dataset.
* You then rolled it up to the customer level, aggregating new features along the way.
* You also saw how intermediary groupby's can help you create otherwise tricky features.
* And you joined all of your customer-level datasets together into one analytical base table.

In the next module, <span style="color:royalblue">Module 2: Dimensionality Reduction</span>, we'll see how trying to include features for individual items can raise new challenges (i.e. The Curse of Dimensionality). We'll also introduce you to the first of 2 recommended solutions: thresholding.

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>