<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Module 2:</span> Dimensionality Reduction</h1>
<hr>

Welcome to <span style="color:royalblue">Module 2: Dimensionality Reduction</span>!

In the previous module, you created an analytical base table with useful customer-level features for **purchase patterns**.

However, remember, our client wishes to incorporate information about **specific item purchases** into the clusters. For example, our model should be more likely to group together customers who buy similar items.

* In this module, we'll prepare individual item features for our clustering algorithms.
* The Curse of Dimensionality is especially relevant for clustering because it means observations are "far away" from each other.
* We'll introduce a simple way to reduce the number of dimensions by applying thresholds.

<br><hr id="toc">

### In this module...

In this module, we'll cover:
1. [The Curse of Dimensionality](#curse)
2. [Item data](#item-data)
3. [Toy example: rolling up item data](#toy)
4. [High dimensionality](#high-dimensionality)
5. [Thresholds](#thresholds)


<br><hr>

### First, let's import libraries and load the cleaned transaction-level data.

First, import libraries that you'll need.

In [2]:
# print_function for compatibility with Python 3
from __future__ import print_function

# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd


# Matplotlib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

Next, let's import the cleaned dataset (not the analytical base table) that we saved in the previous module.
* Remember, we saved it as <code style="color:crimson">'cleaned_transactions.csv'</code>.

In [3]:
# Read cleaned_transactions.csv
df = pd.read_csv('cleaned_transactions.csv')

<span id="curse"></span>
# 1. The Curse of Dimensionality

No code for this part. Please see the online lesson for intuition behind The Curse of Dimensionality.

<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<span id="item-data"></span>
# 2. Item data

So how does The Curse of Dimensionality arise in this problem?

<br>
**Display the first 10 StockCodes and Descriptions from the transaction dataset.**

In [9]:
# First 10 StockCodes and Descriptions
df[['StockCode', 'Description']].head(10)

Unnamed: 0,StockCode,Description
0,22728,ALARM CLOCK BAKELIKE PINK
1,22727,ALARM CLOCK BAKELIKE RED
2,22726,ALARM CLOCK BAKELIKE GREEN
3,21724,PANDA AND BUNNIES STICKER SHEET
4,21883,STARS GIFT TAPE
5,10002,INFLATABLE POLITICAL GLOBE
6,21791,VINTAGE HEADS AND TAILS CARD GAME
7,21035,SET/2 RED RETROSPOT TEA TOWELS
8,22326,ROUND SNACK BOXES SET OF4 WOODLAND
9,22629,SPACEBOY LUNCH BOX


**Next, display the number of unique items in the dataset.**

In [11]:
# Number of unique items
len(df.StockCode.unique())

2574

<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<span id="toy"></span>
# 3. Toy example: rolling up item data

To illustrate how we'll **roll up item information to the customer level**, let's use another toy example. 

<br>
**First, create a <code style="color:crimson">toy_df</code> that only contains transactions for 2 customers.**
* Include transactions for these 2 CustomerID's: <code style="color:crimson">14566</code> and <code style="color:crimson">17844</code>.
* By the way, there's nothing special about these customers... we just chose them because they have relatively few purchases, making the toy example more manageable.
* Then, display the toy dataframe.

In [22]:
# Create toy_df
toy_df = df[df['CustomerID'].isin([14566,17644])]

# Display toy_df
toy_df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Sales
19250,563900,85099C,JUMBO BAG BAROQUE BLACK WHITE,200,8/21/11 11:05,1.79,14566,Channel Islands,358.0
19251,563900,85099B,JUMBO BAG RED RETROSPOT,200,8/21/11 11:05,1.79,14566,Channel Islands,358.0
19252,563900,23199,JUMBO BAG APPLES,200,8/21/11 11:05,1.79,14566,Channel Islands,358.0
19253,563900,22386,JUMBO BAG PINK POLKADOT,200,8/21/11 11:05,1.79,14566,Channel Islands,358.0


**Create a dataframe of dummy variables for <code style="color:steelblue">'StockCode'</code>.**
* Name it <code style="color:crimson">toy_item_dummies</code>.
* We don't need the other features right now, so you can actually just directly pass in the <code style="color:steelblue">toy_df.StockCode</code> Series to <code style="color:steelblue">pd.get_dummies()</code>.
* Then, add <code style="color:steelblue">'CustomerID'</code> to this new dataframe so that we can roll up by customer later.
* Finally, display the dataframe.

In [17]:
# Get toy_item_dummies
toy_item_dummies = pd.get_dummies(toy_df.StockCode)
# Add CustomerID to toy_item_dummies
toy_item_dummies['CustomerID'] = toy_df.CustomerID

# Display toy_item_dummies
toy_item_dummies

Unnamed: 0,22386,23199,85099B,85099C,CustomerID
19250,0,0,0,1,14566
19251,0,0,1,0,14566
19252,0,1,0,0,14566
19253,1,0,0,0,14566


**Finally, we can aggregate this information to the customer-level**.
* In fact, it's as simple as grouping by customer and counting the number of times each customer bought each item.

In [21]:
# Create toy_item_data by aggregating at customer level
toy_item_data = toy_item_dummies.groupby('CustomerID').sum()

# Display toy_item_data
toy_item_data

Unnamed: 0_level_0,22386,23199,85099B,85099C
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
14566,1,1,1,1


<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<span id="high-dimensionality"></span>
# 4. High dimensionality

Now, perhaps the alarms in your head have already started ringing!

<br>
**First, create a dataframe of dummy variables for <code style="color:steelblue">'StockCode'</code>, this time for the full dataset.**
* Name it <code style="color:crimson">item_dummies</code>.
* Then, add <code style="color:steelblue">'CustomerID'</code> to this new dataframe so that we can roll up by customer later.
* Then, display the first 5 rows in this dataframe.

In [25]:
# Get item_dummies
item_dummies = pd.get_dummies(df.StockCode)

# Add CustomerID to item_dummies
item_dummies['CustomerID'] = df.CustomerID

# Display first 5 rows of item_dummies
item_dummies.head()

Unnamed: 0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,...,90201A,90201B,90201C,90201D,90202D,90204,C2,M,POST,CustomerID
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583


**Next, roll up the item dummies data into customer-level item data**.
* Name it <code style="color:crimson">item_data</code>.
* This could take a few seconds.
* Then, display the first 5 rows of the dataframe.

In [27]:
# Create item_data by aggregating at customer level
item_data = item_dummies.groupby('CustomerID').sum()

# Display first 5 rows of item_data
item_data.head()

Unnamed: 0_level_0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,...,90192,90201A,90201B,90201C,90201D,90202D,90204,C2,M,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
12349,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12352,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,5


**Finally, let's display the total number times each item was purchased.**
* This quick check confirms these features are pretty sparse.

In [28]:
# Total times each item was purchased
item_data.sum()

10002        12
10120         1
10125        13
10133         5
10135         4
11001         8
15034         5
15036        19
15039         3
15044A        6
15044B        3
15044C        2
15044D        4
15056BL      50
15056N       35
15056P       24
15058A        9
15058B        8
15058C        4
15060B       12
16008        11
16011         3
16012         4
16014        10
16016        16
16045         8
16048         8
16054         2
16156L        6
16156S       12
           ... 
90098         1
90099         2
90108         1
90114         1
90120B        1
90145         2
90160A        1
90160B        1
90160C        1
90160D        1
90161B        1
90161C        1
90161D        1
90162A        1
90162B        1
90164A        1
90170         1
90173         1
90184B        1
90184C        1
90192         1
90201A        1
90201B        3
90201C        2
90201D        1
90202D        1
90204         1
C2            6
M            34
POST       1055
Length: 2574, dtype: int

**Before moving on, let's save this customer-level item dataframe as <code style="color:crimson">'item_data.csv'</code>. We'll use it again in the next module.**
* In the next module, we'll look at an alternative way to reduce dimensionality.
* Again, we won't set <code style="color:steelblue">index=None</code> because we want to keep the CustomerID's as the index.

In [30]:
# Save item_data.csv
item_data.to_csv('item_data.csv')

<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<span id="thresholds"></span>
# 5. Thresholds

One very **simple and straightforward way** to reduce the dimensionality of this item data is to set a **threshold** for keeping features.

<br>
**First, we can see which items those are and the number of times they were purchased.**
1. Take the sum by column.
* Sort the values.
* Look at the last 20 (since they are sorted in ascending order by default).

In [35]:
# Display most popular 20 items
item_data.sum().sort_values().tail(20)

22961      114
22630      115
22139      117
21080      122
85099B     123
20726      123
20719      128
20750      132
23084      140
20725      141
21212      143
22551      158
22629      160
22328      166
21731      169
22556      179
22554      197
22423      222
22326      271
POST      1055
dtype: int64

**Next, if we take the <code style="color:steelblue">.index</code> of the above series, we can get just a list of the StockCodes for those 20 items.**

In [37]:
# Get list of StockCodes for the 20 most popular items
top_20_items = item_data.sum().sort_values().tail(20).index

top_20_items

Index([u'22961', u'22630', u'22139', u'21080', u'85099B', u'20726', u'20719',
       u'20750', u'23084', u'20725', u'21212', u'22551', u'22629', u'22328',
       u'21731', u'22556', u'22554', u'22423', u'22326', u'POST'],
      dtype='object')

**Finally, we can keep only the features for those 20 items.**

In [39]:
# Keep only features for top 20 items
top_20_item_data = item_data[top_20_items]

# Shape of remaining dataframe
top_20_item_data.shape

(414, 20)

Here, take a look:

In [41]:
# Display first 5 rows of top_20_item_data
top_20_item_data.head(5)

Unnamed: 0_level_0,22961,22630,22139,21080,85099B,20726,20719,20750,23084,20725,21212,22551,22629,22328,21731,22556,22554,22423,22326,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
12347,0,0,0,0,0,0,4,0,3,0,0,0,0,0,5,0,0,4,0,0
12348,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
12349,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1
12350,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
12352,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,5


**Finally, save this top 20 items dataframe as <code style="color:crimson">'threshold_item_data.csv'</code>.**
* We'll see a different way to reduce dimensionality in the next module, but we'll come back to this dataframe again in Module 4.
* Don't set <code style="color:steelblue">index=None</code> because we want to keep the CustomerID's as the index.

In [42]:
# Save threshold_item_data.csv
top_20_item_data.to_csv('threshold_item_data.csv')

<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>

<br>
## Next Steps

Congratulations for making it through Project 4's Dimensionality Reduction!

As a reminder, here are a few things you did in this module:
* You learned about the Curse of Dimensionality and how it can cause issues for clustering.
* You used another toy example to see the process of rolling up item data.
* You created customer-level item features that represent the number of times each item was purchased.
* And you reduced the dimensionality of that dataset using thresholds.

In the next module, <span style="color:royalblue">Module 3: Principal Components Analysis</span>, we'll look at a different way to reduce the number of customer-level item features. This is a more advanced technique, and it's actually considered its own Unsupervised Learning task!

<p style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</p>