# <code style="color:Crimson">2. DIMENSIONALITY REDUCTION</code>

When we have many features (high dimensionality), it makes clustering especially hard because every observation is "far away" from each other. The amount of "space" that a data point could potentially exist in becomes larger and larger, and clusters become very hard to form.

### Import all the libraries

In [None]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd


#### Next, read in the cleaned transaction dataset (not the analytical base table) that we saved in the previous module.
* i.e. <code style="color:crimson">'cleaned_transactions.csv'</code>.

In [2]:
# Read cleaned_transactions.csv
df = pd.read_csv('cleaned_transactions.csv')

<IPython.core.display.Javascript object>

In [4]:
df.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
Sales          0
dtype: int64


During Data Wrangling, we created a customer-level **analytical base table** with important features such as **total sales by customer** and **average cart value by customer**. However, wewould also like to include **information about individual items** that were purchased.

* For example, if two customers purchased similar items, our model should be more likely to group them into the same cluster.
* In other words, we care not just about *how much* a customer purchases, but also *what* they purchase.

To get a better idea, let's take a look at the item information from our transactions dataset.

#### A. Display the first 10 StockCodes and Descriptions from the cleaned transaction dataset.

In [32]:
# First 10 StockCodes and Descriptions
#df.iloc[:,[1,2]].head(10) OR
df[['StockCode','Description']].head(10)

Unnamed: 0,StockCode,Description
0,22728,ALARM CLOCK BAKELIKE PINK
1,22727,ALARM CLOCK BAKELIKE RED
2,22726,ALARM CLOCK BAKELIKE GREEN
3,21724,PANDA AND BUNNIES STICKER SHEET
4,21883,STARS GIFT TAPE
5,10002,INFLATABLE POLITICAL GLOBE
6,21791,VINTAGE HEADS AND TAILS CARD GAME
7,21035,SET/2 RED RETROSPOT TEA TOWELS
8,22326,ROUND SNACK BOXES SET OF4 WOODLAND
9,22629,SPACEBOY LUNCH BOX


We can see, just within the first 10 transactions, we have 10 different items!

#### B. Let's display the number of unique items in the dataset.

In [36]:
df.agg('nunique')

InvoiceNo      1536
StockCode      2574
Description    2639
Quantity         80
InvoiceDate    1523
UnitPrice       168
CustomerID      414
Country          36
Sales           873
dtype: int64

### <span style="color:RoyalBlue">I. Dealing with High Dimensionality</span>


#### A. Let's create a dataframe of dummy variables for <code style="color:steelblue">'StockCode'</code>, for the full dataset.
* Name it <code style="color:crimson">item_dummies</code>.
* Then, add <code style="color:steelblue">'CustomerID'</code> to this new dataframe so that we can roll up by customer later.
* Then, display the first 5 rows in this dataframe.

In [72]:
# Get item_dummies
item_dummies = pd.get_dummies(df.StockCode)

<IPython.core.display.Javascript object>

In [75]:
# Check the shape
item_dummies.shape

(33698, 2574)

In [77]:
# Add CustomerID to item_dummies
item_dummies['CustomerID'] = df.CustomerID

In [None]:
# Display first 5 rows of item_dummies
item_dummies.head()

As we can see, there are MANY features in this item dummies dataset.
* 1 for customer ID
* 2574 for the items!
* And very importantly... we can see that most of the values are 0, indicating most items are not widely popular!


<br>
<br>
<br>



#### B. Next, let's roll up the item dummies data into customer-level item data and call it <code style="color:crimson">item_data</code>.
                                                                                         


In [79]:
# Create item_data by aggregating at customer level
item_data = item_dummies.groupby('CustomerID').sum()

We can see, even after rolling up to the customer level, most of the values are still 0. That means that most customers are not buying a huge array of different items, which is to be expected.



<br>
<br>
<br>

#### C. Finally, let's display the total number times each item was purchased.
* This quick check confirms these features are pretty sparse.

In [83]:
# Total times each item was purchased
item_data.sum().sort_values()

85087        1
21331        1
84927A       1
85131C       1
21327        1
          ... 
22556      179
22554      197
22423      222
22326      271
POST      1055
Length: 2574, dtype: int64

We can see that, most items were purchased less than a handful of times! 
* First of all, we've just created 2574 customer-level item features, which leads to **The Curse of Dimensionality.**
* To make matters even worse, most of the values for many of those features are 0!

<br>
<br>
<br>

#### D. Before moving on, let's save this customer-level item dataframe as <code style="color:crimson">'item_data.csv'</code>. 
* In the next module, we'll look at an alternative way to reduce dimensionality.
* We won't set <code style="color:steelblue">index=None</code> because we want to keep the CustomerID's as the index.

In [84]:
# Save item_data.csv
item_data.to_csv('item_data.csv')

## <span style="color:RoyalBlue">II. Thresholds</span>

One very **simple and straightforward way** to reduce the dimensionality of this item data is to set a **threshold** for features.
* The rationale is that we might only want to keep **popular items**.
* For example, let's say item A was only purchased by 2 customers. Well, the feature for item A will be 0 for almost all observations, which isn't very helpful.
* On the other hand, let's say item B was purchased by 100 customers. The feature for item B will allow more meaningful comparisons.

To make this concrete, assume we only wish to keep item features for the **20 most popular items**. 

<br>
<br>

#### A. First, we can see which items those are and the number of times they were purchased.
1. Take the sum by column.
* Sort the values.
* Look at the last 20 (since they are sorted in ascending order by default).

In [85]:
# Display most popular 20 items
item_data.sum().sort_values().tail(20)

22961      114
22630      115
22139      117
21080      122
85099B     123
20726      123
20719      128
20750      132
23084      140
20725      141
21212      143
22551      158
22629      160
22328      166
21731      169
22556      179
22554      197
22423      222
22326      271
POST      1055
dtype: int64

#### B. Next, if we take the <code style="color:steelblue">.index</code> of the above series, we can get just a list of the StockCodes for those 20 items.

In [87]:
# Get list of StockCodes for the 20 most popular items
top20items = item_data.sum().sort_values().tail(20).index

#### C. Keeping only the features for those 20 items. Save them in a new object <code style="color:steelblue">top_20_item_data</code>.
* Then, as a quick sanity check, display its shape.

In [92]:
# Keep only features for top 20 items
top_20_item_data = item_data[top20items]

# Shape of remaining dataframe
top_20_item_data.shape

#### D. Let's take a look at some example rows in <code style="color:steelblue">top_20_item_data</code>
* These 20 features are much more manageable than the 2574 from earlier, and they are arguably the most important features because they are the most popular items.

In [96]:
# Display top 5 rows
top_20_item_data.head()

Unnamed: 0_level_0,22961,22630,22139,21080,85099B,20726,20719,20750,23084,20725,21212,22551,22629,22328,21731,22556,22554,22423,22326,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
12347,0,0,0,0,0,0,4,0,3,0,0,0,0,0,5,0,0,4,0,0
12348,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
12349,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1
12350,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
12352,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17444,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,1
17508,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
17828,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3
17829,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### E. Finally, let's save this top 20 items dataframe as <code style="color:crimson">'threshold_item_data.csv'</code>.
* We'll see a different way to reduce dimensionality in the next module.
* We won't set <code style="color:steelblue">index=None</code> because we want to keep the CustomerID's as the index.

In [97]:
# Save threshold_item_data.csv
top_20_item_data.to_csv('threshold_item_data.csv')