# Dimensionality Reduction



As our objective is to incorporate information about **specific item purchases** into the clusters. Our model should be more likely to group together customers who buy similar items.

** Importing libraries and load the cleaned transaction-level data **

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

** Import the cleaned dataset 


In [2]:
df = pd.read_csv('cleaned_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


# Toy example: rolling up item data

To illustrate how we'll **roll up item information to the customer level**, let's use toy example. 

**Create a toy datframe that only contains transactions for 2 customers.**(#12817, #12755)


In [3]:
toy_df = df[df.CustomerID.isin([12734,12755])]
toy_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Sales
949,537899,22328,ROUND SNACK BOXES SET OF 4 FRUITS,1488,12/9/2010 10:44,2.55,12755,Japan,3794.4
2181,539829,82613D,METAL SIGN CUPCAKE SINGLE HOOK,20,12/22/2010 12:47,0.42,12734,France,8.4
2182,539829,82613B,"METAL SIGN,CUPCAKE SINGLE HOOK",20,12/22/2010 12:47,0.42,12734,France,8.4
2183,539829,22223,CAKE PLATE LOVEBIRD PINK,24,12/22/2010 12:47,1.95,12734,France,46.8
2184,539829,22222,CAKE PLATE LOVEBIRD WHITE,24,12/22/2010 12:47,1.95,12734,France,46.8


**Create a dataframe of dummy variables for <code style="color:steelblue">'StockCode'</code>.**


In [4]:
toy_item_dummies  = pd.get_dummies(toy_df.StockCode)

In [5]:
toy_item_dummies['CustomerID'] = toy_df['CustomerID']

In [6]:
toy_item_dummies

Unnamed: 0,22222,22223,22328,22652,22654,22968,82613B,82613D,M,CustomerID
949,0,0,1,0,0,0,0,0,0,12755
2181,0,0,0,0,0,0,0,1,0,12734
2182,0,0,0,0,0,0,1,0,0,12734
2183,0,1,0,0,0,0,0,0,0,12734
2184,1,0,0,0,0,0,0,0,0,12734
8336,0,0,0,0,1,0,0,0,0,12755
8337,0,0,0,0,0,1,0,0,0,12755
8338,0,0,0,1,0,0,0,0,0,12755
8484,0,0,0,0,0,0,0,0,1,12755
11085,0,0,1,0,0,0,0,0,0,12755


**Finally, we can aggregate this information to the customer-level**.


In [7]:
toy_item_data = toy_item_dummies.groupby('CustomerID').sum()
toy_item_data

Unnamed: 0_level_0,22222,22223,22328,22652,22654,22968,82613B,82613D,M
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12734,2,2,0,0,0,0,2,2,0
12755,0,0,3,2,2,2,0,0,1


# High dimensionality- Applying the toy example to entire dataset


**First, create a dataframe of dummy variables for <code style="color:steelblue">'StockCode'</code>, this time for the full dataset.**


In [8]:
# Get item_dummies
item_dummies = pd.get_dummies(df['StockCode'])

# Add CustomerID to item_dummies
item_dummies['CustomerID'] = df['CustomerID']

# Display first 5 rows of item_dummies
item_dummies.head(5)

Unnamed: 0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,...,90205A,90205C,90208,90209A,90209C,C2,D,M,POST,CustomerID
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12583


** Rolling up the item dummies data into customer-level item data**.


In [9]:
# Create item_data by aggregating at customer level
item_data = item_dummies.groupby('CustomerID').sum()

# Display first 5 rows of item_data
item_data.head(5)

Unnamed: 0_level_0,10002,10120,10125,10133,10135,11001,15034,15036,15039,15044A,...,90204,90205A,90205C,90208,90209A,90209C,C2,D,M,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,8
12349,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
12350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
12352,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,9,10


**The total number times each item was purchased.**

In [10]:
# Total times each item was purchased
item_data.sum()

10002        25
10120         2
10125        26
10133        12
10135         9
11001        17
15034        10
15036        42
15039         6
15044A       12
15044B        6
15044C        4
15044D        9
15056BL     113
15056N       86
15056P       57
15058A       18
15058B       16
15058C        9
15060B       28
16008        22
16011         7
16012         8
16014        21
16016        33
16045        20
16048        19
16052         2
16054         6
16156L       13
           ... 
90177A        2
90177C        2
90177D        3
90177E        2
90182C        2
90183A        2
90184B        2
90184C        2
90185A        2
90185B        3
90185C        3
90192         4
90201A        4
90201B        8
90201C        6
90201D        3
90202A        2
90202B        2
90202C        2
90202D        4
90204         4
90205A        1
90205C        1
90208         2
90209A        2
90209C        2
C2          110
D             3
M           114
POST       2165
Length: 2796, dtype: int

** Saving this customer-level item dataframe as <code style="color:crimson">'item_data.csv'</code>. To be used it again at later stage**

In [17]:
# Save item_data.csv
item_data.to_csv('item_data.csv')

# Reduce Dimensionality by threshold creation



One very **simple and straightforward way** to reduce the dimensionality of this item data is to set a **threshold** for keeping features.

In [12]:
# Display most popular 20 items
item_data.sum().sort_values().tail(20)

23245      258
22961      265
21080      267
22630      267
20726      270
20719      278
20750      279
85099B     280
23084      299
20725      315
21212      329
22551      338
22629      361
21731      361
22328      361
22556      386
22554      415
22423      553
22326      584
POST      2165
dtype: int64

In [13]:
# Get list of StockCodes for the 20 most popular items
top_20_items = item_data.sum().sort_values().tail(20).index

top_20_items

Index(['23245', '22961', '21080', '22630', '20726', '20719', '20750', '85099B',
       '23084', '20725', '21212', '22551', '22629', '21731', '22328', '22556',
       '22554', '22423', '22326', 'POST'],
      dtype='object')

In [14]:
# Keep only features for top 20 items
top_20_item_data  = item_data[top_20_items]

#Shape of remaining dataframe
top_20_item_data.shape

(422, 20)

Here, take a look:

In [15]:
# Display first 5 rows of top_20_item_data
top_20_item_data.head(5)

Unnamed: 0_level_0,23245,22961,21080,22630,20726,20719,20750,85099B,23084,20725,21212,22551,22629,21731,22328,22556,22554,22423,22326,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
12347,0,0,0,0,0,8,0,0,6,0,0,0,0,10,0,0,0,8,0,0
12348,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
12349,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,2,2,2
12350,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,2
12352,4,0,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,4,0,10


**Saving this top 20 items dataframe as <code style="color:blue">'threshold_item_data.csv'</code>.**


In [16]:
# Save threshold_item_data.csv
top_20_item_data.to_csv('threshold_item_data.csv')