# Dimensionality Reduction

In the previous step, we analysed the data set and cleaned it to identify purchase patterns. In this section we are going to come up with dimensions for the model which is specific to the products. We have to make sure that only relevant features are included otherwise it would be result in "Curse of Dimensionality"

Here we discuss about a simple method to reduce dimensions by using thresholds. This resulting dataset will be used by clustering algorithm

In [1]:
# Pandas for DataFrames
import pandas as pd

# NumPy for numerical computing
import numpy as np

# Matplotlib for visualization
import matplotlib.pyplot as plt

# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

# StandardScaler from Scikit-Learn
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

In [2]:
# loading dataset
data = pd.read_excel("/Users/ruchitha/Desktop/DataMining_Projects/Online-retail-Analysis/clean_data.xlsx")

In [3]:
# Number of unique items
data['StockCode']

0         85123A
1          71053
2         84406B
3         84029G
4         84029E
5          22752
6          21730
7          22633
8          22632
9          84879
10         22745
11         22748
12         22749
13         22310
14         84969
15         22623
16         22622
17         21754
18         21755
19         21777
20         48187
21         22960
22         22913
23         22912
24         22914
25         21756
26         22728
27         22727
28         22726
29         21724
           ...  
294411     21588
294412     21980
294413     21983
294414     22076
294415     22077
294416     22111
294417     22112
294418     22113
294419     22139
294420     22423
294421     22549
294422     22585
294423     22605
294424     22866
294425     21810
294426     21811
294427     22741
294428     23080
294429     23103
294430     23284
294431     23271
294432     22621
294433     22694
294434     21892
294435     21430
294436     85150
294437     21809
294438    4756

**Dataframe of dummy variables for 'StockCode'**

In [4]:
# generating Product dummies
product_dummies = pd.get_dummies(data['StockCode'])

# adding CustomerID feature to Product dummies
product_dummies['CustomerID'] = data['CustomerID']

# displaying first n rows
product_dummies.head(5)

Unnamed: 0,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214M,90214N,90214P,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,CustomerID
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17850
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17850
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17850
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17850
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17850


In [5]:
# Aggregating at customer level
product_data = product_dummies.groupby('CustomerID').sum()

# Display first 5 rows of item_data
product_data.head(5)

Unnamed: 0_level_0,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214L,90214M,90214N,90214P,90214R,90214S,90214V,90214Y,BANK CHARGES,C2
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12352,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# number of times each product was purchased
product_data.sum()

10002            49
10080            14
10120            20
10125            59
10133           124
10135            92
11001            52
15030             9
15034            74
15036           289
15039            58
16008            45
16010             3
16011            35
16012            28
16014            42
16015            20
16016            49
16033             4
16043             2
16045            52
16046            12
16048            33
16049             5
16052             6
16054            21
16216            28
16218            31
16219            36
16225            35
               ... 
90209B           13
90209C           12
90210A            3
90210B            4
90210C            3
90210D            3
90211A            1
90211B            1
90212B            1
90212C            2
90214A            8
90214B            1
90214C            4
90214D            3
90214E            4
90214G            3
90214H            2
90214I            1
90214J            3


In [7]:
# writing the new set of features generated to excel sheet
writer = pd.ExcelWriter('product_data.xlsx')
product_data.to_excel(writer,'Sheet1', index = False)
writer.save()

In [8]:
# Display 30 products that are popular
product_data.sum().sort_values().tail(30)

22727      668
22666      679
22178      685
22699      686
20726      713
82482      720
22993      721
23206      741
22386      751
22960      756
22197      758
22961      760
22469      767
23209      797
22457      801
22382      801
22384      813
23203      818
20728      834
23298      872
20727      887
21212      892
22383      904
22720      960
84879     1094
20725     1114
47566     1278
85099B    1308
22423     1449
85123A    1674
dtype: int64

** Using set of Threshold to maintain the feature count **

This involves following steps:
    
    Step-1: Calculate the total number of each product sold 
    Step-2: sort them in descending order
    Step-3: select the first 30 features

This results in dimensionality reduction
        

In [9]:
# Getting the indices of the top 30 product features 
Products_top_30_indx= product_data.sum().sort_values().tail(30).index

# Selecting top 30 features
# This ensures that only relevant set of features are selected for further analysis
Products_top_30_data = product_data[Products_top_30_indx]

# Shape of this new dataset
Products_top_30_data.shape

(3835, 30)

In [10]:
# writing this new set of features consisting of just the top 30 product dummies into an excel
writer = pd.ExcelWriter('Products_top_30_data.xlsx')
Products_top_30_data.to_excel(writer,'Sheet1', index = 'product_data.CustomerID')
writer.save()

This dimensionality reduction process helps us in getting access to principal variables by filtering out the data that is not within the threshold set.