# 1a.Introduction

The ultimate goal of this notebook is to **explore powerful method called "Principal component analysis" and its extensions**. The main use of PCA is dimensionality reduction, what allows for understanding what part of data really contributes to explaining of the variance. PCA is clearly connected with Singular Value Decomposition (SVD) and they both overlap on many layers ([more info](https://mlfromscratch.com/principal-component-analysis-pca-svd/)).

Content:
* In '1a' and '1b' I introduce the topic, load the data and perform key data cleaning and preparation
* Chapter '2' corresponds to main **PCA** definition introduction
* Part '3' presents **Sparse PCA**, in other words PCA with sparsity constraint imposed
* In chapter '4' I describe **Kernel PCA** which is a repsonse for biggest PCA's shortcoming: linearity assumption
* Next, in '5' I go through **Non-negative matrix factorization** analysing it as alternative dimensions' reduction method
* Last, in '6' methods are summarised

**For the clairty all the code was hidden, exception is chapter 4th, where the code is the point of material.**

# 1b.Data preparation

This step I take from my [another notebook on this topic](https://www.kaggle.com/jjmewtw/clustering-k-means-hierarchical-debscan-ema). I apply following steps:
* NA's cleaning
* categorical variables encoding (nominal & ordinal)
* numeric variables cleaning
* scaling

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from mlxtend.preprocessing import minmax_scaling
import seaborn as sns
from sklearn.decomposition import PCA,SparsePCA,KernelPCA,NMF
from sklearn.datasets import make_circles

In [None]:
summer_products_path = "../input/summer-products-and-sales-in-ecommerce-wish/summer-products-with-rating-and-performance_2020-08.csv"
unique_categories_path = "../input/summer-products-and-sales-in-ecommerce-wish/unique-categories.csv"
unique_categories_sort_path = "../input/summer-products-and-sales-in-ecommerce-wish/unique-categories.sorted-by-count.csv"

summer_products = pd.read_csv(summer_products_path)
unique_categories = pd.read_csv(unique_categories_path)
unique_categories_sort = pd.read_csv(unique_categories_sort_path)

df = summer_products

C = (df.dtypes == 'object')
CategoricalVariables = list(C[C].index)
Integer = (df.dtypes == 'int64') 
Float   = (df.dtypes == 'float64') 
NumericVariables = list(Integer[Integer].index) + list(Float[Float].index)

df[NumericVariables]=df[NumericVariables].fillna(0)
df=df.drop('has_urgency_banner', axis=1) # 70 % NA's

df[CategoricalVariables]=df[CategoricalVariables].fillna('Unknown')
df=df.drop('urgency_text', axis=1) # 70 % NA's
df=df.drop('merchant_profile_picture', axis=1) # 86 % NA's

C = (df.dtypes == 'object')
CategoricalVariables = list(C[C].index)
Integer = (df.dtypes == 'int64') 
Float   = (df.dtypes == 'float64') 
NumericVariables = list(Integer[Integer].index) + list(Float[Float].index)

Size_map  = {'NaN':1, 'XXXS':2,'Size-XXXS':2,'SIZE XXXS':2,'XXS':3,'Size-XXS':3,'SIZE XXS':3,
            'XS':4,'Size-XS':4,'SIZE XS':4,'s':5,'S':5,'Size-S':5,'SIZE S':5,
            'M':6,'Size-M':6,'SIZE M':6,'32/L':7,'L.':7,'L':7,'SizeL':7,'SIZE L':7,
            'XL':8,'Size-XL':8,'SIZE XL':8,'XXL':9,'SizeXXL':9,'SIZE XXL':9,'2XL':9,
            'XXXL':10,'Size-XXXL':10,'SIZE XXXL':10,'3XL':10,'4XL':10,'5XL':10}

df['product_variation_size_id'] = df['product_variation_size_id'].map(Size_map)
df['product_variation_size_id']=df['product_variation_size_id'].fillna(1)
OrdinalVariables = ['product_variation_size_id']

Color_map  = {'NaN':'Unknown','Black':'black','black':'black','White':'white','white':'white','navyblue':'blue',
             'lightblue':'blue','blue':'blue','skyblue':'blue','darkblue':'blue','navy':'blue','winered':'red',
             'red':'red','rosered':'red','rose':'red','orange-red':'red','lightpink':'pink','pink':'pink',
              'armygreen':'green','green':'green','khaki':'green','lightgreen':'green','fluorescentgreen':'green',
             'gray':'grey','grey':'grey','brown':'brown','coffee':'brown','yellow':'yellow','purple':'purple',
             'orange':'orange','beige':'beige'}

df['product_color'] = df['product_color'].map(Color_map)
df['product_color']=df['product_color'].fillna('Unknown')

NominalVariables = [x for x in CategoricalVariables if x not in OrdinalVariables]
Lvl = df[NominalVariables].nunique()

ToDrop=['title','title_orig','currency_buyer', 'theme', 'crawl_month', 'tags', 'merchant_title','merchant_name',
              'merchant_info_subtitle','merchant_id','product_url','product_picture','product_id']
df = df.drop(ToDrop, axis = 1)
FinalNominalVariables = [x for x in NominalVariables if x not in ToDrop]

df_dummy = pd.get_dummies(df[FinalNominalVariables], columns=FinalNominalVariables)

df_clean = df.drop(FinalNominalVariables, axis = 1)
df_clean = pd.concat([df_clean, df_dummy], axis=1)

NumericVariablesNoTarget = [x for x in NumericVariables if x not in ['units_sold']]
df_scale=df_clean
df_scale = minmax_scaling(df_clean, columns=df_clean.columns)

print("The number of categorical variables: " + str(len(FinalNominalVariables)+len(OrdinalVariables)) +"; where 1 ordinal variable and 35 dummy variables")
print("The number of numeric variables: " + str(len(NumericVariables)))
df_scale.describe()

# 2.Principial Component Analysis (PCA)

First, I apply aforeentioned PCA to the given data. I do it for whole data set, namely both numeric and categorical data. For this, it is very important to remember to scale and clean data, cause in other case variables with bigger numbers will dominate the analysis. For the theory [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis) should suffice, chapter: "Details".

Basically, I would like to assess how many components suffice to cover the majority of variance in the data set. For this I look at cumulative variance explained:

In [None]:
pca = PCA().fit(df_scale)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5), dpi=80, facecolor='w', edgecolor='k')
ax0, ax1 = axes.flatten()

ax0.plot(np.cumsum(pca.explained_variance_ratio_))
ax0.set_xlabel('Number of components')
ax0.set_ylabel('Cumulative explained variance');

ax1.bar(range(59),pca.explained_variance_)
ax1.set_xlabel('Number of components')
ax1.set_ylabel('Explained variance');

plt.show()

In [None]:
n_PCA_50 = np.size(np.cumsum(pca.explained_variance_ratio_)>0.5) - np.count_nonzero(np.cumsum(pca.explained_variance_ratio_)>0.5)
n_PCA_80 = np.size(np.cumsum(pca.explained_variance_ratio_)>0.8) - np.count_nonzero(np.cumsum(pca.explained_variance_ratio_)>0.8)
n_PCA_90 = np.size(np.cumsum(pca.explained_variance_ratio_)>0.9) - np.count_nonzero(np.cumsum(pca.explained_variance_ratio_)>0.9)
print("Already: " + format(n_PCA_50) + " Cover 50% of variance.")
print("Already: " + format(n_PCA_80) + " Cover 80% of variance.")
print("Already: " + format(n_PCA_90) + " Cover 90% of variance.")

I decide to go further with 12 components, since 80% of variance explained satisfy my arbitrary threshold. For this I plot our 59 variables in columns and principial components in rows. Due to this, I can see how much particular variable contributes to aprticular component and what is its sign (-/+).

In [None]:
pca = PCA(12).fit(df_scale)

X_pca=pca.transform(df_scale) 

plt.matshow(pca.components_,cmap='viridis')
plt.yticks([0,1,2,3,4,5,6,7,8,9,10,11,12],['1st Comp','2nd Comp','3rd Comp','4th Comp','5th Comp','6th Comp','7th Comp','8th Comp','9th Comp','10th Comp','11th Comp','12th Comp'],fontsize=10)
plt.colorbar()
plt.xticks(range(len(df_scale.columns)),rotation=0)
plt.tight_layout()
plt.show()

A lot of variables, let's look closer:

In [None]:
CompOne = pd.DataFrame(list(zip(df_scale.columns,pca.components_[0])),columns=('Name','Contribution to Component 1'),index=range(1,60,1))
CompOne = CompOne[(CompOne['Contribution to Component 1']>0.05) | (CompOne['Contribution to Component 1']< -0.05)]
CompOne

Alright, 1st component is vastly dominated by 'uses_ad_boosts', ads are power, ok. But for so many features, so many components, I need something iterative, I will define the function which will choose features which contribute at 0.10 or less than -0.10 to any of components.

In [None]:
def ExtractColumn(lst,j): 
    return [item[j] for item in lst] 

PCA_vars = [0]*len(df_scale.columns)

for i, feature in zip(range(len(df_scale.columns)),df_scale.columns):
    x = ExtractColumn(pca.components_,i)
    if ((max(x) > 0.1) | (min(x) < -0.1)):
        if abs(max(x)) > abs(min(x)):
            PCA_vars[i] = max(x)
        else:
            PCA_vars[i] = min(x)                 
    else:
        PCA_vars[i] = 0

PCA_vars = pd.DataFrame(list(zip(df_scale.columns,PCA_vars)),columns=('Name','Max absolute contribution'),index=range(1,60,1))      
PCA_vars = PCA_vars[(PCA_vars['Max absolute contribution']!=0)]
PCA_vars

Fast observations: algorithm did not eiminate correlated 'rating' varables; a lot of color dummy variables remained.

# 3.Sparse Principial Component Analysis (Sparse PCA)

The idea behind the algorithm is similar, but enhanced. As defined on [scikit website](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html):
> Finds the set of sparse components that can optimally reconstruct the data. The amount of sparseness is controllable by the coefficient of the L1 penalty, given by the parameter alpha.

I see then a use similar to this behind LASSO, namely using L1 metric for penalty. I start analysis choosing 12 components, this value was proposed by solo PCA. The sparsity-inducing norm also prevents learning components from noise when few training samples are available.

First, what we expect is sparsity of our results, which in contrast to dense structures (matrices, vectors,..) mean mostly zeros as elements. Namely, components are mapped just to particular parts of data. It simplifies interpretability comapring to regular PCA. Sparse principal components yields a more parsimonious, interpretable representation, clearly emphasizing which of the original features contribute to the differences between samples.

In [None]:
SPCA = SparsePCA(n_components=12)
SPCA_fit = SPCA.fit(df_scale)

plt.matshow(SPCA_fit.components_,cmap='viridis')
plt.yticks([0,1,2,3,4,5,6,7,8,9,10,11,12],['1st Comp','2nd Comp','3rd Comp','4th Comp','5th Comp','6th Comp','7th Comp','8th Comp','9th Comp','10th Comp','11th Comp','12th Comp'],fontsize=10)
plt.colorbar()
plt.xticks(range(len(df_scale.columns)),rotation=0)
plt.tight_layout()
plt.show()

Indeed, results seem to be correct at a glance. Output matrix meets sparsity requirements. Next I repeat the stage from the previous chapter, where top contributing variables are used.

In [None]:
SPCA_vars = [0]*len(df_scale.columns)

for i, feature in zip(range(len(df_scale.columns)),df_scale.columns):
    x = ExtractColumn(SPCA_fit.components_,i)
    if ((max(x) > 0.1) | (min(x) < -0.1)):
        if abs(max(x)) > abs(min(x)):
            SPCA_vars[i] = max(x)
        else:
            SPCA_vars[i] = min(x)                 
    else:
        SPCA_vars[i] = 0

SPCA_vars = pd.DataFrame(list(zip(df_scale.columns,SPCA_vars)),columns=('Name','Max absolute contribution'),index=range(1,60,1))      
SPCA_vars = SPCA_vars[(SPCA_vars['Max absolute contribution']!=0)]
SPCA_vars

I will compare these results altogether at the end.

# 4.Kernel Principial Component Analysis (Kernel PCA)

My favorite extension of PCA is Kernel PCA (KPCA), which deals epicly with lineariy requirement. Once more: PCA detects only linear connections, KPCA is the amazing answer and crucial generalization of PCA. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space, more in [Wikipedia](https://en.wikipedia.org/wiki/Kernel_principal_component_analysis). I will use the [example from scikit](https://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html#sphx-glr-auto-examples-decomposition-plot-kernel-pca-py) as basic methodology and repeat the use of this fantastic method for our data.

To keep it clear, KPCA is not another method for variability detection, but it offers the possibility of data transformation such that it can be used in linear case.

In [None]:
KPCA = KernelPCA(n_components = len(df_scale.columns), kernel="rbf", fit_inverse_transform=True, gamma=10)
KPCA_fit = KPCA.fit(df_scale)
X_KPCA = KPCA.fit_transform(df_scale)
X_KPCA_back = KPCA.inverse_transform(X_KPCA)

The often question: **is it possible to use KPCA for fetaure selection** can be found [here](https://stats.stackexchange.com/questions/8182/is-it-possible-to-use-kernel-pca-for-feature-selection) and sounds as follows:
> (...) in kernel PCA each principal component is a linear combination of features in the target space, and for e.g. Gaussian kernel (which is often used) the target space is infinite-dimensional. So the concept of "loadings" does not really make sense for kPCA,(...)

# 5.Non-negative matrix factorization (NMF)

This algorithm is based on process requirement that elements of matricis used have non-negative elements. Paradoxiacally, this non-negativity makes the resulting matrices easier to inspect. Its performance tcomapred to PCA depends on the case and hard to assess it generally.

**Some relations to other technics:** NMF is not the direct part of PCA family, but is considered to be alternative method. Some types of NMF are an instance of a more general probabilistic model called "multinomial PCA". What's interesting, NMF with the least-squares objective is equivalent to a relaxed form of K-means clustering. Furthermore, NMF is an instance of nonnegative quadratic programming (NQP), just like the support vector machine (SVM).



In [None]:
NNMF = NMF(n_components=12)
NMF_fit = NNMF.fit(df_scale)

GMax = 0
for i in range(len(NMF_fit.components_)):
    Lmax = max(NMF_fit.components_[i])
    if Lmax > GMax:
        GMax = Lmax
    else:
        GMax = GMax
        
ScaledList = NMF_fit.components_ / GMax

plt.matshow(ScaledList,cmap='viridis')
plt.yticks([0,1,2,3,4,5,6,7,8,9,10,11,12],['1st Comp','2nd Comp','3rd Comp','4th Comp','5th Comp','6th Comp','7th Comp','8th Comp','9th Comp','10th Comp','11th Comp','12th Comp'],fontsize=10)
plt.colorbar()
plt.xticks(range(len(df_scale.columns)),rotation=0)
plt.tight_layout()
plt.show()

The results of NMF are not standardized. Hence, we apply min-max scaling to have above results between 0 and 1. Still the previously used method would work in any case of it.

In [None]:
NMF_vars = [0]*len(df_scale.columns)

for i, feature in zip(range(len(df_scale.columns)),df_scale.columns):
    x = ExtractColumn(ScaledList,i)
    if ((max(x) > 0.1) | (min(x) < -0.1)):
        if abs(max(x)) > abs(min(x)):
            NMF_vars[i] = max(x)
        else:
            NMF_vars[i] = min(x)                 
    else:
        NMF_vars[i] = 0

NMF_vars = pd.DataFrame(list(zip(df_scale.columns,NMF_vars)),columns=('Name','Max absolute contribution'),index=range(1,60,1))      
NMF_vars = NMF_vars[(NMF_vars['Max absolute contribution']!=0)]
NMF_vars

The values were calibrated to receive similar number of features.

# 6.Break it down altogether

I will summarize the results comparing the output of PCA, Sparse PCA and NMF. As I mentioned 12 components were used for all methods as on the basis of PCA they contribute to 80% of variability. First I look at the list of variables chosen by our algorithms.

In [None]:
All_Features = np.unique(list(PCA_vars['Name'])+list(SPCA_vars['Name'])+list(NMF_vars['Name']))

All_Features_df =  pd.DataFrame(zip(All_Features,[False]*len(All_Features),[False]*len(All_Features),
                                [False]*len(All_Features)),columns=['Feature','Is in PCA','Is in SPCA','Is in NMF'])

All_Features_df['Is in PCA'] = [True if x in list(PCA_vars['Name']) else False for x in All_Features]
All_Features_df['Is in SPCA'] = [True if x in list(SPCA_vars['Name']) else False for x in All_Features]
All_Features_df['Is in NMF'] = [True if x in list(NMF_vars['Name']) else False for x in All_Features]

All_Features_df=All_Features_df.sort_values('Feature')

All_Features_df

The big difference between solo PCA vs SPCA, NMF is set f features: 'rating'. SPCA and NMF drop these highly-correlated features, not bad. However, they keep different dummy levels for color, second time good job because indeed these features point at another interesting groups.

In [None]:
print(format(sum(All_Features_df['Is in PCA'])) + " features by PCA; " + format(sum(All_Features_df['Is in SPCA'])) + " features by SPCA; " +
     format(sum(All_Features_df['Is in NMF'])) + " features by NMF. ")

Principial component analysis, the basic algorithm accepted the most variables. SPCA dropped the most of them being focused just on the most significant elements. For NMF I made scaling to keep results comparable. All these three methods can be boosted by calibration or solvers' applciation. The final plots for: PCA & SPCA were standardized by absolute value to make them comparable with NMF.

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(15, 10), dpi=80, facecolor='w', edgecolor='k')
ax0, ax1, ax2 = axes.flatten()

ScaledListPCA = abs(pca.components_)
ScaledListSPCA = abs(SPCA_fit.components_)

ax0.matshow(ScaledListPCA,cmap='viridis')
ax1.matshow(ScaledListSPCA,cmap='viridis')
ax2.matshow(ScaledList,cmap='viridis')

plt.show()

# Add.Further reading
* [Data cleaning step by step, notebook by me](https://www.kaggle.com/jjmewtw/prices-cleaning-analysis-estimation-in-stages)
* [Correlation and the method's relevance, notebook by me](https://www.kaggle.com/jjmewtw/yt-pearson-spearman-distance-corr-rv-coef)
* [Comprehensive PCA, notebook by Andrea Sindico](https://www.kaggle.com/asindico/customer-segments-with-pca)
* [Kernel PCA, notebook by "bronson"](https://www.kaggle.com/jsultan/visualizing-classifier-boundaries-using-kernel-pca)
* [PCA, KPCA, KNN, notebook by "nic"](https://www.kaggle.com/nicw102168/trying-out-some-pca-nmf-and-knn)