# Principal component analysis (PCA) / Метод главных компонент (МГК)

---

**Источники:**

[Principal Component Analysis and k-means Clustering to Visualize a High Dimensional Dataset](https://medium.com/@dmitriy.kavyazin/principal-component-analysis-and-k-means-clustering-to-visualize-a-high-dimensional-dataset-577b2a7a5fe2)

[PCA using Python (scikit-learn)](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)

[Everything you did and didn't know about PCA](http://alexhwilliams.info/itsneuronalblog/2016/03/27/pca/)

[Feature Extraction using Principal Component Analysis — A Simplified Visual Demo](https://towardsdatascience.com/feature-extraction-using-principal-component-analysis-a-simplified-visual-demo-e5592ced100a)

[]()

[]()

[]()

[]()

[]()

[]()

---

## Подготовка окружения

In [None]:
# ВНИМАНИЕ: необходимо удостовериться, что виртуальная среда выбрана правильно!

# Для MacOS/Ubuntu
# !which pip

# Для Windows
# !where pip

In [None]:
# !conda install matplotlib numpy scikit-learn seaborn -y

In [None]:
import numpy as np

np.__version__

In [None]:
import pandas as pd

pd.__version__

In [None]:
import matplotlib
import matplotlib.pyplot as plt

matplotlib.__version__

In [None]:
import seaborn as sns

sns.__version__

In [None]:
import sklearn

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import QuantileTransformer

from sklearn.impute import SimpleImputer

sklearn.__version__

In [None]:
# ipympl + widget включает интерактивные функции matplotlib

# !conda install ipympl -y
# !conda install -c conda-forge nodejs

## Customer Clustering

### Загрузка данных

[Источник (Customer Clustering)](https://www.kaggle.com/dev0914sharma/customer-clustering)

In [None]:
df = pd.read_csv("./../../data/segmentation data.csv", index_col=0)
df

### Анализ данных

In [None]:
df.info()

In [None]:
df.hist(figsize=(12, 4))
plt.tight_layout()

### Подготовка

In [None]:
sns.scatterplot(data=df, alpha=0.3)

In [None]:
df_norm = pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns)
df_norm.hist(figsize=(12, 4))
plt.tight_layout()

In [None]:
def explained_variance_plot(pca):
    features = range(pca.n_components_)
    cumulative_sum = np.cumsum(pca.explained_variance_ratio_)

    plt.xlabel('PCA features')
    plt.ylabel('variance %')
    plt.xticks(features)
    
    plt.bar(features, pca.explained_variance_ratio_, align='center', label='Individual explained variance')
    plt.step(range(0,len(cumulative_sum)), cumulative_sum, where='mid',label='Cumulative explained variance')
    
    plt.legend(loc='best')
    plt.tight_layout()

<img src="images/explained_variance_ratio_1.png"/>

<img src="images/explained_variance_ratio_2.png" width=300/>

### N-D PCA

In [None]:
pca = PCA()
pca.fit(df_norm)
explained_variance_plot(pca)

### 2-D PCA

In [None]:
pca_2 = PCA(n_components=2)
df_pca_2 = pd.DataFrame(pca_2.fit_transform(df_norm))
sns.scatterplot(x=df_pca_2[0], y=df_pca_2[1], alpha=0.3)

### 3-D PCA

In [None]:
pca_3 = PCA(n_components=3)
df_pca_3 = pd.DataFrame(pca_3.fit_transform(df_norm))

In [None]:
%matplotlib widget

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D    
fig = plt.figure(figsize=(5, 5))
ax = Axes3D(fig)

ax.scatter(df_pca_3[0], df_pca_3[1], df_pca_3[2], alpha=0.3)

In [None]:
%matplotlib inline

---

## Credit Card Dataset for Clustering

### Загрузка данных

[Источник (Credit Card Dataset for Clustering)](https://www.kaggle.com/arjunbhasin2013/ccdata)

In [None]:
df = pd.read_csv("./../../data/CC GENERAL.csv", index_col=0)
df

### Анализ данных

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
df.hist(figsize=(12, 7))
plt.tight_layout()

### Подготовка

In [None]:
df_without_nan = pd.DataFrame(SimpleImputer(strategy='median').fit_transform(df), columns=df.columns)
df_without_nan.isna().sum()

In [None]:
quant_trans = QuantileTransformer(output_distribution='normal')
df_norm = pd.DataFrame(quant_trans.fit_transform(df_without_nan), columns=df.columns)
df_norm.hist(bins=30, figsize=(12, 7))
plt.tight_layout()

### N-D PCA

In [None]:
pca = PCA()
pca.fit(df_norm)
explained_variance_plot(pca)

### 2-D PCA

In [None]:
pca_2 = PCA(n_components=2)
df_pca_2 = pd.DataFrame(pca_2.fit_transform(df_norm))
df_pca_2

In [None]:
sns.scatterplot(x=df_pca_2[0], y=df_pca_2[1], alpha=0.3)

### 3-D PCA

In [None]:
pca_3 = PCA(n_components=3)
df_pca_3 = pd.DataFrame(pca_3.fit_transform(df_norm))
df_pca_3

In [None]:
%matplotlib widget

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D    
fig = plt.figure(figsize=(5, 5))
ax = Axes3D(fig)

ax.scatter(df_pca_3[0], df_pca_3[1], df_pca_3[2], alpha=0.3)

In [None]:
%matplotlib inline

---

## Credit Card Dataset for Clustering

### Загрузка данных

[Источник (Simple Clustering Data ID Gender Income Spending)](https://www.kaggle.com/harrimansaragih/clustering-data-id-gender-income-spending)

In [None]:
df = pd.read_csv("./../../data/ClusteringHSS.csv", index_col=0)
df

### Анализ данных

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
df.hist(figsize=(12, 7))
plt.tight_layout()

### Подготовка

In [None]:
df_without_nan = pd.DataFrame(SimpleImputer(strategy='median').fit_transform(df), columns=df.columns)
df_without_nan.isna().sum()

In [None]:
quant_trans = QuantileTransformer(output_distribution='normal')
df_norm = pd.DataFrame(quant_trans.fit_transform(df_without_nan), columns=df.columns)
df_norm.hist(bins=30, figsize=(12, 7))
plt.tight_layout()

### N-D PCA

In [None]:
pca = PCA()
pca.fit(df_norm)
explained_variance_plot(pca)

### 2-D PCA

In [None]:
pca_2 = PCA(n_components=2)
df_pca_2 = pd.DataFrame(pca_2.fit_transform(df_norm))
df_pca_2

In [None]:
sns.scatterplot(x=df_pca_2[0], y=df_pca_2[1], alpha=0.3)

### 3-D PCA

In [None]:
pca_3 = PCA(n_components=3)
df_pca_3 = pd.DataFrame(pca_3.fit_transform(df_norm))
df_pca_3

In [None]:
%matplotlib widget

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
fig = plt.figure(figsize=(5, 5))
ax = Axes3D(fig)

ax.scatter(df_pca_3[0], df_pca_3[1], df_pca_3[2], alpha=0.3)

In [None]:
%matplotlib inline