# Principal component analysis (PCA) / Метод главных компонент (МГК)

---

**Источники:**

[Principal Component Analysis and k-means Clustering to Visualize a High Dimensional Dataset](https://medium.com/@dmitriy.kavyazin/principal-component-analysis-and-k-means-clustering-to-visualize-a-high-dimensional-dataset-577b2a7a5fe2)

[PCA using Python (scikit-learn)](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)

[Everything you did and didn't know about PCA](http://alexhwilliams.info/itsneuronalblog/2016/03/27/pca/)

[Feature Extraction using Principal Component Analysis — A Simplified Visual Demo](https://towardsdatascience.com/feature-extraction-using-principal-component-analysis-a-simplified-visual-demo-e5592ced100a)

[]()

[]()

[]()

[]()

[]()

[]()

---

## Подготовка окружения

In [None]:
# ВНИМАНИЕ: необходимо удостовериться, что виртуальная среда выбрана правильно!

# Для MacOS/Ubuntu
# !which pip

# Для Windows
# !where pip

In [None]:
# !conda install matplotlib numpy scikit-learn seaborn -y

In [None]:
import numpy as np

np.__version__

In [None]:
import pandas as pd

pd.__version__

In [None]:
import matplotlib
import matplotlib.pyplot as plt

matplotlib.__version__

In [None]:
import seaborn as sns

sns.__version__

In [None]:
import sklearn

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import OrdinalEncoder

from sklearn.impute import SimpleImputer

sklearn.__version__

In [None]:
import missingno as msno

msno.__version__

In [None]:
# ipympl + widget включает интерактивные функции matplotlib

# !conda install ipympl -y
# !conda install -c conda-forge nodejs

## Описание

TODO

<img src="images/explained_variance_ratio_1.png"/>

<img src="images/explained_variance_ratio_2.png" width=300/>

## Customer Clustering

### Загрузка данных

[Источник (Customer Clustering)](https://www.kaggle.com/dev0914sharma/customer-clustering)

In [None]:
df_1 = pd.read_csv("./../../data/segmentation data.csv", index_col=0)
df_1

### Анализ данных

#### Типы данных

In [None]:
df_1.info()

#### Пропущенные значения

In [None]:
df_1.isna().sum()

#### Распределение данных

In [None]:
df_1.hist(figsize=(12, 4))
plt.tight_layout()

### Подготовка

#### Масштабирование

In [None]:
df_1_norm = pd.DataFrame(StandardScaler().fit_transform(df_1), columns=df_1.columns)
df_1_norm.hist(figsize=(12, 4))
plt.tight_layout()

#### Полезные функции

In [None]:
def explained_variance_plot(pca):
    features = range(pca.n_components_)
    cumulative_sum = np.cumsum(pca.explained_variance_ratio_)

    plt.xlabel('PCA features')
    plt.ylabel('variance %')
    plt.xticks(features)
    
    plt.bar(features, pca.explained_variance_ratio_, align='center', label='Individual explained variance')
    plt.step(range(0,len(cumulative_sum)), cumulative_sum, where='mid',label='Cumulative explained variance')
    
    plt.legend(loc='best')
    plt.tight_layout()

### N-D PCA

In [None]:
pca = PCA()
pca.fit(df_1_norm)
explained_variance_plot(pca)

### 2-D PCA

In [None]:
pca_2 = PCA(n_components=2)
df_1_pca_2 = pd.DataFrame(pca_2.fit_transform(df_1_norm))
sns.scatterplot(x=df_1_pca_2[0], y=df_1_pca_2[1], alpha=0.3)

### 3-D PCA

In [None]:
pca_3 = PCA(n_components=3)
df_1_pca_3 = pd.DataFrame(pca_3.fit_transform(df_1_norm))

In [None]:
%matplotlib widget

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D    
fig = plt.figure(figsize=(5, 5))
ax = Axes3D(fig, azim=-55, elev=45)

ax.scatter(df_1_pca_3[0], df_1_pca_3[1], df_1_pca_3[2], alpha=0.3)

In [None]:
%matplotlib inline

---

## Credit Card Dataset for Clustering

### Загрузка данных

[Источник (Credit Card Dataset for Clustering)](https://www.kaggle.com/arjunbhasin2013/ccdata)

In [None]:
df_2 = pd.read_csv("./../../data/CC GENERAL.csv", index_col=0)
df_2

### Анализ данных

#### Типы данных

In [None]:
df_2.info()

#### Пропущенные значения

In [None]:
df_2.isna().sum()

#### Распределение данных

In [None]:
df_2.hist(figsize=(12, 7))
plt.tight_layout()

### Подготовка

#### Пропущенные значения

In [None]:
df_2_without_na = pd.DataFrame(SimpleImputer(strategy='median').fit_transform(df_2), columns=df_2.columns)
df_2_without_na.isna().sum()

#### Масштабирование

In [None]:
norm_trans = QuantileTransformer(output_distribution='normal')
df_2_norm = pd.DataFrame(norm_trans.fit_transform(df_2_without_na), columns=df_2.columns)
df_2_norm.hist(bins=30, figsize=(12, 7))
plt.tight_layout()

### N-D PCA

In [None]:
pca = PCA()
pca.fit(df_2_norm)
explained_variance_plot(pca)

### 2-D PCA

In [None]:
pca_2 = PCA(n_components=2)
df_2_pca_2 = pd.DataFrame(pca_2.fit_transform(df_2_norm))
df_2_pca_2

In [None]:
sns.scatterplot(x=df_2_pca_2[0], y=df_2_pca_2[1], alpha=0.3)

### 3-D PCA

In [None]:
pca_3 = PCA(n_components=3)
df_2_pca_3 = pd.DataFrame(pca_3.fit_transform(df_2_norm))
df_2_pca_3

In [None]:
%matplotlib widget

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D    
fig = plt.figure(figsize=(5, 5))
ax = Axes3D(fig, azim=-120, elev=-25)

ax.scatter(df_2_pca_3[0], df_2_pca_3[1], df_2_pca_3[2], alpha=0.3)

In [None]:
%matplotlib inline

---

## Simple Clustering Data ID Gender Income Spending

### Загрузка данных

[Источник (Simple Clustering Data ID Gender Income Spending)](https://www.kaggle.com/harrimansaragih/clustering-data-id-gender-income-spending)

In [None]:
df_3 = pd.read_csv("./../../data/ClusteringHSS.csv", index_col=0)
df_3

### Анализ данных

#### Типы данных

In [None]:
df_3.info()

#### Пропущенные значения

In [None]:
df_3.isna().sum()

In [None]:
# процент пропущенных значений в каждой колонке
percent_missing = round(df_3.isnull().mean()*100, 2)
percent_missing.sort_values(ascending=False)

In [None]:
msno.matrix(df_3, figsize=(15, 3))

#### Распределение данных

In [None]:
df_3.Gender_Code.value_counts()

In [None]:
df_3.Region.value_counts()

In [None]:
df_3.hist(figsize=(10, 3))
plt.tight_layout()

### Подготовка

#### Пропущенные значения

In [None]:
# удалить все СТРОКИ с пропущенными ячейками
# сохранить результат в новой переменной
df_3_without_na = df_3.dropna(axis='rows').reset_index(drop=True)
df_3_without_na.isna().sum()

#### Масштабирование

In [None]:
# norm_trans = QuantileTransformer(output_distribution='normal')
norm_trans = StandardScaler()
df_3_norm = pd.DataFrame(norm_trans.fit_transform(df_3_without_na[['Income', 'Spending']]), columns=['Income', 'Spending'])
df_3_norm.hist(bins=30, figsize=(10, 3))
plt.tight_layout()

#### Категориальные признаки

In [None]:
enc = OrdinalEncoder()
df_3_without_na[['is_Male', 'is_Urban']] = enc.fit_transform(df_3_without_na[['Gender_Code', 'Region']])
df_3_without_na[['Gender_Code', 'is_Male', 'Region', 'is_Urban']]

In [None]:
df_3_norm = df_3_norm.join(df_3_without_na[['is_Male', 'is_Urban']])
df_3_norm

### N-D PCA

In [None]:
pca = PCA()
pca.fit(df_3_norm)
explained_variance_plot(pca)

### 2-D PCA

In [None]:
pca_2 = PCA(n_components=2)
df_3_pca_2 = pd.DataFrame(pca_2.fit_transform(df_3_norm))
df_3_pca_2

In [None]:
sns.scatterplot(x=df_3_pca_2[0], y=df_3_pca_2[1], alpha=0.3)

### 3-D PCA

In [None]:
pca_3 = PCA(n_components=3)
df_3_pca_3 = pd.DataFrame(pca_3.fit_transform(df_3_norm))
df_3_pca_3

In [None]:
%matplotlib widget

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
fig = plt.figure(figsize=(5, 5))
ax = Axes3D(fig, azim=-100, elev=30)


ax.scatter(df_3_pca_3[0], df_3_pca_3[1], df_3_pca_3[2], 
           alpha=0.3)

In [None]:
%matplotlib inline

## Customer Segmentation

### Загрузка данных

[Источник (custDatasets)](https://www.kaggle.com/gangliu/custdatasets)

In [None]:
df_4 = pd.read_csv("./../../data/Cust_Segmentation.csv", index_col=0)
df_4

### Анализ данных

#### Типы данных

In [None]:
df_4.info()

#### Пропущенные значения

In [None]:
df_4.isna().sum()

In [None]:
# процент пропущенных значений в каждой колонке
percent_missing = round(df_4.isnull().mean()*100, 2)
percent_missing.sort_values(ascending=False)

In [None]:
msno.matrix(df_4, figsize=(15, 3))

#### Распределение данных

In [None]:
df_4.Address.value_counts()

In [None]:
df_4.Defaulted.value_counts()

In [None]:
df_4.hist(figsize=(15, 5))
plt.tight_layout()

### Подготовка

#### Пропущенные значения

In [None]:
df_4_without_na = df_4.copy()
df_4_without_na['Defaulted'].fillna(0, inplace=True)
df_4_without_na['Defaulted'].isna().sum()

#### Масштабирование

In [None]:
num_cols = df_4_without_na.select_dtypes(include=np.number).columns.tolist()
norm_trans = QuantileTransformer(output_distribution='normal', n_quantiles=100)
# norm_trans = StandardScaler()
df_4_norm = pd.DataFrame(norm_trans.fit_transform(df_4_without_na[num_cols]), columns=num_cols)
df_4_norm.hist(bins=30, figsize=(15, 5))
plt.tight_layout()

#### Категориальные признаки

In [None]:
#TODO

### N-D PCA

In [None]:
pca = PCA()
pca.fit(df_4_norm)
explained_variance_plot(pca)

### 2-D PCA

In [None]:
pca_2 = PCA(n_components=2)
df_4_pca_2 = pd.DataFrame(pca_2.fit_transform(df_4_norm))
df_4_pca_2

In [None]:
sns.scatterplot(x=df_4_pca_2[0], y=df_4_pca_2[1], alpha=0.3)

### 3-D PCA

In [None]:
pca_3 = PCA(n_components=3)
df_4_pca_3 = pd.DataFrame(pca_3.fit_transform(df_4_norm))
df_4_pca_3

In [None]:
%matplotlib widget

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
fig = plt.figure(figsize=(5, 5))
ax = Axes3D(fig, azim=-155, elev=45)


ax.scatter(df_4_pca_3[0], df_4_pca_3[1], df_4_pca_3[2], 
           alpha=0.3)

In [None]:
%matplotlib inline

## Weather Stations in USA

### Загрузка данных

[Источник (Weather Stations in USA)](https://www.kaggle.com/akashsdas/weather-stations-in-usa/version/1).

In [None]:
df_4 = pd.read_csv("./../../data/weather-stations20140101-20141231.csv")
df_4

### Анализ данных

#### Типы данных

In [None]:
df_4.info()

#### Пропущенные значения

In [None]:
df_4.isna().sum()

In [None]:
# процент пропущенных значений в каждой колонке
percent_missing = round(df_4.isnull().mean()*100, 2)
percent_missing.sort_values(ascending=False)

In [None]:
msno.matrix(df_4, figsize=(15, 3))

#### Распределение данных

In [None]:
df_4.Address.value_counts()

In [None]:
df_4.Defaulted.value_counts()

In [None]:
df_4.hist(figsize=(15, 5))
plt.tight_layout()

### Подготовка

#### Пропущенные значения

In [None]:
df_4_without_na = df_4.copy()
df_4_without_na['Defaulted'].fillna(0, inplace=True)
df_4_without_na['Defaulted'].isna().sum()

#### Масштабирование

In [None]:
num_cols = df_4_without_na.select_dtypes(include=np.number).columns.tolist()
norm_trans = QuantileTransformer(output_distribution='normal', n_quantiles=100)
# norm_trans = StandardScaler()
df_4_norm = pd.DataFrame(norm_trans.fit_transform(df_4_without_na[num_cols]), columns=num_cols)
df_4_norm.hist(bins=30, figsize=(15, 5))
plt.tight_layout()

#### Категориальные признаки

In [None]:
#TODO

### N-D PCA

In [None]:
pca = PCA()
pca.fit(df_4_norm)
explained_variance_plot(pca)

### 2-D PCA

In [None]:
pca_2 = PCA(n_components=2)
df_4_pca_2 = pd.DataFrame(pca_2.fit_transform(df_4_norm))
df_4_pca_2

In [None]:
sns.scatterplot(x=df_4_pca_2[0], y=df_4_pca_2[1], alpha=0.3)

### 3-D PCA

In [None]:
pca_3 = PCA(n_components=3)
df_4_pca_3 = pd.DataFrame(pca_3.fit_transform(df_4_norm))
df_4_pca_3

In [None]:
%matplotlib widget

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
fig = plt.figure(figsize=(5, 5))
ax = Axes3D(fig, azim=-155, elev=45)


ax.scatter(df_4_pca_3[0], df_4_pca_3[1], df_4_pca_3[2],
           alpha=0.3)

In [None]:
%matplotlib inline