<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Python for Finance Key Skills

&copy; Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | [training@tpq.io](mailto:trainin@tpq.io) | [@dyjh](http://twitter.com/dyjh)

## Unsupervised Learning & PCA

### Imports

In [None]:
!git clone https://github.com/tpq-classes/pff_key_skills.git
import sys
sys.path.append('pff_key_skills')


In [None]:
import numpy as np
import pandas as pd
from pylab import plt, mpl

In [None]:
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'

### The Data

In [None]:
url = 'https://certificate.tpq.io/eod_fundamentals.csv'

In [None]:
# url = 'https://certificate.tpq.io/eod_dax_fundamentals.csv'

In [None]:
raw = pd.read_csv(url, index_col=0)

In [None]:
# raw

In [None]:
raw.index

In [None]:
raw.columns

In [None]:
raw.info()

In [None]:
raw.dropna(inplace=True)

In [None]:
raw.T['MarketCapitalization'].astype(int)

In [None]:
raw.T['MarketCapitalization'].astype(int) / 1_000_000

In [None]:
# raw.T['MarketCapitalization'] / 2  # does not work without dtype conversion

In [None]:
raw = raw.drop('MostRecentQuarter')

In [None]:
raw = raw.astype(float)

In [None]:
raw.info()

In [None]:
raw.T.describe().round(2)

In [None]:
raw.T['MarketCapitalizationMln'].plot.bar();

In [None]:
raw.T['MarketCapitalizationMln'].sort_values().plot.bar();

### Unsupervised Learning

About unsupervised learning (from https://perplexity.ai):

> Unsupervised learning is a branch of machine learning that aims to analyze data and find hidden patterns or structures within it without the use of labeled outputs. In other words, the algorithms are left to their own devices to discover and present interesting structures in the data. The main goal of unsupervised learning is to identify patterns within the training data sets and categorize input objects based on these patterns. Common algorithms used in unsupervised learning include clustering, anomaly detection, and dimensionality reduction. Some typical goals of unsupervised learning include:
> * Clustering: Grouping similar data points together based on their characteristics, such as customer segmentation or grouping images of animals based on their features.
> * Anomaly Detection: Identifying outliers or anomalies in the data that do not conform to expected patterns, which is useful for fraud detection or quality control.
> * Dimensionality Reduction: Reducing the number of input variables in the dataset while retaining important information, which can aid in visualization and computational efficiency.

#### Original Data

In [None]:
cols = ['MarketCapitalization', 'QuarterlyEarningsGrowthYOY']

In [None]:
data = raw.T[cols]

In [None]:
data

In [None]:
from sklearn.cluster import KMeans

In [None]:
n = 3

In [None]:
model = KMeans(n_clusters=n, n_init='auto')  # 1. step: model instantiation

In [None]:
model.fit(data)  # 2. step: model fitting

In [None]:
model.predict(data)  # 3. step: prediction

In [None]:
p = model.predict(data)
p

#### Normalized Data

About Gaussian normalization (from https://perplexity.ai):

> Gaussian normalization, also known as feature scaling or feature normalization, is a data preprocessing technique used to bring all the features of a dataset onto a similar scale. This is done by transforming the data so that it has a mean of 0 and a standard deviation of 1, or by rescaling the data to a range between 0 and 1. The main purpose of Gaussian normalization is to ensure that all features contribute equally to the analysis and modeling processes, regardless of their original scale or distribution. It is particularly useful when the dataset contains features with different units or scales, as it helps to prevent certain features from dominating the analysis simply because of their larger scale. Gaussian normalization is especially important for algorithms that are sensitive to the scale of the input features, such as support vector machines (SVM) with Gaussian kernels, and it can also improve the convergence and performance of various machine learning algorithms. Overall, Gaussian normalization helps to make the data more interpretable, improves the performance of machine learning models, and ensures that the features are treated fairly during the analysis process.

In [None]:
data_ = (data - data.mean()) / data.std()

In [None]:
data_.describe().round(3)

In [None]:
model = KMeans(n_clusters=n, n_init='auto')

In [None]:
model.fit(data_)

In [None]:
p_ = model.predict(data_)
p_

In [None]:
p == p_

### Visualization

#### 2d Visualization

In [None]:
data.plot.scatter(x='MarketCapitalization',
                  y='QuarterlyEarningsGrowthYOY',
                  c=p,  # clustering results based on original data 
                  cmap='brg');

In [None]:
data.plot.scatter(x='MarketCapitalization',
                  y='QuarterlyEarningsGrowthYOY',
                  c=p_,  # clustering results based on normalized data
                  cmap='brg');

### Dimensionality Reduction

About PCA (from https://perplexity.ai):

> Principal Component Analysis (PCA) is a statistical technique used to simplify and reduce the dimensionality of large datasets while preserving most of the information present in the original data. It achieves this by transforming a large set of variables into a smaller set, known as principal components, that still captures the essential information in the original dataset. The main goal of PCA is to find a reduced number of features that represent the original dataset in a compressed way, capturing a significant portion of its variance. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component has the highest possible variance possible. In simpler terms, PCA helps to identify the most important directions in the high-dimensional space where the data actually resides, making it easier to interpret the data and visualize multidimensional data. It is commonly used for dimensionality reduction and exploratory data analysis, as well as for making predictive models.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
f_norm = scaler.fit_transform(raw.T)

In [None]:
f_norm.round(2)[:3]

In [None]:
model = KMeans(n_clusters=n, n_init='auto')

In [None]:
model.fit(f_norm)

In [None]:
p__ = model.predict(f_norm)
p__

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=2)

In [None]:
f_pca = pca.fit_transform(f_norm)

In [None]:
f_pca.shape

In [None]:
plt.scatter(f_pca[:, 0], f_pca[:, 1], c=p__, cmap='brg')
plt.xlabel('principal component 1')
plt.ylabel('principal component 2');

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="mailto:training@tpq.io">training@tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> 