<a href="https://colab.research.google.com/github/rhudaina/Linear-Systems-and-Applications-A-Hands-On-Python-Workshop/blob/main/Day3/Day3_Lecture_2_DimensionalityReduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Multilinear Regression

Consider a larger data set to determine which factors best predict prices in a housing market.

In [None]:
# Load dataset
H = np.loadtxt('housing.data')
print(H.shape)

H_dataset = pd.DataFrame(H)
H_dataset.head()

The data contains 13 features and prices for 506 houses.

In [None]:
A = H[:,:-1] # house features (e.g., property tax rate, per-capita crime rate)
y = H[:,-1]  # housing values in $1000s

It is important to pad this matrix with an additional column of ones, to take into account the possibility of non-zero constant offset in the regression formula. This corresponds to the "y-intercept" in a simple one-dimensional linear regression.

In [None]:
A = np.pad(A,[(0,0),(0,1)],mode='constant',constant_values=1)

Perform multilinear regression via SVD:

In [None]:
from numpy.linalg import svd
U, S, Vh = svd(A, full_matrices=False);

D = 1.0/S;
x = (Vh.T * D) @ U.T @ y;
print(x); # minimum norm solution

In [None]:
plt.plot(y, 'b', label='housing value')
plt.plot(A @ x, 'r', label='best-fit price prediction')
plt.xlabel('neighborhood')
plt.ylabel('median housing value (in $1000s)')
plt.legend()
plt.show()

Sorting data by housing value:

In [None]:
sort_ind = np.argsort(y)
ys = y[sort_ind] # sorted values
As = A[sort_ind,:];

plt.plot(ys, 'b', label='housing value')
plt.plot(As@x, 'r', label='best-fit price prediction')
plt.xlabel('neighborhood')
plt.ylabel('median housing value (in $1000s)')
plt.legend()
plt.show()

Although the housing value are not perfectly predicted, the trend agrees quite well. It is often the case that the highest value outliers are not well captured by simple linear fits.

# Principal Component Analysis

## SciKit Learn

[Scikit-learn](https://scikit-learn.org/stable/) is a library that allows you to do machine learning, that is, make predictions from data, in Python. There are four basic tasks:

 1. Regression: predict a number from data points, given data points and corresponding numbers
 2. Classification: predict a category from datapoints, given data points and corresponding numbers
 3. Clustering: predict a category from data points, given only data points
 4. Dimensionality reduction: make data points lower-dimensional so that we can visualize the data

Here is a [flowchart from the scikit learn documentation](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) of when to use each technique.

![](https://scikit-learn.org/stable/_static/ml_map.png)

In [None]:
import sklearn # scikit-learn

A good place to look for example data sets to use in machine learning tasks is the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)

This repository (currently) contains 559 data sets, including information on where they came from and how to use them.

On this page we'll use the [Iris](https://archive.ics.uci.edu/ml/datasets/Iris) and [Abalone](https://archive.ics.uci.edu/ml/datasets/Abalone) data sets.

The Iris data set consists of measurements of three species of Iris (a flower).  The Abalone data set consists of meaurements of abalone, a type of edible marine snail.

You can download the data by going to the data folder for each data set ([here is the one for Iris](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/)).  You will see a file with the extension `*.data` which is a csv file containing the data.  This file does not have a header - you need to look at the attribute information on the data set home page to get the attribute names.

Scikit learn also has a few built-in data sets for easy loading:

In [None]:
from sklearn import datasets

In [None]:
from sklearn.datasets import load_breast_cancer
breast = load_breast_cancer()

In [None]:
print(breast)

In [None]:
breast_data = breast.data
print(breast_data.shape)

In [None]:
features = breast.feature_names
print(features.shape)

In [None]:
breast_labels = breast.target
print(breast_labels.shape)

In [None]:
breast_dataset = pd.DataFrame(breast_data)
breast_dataset.columns = features

In [None]:
breast_dataset.head()

## PCA via eigendecomposition

In [None]:
A = breast_dataset.loc[:,features].values
m,n = A.shape
print(A.shape)

#### Step 1. Standardize the data along the features

In [None]:
Z = (A - A.mean(axis = 0)) / A.std(axis=0)

print("mean: ", Z.mean())
print("std: ", Z.std())

#### Step 2. Calculate covariance matrix for the features

In [None]:
C = np.cov(Z, ddof = 1, rowvar = False)

#### Step 3. Perform eigendecomposition on covariance matrix

In [None]:
Lam, X = np.linalg.eig(C)

#### Step 4. Sort (descending) PCs (eigenvectors) based on their eigenvalues

In [None]:
idx = np.argsort(Lam)[::-1] # by default: ascending

sorted_eigvals = Lam[idx]
sorted_eigvecs = X[:,idx]  # sort columns

plt.plot(sorted_eigvals,'-o')
plt.ylabel('singular value')
plt.show()

#### Step 5. Calculate the explained variance for each PC

In [None]:
expvar = sorted_eigvals/ np.sum(sorted_eigvals)

S = np.cumsum(expvar)
plt.plot(S,'-o')
plt.ylabel('explained variance')
plt.show()

#### Step 6. Reduce the standardized data by the desired number pf PCs

In [None]:
p = 5 # desired number of PCs
reduced_data = Z @ sorted_eigvecs[:,:p]

print(sum(expvar[:p]))

In [None]:
targets = ['Benign', 'Malignant']
colors = ['b', 'r']
markers = ['.','x']

for i in range(2):
  idx = (breast.target == i)
  plt.scatter(reduced_data[idx,0], reduced_data[idx,1], marker = markers[i],color = colors[i], label=targets[i])

plt.title("Principal Component Analysis")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.legend()
plt.show()

## PCA via SVD

In [None]:
from sklearn.preprocessing import StandardScaler

A = breast_dataset.loc[:,features].values
Z = StandardScaler().fit_transform(A)

print(Z.mean(), Z.std())

In [None]:
from sklearn.decomposition import PCA

pca = PCA().fit(Z)
S = np.cumsum(pca.explained_variance_ratio_)

plt.plot(S,'-o')
plt.ylabel('explained variance')
plt.show()

In [None]:
pca = PCA(n_components = 5)
reduced_data = pca.fit_transform(Z)
print(Z.shape, reduced_data.shape)

targets = ['Benign', 'Malignant']
colors = ['b', 'r']
markers = ['.','x']

for i in range(2):
  idx = (breast.target == i)
  plt.scatter(reduced_data[idx,0], reduced_data[idx,1], marker = markers[i],color = colors[i], label=targets[i])

plt.title("Principal Component Analysis")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.legend()
plt.show()

## Other Dimensionality Reducting Routines

Note that scikit-learn contains many other unsupervised dimensionality reduction routines: some you might wish to try are
Other dimensionality reduction techniques which are useful to know about:

- [sklearn.decomposition.PCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.PCA.html):
   Principal Component Analysis
- [sklearn.decomposition.RandomizedPCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.RandomizedPCA.html):
   extremely fast approximate PCA implementation based on a randomized algorithm
- [sklearn.decomposition.SparsePCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.SparsePCA.html):
   PCA variant including L1 penalty for sparsity
- [sklearn.decomposition.FastICA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.FastICA.html):
   Independent Component Analysis
- [sklearn.decomposition.NMF](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.NMF.html):
   non-negative matrix factorization
- [sklearn.manifold.LocallyLinearEmbedding](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html):
   nonlinear manifold learning technique based on local neighborhood geometry
- [sklearn.manifold.IsoMap](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.Isomap.html):
   nonlinear manifold learning technique based on a sparse graph algorithm
   
Each of these has its own strengths & weaknesses, and areas of application. You can read about them on the [scikit-learn website](http://sklearn.org).

# Image Compression

In [None]:
import cv2

Img = cv2.imread('sunflowergray.jpg', cv2.IMREAD_GRAYSCALE);
print(Img.shape)

plt.imshow(Img, cmap = 'gray', vmin = 0, vmax = 255)
plt.axis("off")
plt.show()

In [None]:
from numpy.linalg import svd
U, S, Vh = svd(Img, full_matrices=False)
print(U.shape, S.shape, Vh.shape)

In [None]:
p = 5;

ApproxImg = (U[:,0:p] * S[0:p]) @ Vh[0:p,:]

plt.imshow(ApproxImg, cmap = 'gray', vmin = 0, vmax = 255)
plt.axis("off")
plt.show()

In [None]:
plt.plot(S,'b')
plt.semilogy();

p = np.array([5,20,200]);
for i in p:
  plt.plot(i-1, S[i-1],'r.')

In [None]:
ImgVar = np.cumsum(S)/np.sum(S);
plt.plot(ImgVar, 'b')
for i in p:
  plt.plot(i-1, ImgVar[i-1],'r.')
plt.show()

# Reference

1.   [Brad Nelson (2021), Scientific Computing with Python](https://caam37830.github.io/book/index.html)
2.   [Krishna et al. (2022) Introduction to Data Science with Python](https://nustat.github.io/DataScience_Intro_python/Introduction%20to%20Python%20and%20Jupyter%20Notebooks.html)
3. [Serafina Di Gioia (2024), Python 101, SMR 3935](https://indico.ictp.it/event/10473)
