# Supplements

In [310]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import Normalizer
from sklearn.decomposition import TruncatedSVD, PCA

## TF-IDF Transformation

TF-IDF trasnformation is a technique for vectorizing text data. TF stands for **term frequency** and IDF stands for **inverse document frequency**. 

Here, we use the **inversed mean value** over the training set as our IDF:

```python
IDF = X.shape[0] / X.sum(axis=0)
```
and the **L1 normalized values** as our TF:

```python
TF = X / X.sum(axis=1).reshape(-1,1)
```

In [None]:
class tfidfTransformer():
    
    def __init__(self):
        self.idf = None
        self.fitted = False

    def fit(self, X):
        self.idf = X.shape[0] / X.sum(axis=0)
        self.fitted = True

    def transform(self, X):
        if not self.fitted:
            raise RuntimeError('Transformer was not fitted on any data')
        if scipy.sparse.issparse(X):
            tf = X.multiply(1 / X.sum(axis=1))
            return tf.multiply(self.idf)
        else:
            tf = X / X.sum(axis=1).reshape(-1,1)
            return tf * self.idf

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

## L2 Normalization

We use the following codes to perform **`L2 normalization`**:

```python
normalizer = sklearn.preprocessing.Normalizer(norm="l2")
X = normalizer.fit_transform(X)
```

## Log Transformation

We use **`log transformation`** to transform skewed data to approximately normal (linear models work better with normally distributed data):

```python
np.log1p(X)
```

The above line of code returns the natural logarithm of one plus the input array **log(1 + x)**, element-wise.


## Truncated SVD

We use the following codes to perform **`truncated SVD`** for dimension reduction (down to 512 dimensions):

```python
pca = sklearn.decomposition.TruncatedSVD(n_components=512, random_state=0)
X = pca.fit_transform(X)
```

SVD is not unique and thus a **random state** was set to ensure reproducibility.

Truncated SVD reduces the original data to a smaller subset of features that are **most relevant** to the prediction problem. The reduced dataset is a matrix with a **lower rank** that is said to **approximate** the original matrix.

This is done by first performing the standard SVD to decompose a matrix A into:

$$A = U \Sigma V^T$$

were $U$ is an **$m×m$** complex unitary matrix, $\Sigma$ is an **$m×n$** diagonal matrix, and $V$ is an **$m×n$** complex unitary matrix.

The diagonal values of $\Sigma$ are the **singular values** of $A$. The columns of $U$ and $V$ are the **left-singular vectors** and the **right-singular vectors** of $A$.

Then, we select the **top k largest** singular values given in $\Sigma$. These **k columns** can be selected from $\Sigma$ and **k rows** selected from $V^T$.

The **approximate matrix** $B$ is given as:

$$A = U \Sigma_K V_K^T$$

In practice, we usually retain and work with a **descriptive subset** (dimensionally reduced) of the data. This is a **dense summary** of the original data matrix:

$$T = U \Sigma_K = AV_K^T$$

We choose truncated SVD over PCA because it can deal with **sparse matrices**. Therefore, we dont have to **densify** our data matrix which requires a lot of memory. 

Note: if we **deduct columwise means** from our data, the resultant reduced matrix will be the **same** for truncated SVD and PCA.

Why is centering necessary for PCA?

- It ensures that the resulting components are only looking at the variance within the dataset, and not capturing the overall mean of the dataset as an important variable.

An example of standard SVD:

In [379]:
X = np.array([[1,4,1,3,7],[9,4,2,6,5]])
X

array([[1, 4, 1, 3, 7],
       [9, 4, 2, 6, 5]])

In [380]:
U, s, V = np.linalg.svd(X, full_matrices=True)

In [381]:
U

array([[-0.51310666, -0.85832485],
       [-0.85832485,  0.51310666]])

In [382]:
s

array([14.48530309,  5.30810648])

In [383]:
Sigma = np.zeros((X.shape[0], X.shape[1]))
Sigma[:X.shape[0], :X.shape[0]] = np.diag(s)
Sigma

array([[14.48530309,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  5.30810648,  0.        ,  0.        ,  0.        ]])

In [384]:
V

array([[-0.56871646, -0.37870979, -0.15393232, -0.46179697, -0.54423237],
       [ 0.70828177, -0.26014414,  0.03162869,  0.09488607, -0.64858169],
       [-0.12902955, -0.03842183,  0.98656765, -0.04029704, -0.08327995],
       [-0.38708866, -0.11526549, -0.04029704,  0.87910888, -0.24983985],
       [ 0.09171832, -0.87985314,  0.0191018 ,  0.05730541,  0.46238232]])

In [385]:
np.dot(np.dot(U,Sigma), V)

array([[1., 4., 1., 3., 7.],
       [9., 4., 2., 6., 5.]])

An example of truncated SVD:

In [386]:
X = np.array([[1,4,1,3,7],[9,4,2,6,5]])
X

array([[1, 4, 1, 3, 7],
       [9, 4, 2, 6, 5]])

In [387]:
U, s, V = np.linalg.svd(X, full_matrices=True)

In [390]:
Sigma = np.zeros((X.shape[0], X.shape[1]))
Sigma[:X.shape[0], :X.shape[0]] = np.diag(s)
Sigma

array([[14.48530309,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  5.30810648,  0.        ,  0.        ,  0.        ]])

In [391]:
n = 2
Sigmak = Sigma[:, :n]
Vk = V[:n, :]

In [392]:
u.dot(SigmaK)

array([[ -7.43250547,  -4.55607972],
       [-12.43309568,   2.72362478]])

In [393]:
np.dot(X, vK.T)

array([[ -7.43250547,  -4.55607972],
       [-12.43309568,   2.72362478]])

Using sklearn:

In [396]:
X = np.array([[1,4,1,3,7],[9,4,2,6,5]])
X

array([[1, 4, 1, 3, 7],
       [9, 4, 2, 6, 5]])

In [397]:
pca = sklearn.decomposition.TruncatedSVD(n_components=2, random_state=0)

In [398]:
X = pca.fit_transform(X)
X

array([[ 7.43250547,  4.55607972],
       [12.43309568, -2.72362478]])

An example of PCA:

In [399]:
X = np.array([[1,4,1,3,7],[9,4,2,6,5]])
X

array([[1, 4, 1, 3, 7],
       [9, 4, 2, 6, 5]])

In [400]:
pca = PCA(2)

In [401]:
X = pca.fit_transform(X)
X

array([[ 4.41588043e+00,  6.72203811e-17],
       [-4.41588043e+00,  6.72203811e-17]])

## Standardization

We **`standardize`** the features to avoid the effect of scaling when performing regression.