### Truncate SVD example

Truncated Singular Value Decomposition (SVD) is a matrix factorization technique that factors a matrix M into the three matrices U, Σ, and V. This is very similar to PCA, excepting that the factorization for SVD is done on the data matrix, whereas for PCA, the factorization is done on the covariance matrix. Typically, SVD is used under the hood to find the principle components of a matrix.

Like PCA, Truncate PCD is used to reduce the number of dimensions available in a matrix.

A couple of good videos to watch on this topic -

1. https://www.youtube.com/watch?v=P5mlg91as1c

2. https://www.youtube.com/watch?v=UyAfmAZU_WI

In [76]:
import pandas as pd
import numpy as np
import json
import zipfile
import os
import time

from sklearn import preprocessing
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.datasets import fetch_20newsgroups


from sklearn.datasets import load_iris
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from scipy import linalg

### Download the data to run Truncate SVD

In [77]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

In [78]:
newsgroups_train.filenames.shape, newsgroups_train.target.shape

((2034,), (2034,))

In [79]:
print(newsgroups_train.target_names)
print(np.unique(newsgroups_train.target))

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
[0 1 2 3]


In [80]:
print("\n".join(newsgroups_train.data[:3]))

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.c

In [81]:
np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]

array(['comp.graphics', 'talk.religion.misc', 'sci.space'], dtype='<U18')

In [82]:
vectorizer = CountVectorizer(stop_words='english')
x_input = vectorizer.fit_transform(newsgroups_train.data).todense()
x_input.shape

(2034, 26576)

In [83]:
vocab = np.array(vectorizer.get_feature_names())
vocab.shape

(26576,)

In [84]:
vocab[7000:7020]

array(['cosmonauts', 'cosmos', 'cosponsored', 'cost', 'costa', 'costar',
       'costing', 'costly', 'costruction', 'costs', 'cosy', 'cote',
       'couched', 'couldn', 'council', 'councils', 'counsel',
       'counselees', 'counselor', 'count'], dtype='<U80')

#### Calculate Truncate SVD

#### Calculate first using Scipy

In [85]:
%time U, s, Vh = linalg.svd(x_input,full_matrices=False)

CPU times: user 42.6 s, sys: 1.13 s, total: 43.7 s
Wall time: 12 s


In [86]:
print (U.shape)
print(s.shape)
print(Vh.shape)

(2034, 2034)
(2034,)
(2034, 26576)


In [87]:
print(s)

[4.33926985e+02 2.91510127e+02 2.40711377e+02 ... 1.86503480e-15
 1.50688986e-15 1.35283161e-15]


In [135]:
s_df = pd.DataFrame(s).round(2)
s_df.rename(columns={0:"variance"},inplace=True)
s_df["variance_pct"] = s_df["variance"]*100.0/s_df["variance"].sum()
s_df.head(30)

Unnamed: 0,variance,variance_pct
0,433.93,2.262575
1,291.51,1.519976
2,240.71,1.255097
3,220.0,1.147112
4,182.74,0.952833
5,168.15,0.876759
6,148.16,0.772528
7,128.79,0.67153
8,126.49,0.659538
9,118.35,0.617094


_From the value of S (Σ), it can be seen the weights are scattered and distributed across multiple words (which is understandable for such documents). 

In [88]:
print(Vh)

[[-9.40971949e-03 -1.14531979e-02 -2.16949925e-05 ... -5.71798766e-06
  -1.14359753e-05 -1.09243411e-03]
 [-3.56688261e-03 -1.76916681e-02 -3.04483622e-05 ... -7.73124401e-06
  -1.54624880e-05 -1.85490440e-03]
 [ 9.49713213e-04 -2.28284545e-02 -2.33939629e-05 ... -1.22019598e-05
  -2.44039195e-05  1.50537828e-03]
 ...
 [-7.98329636e-03  8.54523075e-05 -6.51951431e-03 ... -2.62895556e-05
  -1.76065710e-05  4.79428021e-07]
 [-1.63296813e-03  2.47318168e-04 -1.86095724e-04 ...  5.35465333e-06
   1.39548625e-05 -1.42946143e-04]
 [ 4.14026862e-04 -1.29658375e-03 -4.80365443e-04 ...  1.41504622e-05
   5.47468666e-05  5.88689486e-05]]


### Figure out top words in the top 10 components

In [138]:
num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[-num_top_words-1:]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [140]:
### Using abs as we need to pick up words that have either +ve or -ve heavy weights
show_topics(np.abs(Vh[0:10]))

['pub data gif images graphics file edu image jpeg',
 'color file pub space data graphics gif edu jpeg',
 'matthew satellite people edu graphics god launch jesus space',
 'atheism atheists satellite people matthew launch god jesus space',
 'edu pub analysis processing space graphics jpeg data image',
 'religion believe prophecy religious atheism matthew atheists god jesus',
 'probe mars market lunar satellite commercial nasa space launch',
 'mars lunar surface probe ftp nasa available data image',
 'premises argumentum ad true atheists example conclusion fallacy argument',
 'mars moon surface theory data image larson probe space']

### Calculate using scikit learn TruncateSVD

In [143]:
svd = TruncatedSVD(n_components=10)
svd_transformed = svd.fit_transform(x_input)

In [144]:
svd.explained_variance_

array([90.64507815, 41.57611948, 26.9144205 , 23.63770842, 16.41821839,
       13.83983333, 10.76926986,  8.15117348,  7.86458302,  6.88208483])

In [104]:
svd.explained_variance_ratio_

array([0.21206798, 0.09726909, 0.06296742, 0.05530141, 0.03841112,
       0.03237886, 0.02519518, 0.01906992, 0.01839947, 0.01610078])

_First 10 components explain very little variance. More components will require to be taken into account._

In [105]:
svd.components_

array([[ 9.40971948e-03,  1.14531979e-02,  2.16949924e-05, ...,
         5.71798669e-06,  1.14359734e-05,  1.09243411e-03],
       [-3.56688198e-03, -1.76916530e-02, -3.04483508e-05, ...,
        -7.73131066e-06, -1.54626213e-05, -1.85490432e-03],
       [-9.49720119e-04,  2.28284087e-02,  2.33938394e-05, ...,
         1.22014265e-05,  2.44028531e-05, -1.50537818e-03],
       ...,
       [ 2.17172433e-03,  4.32226412e-02,  1.25511772e-04, ...,
        -3.74205502e-05, -7.48411004e-05, -1.60929401e-03],
       [-3.87402588e-04,  4.90966885e-03,  3.02618036e-06, ...,
        -1.38485613e-05, -2.76971226e-05, -1.50091441e-04],
       [-3.07858550e-03,  1.41343220e-02,  3.77826186e-06, ...,
         3.13907212e-05,  6.27814424e-05, -5.62417736e-04]])