# Equivalence of normalize and unit_norm

I created a function to 'normalize' my doc_topic matrix points using the function
$\left(\frac{x}{\sqrt{(x^2+y^2)}},\frac{y}{\sqrt{(x^2+y^2)}}\right)$ based on a suggestion from Roberto.  I later found out about sklearn.preprocessing.normalize.  This notebook is aimed at determining if they give the same result.

In [1]:
from sklearn.preprocessing import normalize
from src.functions import unit_norm

I'm including this next cell only to get the vector I was normalizing when I encountered this issue...

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from src.functions import get_sents
# Read in the transcripts
raw_data_path = './data/raw/ted-talks/'
transcripts_filename = 'transcripts.csv'
t_df = pd.read_csv(raw_data_path+transcripts_filename)
# Pare down the corpus to only those talks with the word 'love'
love=t_df[t_df['transcript'].str.contains('love',case=False)]
# Tokenize
# Get the collection of n(=5)-sentence snippets with the word 'love'
love_snippets = get_sents(love,'love',0,0)
# Topic modeling
# Vectorize
cv1 = CountVectorizer(stop_words='english',binary=True)
cv_doc_word = cv1.fit_transform(love_snippets.love)
# Dimension Reduction
cv_lsa=[]
cv_doc_topic=[]
for i in range(2,3):
    cv_lsa.append(TruncatedSVD(i))
    cv_doc_topic.append(cv_lsa[i-2].fit_transform(cv_doc_word))
    print(int(i),'topics variance ratios:',cv_lsa[i-2].explained_variance_ratio_)
cv_doc_topic[0]

2 topics variance ratios: [0.02570641 0.0142085 ]


array([[ 1.25876704, -0.39405516],
       [ 1.04318722, -0.5211399 ],
       [ 0.69086968, -0.71191713],
       ...,
       [ 0.65087842, -0.67896504],
       [ 1.04650948, -0.71010517],
       [ 0.78235133, -0.67294219]])

Now I will normalize using two functions: sklearn.preprocessing.normalize and src.unit_norm:

In [3]:
x=normalize(cv_doc_topic[0])
y=unit_norm(pd.DataFrame(cv_doc_topic[0]),demean=False).to_numpy()

Now I'll compare the results:

In [4]:
(x==y).all()

True

In [5]:
for i in range(len(x)):
    if x[i][0]!=y[i][0] or x[i][1]!=y[i][1]:
        print(i, x[i][0]-y[i][0],x[i][1]-y[i][1])

The above result shows that the values were essentially the same but not necessarily to the last decimal.