<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dimensionality-Reduction" data-toc-modified-id="Dimensionality-Reduction-1">Dimensionality Reduction</a></span></li><li><span><a href="#Latent-Semantic-Analysis" data-toc-modified-id="Latent-Semantic-Analysis-2">Latent Semantic Analysis</a></span></li><li><span><a href="#numpy를-이용한-cosine-similarity-계산" data-toc-modified-id="numpy를-이용한-cosine-similarity-계산-3">numpy를 이용한 cosine similarity 계산</a></span></li><li><span><a href="#SVD(Singular-Value-Decomposition)" data-toc-modified-id="SVD(Singular-Value-Decomposition)-4">SVD(Singular Value Decomposition)</a></span></li><li><span><a href="#exam" data-toc-modified-id="exam-5">exam</a></span></li></ul></div>

### Dimensionality Reduction

- https://opentutorials.org/module/3653/22994

![Dimensionality_Reduction.png](./images/Dimensionality_Reduction.png)

차원 축소 (Dimensionality Reduction)는 어떤 목적에 따라서 데이터의 양을 줄이는 방법입니다.

차원 축소로 데이터의 양을 일부러 줄여서 사용하는 이유는

첫째로, 데이터의 양이 줄어든다면 <br>
기본적으로 시간 복잡도 (time complexity, 계산하는 시간)와 공간 복잡도 (space complexity, 저장하는 변수의 양)가 줄어듭니다.

둘째로, 아주 많은 차원의 데이터로 학습시킨 머신러닝 모델은 내부의 파라미터도 매우 복잡하게 형성되기 때문에 (=overfit 되기가 쉽기 때문에), <br>
적은 양의 데이터에 대해서 이리 휘청 저리 휘청 불안정한 결과를 내놓게 됩니다. <br>
입력 데이터의 차원을 줄여서 학습을 시키면 모델이 비교적 간단해지고, 그러면 적은 데이터 셋에 대해 안정적(robust)인 결과를 내놓게 됩니다.
 
셋째로, 간단한 모델일수록 사람이 그 내부 구조를 이해하기에 편하고 (interpretable), <br>
모델이 내놓은 결과를 2차원이나 3차원의 그림으로 축소해 드러내어서, 사람이 결과를 알아보기에 편합니다. 

차원축소 방법은, 원본 데이터를 훼손하는 것이므로 있던 정보를 일부 없애버리게 됩니다.<br>
어떤 목적 아래에서 어떤 정보를 얼만큼 없애면 좋을까에 대한 고민과 그 결과가 차원축소 방법입니다.

그림은 총 세 파트로 나뉘어 있습니다.<br>
우상단의 분홍색 배경에는 데이터를 표현해 놓았습니다.

차원축소를 하는 방식에는 크게 두 가지가 있는데요,<br>
상단의 짙은 회색 배경에는 feature selection 방법을 그려놓았습니다.<br>
하단의 옅은 회색 배경에는 feature extraction 방법을 그려놓았습니다.

### Latent Semantic Analysis

- https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/

잠재의미분석이란 단어-문서행렬(Word-Document Matrix), 단어-문맥행렬(window based co-occurrence matrix) 등 입력 데이터에 특이값 분해를 수행해 데이터의 차원수를 줄여 계산 효율성을 키우는 한편 행간에 숨어있는(latent) 의미를 이끌어내기 위한 방법론입니다.

잠재의미분석을 수행하는 절차는 이렇습니다. 

n개의 문서를 m개의 단어로 표현된 입력데이터 행렬 A가 주어졌다고 칩시다. 

A의 0보다 큰 고유값의 개수를 r이라고 할 때, r보다 작은 k를 연구자가 임의로 설정하고 $\Sigma_k$를 만듭니다. <br>
이후 $U$와 $V$행렬에서 여기에 대응하는 부분만 남겨 $U_k$와 $V_k$를 만들어줍니다. <br>
이렇게 되면 $A$와 비슷한 $A_k$ 행렬을 구축할 수 있습니다.

$A_k = U_k \Sigma_k V_t^T$

![LSA](./images/LSA.png)

위 식 양변에 $U_k$의 전치행렬을 곱해준 것을 $X_1$, $V_k$를 곱해준 것을 $X_2$라고 둡니다. <br>
그러면 $X_1$의 경우 n개의 문서는 원래 단어수 m보다 훨씬 작은 k개 변수로 표현된 결과가 됩니다. <br>
$X_2$는 m개의 단어가 원래 문서 수 n보다 작은 k개 변수로 표현한 결과입니다. 

이는 주성분 분석에서의 차원축소 효과와 비슷한 것으로 이해하면 좋을 것 같습니다.

$U^T_k A_k = U^T_k U_k \Sigma_k V^T_k = I \Sigma_k V^T_k = \Sigma_k V^T_k = X_1$

$A_k V_k = U_k \Sigma_k V^T_k V_k = U_k \Sigma_k I = U_k \Sigma_k = X_2$


### numpy를 이용한 cosine similarity 계산

In [1]:
import numpy as np

In [7]:
X = np.array([[1, 1, 0, 0, 0],
                    [0, 1, 0, 0, 0],
                    [0, 0, 1, 0, 0],
                    [0, 0, 0, 1, 0],
                    [0, 0, 0, 0, 1]])

In [28]:
X, X.shape, X.ndim
# Term-Document Matric
# Row: Terms
# Column: Documents

# Document-Term Matrix
# Row: Documrnts
# Column: Terms

(array([[1, 1, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1]]), (5, 5), 2)

In [22]:
X.T

array([[1, 0, 0, 0, 0],
       [1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

In [33]:
_X = X.dot(X.T)
_X

array([[2, 1, 0, 0, 0],
       [1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

In [27]:
np.matmul(X, X.T)

array([[2, 1, 0, 0, 0],
       [1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

In [36]:
def cosine(x, y):
    return inner_product(x, y) / (len(x)*len(y))     # len(x), len(y) : norm

def inner_product(x, y):
    return x * y

In [34]:
np.linalg.norm(X[0])

1.4142135623730951

In [35]:
np.linalg.norm(X, axis=1)    # 행의 길이 : 문서의 길이

array([1.41421356, 1.        , 1.        , 1.        , 1.        ])

In [39]:
np.linalg.norm(X.T, axis=0)    # 열의 길이 : term의 길이

array([1.41421356, 1.        , 1.        , 1.        , 1.        ])

In [44]:
# cosine similarity
X.dot(X.T) / np.linalg.norm(X, axis=1).reshape(5, 1)  * np.linalg.norm(X.T, axis=0)

array([[2.        , 0.70710678, 0.        , 0.        , 0.        ],
       [1.41421356, 1.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ]])

### SVD(Singular Value Decomposition)

특이값분해는 m x n 크기의 데이터 행렬 $A$를 아래와 같이 분해하는 걸 말합니다.

![SVD](./images/SVD.png)

- $SVD: C = UΣV^T$ (where C = term-document matrix)

![lsi](./images/LSI.png)

In [95]:
import pandas as pd

In [85]:
words = ["ship", "boat", "ocean", "wood", "tree"]
documents = ["d1", "d2", "d3", "d4", "d5", "d6"]

In [154]:
C = np.array([[1, 0, 1, 0, 0, 0],
              [0, 1, 0, 0, 0, 0],
              [1, 1, 0, 0, 0, 0],
              [1, 0, 0, 1, 1, 0],
              [0, 0, 0, 1, 0, 1]    
])

In [93]:
pd.DataFrame(C, index=words, columns=documents)

Unnamed: 0,d1,d2,d3,d4,d5,d6
ship,1,0,1,0,0,0
boat,0,1,0,0,0,0
ocean,1,1,0,0,0,0
wood,1,0,0,1,1,0
tree,0,0,0,1,0,1


In [57]:
# C: M x N
# U: M x K
# sigma: K, K   => K = min(M, N)  => 5: (클러스터의 수 또는 잠재의미의 수)
# VT: K, N
U, sigma, VT = np.linalg.svd(C, full_matrices=False)

In [91]:
pd.DataFrame(U, index=words)

Unnamed: 0,0,1,2,3,4
ship,0.440347,-0.296174,-0.569498,0.5773503,-0.246402
boat,0.129346,-0.331451,0.587022,9.436896e-16,-0.727197
ocean,0.47553,-0.511115,0.36769,4.518514e-16,0.614358
wood,0.70302,0.350572,-0.154906,-0.5773503,-0.159788
tree,0.262673,0.646747,0.414592,0.5773503,0.086614


In [79]:
sigma   # Latent Semantic => U x sigma: Term의 중요도,     sigma x VT: Document의 중요도

array([2.16250096, 1.59438237, 1.27529025, 1.        , 0.39391525])

In [92]:
_sigma = np.diag(sigma)
pd.DataFrame(_sigma)

Unnamed: 0,0,1,2,3,4
0,2.162501,0.0,0.0,0.0,0.0
1,0.0,1.594382,0.0,0.0,0.0
2,0.0,0.0,1.27529,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.393915


In [94]:
pd.DataFrame(VT, columns=documents)

Unnamed: 0,d1,d2,d3,d4,d5,d6
0,0.748623,0.2797116,0.203629,0.4465631,0.325096,0.121467
1,-0.286454,-0.5284591,-0.185761,0.6255207,0.21988,0.405641
2,-0.2797116,0.748623,-0.446563,0.2036288,-0.121467,0.325096
3,-4.162907e-16,1.066959e-15,0.57735,2.382035e-16,-0.57735,0.57735
4,0.5284591,-0.286454,-0.625521,-0.1857612,-0.405641,0.21988


In [99]:
# C = UΣVT
pd.DataFrame(U.dot(_sigma).dot(VT), index=words, columns=documents)

Unnamed: 0,d1,d2,d3,d4,d5,d6
ship,1.0,2.866176e-16,1.0,-2.957499e-16,1.371194e-16,-3.70787e-16
boat,-5.863986e-17,1.0,-1.655668e-16,1.602963e-16,7.954781e-17,-4.819879e-17
ocean,1.0,1.0,-5.924052e-16,-5.787719e-17,1.7335480000000002e-17,-1.454892e-16
wood,1.0,6.760135000000001e-17,-1.334818e-16,1.0,1.0,2.535997e-16
tree,-3.275686e-16,-5.436625e-16,-4.857263e-16,1.0,2.091058e-16,1.0


In [100]:
# C와 동일하지는 않지만, 거의 유사함
pd.DataFrame(np.round(U.dot(_sigma).dot(VT)), index=words, columns=documents)

Unnamed: 0,d1,d2,d3,d4,d5,d6
ship,1.0,0.0,1.0,-0.0,0.0,-0.0
boat,-0.0,1.0,-0.0,0.0,0.0,-0.0
ocean,1.0,1.0,-0.0,-0.0,0.0,-0.0
wood,1.0,0.0,-0.0,1.0,1.0,0.0
tree,-0.0,-0.0,-0.0,1.0,0.0,1.0


In [108]:
# U x sigma: latent semantic(sigma) 차원에서 Term의 중요도(reduced matrix)
pd.DataFrame(U.dot(_sigma), index=words)

Unnamed: 0,0,1,2,3,4
ship,0.952252,-0.472215,-0.726275,0.5773503,-0.097062
boat,0.279712,-0.528459,0.748623,9.436896e-16,-0.286454
ocean,1.028335,-0.814913,0.468911,4.518514e-16,0.242005
wood,1.520282,0.558946,-0.19755,-0.5773503,-0.062943
tree,0.56803,1.031162,0.528725,0.5773503,0.034119


In [109]:
# sigma x VT: latent semantic(sigma) 차원에서 Document의 중요도(reduced matrix)
pd.DataFrame(_sigma.dot(VT), index=words, columns=documents)

Unnamed: 0,d1,d2,d3,d4,d5,d6
ship,1.618898,0.6048766,0.440347,0.9656932,0.70302,0.262673
boat,-0.4567172,-0.8425659,-0.296174,0.9973192,0.350572,0.646747
ocean,-0.3567135,0.9547117,-0.569498,0.2596858,-0.154906,0.414592
wood,-4.162907e-16,1.066959e-15,0.57735,2.382035e-16,-0.57735,0.57735
tree,0.2081681,-0.1128386,-0.246402,-0.07317416,-0.159788,0.086614


In [148]:
# Latent Sematic 차원에서의 Term의 cosine similarity
_U = U.dot(np.diag(sigma))
cos_sim_U = _U.dot(_U.T) / (np.linalg.norm(_U, axis=1).reshape(5,1) * np.linalg.norm(_U.T, axis=0))
pd.DataFrame(np.round(cos_sim_U*100), index=words, columns=words)

Unnamed: 0,ship,boat,ocean,wood,tree
ship,100.0,-0.0,50.0,41.0,-0.0
boat,-0.0,100.0,71.0,-0.0,-0.0
ocean,50.0,71.0,100.0,41.0,-0.0
wood,41.0,-0.0,41.0,100.0,41.0
tree,-0.0,-0.0,-0.0,41.0,100.0


In [149]:
# Latent Sematic 차원에서의 document의 cosine similarity
_V = np.diag(sigma).dot(VT)
cos_sim_V = _V.T.dot(_V) / (np.linalg.norm(_V.T, axis=1).reshape(6,1) * np.linalg.norm(_V, axis=0))
pd.DataFrame(np.round(cos_sim_V*100), index=documents, columns=documents)

Unnamed: 0,d1,d2,d3,d4,d5,d6
d1,100.0,41.0,58.0,41.0,58.0,-0.0
d2,41.0,100.0,-0.0,-0.0,0.0,-0.0
d3,58.0,-0.0,100.0,-0.0,-0.0,-0.0
d4,41.0,-0.0,-0.0,100.0,71.0,71.0
d5,58.0,0.0,-0.0,71.0,100.0,0.0
d6,-0.0,-0.0,-0.0,71.0,0.0,100.0


In [161]:
_U = U[:, :2].dot(_sigma[:2])    # 차원 축소
cos_sim_U = _U.dot(_U.T) / (np.linalg.norm(_U, axis=1).reshape(5,1) * np.linalg.norm(_U.T, axis=0))
pd.DataFrame(cos_sim_U, index=words, columns=words)

Unnamed: 0,ship,boat,ocean,wood,tree
ship,1.0,0.811764,0.978079,0.687557,0.043137
boat,0.811764,1.0,0.915575,0.134084,-0.548426
ocean,0.978079,0.915575,1.0,0.52128,-0.165849
wood,0.687557,0.134084,0.52128,1.0,0.755113
tree,0.043137,-0.548426,-0.165849,0.755113,1.0


In [163]:
cluster = pd.DataFrame(cos_sim_U, index=words)
temp = cluster.sort_values(by=[1], ascending=False)
temp[temp[1] > 0][1].to_dict()

{'boat': 1.0,
 'ocean': 0.9155748759872768,
 'ship': 0.8117637410018149,
 'wood': 0.13408431998247014}

In [164]:
temp = cluster.sort_values(by=[0], ascending=False)
temp[temp[0] > 0][0].to_dict()

{'ship': 1.0000000000000002,
 'ocean': 0.9780790149520933,
 'boat': 0.8117637410018149,
 'wood': 0.6875573362073368,
 'tree': 0.043136502613820224}

### exam

1. SVD => U, Sigma, Vt => (N, M), N개의 문서와 M개의 단어
2. U.Sigma : U[:, :K].Sigma[:K] => Latent Semantic 차원에서 어느 문서가 중요한지
3. Sigma.Vt : Sigma[:, :K].Vt[:K, :] => Lagent Semantic 차원에서 어느 단어가 중요한지
4. 2-> 각 문서가 어느 문서와 유사한지 (on Latent Semantic Dimensions)
5. 3-> 각 단어가 어느 단어와 유사한지 (on Latent Semantic Dimensions)

In [174]:
C = np.array([
    [1, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 1, 0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0, 0, 1],
    [0, 0, 0, 1, 1, 0, 0],
    [1, 0, 1, 0, 1, 1, 1]
])

In [178]:
doc = ["A", "B", "C", "D", "E", "F"]
term = ["cute", "kitty", "eat", "rice", "cake", "hamster", "bread"]

In [204]:
pd.DataFrame(C, index=doc, columns=term)

Unnamed: 0,cute,kitty,eat,rice,cake,hamster,bread
A,1,1,0,0,0,0,0
B,0,0,1,1,1,0,0
C,0,1,0,0,0,1,0
D,0,0,1,0,0,0,1
E,0,0,0,1,1,0,0
F,1,0,1,0,1,1,1


In [205]:
U, sigma, VT = np.linalg.svd(C, full_matrices=False)
U.shape, sigma.shape, VT.shape

((6, 6), (6,), (6, 7))

In [207]:
_sigma = np.diag(sigma)
pd.DataFrame(_sigma)

Unnamed: 0,0,1,2,3,4,5
0,2.822643,0.0,0.0,0.0,0.0,0.0
1,0.0,1.89393,0.0,0.0,0.0,0.0
2,0.0,0.0,1.562689,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.899362,0.0
5,0.0,0.0,0.0,0.0,0.0,0.441438


In [224]:
_U = U[:, :2].dot(_sigma[:2])
_U

array([[ 0.4185015 , -0.91368177,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 1.34429039,  0.94566246,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.4185015 , -0.91368177,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.92201403, -0.07999361,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.79892052,  0.85384019,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 2.07882785, -0.53630501,  0.        ,  0.        ,  0.        ,
         0.        ]])

In [225]:
pd.DataFrame(U[:, :2].dot(_sigma[:2]), index=doc)

Unnamed: 0,0,1,2,3,4,5
A,0.418502,-0.913682,0.0,0.0,0.0,0.0
B,1.34429,0.945662,0.0,0.0,0.0,0.0
C,0.418502,-0.913682,0.0,0.0,0.0,0.0
D,0.922014,-0.079994,0.0,0.0,0.0,0.0
E,0.798921,0.85384,0.0,0.0,0.0,0.0
F,2.078828,-0.536305,0.0,0.0,0.0,0.0


In [227]:
_V = _sigma[:, :2].dot(VT[:2, :])
_V

array([[ 0.88474861,  0.29653167,  1.53938435,  0.75929227,  1.49577504,
         0.88474861,  1.06313197],
       [-0.76559676, -0.96485267,  0.17390496,  0.95014204,  0.66697161,
        -0.76559676, -0.32540726],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ]])

In [228]:
pd.DataFrame(_sigma[:, :2].dot(VT[:2, :]), columns=term)

Unnamed: 0,cute,kitty,eat,rice,cake,hamster,bread
0,0.884749,0.296532,1.539384,0.759292,1.495775,0.884749,1.063132
1,-0.765597,-0.964853,0.173905,0.950142,0.666972,-0.765597,-0.325407
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [229]:
# Latent Sematic 차원에서의 Document의 cosine similarity
cos_sim_U = _U.dot(_U.T) / (np.linalg.norm(_U, axis=1).reshape(6,1) * np.linalg.norm(_U.T, axis=0))
pd.DataFrame(np.round(cos_sim_U*100), index=doc, columns=doc)

Unnamed: 0,A,B,C,D,E,F
A,100.0,-18.0,100.0,49.0,-38.0,63.0
B,-18.0,100.0,-18.0,77.0,98.0,65.0
C,100.0,-18.0,100.0,49.0,-38.0,63.0
D,49.0,77.0,49.0,100.0,62.0,99.0
E,-38.0,98.0,-38.0,62.0,100.0,48.0
F,63.0,65.0,63.0,99.0,48.0,100.0


In [230]:
# Latent Sematic 차원에서의 Term의 cosine similarity
cos_sim_V = _V.T.dot(_V) / (np.linalg.norm(_V.T, axis=1).reshape(7,1) * np.linalg.norm(_V, axis=0))
pd.DataFrame(np.round(cos_sim_V*100), columns=term, index=term)

Unnamed: 0,cute,kitty,eat,rice,cake,hamster,bread
cute,100.0,85.0,68.0,-4.0,42.0,100.0,91.0
kitty,85.0,100.0,18.0,-56.0,-12.0,85.0,56.0
eat,68.0,18.0,100.0,71.0,95.0,68.0,92.0
rice,-4.0,-56.0,71.0,100.0,89.0,-4.0,37.0
cake,42.0,-12.0,95.0,89.0,100.0,42.0,75.0
hamster,100.0,85.0,68.0,-4.0,42.0,100.0,91.0
bread,91.0,56.0,92.0,37.0,75.0,91.0,100.0


In [238]:
cluster = pd.DataFrame(cos_sim_U, index=doc)
cluster

Unnamed: 0,0,1,2,3,4,5
A,1.0,-0.182501,1.0,0.493458,-0.379352,0.630345
B,-0.182501,1.0,-0.182501,0.765105,0.978945,0.648239
C,1.0,-0.182501,1.0,0.493458,-0.379352,0.630345
D,0.493458,0.765105,0.493458,1.0,0.617561,0.986264
E,-0.379352,0.978945,-0.379352,0.617561,1.0,0.479164
F,0.630345,0.648239,0.630345,0.986264,0.479164,1.0


In [239]:
temp = cluster.sort_values(by=[1], ascending=False)     # 0번째는 모든 문서에 다 포함되는 중요하지 않은 정보이므로, 1번째를 기준으로 정렬
temp

Unnamed: 0,0,1,2,3,4,5
B,-0.182501,1.0,-0.182501,0.765105,0.978945,0.648239
E,-0.379352,0.978945,-0.379352,0.617561,1.0,0.479164
D,0.493458,0.765105,0.493458,1.0,0.617561,0.986264
F,0.630345,0.648239,0.630345,0.986264,0.479164,1.0
C,1.0,-0.182501,1.0,0.493458,-0.379352,0.630345
A,1.0,-0.182501,1.0,0.493458,-0.379352,0.630345


In [240]:
temp[temp[1] > 0][1].to_dict()

{'B': 1.0000000000000002,
 'E': 0.9789454660748421,
 'D': 0.7651054648220396,
 'F': 0.6482386350719546}

In [241]:
temp = cluster.sort_values(by=[0], ascending=False)  # 0번째를 기준으로 정렬하는 경우 의미 있는 순위를 찾기 어려움
temp

Unnamed: 0,0,1,2,3,4,5
A,1.0,-0.182501,1.0,0.493458,-0.379352,0.630345
C,1.0,-0.182501,1.0,0.493458,-0.379352,0.630345
F,0.630345,0.648239,0.630345,0.986264,0.479164,1.0
D,0.493458,0.765105,0.493458,1.0,0.617561,0.986264
B,-0.182501,1.0,-0.182501,0.765105,0.978945,0.648239
E,-0.379352,0.978945,-0.379352,0.617561,1.0,0.479164


In [242]:
temp[temp[0] > 0][0].to_dict()

{'A': 0.9999999999999998,
 'C': 0.9999999999999998,
 'F': 0.6303451754109326,
 'D': 0.49345847594937386}

In [244]:
cluster = pd.DataFrame(cos_sim_V, columns=term)
cluster

Unnamed: 0,cute,kitty,eat,rice,cake,hamster,bread
0,1.0,0.847627,0.677955,-0.039102,0.424155,1.0,0.914593
1,0.847627,1.0,0.184613,-0.563331,-0.120974,0.847627,0.560674
2,0.677955,0.184613,1.0,0.708032,0.95326,0.677955,0.917311
3,-0.039102,-0.563331,0.708032,1.0,0.888312,-0.039102,0.368305
4,0.424155,-0.120974,0.95326,0.888312,1.0,0.424155,0.754128
5,1.0,0.847627,0.677955,-0.039102,0.424155,1.0,0.914593
6,0.914593,0.560674,0.917311,0.368305,0.754128,0.914593,1.0


In [249]:
temp = cluster.sort_values(by=[1], axis=1, ascending=False)
temp.T

Unnamed: 0,0,1,2,3,4,5,6
kitty,0.847627,1.0,0.184613,-0.563331,-0.120974,0.847627,0.560674
hamster,1.0,0.847627,0.677955,-0.039102,0.424155,1.0,0.914593
cute,1.0,0.847627,0.677955,-0.039102,0.424155,1.0,0.914593
bread,0.914593,0.560674,0.917311,0.368305,0.754128,0.914593,1.0
eat,0.677955,0.184613,1.0,0.708032,0.95326,0.677955,0.917311
cake,0.424155,-0.120974,0.95326,0.888312,1.0,0.424155,0.754128
rice,-0.039102,-0.563331,0.708032,1.0,0.888312,-0.039102,0.368305


In [251]:
temp.T[temp.T[1] > 0][1].to_dict()

{'kitty': 1.0000000000000002,
 'hamster': 0.8476267357426155,
 'cute': 0.8476267357426152,
 'bread': 0.560674315082687,
 'eat': 0.18461264781848385}