In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Insights

#### **The punctuations in the actual document maters as term truncation happens on these characters - It is better to remove them before calculating tfidf**



- We will focus on similarity of `"Harry Potter and the Sorcerer's Stone"` with other documents in the list.

Actual document => `"Great Sorcerer's of NY"`
```
docs=[
    "Harry Potter and the Sorcerer's Stone", 
    "Harry Potter and the Chamber of Secrets", 
    "The Sorcerer's Den",
    "Great Sorcerer's of NY", 
    "Great Secrets of Amazon", 
]
```
Score => r4c1
```
array([[1.        , 0.45682318, 0.22470915, 0.18665146, 0.        ],
       [0.45682318, 1.        , 0.        , 0.        , 0.24967495],
       [0.22470915, 0.        , 1.        , 0.25719572, 0.        ],
       [0.18665146, 0.        , 0.25719572, 1.        , 0.29609938],
       [0.        , 0.24967495, 0.        , 0.29609938, 1.        ]])
```

---

Actual document => `"Great Sorcerers of NY"`
```
docs=[
    "Harry Potter and the Sorcerer's Stone", 
    "Harry Potter and the Chamber of Secrets", 
    "The Sorcerer's Den",
    "Great Sorcerers of NY", 
    "Great Secrets of Amazon", 
]
```
Score => r4c1
```
array([[1.        , 0.4408883 , 0.29481481, 0.        , 0.        ],
       [0.4408883 , 1.        , 0.        , 0.        , 0.24967495],
       [0.29481481, 0.        , 1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        , 0.2635058 ],
       [0.        , 0.24967495, 0.        , 0.2635058 , 1.        ]])
```

#### cosine_similary([5 x 11], [5 x 11]) => Output = [5 x 11] and cosine_similary([5 x 11], [1 x 11]) => Output = [1 x 11] 

```
my_matrix = 
v1=> 1 2 3
v2=> 4 5 6
v3=> 7 8 9
```

`cosine_similary(my_matrix,my_matrix) =` 

```
    col1     col1      col1
[    
[v1 vs v1, v2 vs v1, v3 vs v1], 
[v1 vs v2, v2 vs v2, v3 vs v2], 
[v1 vs v3, v2 vs v3, v4 vs v3]
]
```

---

```
other_matrix =
v4 => 1 5 9
```

`cosine_similary(my_matrix,other_matrix) =` 

```
    col1
[    
[v4 vs v1], 
[v4 vs v2], 
[v4 vs v3]
]
```


In [77]:
docs=[
    "Harry Potter and the Sorcerer's Stone", 
    "Harry Potter and the Chamber of Secrets", 
    "The Sorcerer's Den",
    "Great Sorcerers of NY", 
    "Great Secrets of Amazon", 
]

In [78]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 1),
                     min_df=0, stop_words='english')

In [79]:
tfidf = tf.fit_transform(docs)

In [80]:
tfidf.shape

(5, 11)

In [81]:
tfidf.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.4695148 ,
         0.        , 0.4695148 , 0.        , 0.4695148 , 0.        ,
         0.5819515 ],
        [0.        , 0.5819515 , 0.        , 0.        , 0.4695148 ,
         0.        , 0.4695148 , 0.4695148 , 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.77828292, 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.62791376, 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , 0.49552379, 0.        ,
         0.61418897, 0.        , 0.        , 0.        , 0.61418897,
         0.        ],
        [0.659118  , 0.        , 0.        , 0.53177225, 0.        ,
         0.        , 0.        , 0.53177225, 0.        , 0.        ,
         0.        ]])

In [82]:
tf.get_feature_names_out()

array(['amazon', 'chamber', 'den', 'great', 'harry', 'ny', 'potter',
       'secrets', 'sorcerer', 'sorcerers', 'stone'], dtype=object)

In [83]:
pd.DataFrame(data = tfidf.toarray(),index = docs,columns = tf.get_feature_names_out())

Unnamed: 0,amazon,chamber,den,great,harry,ny,potter,secrets,sorcerer,sorcerers,stone
Harry Potter and the Sorcerer's Stone,0.0,0.0,0.0,0.0,0.469515,0.0,0.469515,0.0,0.469515,0.0,0.581951
Harry Potter and the Chamber of Secrets,0.0,0.581951,0.0,0.0,0.469515,0.0,0.469515,0.469515,0.0,0.0,0.0
The Sorcerer's Den,0.0,0.0,0.778283,0.0,0.0,0.0,0.0,0.0,0.627914,0.0,0.0
Great Sorcerers of NY,0.0,0.0,0.0,0.495524,0.0,0.614189,0.0,0.0,0.0,0.614189,0.0
Great Secrets of Amazon,0.659118,0.0,0.0,0.531772,0.0,0.0,0.0,0.531772,0.0,0.0,0.0


In [84]:
cosine_sim = cosine_similarity(tfidf, tfidf)
cosine_sim

array([[1.        , 0.4408883 , 0.29481481, 0.        , 0.        ],
       [0.4408883 , 1.        , 0.        , 0.        , 0.24967495],
       [0.29481481, 0.        , 1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        , 0.2635058 ],
       [0.        , 0.24967495, 0.        , 0.2635058 , 1.        ]])

In [85]:
[
    "Harry Potter and the Sorcerer's Stone", 
    "Harry Potter and the Chamber of Secrets", 
    "The Sorcerer's Den",
    "Great Sorcerers of NY", 
    "Great Secrets of Amazon", 
]

["Harry Potter and the Sorcerer's Stone",
 'Harry Potter and the Chamber of Secrets',
 "The Sorcerer's Den",
 'Great Sorcerers of NY',
 'Great Secrets of Amazon']

In [94]:
cosine_sim = cosine_similarity(tfidf, tf.transform(["Harry Secrets Sorcerers"]))
cosine_sim

array([[0.24967495],
       [0.49934989],
       [0.        ],
       [0.404823  ],
       [0.28278173]])

In [95]:
np.argsort(cosine_sim.flatten())

array([2, 0, 4, 3, 1], dtype=int64)

- `argpartition` first sorts the array then partitions the array on the *kth* element.
- All elements lower than the *kth* element will be behind it and larget will be after it.

In [101]:
np.argpartition(cosine_sim.flatten(),-3)

array([2, 0, 4, 3, 1], dtype=int64)

In [102]:
list(reversed((np.argsort(cosine_sim.flatten())[-3:])))

[1, 3, 4]

In [103]:
list(reversed((np.argpartition(cosine_sim.flatten(),-3)[-3:])))

[1, 3, 4]