# Recommender Setup

In [39]:
import pickle
import numpy as np
import pandas as pd
from gensim.similarities import Similarity
from sklearn.metrics.pairwise import cosine_similarity

We use Gensim's ```Similarity``` module to get cosine similarity between documents' topics. See the [documentation](https://radimrehurek.com/gensim/similarities/docsim.html).

Let's create and save the ```Similarity``` object, then demonstrate its functionality. We will load this object in our recommender app.

In [3]:
model_info = pickle.load(open("../models/avatar_model.pickle", "rb"))
model = model_info["model"]
dictionary = model_info["dict"]
corpus = model_info["corpus"]

In [9]:
index = Similarity('temp', model[corpus], num_features = model.num_topics)
index.save("../models/sim_index.pickle")

```Similarity``` takes advantage of sparseness to compute an estimate of cosine similarity between documents. It allows the user to quickly get the pairwise cosine similarities between all documents in a corpus, or to get the distance of any new document (given its topic distribution) to all the existing documents in the corpus.

Get the distances from document 0 to all the other documents:

In [21]:
index.similarity_by_id(0)

array([0.99999994, 0.32554293, 0.5997491 , ..., 0.7460016 , 0.07849149,
       0.        ], dtype=float32)

Let's verify the accuracy.

In [30]:
doc0_top = model[corpus[0]]
doc1_top = model[corpus[1]]
print(doc0_top, doc1_top)

[(1, 0.50612485), (7, 0.14371134), (8, 0.34252024)] [(1, 0.23142198), (2, 0.040100906), (3, 0.5694259), (7, 0.07503654), (9, 0.08157011)]


Convert these into dense array form.

In [32]:
doc0_arr = np.zeros(model.num_topics)
for topic, weight in doc0_top:
    doc0_arr[topic] = weight
doc1_arr = np.zeros(model.num_topics)
for topic, weight in doc1_top:
    doc1_arr[topic] = weight
print(doc0_arr, doc1_arr)

[0.         0.50612485 0.         0.         0.         0.
 0.         0.14371134 0.34252024 0.         0.        ] [0.         0.23142198 0.04010091 0.56942588 0.         0.
 0.         0.07503654 0.         0.08157011 0.        ]


Verify the similarity, which should be approximately 0.3255

In [35]:
cosine_similarity([doc0_arr, doc1_arr])

array([[1.      , 0.325548],
       [0.325548, 1.      ]])

As expected. Now note that we can also pass in the topics of an array to ```index``` and get the similarities to each document in the corpus.

In [36]:
index[doc0_top]

array([1.        , 0.32554862, 0.599726  , ..., 0.74596924, 0.07848788,
       0.        ], dtype=float32)

In [38]:
# A theoretical document with 50% weight to topic 0 and 50% to topic 1
index[[(0, 0.5), (1, 0.5)]]

array([0.57003903, 0.26146775, 0.04861848, ..., 0.19564898, 0.5913024 ,
       0.69760233], dtype=float32)

We'll load the ```sim_index.pickle``` file in our recommender app and use it to make recommendations, by passing in the users' inputted text in topic format.

## Extra Setup

Let's save some metadata for works in our database so we can access it without loading the entire corpus. We particularly want to avoid having to load the entire text; we can simply display the first few lines along with the title, author, and a link to the work.

In [44]:
fic_df = pd.read_pickle("../data/avatar_fics_processed.pickle")

In [47]:
info_list = []

# save the first 100 characters, the title, and the authors of every fic in the database
for i in range(len(fic_df)):
    doc_info = {}
    the_fic = fic_df.iloc[i]
    doc_info["starting_text"] = the_fic["text"][:100]
    doc_info["title"] = the_fic["title"]
    doc_info["all_authors"] = the_fic["all_authors"]
    doc_info["work_id"] = the_fic["work_id"]
    info_list.append(doc_info)

#pickle.dump(info_list, open("../data/fic_info.pickle", "wb"))