# Profile-Based Information Retrieval System using Word2Vec
![alt text](https://liaison2020.eu/wp-content/uploads/2020/01/partner_logo_UPM_.jpg "Logo UPM")
<strong>Assignment for the course of Information Retrieval, Extraction and Integration @ Universidad Politécnica de Madrid \
Developed by Tom van Knippenberg &copy;</strong> 




The idea of the information retrieval system is as follows.
User 1 has sports as a preference. The profile vector consists of "Tennis", "Football", "Rugby". User 2 has cars as a preference. The profile vector may for example consist of "BMW", "Engine", "Wheels". There is a new document called: "Tennis is healthy". The score to user 1 is 0.9 and the score to user 2 is 0. The document is sent to user 1 because it is above the threshold of 0.5 (for example).

The steps that are part of the information retrieval system are:
1. Build profile vector
2. New incoming document transformation into keywords
3. Calculate score of new, prepared document that enters the system
4. If score > threshold, recommend to user
5. Let user provide feedback
6. Adapt the profile according to the profile and the feedback provided


In the next sections, each step will be explained.


In [None]:
!pip install numpy
!pip install gensim
!pip install nltk


In [1]:
import numpy as np

import gensim
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import gensim.downloader as api

from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

import random
from collections import Counter

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tomva\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Profile Vectors
The profile vectors for each user are built by extending the preference with the most similar words to that preference according to our vector space model.

In this example we use the pretrained model of glove-wiki-gigaword-100.  

In [2]:

def profile_initialization_model():
    return api.load('glove-wiki-gigaword-100')


def profile_vectors(user_list, model, n):
    # Function to build profile vectors
    profiles = dict()
    for num, user in enumerate(user_list):
        user_proc = [word.lower() for word in user]
        words = model.most_similar(positive=user_proc, topn=n)
        words = [t[0] for t in words]
        words.extend([word.lower() for word in user])
        profiles[num+1] = words
    return profiles



In [3]:
# Load the model for Word2Vec evaluation
model = profile_initialization_model()

In [4]:
# Initialize User Preferences
user1 = ["Sport"]
user2 = ["Music", "Films"]
user3 = ["Politics", "Sport"]
user4 = ["Cars"]
user5 = ["Cars", "Sport"]

user_list = [user1, user2, user3, user4, user5]
profiles = profile_vectors(user_list, model, n=10)
db_profile = {key: Counter(value) for key, value in profiles.items()}

# Show results of profile initialization
profiles

{1: ['sports',
  'cycling',
  'soccer',
  'racing',
  'sporting',
  'competition',
  'football',
  'compete',
  'competitive',
  'professional',
  'sport'],
 2: ['film',
  'movies',
  'musical',
  'movie',
  'productions',
  'songs',
  'soundtrack',
  'cinema',
  'comedy',
  'artists',
  'music',
  'films'],
 3: ['sports',
  'culture',
  'history',
  'political',
  'social',
  'rather',
  'well',
  'popular',
  'life',
  'focus',
  'politics',
  'sport'],
 4: ['vehicles',
  'trucks',
  'car',
  'automobiles',
  'motorcycles',
  'vehicle',
  'buses',
  'truck',
  'vans',
  'bicycles',
  'cars'],
 5: ['car',
  'vehicles',
  'vehicle',
  'trucks',
  'sports',
  'racing',
  'driving',
  'motorcycles',
  'models',
  'suv',
  'cars',
  'sport']}

## Document Preparation
For the processing of documents we use the function provided in the lecture. It does the necessary pre-processing such as removing punctuation and stopwords, converting to lowercase, and removing words that have less than 2 characters. 

In [5]:
def normalize_doc(doc): 
    # Function to process new incoming documents
    stopset = set(stopwords.words('english'))
    tokens = wordpunct_tokenize(doc)
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
    return clean



The following documents are retrieved from the New York Times website on 14th of March 2021. They will serve as potential
documents for the profiles. 

In [6]:
# Sample documents from NY Times
corpus = [
    "N.C.A.A. Tournament: Things to Know Going Into Selection Sunday. The field of 68 teams for the men’s N.C.A.A. tournament will include 37 at-large bids and 31 automatic qualifiers.",
    "Women’s Basketball Makes Room for New Stars, and New Contenders. The usual elites are still great, but the rest of the college field has a real shot to win the championship this year. Star power isn’t concentrated at the top anymore.",
    "The Pandemic Drove People to Tennis and Golf. Will They Keep Playing?",
    "How Democrats Hope to Press Their Advantage on the Stimulus. Our political reporter lays out Democrats’ messaging goals for the relief bill as they eye the midterms.",
    "After Capitol Riot, Lawmakers Chafe at Security Measures. There is bipartisan interest in removing fencing around the Capitol and dismissing the National Guard troops deployed there, but law enforcement officials fear new threats.",
    "‘We’ll Be Back,’ Broadway Says, on Shutdown Anniversary. A pop-up performance in Times Square on Friday, featuring stars like André De Shields, was full of excitement as reopenings may be on the horizon.",
    "Meet the Best New Artist Grammy Nominees",
    "Electric Cars Are Coming. How Long Until They Rule the Road? A new car sold today can last a decade or two before retiring. This “fleet turnover” poses a major challenge for climate policy.",
    "Volvo Plans to Sell Only Electric Cars by 2030. The Swedish company would phase out internal combustion engine vehicles faster than other automakers."
]

## Scoring Documents
The documents will be scored with the cosine similarity. This function is described below. 

In [7]:
# Score document for each user, average cosine similarity 
def evaluate_cosine_sim(doc, user, model, method='avg'):
    user_low = [word.lower() for word in user]
    scores = []
    max_score = 0
    max_word = ""
    
    for word in user_low:
        for w2 in doc:
            score = model.similarity(word, w2)
            scores.append(score)
            if score > max_score:
                max_word = word
            max_score = max(score, max_score)
    if method == 'avg':
        score = sum(scores) / len(scores)
    else:
        score = max(scores)
    return score, max_word


## Information Retrieval System
This section describes the information retrieval system. The system processes the incoming documents. Then it starts to evaluate these documents for the specific user put in. Each document gets first classified to one of the categories specified. Then the word that triggers most that classification is selected and used to update the profile vector, see the next section. 

The example shown demonstrates that user 1 (with preferences for sports) only shows interest in cars twice. The effect after one time clicking on such a document does not change the profile, but after two times only showing interest in the car documents, the profile vector changes.

In [8]:
def classify_doc(model, doc):
    categories = ["Sport", "Music", "Film", "Politics", "Cars", "Tech", "Science", "Food", "Business", "Books", "Health"]
    cat_vectors = profile_vectors([[i] for i in categories], model, n = 4)
    classification = []
    max_score = 0
    topic_word = ""
    for topic in cat_vectors.values():
        score, max_word = evaluate_cosine_sim(topic, doc, model, method='max')
        if score > max_score:
            max_score = score
            classification = topic
            topic_word = max_word
    return classification, topic_word


def profile_ir(corpus, model, profile, threshold=0.65, eps=0.1):
    pref = []
    docs = [normalize_doc(doc) for doc in corpus]
    for n, doc in enumerate(docs):
        show = 0 
        # Classify topic of document 
        topic, topic_word = classify_doc(model, doc)
        
        # epsilon greedy algorithm to promote exploration
        random_n = random.random()
        if random_n > eps:
            score, _ = evaluate_cosine_sim(doc, profile, model, method = 'max')
            if score > threshold:
                print("Show: '", corpus[n], "' to user based on scoring. \n")
                show = 1
        elif random_n < eps:
            print("Show: '", corpus[n], "' to user based on exploration. \n")
            show = 1
        
        # When the document is shown, ask if relevant
        if show == 1:
            select = input("Do you like to show this article? (1 = yes, 0 = no) ")
            while select not in ['0', '1']:
                print("Select 0 or 1")
                select = input("Do you like to show this article? (1 = yes, 0 = no) ")
            print("\n")
            select = int(select)
            preferences = [word for word, _ in model.most_similar(topic_word)]
            if select == 1:
                pref.extend(preferences)
    return pref



In [9]:
# Perform Information Retrieval for user 1
user_no = 1
pref_u1 = profile_ir(corpus, model, profiles[user_no])

Show: ' N.C.A.A. Tournament: Things to Know Going Into Selection Sunday. The field of 68 teams for the men’s N.C.A.A. tournament will include 37 at-large bids and 31 automatic qualifiers. ' to user based on scoring. 

Do you like to show this article? (1 = yes, 0 = no) 0


Show: ' Women’s Basketball Makes Room for New Stars, and New Contenders. The usual elites are still great, but the rest of the college field has a real shot to win the championship this year. Star power isn’t concentrated at the top anymore. ' to user based on scoring. 

Do you like to show this article? (1 = yes, 0 = no) 0


Show: ' The Pandemic Drove People to Tennis and Golf. Will They Keep Playing? ' to user based on scoring. 

Do you like to show this article? (1 = yes, 0 = no) 0


Show: ' Electric Cars Are Coming. How Long Until They Rule the Road? A new car sold today can last a decade or two before retiring. This “fleet turnover” poses a major challenge for climate policy. ' to user based on scoring. 

Do you

## New Profile Vector
The new profile vector will consist of the words that described the documents the best and mixed with the initial preferences of the user. We will see the effect of the previous choices on the profile and the relevant documents.

In [10]:

def adapt_profile(profiles, pref, db_profile, user_no): 
    # Function that adapts the profile vector of a given user
    new_db = db_profile.copy()
    new_db[user_no].update(pref)
    new_profile = [word for word, _ in new_db[user_no].most_common(10)]
    new_profile = list(set(new_profile))
    return new_profile, new_db


print("Old profile contains following words: ", profiles[user_no])

# Adapt profile vector
new_profile, new_db = adapt_profile(profiles, pref_u1, db_profile, user_no)
print("New profile contains following words: ", new_profile)

# Adapt profiles with new profile and is ready for new documents
profiles[user_no] = new_profile

Old profile contains following words:  ['sports', 'cycling', 'soccer', 'racing', 'sporting', 'competition', 'football', 'compete', 'competitive', 'professional', 'sport']
New profile contains following words:  ['competitive', 'competition', 'racing', 'sporting', 'sports', 'cycling', 'compete', 'professional', 'soccer', 'football']


As can be seen, the initial preferences are still preserved. Now, the user again only shows interest in the documents about cars. See below. 

In [12]:
pref_u1 = profile_ir(corpus, model, profiles[user_no])

Show: ' N.C.A.A. Tournament: Things to Know Going Into Selection Sunday. The field of 68 teams for the men’s N.C.A.A. tournament will include 37 at-large bids and 31 automatic qualifiers. ' to user based on scoring. 

Do you like to show this article? (1 = yes, 0 = no) 0


Show: ' Women’s Basketball Makes Room for New Stars, and New Contenders. The usual elites are still great, but the rest of the college field has a real shot to win the championship this year. Star power isn’t concentrated at the top anymore. ' to user based on scoring. 

Do you like to show this article? (1 = yes, 0 = no) 0


Show: ' The Pandemic Drove People to Tennis and Golf. Will They Keep Playing? ' to user based on exploration. 

Do you like to show this article? (1 = yes, 0 = no) 0


Show: ' Electric Cars Are Coming. How Long Until They Rule the Road? A new car sold today can last a decade or two before retiring. This “fleet turnover” poses a major challenge for climate policy. ' to user based on scoring. 

Do

In [13]:
print("Old profile contains following words: ", profiles[user_no])

# Adapt profile vector
new_profile, new_db = adapt_profile(profiles, pref_u1, db_profile, user_no)
print("New profile contains following words: ", new_profile)

# Adapt profiles with new profile and is ready for new documents
profiles[user_no] = new_profile

Old profile contains following words:  ['competitive', 'competition', 'racing', 'sporting', 'sports', 'cycling', 'compete', 'professional', 'soccer', 'football']
New profile contains following words:  ['bus', 'parked', 'truck', 'vehicles', 'driver', 'cars', 'driving', 'motorcycle', 'vehicle', 'taxi']


The profile now contains more keywords related to cars as the user continues to choose for car documents and ignores the sport documents. This example showed how the profile might change due to the behaviour of the user. 