# Information Retrieval System

Usages:<br>
1. Search Engines
2. Document Retrieval systems 
3. Passage Retrieval systems 
4. Question Answering Systems

Paragraphs from wikipedia pages with following contexts are taken as documents<br>
1. Football
2. Computer
3. Car
4. Laws<br>

Results at the bottom

In [32]:
Doc1 = ['''Football is a family of team sports that involve, to varying degrees, a kicking a ball to score a goal. Unqualified, the word football normally means the form of football that is the most popular where the word is used. Sports commonly called football include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby union or rugby league); and Gaelic football.[1][2] These various forms of football share to varying extent common origins and are known as football codes.''']

In [33]:
Doc2 = ['''A computer is a machine that can be instructed to carry out sequences of arithmetic or logical operations automatically via computer programming. Modern computers have the ability to follow generalized sets of operations, called programs. These programs enable computers to perform an extremely wide range of tasks. A "complete" computer including the hardware, the operating system (main software), and peripheral equipment required and used for "full" operation can be referred to as a computer system. This term may as well be used for a group of computers that are connected and work together, in particular a computer network or computer cluster.''']

In [34]:
Doc3 = ['''Cars came into global use during the 20th century, and developed economies depend on them. The year 1886 is regarded as the birth year of the modern car when German inventor Karl Benz patented his Benz Patent-Motorwagen. Cars became widely available in the early 20th century. One of the first cars accessible to the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced animal-drawn carriages and carts, but took much longer to be accepted in Western Europe and other parts of the world.[citation needed]''']

In [35]:
Doc4 = ['''Law commonly refers to a system of rules created and enforced through social or governmental institutions to regulate behavior,[2] with its precise definition a matter of longstanding debate.[3][4][5] It has been variously described as a science[6][7] and the art of justice.[8][9][10] State-enforced laws can be made by a group legislature or by a single legislator, resulting in statutes; by the executive through decrees and regulations; or established by judges through precedent, usually in common law jurisdictions. Private individuals may create legally binding contracts, including arbitration agreements that adopt alternative ways of resolving disputes to standard court litigation. The creation of laws themselves may be influenced by a constitution, written or tacit, and the rights encoded therein. The law shapes politics, economics, history and society in various ways and serves as a mediator of relations between people.''']

In [36]:
import re
from nltk.corpus import stopwords

In [6]:
stp = stopwords.words('english')

In [7]:
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    #text = re.sub(r'[^\w\s]+',' ', text)
    text = ' '.join([word for word in text.split() if word not in stp])
    return text

In [8]:
path = r'E:\GoogleNews-vectors-negative300.bin'

In [9]:
import gensim
import numpy as np

In [10]:
w2vec = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)

In [11]:
def get_embeddings(word):
    if word in w2vec.vocab:
        return w2vec[word]
    else:
        return np.zeros(300)

In [12]:
# Average embeddings

In [37]:
docs = Doc1+Doc2+Doc3+Doc4

In [38]:
out_dict = {}

In [39]:
import nltk

In [40]:
for sent in docs:
    average_vector = np.mean(np.array([get_embeddings(word) for word in nltk.word_tokenize(preprocess(sent))]),axis=0)
    d = {sent:average_vector}
    out_dict.update(d)

In [41]:
# Get similarity between query and documents vectorsout_dict

In [42]:
import scipy

In [43]:
def get_similarity(query, doc):
    cos_sim = np.dot(query, doc)/(np.linalg.norm(query)*np.linalg.norm(doc))
    return cos_sim

In [44]:
# Function to generate ranked documents

In [60]:
def rank_text(query):
    query_vector = np.mean(np.array([get_embeddings(word) for word in nltk.word_tokenize(preprocess(query))]),axis=0)
    rank = []
    for k,v in out_dict.items():
        rank.append((k,get_similarity(query_vector, v)))
    rank = sorted(rank, key=lambda x:x[1], reverse=True)
    return rank

In [61]:
rank_text('play')

[('Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football normally means the form of football that is the most popular where the word is used. Sports commonly called football include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby union or rugby league); and Gaelic football.[1][2] These various forms of football share to varying extent common origins and are known as football codes.',
  0.4023621),
 ('A computer is a machine that can be instructed to carry out sequences of arithmetic or logical operations automatically via computer programming. Modern computers have the ability to follow generalized sets of operations, called programs. These programs enable computers to perform an extremely wide range of tasks. A "complete" computer including the hardware, the operating sy

In [66]:
rank_text('disk')

[('A computer is a machine that can be instructed to carry out sequences of arithmetic or logical operations automatically via computer programming. Modern computers have the ability to follow generalized sets of operations, called programs. These programs enable computers to perform an extremely wide range of tasks. A "complete" computer including the hardware, the operating system (main software), and peripheral equipment required and used for "full" operation can be referred to as a computer system. This term may as well be used for a group of computers that are connected and work together, in particular a computer network or computer cluster.',
  0.36218327),
 ('Cars came into global use during the 20th century, and developed economies depend on them. The year 1886 is regarded as the birth year of the modern car when German inventor Karl Benz patented his Benz Patent-Motorwagen. Cars became widely available in the early 20th century. One of the first cars accessible to the masses w

In [73]:
rank_text('engine')

[('Cars came into global use during the 20th century, and developed economies depend on them. The year 1886 is regarded as the birth year of the modern car when German inventor Karl Benz patented his Benz Patent-Motorwagen. Cars became widely available in the early 20th century. One of the first cars accessible to the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced animal-drawn carriages and carts, but took much longer to be accepted in Western Europe and other parts of the world.[citation needed]',
  0.3748004376327669),
 ('A computer is a machine that can be instructed to carry out sequences of arithmetic or logical operations automatically via computer programming. Modern computers have the ability to follow generalized sets of operations, called programs. These programs enable computers to perform an extremely wide range of tasks. A "complete" computer including the hardware, the operating

In [74]:
rank_text('judge')

[('Law commonly refers to a system of rules created and enforced through social or governmental institutions to regulate behavior,[2] with its precise definition a matter of longstanding debate.[3][4][5] It has been variously described as a science[6][7] and the art of justice.[8][9][10] State-enforced laws can be made by a group legislature or by a single legislator, resulting in statutes; by the executive through decrees and regulations; or established by judges through precedent, usually in common law jurisdictions. Private individuals may create legally binding contracts, including arbitration agreements that adopt alternative ways of resolving disputes to standard court litigation. The creation of laws themselves may be influenced by a constitution, written or tacit, and the rights encoded therein. The law shapes politics, economics, history and society in various ways and serves as a mediator of relations between people.',
  0.3329420774624741),
 ('Cars came into global use durin