# Inverted Index

A data structure called **inverted index** which given a term provides access to the list of documents that contain the term. The inverted index is the list of words and the documents in which they appear

<img src="invertedindex.jpeg">

## Load  libraries

In [1]:
import os
import nltk

In [2]:
cwd = os.getcwd() #Get current working directory
all_files = os.listdir(cwd)   # List all files in directory
strtxt = '.txt'

# To remove puctuation from the text
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

In [3]:
documents =[]
for txtfile in all_files:
    if txtfile.__contains__(strtxt):
        file = open(txtfile, "r")
        data = file.read()
        data = data.lower()  # converting all text to lower case
        new_data = ""
        for char in data:
            if char not in punctuations:
                new_data = new_data + char
#         print(new_data+"\n")
        
#         result_list +=new_data
     
        sent = nltk.sent_tokenize(new_data)
        documents += sent                # append all words in a single  document list


In [4]:
documents

['religion is a set in which collection of beliefs differentdifferent\ncultures and the views that are related with civilization at present so\nmany religions are there and all are explaining the origin of life and\norigin of universe',
 'science is the essentiality of education what we are watching with our\neyes happening nonhappening all include science various scientists\nworking on different topics help us to understand the science in our\nlives science is continuous evolutionary study each day something\nnew is determined',
 'land rover to built discovery sport in brazil and also opens new plant\nin america maruti launches the new version of alto k10 with or\nwithout automatic facility in the indian market',
 'entertainment includes so many things that are responsible for holding\nthe attention and interest of an audience it includes various things like\nentertainment news celebrities fashion movies music and television\nshows etc',
 'engineering makes the development of any coun

### documents list contains all the 8 document as list and it's index start from 0 to 7.

In [5]:
documents[0]

'religion is a set in which collection of beliefs differentdifferent\ncultures and the views that are related with civilization at present so\nmany religions are there and all are explaining the origin of life and\norigin of universe'

In [6]:
documents[7]

'cricket is the most popular game in india in crocket a player uses a\nbat to hit the ball and scoring runs it is played between two teams the\nteam scoring maximum runs will win the game'

In [7]:
len(documents)

8

In [8]:
unique_terms = {term for doc in documents for term in doc.split()}

In [9]:
unique_terms

{'a',
 'agents',
 'air',
 'all',
 'also',
 'alto',
 'america',
 'an',
 'and',
 'any',
 'are',
 'army',
 'as',
 'at',
 'attention',
 'audience',
 'automatic',
 'ball',
 'bat',
 'be',
 'beliefs',
 'beneficial',
 'between',
 'brazil',
 'builder',
 'buildings',
 'built',
 'by',
 'called',
 'can',
 'career',
 'celebrities',
 'civil',
 'civilization',
 'collection',
 'computer',
 'consists',
 'construction',
 'continuous',
 'controlled',
 'country',
 'covering',
 'cricket',
 'crocket',
 'cultures',
 'day',
 'defense',
 'determined',
 'development',
 'develops',
 'different',
 'differentdifferent',
 'discovery',
 'distribution',
 'each',
 'ecological',
 'economic',
 'economy',
 'education',
 'engineering',
 'engineers',
 'entertainment',
 'essentiality',
 'etc',
 'everything',
 'evolutionary',
 'explaining',
 'eyes',
 'facility',
 'fashion',
 'for',
 'force',
 'form',
 'forms',
 'game',
 'given',
 'gives',
 'good',
 'goods',
 'happening',
 'help',
 'hit',
 'holding',
 'hospitals',
 'important

In [10]:
len(unique_terms)

192

In [11]:
inverted_index = {}       # Dictonary that contains unique term as key and document number as value

for i, doc in enumerate(documents):
    for term in doc.split():
        if term in inverted_index:
            inverted_index[term].add(i)
        else: inverted_index[term] = {i}

In [12]:
print(inverted_index)

{'religion': {0}, 'is': {0, 1, 5, 6, 7}, 'a': {0, 5, 6, 7}, 'set': {0}, 'in': {0, 1, 2, 5, 7}, 'which': {0, 4}, 'collection': {0}, 'of': {0, 1, 2, 3, 4, 5, 6}, 'beliefs': {0}, 'differentdifferent': {0}, 'cultures': {0}, 'and': {0, 2, 3, 5, 6, 7}, 'the': {0, 1, 2, 3, 4, 6, 7}, 'views': {0}, 'that': {0, 3}, 'are': {0, 1, 3, 4, 6}, 'related': {0}, 'with': {0, 1, 2}, 'civilization': {0}, 'at': {0}, 'present': {0}, 'so': {0, 3}, 'many': {0, 3}, 'religions': {0}, 'there': {0}, 'all': {0, 1}, 'explaining': {0}, 'origin': {0}, 'life': {0}, 'universe': {0}, 'science': {1}, 'essentiality': {1}, 'education': {1}, 'what': {1}, 'we': {1}, 'watching': {1}, 'our': {1}, 'eyes': {1}, 'happening': {1}, 'nonhappening': {1}, 'include': {1}, 'various': {1, 3, 6}, 'scientists': {1}, 'working': {1}, 'on': {1}, 'different': {1, 5}, 'topics': {1}, 'help': {1}, 'us': {1}, 'to': {1, 2, 4, 7}, 'understand': {1}, 'lives': {1}, 'continuous': {1}, 'evolutionary': {1}, 'study': {1}, 'each': {1}, 'day': {1, 4}, 'somet