# Research Papers Text Analysiss.

## Table of Contents

- [Requirements](#requirements)
- [Functions](#functions)

## Requirements

This script requires Python 3.x along with the following libraries:

- nltk
- os
- string
- nltk.corpus.words
- nltk.tokenize.word_tokenize
- nltk.stem.porter.PorterStemmer
- nltk.corpus.wordnet
- math

## Functions

Several functions are defined in the script to handle different aspects of text analysis:

- `extract_data(directory)`: Extracts data from text files in the specified directory, including tokenization and filtering based on various criteria.
- `cleaning_pipeline(token)`: Cleans and processes tokens, removing unnecessary characters, splitting compound words, and filtering out non-English words.
- `token_seperator(token)`: Separates tokens into individual words using a word segmentation algorithm.
- `remove_conjunctions_from_sets(input_list)`: Removes conjunctions from a list of tuples containing word sets and their associated information.
- `word_segment(sentences)`: Segments sentences into individual words and filters out stop words.
- `build_inverted_index(tokens)`: Builds an inverted index for efficient search operations based on the tokens.
 an example for reference.

In [58]:
import nltk
import os
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer
from math import log
import csv
import string

In [59]:
nltk.download('stopwords')
nltk.download('words')

[nltk_data] Downloading package stopwords to C:\Users\Zain
[nltk_data]     Abbas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to C:\Users\Zain
[nltk_data]     Abbas\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

# COMMENTs

## COMMENT 1

### Purpose

The main function of this module is to categorize tokens into different lists based on specific conditions. Two types of variables are created: boolean index and positional index. Boolean index, also known as an inverted index, is denoted by variables prefixed with 'boolean_', making it easy to identify throughout the code. Similarly, positional index variables store tokens along with their positions.

### Function Naming Convention

All function names have the suffix:
   - `boolean` for boolean index.
   - `positional` for positional index.

### VARIBALES

- **Boolean Token 1**: Tokens with a length between 3 and 14 characters, without punctuation, and not consisting entirely of digits.
- **Positional Token 1**: Same as boolean token 1, but tokens are also stored with their positions.
- **Boolean Token of Length 2**: Tokens exactly 2 characters long and not consisting entirely of digits.
- **Positional Token of Length 2**: Same as boolean token of length 2, but tokens are also stored with their positions.
- **Boolean Token as Sentence**: Tokens longer than 14 characters, without hyphens, and not consisting entirely of digits.
- **Positional Token as Sentence**: Same as boolean token as sentence, but tokens are also stored with their positions.
- **Boolean Token with Hyphen**: Tokens containing hyphens.
- **Positional Token with Hyphen**: Same as boolean token with hyphen, but tokens are also stored with their positions.
- **Boolean Token with Punctuation**: Tokens containing punctuation (excluding hyphens), with a length less than or equal to 14 characters, and not consisting entirely of digits.
- **Positional Token with Punctuation**: Same as boolean token with punctuation, but tokens are also stored with their positions.
- **Positional All Number**: Tokens consisting entirely of digits, stored along with their positions.
- **Boolean All Number**: Tokens consisting entirely of digits.
- **Boolean All Token**: Tokens not meeting any specific condition.


## COMMENT 2

### Cleaning Pipeline for Boolean and Positional Indexing

The `cleaning_pipeline` function performs token cleaning and preprocessing for both boolean and positional indexing purposes. Here's a concise breakdown of its functionality:
#
## Purose

This function is designed to clean and preprocess tokens for indexing purposes, distinguishing between boolean and positional indexing need#s.

## Functionality

1. **Initialization**: 
   - Initialize lists to store different types of tokens for both boolean and positional indexing.

2. **Token Processing Loop**:
   - Iterate through each token provided.
   - Clean tokens by replacing punctuation with spaces and splitting if they contain spaces.
   - Separate tokens consisting entirely of digits.

3. **Post-Processing Loop**:
   - For each cleaned token:
     - Remove digits if present.
     - Split into words and filter based on length.
     - Filter tokens based on English word list.

4. **Return Values**:
   - Return three lists for both boolean and positional indexing:
     - Tokens not needing further processing.
     - Tokens needing further processing.
     - Tokens cons`p

# COMMENT 3

I integrate the concept of Zipf's Law into the explanation of the code by emphasizing how it influences word frequency calculations and token extraction. 

### EXPLANATION

1. **token_seperator(token)**:
   - This function takes a list of tokens as input.
   - It initializes an empty list called `words`.
   - It iterates over each word in `english_words` and appends it to the `words` list. This assumes that `english_words` is a pre-defined list of English words.
   - It then iterates over pairs of boolean tokens represented by `boolean_token1`, where each pair `(x, y)` seems to represent a boolean token and its corresponding value.
   - Each word from these tokens is appended to the `words` list.
   - It calculates the cost of each word using a formula based on its frequency and position in the list of words, potentially leveraging Zipf's Law to estimate word frequencies.
   - It finds the maximum length of words in the `words` list.
   - It calls the `infer_spaces_boolean` function for each token in the input, converts it to lowercase, and computes its clean version along with the original token's corresponding boolean value.
   - Finally, it returns the result of the `cleaning_pipeline_boolean` function applied to the cleaned terms.

2. **infer_spaces(s, maxword, wordcost)**:
   - This function takes a string `s`, maximum word length `maxword`, and a dictionary `wordcost` as input.
   - It defines a nested function `best_match(i)` that finds the best matching words for a given index `i` in the string.
   - It initializes a list `cost` with the cost of starting from each character as 0.
   - It iterates over each character in the input string `s`.
   - For each character, it calculates the cost of matching the substring ending at that character with the known words and updates the `cost` list accordingly, potentially taking into account Zipf's Law in estimating word frequencies.
   - It constructs the best matching words by backtracking from the end of the string using the `best_match(i)` function.
   - Finally, it joins the reversed list of best matching words with spaces and returns the result.

By incorporating the idea of Zipf's Law Ito the explanation, we highlight how word frequencies, and consequently word costs, are calculated, adding depth to the understanding of the 

# COMMENT 4

Two functions appear to be part of a text processing pipeline, likely aimed at cleaning and segmenting text data.

### EXPLANATION

1. **remove_conjunctions_from_sets_boolean(input_list)**:
   - This function takes a list of tuples `(sentence, info)` as input.
   - It initializes a list `conjunctions` containing common conjunction words.
   - It iterates through each tuple in the input list, where `s` represents the sentence and `x` represents some associated information.
   - For each sentence, it iterates through its words:
      - It checks if any prefix of the sentence matches with any conjunction in the `conjunctions` list. If a match is found, it removes the conjunction and its preceding words.
      - It also checks if any suffix of the sentence matches with any conjunction. If found, it removes the conjunction and its succeeding words.
   - It appends the modified sentence along with the associated information to the `filtered_list`.
   - Finally, it returns the filtered list of tuples.

2. **word_segment_boolean(sentences)**:
   - This function takes a list of tuples `(sentence, info)` as input.
   - It initializes an empty list `len_3_word` to store words with length less than 3.
   - It initializes an empty list `segmented_sentences` to store segmented sentences.
   - It initializes an empty list `term` to store the final terms after processing.
   - For each sentence in the input list:
      - It initializes an empty list `words` to store segmented words of the sentence.
      - It iterates through the sentence character by character, attempting to find valid words by checking if substrings form valid English words.
      - If a word is found and its length is greater than or equal to 4, it is added to the `words` list.
      - If a word has a length less than 4, it is added to the `len_3_word` list.
   - After segmenting each sentence, it constructs the `term` list by checking each word:
      - It ensures the word is not a stop word, it's a valid English word, and its length is at least 3.
      - It checks if the word is a noun using WordNet.
      - If the conditions are met, it appends the word along with its associated information to the `term` list.
   - Finally, it returns the `term` list containing segmther analysis or processing.code's functionality.ne the repository to your local machine:



In [60]:
stop_words = set(["a", "is", "the", "of", "all", "and", "to", "can", "be", "as", "once", "for", "at", "am", "are", "has", "have", "had", "up", "his", "her", "in", "on", "no", "we", "do"])
punctuation = set(['!', '\\', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '—'])
english_words = set(nltk.corpus.words.words())
porter = PorterStemmer()


#remove_stopwords, simply remove the stopwords and the punctuatated tokens
def remove_stopwords(tokens):    
    punctuation = set(string.punctuation)
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word not in punctuation]
    return filtered_tokens

#comment 1

def extract_and_index_data(directory):
    boolean_token1 = []
    positional_token1 = []
    
    boolean_token_of_len_2 = []
    positional_token_of_len_2 = []
    
    boolean_token_as_sentence = []
    positional_token_as_sentence = []
    
    boolean_token_have_hyphen = []
    positional_token_have_hyphen = []
    
    boolean_token_have_punctuation = []
    positional_token_have_punctuation = []
    
    boolean_all_token = []
    positional_all_token = []

    boolean_all_number = []
    positional_all_number = []
    
    for filename in os.listdir(directory):
        index = 0
        if os.path.isfile(os.path.join(directory, filename)):
            with open(os.path.join(directory, filename), 'r') as text:
                data = text.read()

            file_tokens = word_tokenize(data)
            normalized = remove_stopwords(file_tokens)

            # index = 0
            for token in normalized:
                if 3 <= len(token) <= 14 and all(char not in string.punctuation for char in token) and not token.isdigit():
                    boolean_token1.append((token.lower(), filename))
                    positional_token1.append((token.lower(), filename, index))
                elif len(token) == 2 and not token.isdigit():
                    boolean_token_of_len_2.append((token.lower(), filename))
                    positional_token_of_len_2.append((token.lower(), filename, index))
                elif len(token) > 14 and '-' not in token and not token.isdigit():
                    boolean_token_as_sentence.append((token.lower(), filename))
                    positional_token_as_sentence.append((token.lower(), filename, index))
                elif '-' in token:
                    boolean_token_have_hyphen.append((token.lower(), filename))
                    positional_token_have_hyphen.append((token.lower(), filename, index))
                elif any(char in string.punctuation for char in token) and '-' not in token and len(token) <= 14 and not token.isdigit():
                    boolean_token_have_punctuation.append((token.lower(), filename))
                    positional_token_have_punctuation.append((token.lower(), filename, index))
                elif token.isdigit():
                    positional_all_number.append((token.lower(), filename, index))
                    boolean_all_number.append((token.lower(), filename))
                else:
                    boolean_all_token.append((token.lower(), filename))
                index += 1  # Adding 1 for the space after the token
                positional_all_token.append((token.lower(), filename, index))

    return (boolean_token1, boolean_token_of_len_2, boolean_token_as_sentence, boolean_token_have_hyphen, boolean_token_have_punctuation, boolean_all_token, boolean_all_number), \
           (positional_token1, positional_token_of_len_2, positional_token_as_sentence, positional_token_have_hyphen, positional_token_have_punctuation, positional_all_token, positional_all_number)






directory_path = r'D:\SEMESTER 06\INFORMATION RETRIEVAL\ASSIGNMENT I\ResearchPapers'
#function invoking
boolean_tokens, positional_tokens = extract_and_index_data(directory_path)

boolean_token1, boolean_token_of_len_2, boolean_token_as_sentence, boolean_token_have_hyphen, boolean_token_have_punctuation, boolean_all_token, boolean_all_number = boolean_tokens
positional_token1, positional_token_of_len_2, positional_token_as_sentence, positional_token_have_hyphen, positional_token_have_punctuation, positional_all_token, positional_all_number = positional_tokens

#########################   


#COMMENT 2

def cleaning_pipeline_boolean(token):
    clean_token = []
    token_split = []
    new_token = []
    tokens_no_further_processing_required = []
    token_contain_only_numbers = []
    for x, y in token:
        if len(x) >= 4 and not x.isdigit() and any(c.isalpha() for c in x):
            new_x = ''.join(' ' if char in punctuation else char for char in x)
            if ' ' in new_x:
                token_split = new_x.split(' ')
                for word in token_split:
                    if len(word) >= 4:
                        new_token.append((word, y))
            else:
                if len(new_x) >= 4:
                    new_token.append((new_x, y))
        elif x.isdigit():
            token_contain_only_numbers.append((x, y))
    
    for x,y in new_token:
        if x.isdigit():
            token_contain_only_numbers.append((x,y))

    
    for x, y in new_token:
        if any(char.isdigit() for char in x):
            cleaned_x = ''.join(char if char != '—' else '' for char in x) and ''.join(char for char in x if not char.isdigit())
            split_tokens = nltk.word_tokenize(cleaned_x)
            for word in split_tokens:
                if len(word) >= 4:
                    clean_token.append((word, y))
        else:
            clean_token.append((x, y))
    tokens_no_further_processing_required = [(x,y) for x,y in clean_token if x in english_words]
    clean_token = [(x, y) for x, y in clean_token if x not in english_words]
    return tokens_no_further_processing_required,clean_token,token_contain_only_numbers


def cleaning_pipeline_positional(token):
    clean_token = []
    token_split = []
    new_token = []
    tokens_no_further_processing_required = []
    token_contain_only_numbers = []
    for x, y, z in token:
        if len(x) >= 4 and not x.isdigit() and any(c.isalpha() for c in x):
            new_x = ''.join(' ' if char in punctuation else char for char in x)
            if ' ' in new_x:
                token_split = new_x.split(' ')
                for word in token_split:
                    if len(word) >= 4:
                        new_token.append((word, y, z))
            else:
                if len(new_x) >= 4:
                    new_token.append((new_x, y, z))
        elif x.isdigit():
            token_contain_only_numbers.append((x, y, z))
    
    for x,y,z in new_token:
        if x.isdigit():
            token_contain_only_numbers.append((x,y,z))

    
    for x, y, z in new_token:
        if any(char.isdigit() for char in x):
            cleaned_x = ''.join(char if char != '—' else '' for char in x) and ''.join(char for char in x if not char.isdigit())
            split_tokens = nltk.word_tokenize(cleaned_x)
            for word in split_tokens:
                if len(word) >= 4:
                    clean_token.append((word, y, z))
        else:
            clean_token.append((x, y, z))
    tokens_no_further_processing_required = [(x,y, z) for x,y, z in clean_token if x in english_words]
    clean_token = [(x, y, z) for x, y , z in clean_token if x not in english_words]
    return tokens_no_further_processing_required,clean_token,token_contain_only_numbers



#########################################

#COMMENT 3

def token_seperator_boolean(token):
    words = []
    for x in english_words:
        words.append(x)

    for x,y in boolean_token1:
        words.append(x)


    wordcost = {k: log((i + 1) * log(len(words))) for i, k in enumerate(words)}
    maxword = max(len(x) for x in words)
    # return infer_spaces(token,maxword,wordcost)
    clean_term = []
    for i in range(len(token)):
        clean_term.append((infer_spaces_boolean(token[i][0].lower(),maxword,wordcost),token[i][1]))
    return cleaning_pipeline_boolean(clean_term)
    

def infer_spaces_boolean(s,maxword,wordcost):
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i - maxword):i]))
        return min((c + wordcost.get(s[i - k - 1:i], 9e999), k + 1) for k, c in candidates)
    
    cost = [0]
    for i in range(1, len(s) + 1):
        c, k = best_match(i)
        cost.append(c)
    
    out = []
    i = len(s)
    while i > 0:
        c, k = best_match(i)
        assert c == cost[i]
        out.append(s[i - k:i])
        i -= k
    
    return " ".join(reversed(out))



def token_seperator_poistional(token):
    words = []
    for x in english_words:
        words.append(x)

    for x, y, z in positional_token1:
        words.append(x)

    wordcost = {k: log((i + 1) * log(len(words))) for i, k in enumerate(words)}
    maxword = max(len(x) for x in words)
    clean_term = []
    for i in range(len(token)):
        clean_term.append((infer_spaces_positional(token[i][0].lower(), maxword, wordcost), token[i][1], token[i][2]))
    return cleaning_pipeline_positional(clean_term)


def infer_spaces_positional(s, maxword, wordcost):
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i - maxword):i]))
        return min((c + wordcost.get(s[i - k - 1:i], 9e999), k + 1) for k, c in candidates)

    cost = [0]
    for i in range(1, len(s) + 1):
        c, k = best_match(i)
        cost.append(c)

    out = []
    i = len(s)
    while i > 0:
        c, k = best_match(i)
        assert c == cost[i]
        out.append(s[i - k:i])
        i -= k

    return " ".join(reversed(out))




#######################################

#COMMENT 4

def remove_conjunctions_from_sets_boolean(input_list):
    conjunctions = ['this', 'that','of', 'and', 'or', 'but', 'for', 'nor', 'so', 'yet', 'to', 'with', 'in', 'on', 'at', 'by', 'is', "the", "of", "all", "and", "to", "can", "be", "as", "once", "for", "at", "am", "are", "has", "have", "had", "up", "his", "her", "in", "on", "no"]
    filtered_list = []
    for s, x in input_list:  # Unpack the tuple
        # Check from the beginning
        for i in range(len(s)):
            first_word = s[:i].lower()
            if first_word in conjunctions:
                s = s[i:]
                break
        # Check from the end
        for i in range(len(s), 0, -1):
            last_word = s[i:].lower()
            if last_word in conjunctions:
                s = s[:i]
                break
        filtered_list.append((s, x))  # Append the modified tuple
    return filtered_list






def word_segment_boolean(sentences):
    len_3_word = []
    segmented_sentences = []
    term = []
    for sentence, info in sentences:
        words = []
        start = 0
        while start < len(sentence):
            found = False
            for end in range(len(sentence), start, -1):
                word = sentence[start:end]
                if word.lower() in english_words:
                    if word.lower().startswith("the"):
                        word = word[2:]
                    if len(word) >= 4:
                        if word.lower() not in {"a", "an", "the"} or start != 0:
                            words.append(word)
                        start = end
                        found = True
                    else:
                        len_3_word.append((word, info))
                    break
            if not found:
                start += 1
        segmented_sentences.append((words, info))

    term = [(wd.lower(), info) for wd, info in len_3_word if len(wd) >= 3 and wordnet.synsets(word, pos=wordnet.NOUN) and word.lower() not in stop_words and word not in punctuation]
    for x, y in segmented_sentences:
        for word in x:
            if word not in stop_words and word in english_words:
                term.append((word, y))
    return term





def remove_conjunctions_from_sets_positional(input_list):
    conjunctions = ['this', 'that','of', 'and', 'or', 'but', 'for', 'nor', 'so', 'yet', 'to', 'with', 'in', 'on', 'at', 'by', 'is', "the", "of", "all", "and", "to", "can", "be", "as", "once", "for", "at", "am", "are", "has", "have", "had", "up", "his", "her", "in", "on", "no"]
    filtered_list = []
    for s, x, index in input_list:  # Unpack the tuple
        # Check from the beginning
        for i in range(len(s)):
            first_word = s[:i].lower()
            if first_word in conjunctions:
                s = s[i:]
                break
        # Check from the end
        for i in range(len(s), 0, -1):
            last_word = s[i:].lower()
            if last_word in conjunctions:
                s = s[:i]
                break
        filtered_list.append((s, x, index))  # Append the modified tuple
    return filtered_list


def word_segment_positional(sentences):
    len_3_word = []
    segmented_sentences = []
    term = []
    for sentence, info, index in sentences:
        words = []
        start = 0
        while start < len(sentence):
            found = False
            for end in range(len(sentence), start, -1):
                word = sentence[start:end]
                if word.lower() in english_words:
                    if word.lower().startswith("the"):
                        word = word[2:]
                    if len(word) >= 4:
                        if word.lower() not in {"a", "an", "the"} or start != 0:
                            words.append(word)
                        start = end
                        found = True
                    else:
                        len_3_word.append((word, info, index))
                    break
            if not found:
                start += 1
        segmented_sentences.append((words, info, index))

    term = [(wd.lower(), info, index) for wd, info, index in len_3_word if len(wd) >= 3 and wordnet.synsets(word, pos=wordnet.NOUN) and word.lower() not in stop_words and word not in punctuation]
    for x, y, index in segmented_sentences:
        for word in x:
            if word not in stop_words and word in english_words:
                term.append((word, y, index))
    return term


    

####################





# INVERTED INDEX
# COMMENT 5


1. **Boolean Token Segregation and Cleaning**:
   - The code starts by segregating different types of boolean tokens based on specific conditions like presence of hyphens, existence in vocabulary, presence of punctuation, etc.
   - Tokens starting with URLs or domain names are separated out and stored in `boolean_token_link`. Other tokens are filtered accordingly.
   - Tokens containing hyphens are also segregated, with those meeting certain length criteria and containing a single hyphen being stored in `boolean_hyphen_term`.
   - Tokens existing in the English vocabulary are identified and stored in `boolean_token_exist_in_vocab`, while those not found in the vocabulary are filtered out.
   - Additionally, tokens containing only numbers are processed separately and stored in `boolean_all_number`.
   - The remaining tokens undergo a cleaning pipeline (`cleaning_pipeline_boolean`), which likely involves some form of normalization or preprocessing.
   - Conjunctions are removed from the sets of cleaned tokens using `remove_conjunctions_from_sets_boolean`.
   - Finally, the cleaned tokens are segmented into terms using `word_segment_boolean`.

2. **Stemming and Inverted Index Creation**:
   - Porter stemming is applied to the terms to reduce them to their root forms.
   - Stemmed terms along with their corresponding information are stored in `stem_terms`.
   - An inverted index is then created, where each term serves as a key, and the corresponding values are the positions where the term appears in the e processing system.

In [61]:
#COMMENT 5
boolean_hyphen_term = []
boolean_token_exist_in_vocab = []

boolean_token_link = [(x, y) for x, y in boolean_token_have_hyphen if x.startswith('//') or x.startswith('www') or x.startswith('org')]
boolean_token_have_hyphen = [(x, y) for x, y in boolean_token_have_hyphen if not (x.startswith('//') or x.startswith('www') or x.startswith('org'))]

boolean_hyphen_term = [(x, y) for x, y in boolean_token_have_hyphen if x.count('-') == 1 and '-' not in [x[0], x[-1]] and 4 <= len(x.split('-')[0]) + len(x.split('-')[1]) <= 20]
boolean_token_have_hyphen = [(x, y) for x, y in boolean_token_have_hyphen if (x, y) not in boolean_hyphen_term]

boolean_token_link = [(x, y) for x, y in boolean_token_as_sentence if x.startswith('//') or x.startswith('www') or x.startswith('org')]
boolean_token_as_sentence = [(x, y) for x, y in boolean_token_as_sentence if not (x.startswith('//') or x.startswith('www') or x.startswith('org'))]



boolean_token_exist_in_vocab = [(x,y) for x,y in boolean_token_as_sentence if x in english_words]
boolean_token_as_sentence = [(x, y) for x, y in boolean_token_as_sentence if x not in english_words]

boolean_token1 += boolean_token_exist_in_vocab

boolean_tokens_no_further_processing_required, boolean_clean_token, boolean_token_contain_only_numbers = cleaning_pipeline_boolean(boolean_token_as_sentence)
boolean_all_number += boolean_token_contain_only_numbers
boolean_token1 += boolean_tokens_no_further_processing_required


boolean_tokens_no_further_processing_required1, boolean_clean_token1, boolean_token_contain_only_numbers1 = cleaning_pipeline_boolean(boolean_token_have_punctuation)
boolean_all_number += boolean_token_contain_only_numbers1
boolean_token1 += boolean_tokens_no_further_processing_required1
boolean_clean_token += boolean_clean_token1

tokens_no_further_processing_required3, boolean_clean_token3, token_contain_only_numbers3 = cleaning_pipeline_boolean(boolean_token_have_hyphen)
boolean_all_number += token_contain_only_numbers3
boolean_token1 += tokens_no_further_processing_required3
boolean_clean_token += boolean_clean_token3

boolean_tokens_no_further_processing_required2, boolean_clean_token2,token_contain_only_numbers2 = token_seperator_boolean(boolean_clean_token)
boolean_token1 += boolean_tokens_no_further_processing_required2

boolean_sets_list = remove_conjunctions_from_sets_boolean(boolean_clean_token2)
boolean_token_sentences = word_segment_boolean(boolean_sets_list)

boolean_term = []

boolean_term = boolean_token_link + boolean_token1 + boolean_all_number + boolean_token_sentences + boolean_hyphen_term

stem_terms = []
for x, y in boolean_term:
    stemmed_x = porter.stem(x)
    stem_terms.append((stemmed_x,y))

inverted_index = {}

for x, y in stem_terms:
    if x not in inverted_index:
        inverted_index[x] = []
    if y not in inverted_index[x]:
        inverted_index[x].append(y)

# POSITIONAL INDEX

# COMMENT 6


1. **Positional Token Segregation and Cleaning**:
   - Similar to before, the code segregates different types of positional tokens based on conditions like the presence of hyphens, existence in the vocabulary, and the presence of punctuation.
   - Tokens starting with URLs or domain names are separated out and stored in `positional_token_link`. Other tokens are filtered accordingly.
   - Tokens containing hyphens are also segregated based on specific length and structure criteria.
   - Tokens existing in the English vocabulary are identified and stored in `positional_token_exist_in_vocab`, while those not found are filtered out.
   - The tokens then undergo cleaning pipelines (`cleaning_pipeline_positional`) and are processed for further segmentation and cleaning.
   - Finally, tokens are segmented into terms using `word_segment_positional`.

2. **Stemming and Positional Index Creation**:
   - Porter stemming is applied to the terms to reduce them to their root forms.
   - Stemmed terms along with their corresponding document names and positions are stored in `stem_terms_positional`.
   - A positional index is then created, where each document name serves as a key, and the corresponding values are dictionaries containing terms and their positions within td their positions.

In [62]:
#COMMENT 6
positional_hyphen_term = []
positional_token_exist_in_vocab = []

positional_token_link = [(x, y , z) for x, y, z in positional_token_have_hyphen if x.startswith('//') or x.startswith('www') or x.startswith('org')]
positional_token_have_hyphen = [(x, y , z) for x, y, z in positional_token_have_hyphen if not (x.startswith('//') or x.startswith('www') or x.startswith('org'))]

positional_hyphen_term = [(x, y , z) for x, y, z in positional_token_have_hyphen if x.count('-') == 1 and '-' not in [x[0], x[-1]] and 4 <= len(x.split('-')[0]) + len(x.split('-')[1]) <= 20]
positional_token_have_hyphen = [(x, y , z) for x, y, z in positional_token_have_hyphen if (x, y) not in positional_hyphen_term]

positional_token_link = [(x, y , z) for x, y, z in positional_token_as_sentence if x.startswith('//') or x.startswith('www') or x.startswith('org')]
positional_token_as_sentence = [(x, y , z) for x, y, z in positional_token_as_sentence if not (x.startswith('//') or x.startswith('www') or x.startswith('org'))]

positional_token_exist_in_vocab = [(x, y , z) for x, y, z in positional_token_as_sentence if x in english_words]
positional_token_as_sentence = [(x, y , z) for x, y, z in positional_token_as_sentence if x not in english_words]

positional_token1 += positional_token_exist_in_vocab

positional_tokens_no_further_processing_required, positional_clean_token, positional_token_contain_only_numbers = cleaning_pipeline_positional(positional_token_as_sentence)
positional_all_number += positional_token_contain_only_numbers
positional_token1 += positional_tokens_no_further_processing_required

positional_tokens_no_further_processing_required1, positional_clean_token1, positional_token_contain_only_numbers1 = cleaning_pipeline_positional(positional_token_have_punctuation)
positional_all_number += positional_token_contain_only_numbers1
positional_token1 += positional_tokens_no_further_processing_required1
positional_clean_token += positional_clean_token1

tokens_no_further_processing_required3, positional_clean_token3, token_contain_only_numbers3 = cleaning_pipeline_positional(positional_token_have_hyphen)
positional_all_number += token_contain_only_numbers3
positional_token1 += tokens_no_further_processing_required3
positional_clean_token += positional_clean_token3

positional_tokens_no_further_processing_required2, positional_clean_token2, token_contain_only_numbers2 = token_seperator_poistional(positional_clean_token)
positional_token1 += positional_tokens_no_further_processing_required2

positional_sets_list = remove_conjunctions_from_sets_positional(positional_clean_token2)
positional_token_sentences = word_segment_positional(positional_sets_list)

positional_term = []

positional_term = positional_token_link + positional_token1 + positional_all_number + positional_token_sentences + positional_hyphen_term

porter = PorterStemmer()
stem_terms_positional = []
for x, y, z in positional_term:
    stemmed_x = porter.stem(x)
    stem_terms_positional.append((stemmed_x,y,z))

positional_index = {}
    
for term, doc_name, index in stem_terms_positional:
    if doc_name not in positional_index:
        positional_index[doc_name] = {}
    
    if term not in positional_index[doc_name]:
        positional_index[doc_name][term] = []
    
    positional_index[doc_name][term].append(index)
    positional_index[doc_name][term].sort()  # Sort the indexes in ascending order


# SEARCH ENGINE

In [63]:
def boolean_search(boolean_query):
    porter = PorterStemmer()
    query_terms = boolean_query.split()
    stemmed_query_terms = []
    doc = {'1.txt', '2.txt', '3.txt', '7.txt', '8.txt', '9.txt', '11.txt', '12.txt', '13.txt', '14.txt', '15.txt', '16.txt', '17.txt', '18.txt', '21.txt', '22.txt', '23.txt', '24.txt', '25.txt', '26.txt'}

    for term in query_terms:
        stemmed_term = porter.stem(term.lower())
        stemmed_query_terms.append(stemmed_term)

    common_docs = set(inverted_index.get(stemmed_query_terms[0], []))

    i = 0
    while i < len(stemmed_query_terms)-1:
        
        if stemmed_query_terms[i] == 'not':
            term = stemmed_query_terms[i+1]
            term_docs = set(inverted_index.get(term, []))
            common_docs = doc.difference(term_docs)
            i += 2

        elif i+1 < len(stemmed_query_terms) and stemmed_query_terms[i+1] == 'not':
            term = stemmed_query_terms[i+2]
            operator = query_terms[i].lower()
            term_docs = set(inverted_index.get(term, []))
            term_docs_complement = doc.difference(term_docs)
            
            if operator == 'and':
                common_docs = common_docs.intersection(term_docs_complement)
            elif operator == 'or':
                common_docs = common_docs.union(term_docs_complement)
            i += 3

        else:
            operator = query_terms[i+1].lower()
            term_docs = set(inverted_index.get(stemmed_query_terms[i+2], []))
            
            if operator == 'and':
                common_docs = common_docs.intersection(term_docs)
            elif operator == 'or':
                common_docs = common_docs.union(term_docs)
            i += 2


    return list(common_docs)


def proximity_search(term1, term2, proximity):
    result = set()
    for doc_name in positional_index:
        if term1 in positional_index[doc_name] and term2 in positional_index[doc_name]:
            indexes1 = positional_index[doc_name][term1]
            indexes2 = positional_index[doc_name][term2]
            for index1 in indexes1:
                for index2 in indexes2:
                    if abs(index1 - index2) <= proximity:
                        result.add(doc_name)
                        break  # No need to check other occurrences of term1 in this document
    return result



In [None]:
def search_documents(query):
    # Check if the query is empty
    if not query:
        print("Please enter a valid search query.")
        return
    
    if '/' in query:
        terms, proximity = query.split('/')
        term1, term2 = terms.split()
        proximity = int(proximity)
        print(term1, term2, proximity)
        results = proximity_search(term1, term2, proximity)
    else:
        results = boolean_search(query)
    
    if results:
        print("Documents found:", results)
    else:
        print("No documents found for the",query)


while True:
    query = input("Enter your search query (or 'press enter, pass empty srting' to quit): ")
    if query.lower() == '':
        break
    print("Query:", query)
    search_documents(query)
    print()


Enter your search query (or 'press enter, pass empty srting' to quit):  heart


Query: heart
Documents found: ['11.txt', '3.txt', '1.txt', '8.txt', '9.txt', '26.txt', '7.txt']



Enter your search query (or 'press enter, pass empty srting' to quit):  past research /3


Query: past research /3
past research 3
Documents found: {'12.txt'}

