# 1. Text processing

We will create the pipline of text preprocessing

# 1. 1 Normalization

The first step is normalisation.
It might include:
* converting all letters to lower or upper case
* converting numbers into words or removing numbers
* removing punctuations, accent marks and other diacritics
* removing white spaces
* expanding abbreviations

In this exercise it would be ok to have a lowercase text without specific characters and digits and without unnecessery space symbols.

How neural networks could be implemented for text normalization?

In [1]:
import re
# normilize text
def normalize(text, allow_asterix=False):
    text = text.lower()
    text = re.sub('\'', '', text)                          # remove apostrophes
    text = re.sub('[!@#$.\-+—,\(\):“”]', ' ', text)        # replace all punctuation signs with spaces
    if not allow_asterix:
        text = re.sub('\*', ' ', text)                     # replace all astrixes (*) with spaces
    text = re.sub('[0-9]', ' ', text)                      # replace all digits with spaces
    result = " ".join([x.lower() for x in text.split()])   # lower all letters and delete all doubled spaces
    return result
     

In [2]:
text = """Borrowed from Latin per sē (“by itself”), from per (“by, through”) and sē (“itself, himself, herself, themselves”)"""

text = normalize(text)
print(text)

borrowed from latin per sē by itself from per by through and sē itself himself herself themselves


# 1.2 Tokenize
Use nltk tokenizer to tokenize the text

In [3]:
# tokenize text using nltk lib
import nltk


nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

def tokenize(text):
    result = nltk.word_tokenize(text)
    return result

[nltk_data] Downloading package punkt to /home/sabzirov/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/sabzirov/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/sabzirov/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sabzirov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
tokens = tokenize(text)
print(tokens)

['borrowed', 'from', 'latin', 'per', 'sē', 'by', 'itself', 'from', 'per', 'by', 'through', 'and', 'sē', 'itself', 'himself', 'herself', 'themselves']


# 1.3 Lemmatization
What is the difference between stemming and lemmatization?

**Stemming**: cuts ending of the word (played -> play)

**Lemming**: convert the word to its initial form. (wolves -> wolf)

[Optional reading](https://towardsdatascience.com/state-of-the-art-multilingual-lemmatization-f303e8ff1a8)


In [5]:
def lemmatization(tokens):
    lemmatizer = nltk.stem.WordNetLemmatizer() 
    result = [None] * len(tokens)
    for i in range(len(tokens)):
        speech_part = nltk.pos_tag([tokens[i]])[0][1]
        if speech_part == 'VBN':
            result[i] = lemmatizer.lemmatize(tokens[i], pos='v')
        else:
            result[i] = lemmatizer.lemmatize(tokens[i])
    return result

In [6]:
lemmed = lemmatization(tokens)
print(lemmed)

['borrow', 'from', 'latin', 'per', 'sē', 'by', 'itself', 'from', 'per', 'by', 'through', 'and', 'sē', 'itself', 'himself', 'herself', 'themselves']


# 1.4 Stop words
The next step is to remove stop words. Take the list of stop words from nltk.

In [7]:
stopwords = set(nltk.corpus.stopwords.words('english'))

def remove_stop_word(tokens):
    result = [word for word in tokens if word not in stopwords]
    return result

In [8]:
clean = remove_stop_word(lemmed)
print(clean)

['borrow', 'latin', 'per', 'sē', 'per', 'sē']


# 1.5 Pipeline
Run a complete pipeline inone function.

In [91]:
def preprocess(text, lemmatize=True):
    text = normalize(text, allow_asterix=True)
    text = tokenize(text)
    if lemmatize:
        text = lemmatization(text)
    text = remove_stop_word(text)
    return text


In [10]:
clean = preprocess(text)
print(clean)

['borrow', 'latin', 'per', 'sē', 'per', 'sē']


# 2. Collection

Download Reuters data from here:
https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz

Read data description here:
https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

The function should return a list of strings - raw texts. Remove html tags using bs4 package.

## 2.1 Alternative (0.5 task bonus points)

Download songs (the process takes time, 1000 documents might be enough for a sake of exercise) from https://www.lyrics.com/. Implement a text search on it. In this case you have to creare class *Song* with fiels *title*, *artist* *and* text. The collection will contain a list of songs.

In [11]:
import requests
import os
from urllib.parse import quote, urlsplit

class Document:
    def __init__(self, url):
        self.url = url
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def __get_filename(self):
        name = str(hash(self.url))  # use the hash as a file name
        return name
    
    def download(self):
        try:
            r = requests.get(self.url)
            if r.status_code // 100 not in (2, 3):  # either 2.. or 3..
                return False
            self.content = r.content
            return True
        except Exception:
            return False
        
    def persist(self):
        if self.content is None:  # If there is nothing to save
            return False
        
        file_name = self.__get_filename()
        file = open(file_name, "wb")
        file.write(self.content)
        return True
            
    def load(self):
        #TODO load content from hard drive, store it in self.content and return True in case of success
        file_name = self.__get_filename()
        if file_name not in os.listdir():  # if there is no such file the folder
            return False
        
        file = open(file_name, "rb")
        self.content = file.read()
        return True

In [159]:
import requests
from bs4 import BeautifulSoup
from bs4.element import Comment

BASE_URL = "https://www.lyrics.com/lyric/{}/"
SONGS_LIMIT = 1000    # number of songs to proceed
START_SONG = 3525851 # determines from which song id start searching (chosen randomly)

class Song():
    title = ""
    artists = []
    text = ""

    def __init__(self, title, artists, text):
        self.title = title
        self.artists = artists
        self.text = text

    def __str__(self):
        pos = min(25, self.text.find("\n"))
        s = f"Title: `{self.title}`, Artists: `{self.artists}`, Text: `{self.text[:pos]}...`"
        return s


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def get_collection():
    collection = []
    i = START_SONG
    while len(collection) < SONGS_LIMIT:
        url = BASE_URL.format(i)
        
        try:
            doc = Document(url)
            doc.get()
            html_page = doc.content
        except:
            i -= 10
            print("Skipped")
            continue
            
        soup = BeautifulSoup(html_page)
        
        if not soup.findAll(id="lyric-title-text"):
            i -= 10
            print("Skipped")
            continue
        
        title = soup.findAll(id="lyric-title-text")[0].contents[0]
        artists = soup.findAll(class_="lyric-artist")
        artists = artists[0].findNext('a', href=True).contents[0]
      
        text = ""
        for t in soup.findAll(id="lyric-body-text"): 
            if tag_visible(t):  # is tag is visible to user
                for s in t:
                    if tag_visible(s):
                        try:
                            text += s
                        except:
                            text += s.contents[0]
        song = Song(title, artists, text)
        collection.append(song)
        print(song)
        i -= 1

    return collection

In [160]:
collection = get_collection()
print(len(collection))

...`e: `Georgia on My Mind`, Artists: `Billie Holiday`, Text: `Georgia
Title: `Loveless Love [Take 1]`, Artists: `Billie Holiday`, Text: `Love is like a hydrant tu...`
Title: `What Is This Going to Get Us?`, Artists: `Billie Holiday`, Text: `What is this going to get...`
Title: `Night and Day`, Artists: `Billie Holiday`, Text: `Like the beat, beat, beat...`
...`e: `Under a Blue Jungle Moon`, Artists: `Billie Holiday`, Text: `Life's just like a dream
Title: `More Than You Know`, Artists: `Billie Holiday`, Text: `Whether you are here or y...`
Title: `Let's Dream in the Moonlight`, Artists: `Billie Holiday`, Text: `Let's dream in the moonli...`
Title: `Hello My Darling`, Artists: `Billie Holiday`, Text: `I'll forget your tender k...`
...`e: `You're So Desirable`, Artists: `Billie Holiday`, Text: `You're so desirable
Title: `Please Keep Me in Your Dreams`, Artists: `Billie Holiday`, Text: `Please keep me in your dr...`
Title: `Let's Call a Heart a Heart`, Artists: `Billie Holiday`, Text: `

Title: `Hey You [2011 Remastered Version]`, Artists: `Pink Floyd`, Text: `Hey you, out there in the...`
...`e: `Goodbye Cruel World`, Artists: `Pink Floyd`, Text: `Goodbye cruel world,
Title: `Another Brick in the Wall, Pt. 3`, Artists: `Pink Floyd`, Text: `I don't need no walls aro...`
...`e: `Don't Leave Me Now`, Artists: `Pink Floyd`, Text: `Ooooh, babe
Title: `One of My Turns`, Artists: `Pink Floyd`, Text: `"Oh my God! What a fabulo...`
...`e: `Young Lust`, Artists: `Pink Floyd`, Text: `I am just a new boy
...`e: `Empty Spaces`, Artists: `Pink Floyd`, Text: `What shall we use
Title: `Goodbye Blue Sky`, Artists: `Pink Floyd`, Text: `Look mummy, there's an ae...`
Title: `Mother`, Artists: `Pink Floyd`, Text: `Mother do you think they'...`
Title: `Another Brick in the Wall, Pt. 2`, Artists: `Pink Floyd`, Text: `We don't need no educatio...`
Title: `The Happiest Days of Our Lives`, Artists: `Pink Floyd`, Text: `You! Yes, you! Stand stil...`
Title: `Another Brick in the Wall, Pt. 1`, Ar

Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
...`e: `Get Out of Town`, Artists: `Russell Malone`, Text: `Get out of town
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
...`e: `Backyard [Single Version]`, Artists: `Salt-N-Pepa`, Text: `Do you get suspicious?
...`e: `Girlfriend [Single Version]`, Artists: `Pebbles`, Text: `To believe
Title: `Giving You the Benefit [Single Version]`, Artists: `Pebbles`, Text: `Lately IÂ´ve been trying ...`
Skipped
Title: `No Hay Que Llorar`, Artists: `Thalía`, Text: `Si piensas que el mundo p...`
Title: `Arrasando`, Artists: `Thalía`, Text: `Arrasando oye papi damelo...`
...`e: `Reencarnación`, Artists: `Thalía`, Text: `Cuando una pregunta 
...`e: `Regresa a Mi`, Artists: `Thalía`, Text: `Oh señor
Title: `Entre el Mar y una Estrella`, Artists: `Thalía`, Text: `Aunque te hayas ido sigue...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Titl

Title: `Respect Yourself`, Artists: `The Staple Singers`, Text: `If you disrespect anybody...`
Skipped
Skipped
Skipped
...`e: `World Weary Eyes`, Artists: `Arid`, Text: `You should have seen her
Title: `Me and My Melody`, Artists: `Arid`, Text: `Oh lord, it's pulling me ...`
Title: `Dearly Departed`, Artists: `Arid`, Text: `Fingers numb, feel sick a...`
...`e: `Believer`, Artists: `Arid`, Text: `If I could be someone
...`e: `Little Things of Venom`, Artists: `Arid`, Text: `Out on the scene today
...`e: `All Will Wait`, Artists: `Arid`, Text: `We wait for love to come
Title: `Too Late Tonight`, Artists: `Arid`, Text: `It's too late tonight...`
Title: `At the Close of Every Day`, Artists: `Arid`, Text: `At the close of every day...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Title: `We'll Meet Again`, Artists: `Vera Lynn`, Te

...`e: `Wild Thing`, Artists: `Tone-Loc`, Text: `Let's do it
Skipped
...`e: `Noche de Ronda`, Artists: `Alejandro Algara`, Text: `Noche de ronda 
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
...`e: `How High the Moon`, Artists: `Jerry Vale`, Text: `Somewhere there's music
Skipped
Skipped
Skipped
Skipped
Title: `The End of the World`, Artists: `Skeeter Davis`, Text: `Why does the sun go on sh...`
Title: `Teddy Bear Song`, Artists: `Barbara Fairchild`, Text: `I wish I had button eyes ...`
...`e: `Queen of the House`, Artists: `Jody Miller`, Text: `Up every day at six 
Title: `Harper Valley P.T.A.`, Artists: `Jeannie C. Riley`, Text: `I wanna tell you all the ...`
Title: `Help Me Make It Through the Night`, Artists: `Sammi Smith`, Text: `Take the ribbon from my h...`
Title: `Don't Touch Me`, Artists: `Jeannie Seely`, Text: `Your hand is like a torch...`
Title: `Right or Wrong`, Artists: `Wanda Jackson`, Text: `Right or wrong I'll be wi...`
...`e: `Satin Sheets`, Artists

Title: `Dancing in the Dark`, Artists: `Frank Sinatra`, Text: `Dancin' in the dark 'til ...`
Title: `Come Fly With Me`, Artists: `Frank Sinatra`, Text: `Come fly with me, let's f...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Title: `Body and Soul`, Artists: `Pete Johnson`, Text: `My heart is sad and lonel...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
...`e: `Le fils de Superman`, Artists: `Céline Dion`, Text: `Tout comme son père
Title: `Le blues du businessman`, Artists: `Céline Dion`, Text: `J'ai du succès dans mes a...`
Skipped
Skipped
Title: `On My Radio '91`, Artists: `The Selecter`, Text: `Someone who loves me swit...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
...`e: `If You Must Leave My Life`, Artists: `Richard Harri

Skipped
Title: `Outlaw`, Artists: `Axel Rudi Pell`, Text: `Oh so long, hide your fac...`
Title: `Casbah`, Artists: `Axel Rudi Pell`, Text: `Here I go and make my fin...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
...`e: `Sad Sweet Dreamer`, Artists: `Sweet Sensation`, Text: `Sad sweet dreamer 
...`e: `This Will Be (An Everlasting Love)`, Artists: `Natalie Cole`, Text: `This will be
Title: `Bad Time`, Artists: `Grand Funk Railroad`, Text: `I'm in love with the girl...`
...`e: `All By Myself`, Artists: `Eric Carmen`, Text: `When I was young
...`e: `Magic`, Artists: `Pilot`, Text: `It's magic you know
Title: `Lovin' You`, Artists: `Minnie Riperton`, Text: `Lovin' you is easy 'cause...`
Title: `How Long`, Artists: `Ace`, Text: `How long has this been go...`
Skipped
Title: `Former Lee Warmer`, Artists: `Alice Cooper`, Text: `In an upstairs room, unde...`
Title: `Enough's Enough`, Artists: `Alice Cooper`, Text: `Enough's en

Title: `Grant Hart`, Artists: `The Posies`, Text: `I can't cry, I can't appl...`
...`e: `Precious Moments`, Artists: `The Posies`, Text: `The bloodless toil
Title: `Start a Life`, Artists: `The Posies`, Text: `I don't know why...`
Skipped
Title: `Someone's Gonna Break Your Heart`, Artists: `Jill Sobule`, Text: `Someone's gonna break you...`
Title: `Guy Who Doesn't Get It`, Artists: `Jill Sobule`, Text: `Can't you tell that I am ...`
Title: `Somewhere in New Mexico`, Artists: `Jill Sobule`, Text: `I have a friend, swear sh...`
Title: `Mary Kay`, Artists: `Jill Sobule`, Text: `Mary Kay, she's got it ba...`
Title: `Heroes`, Artists: `Jill Sobule`, Text: `Why are all our heroes so...`
Title: `Mexican Wrestler`, Artists: `Jill Sobule`, Text: `Sometimes I wish that I w...`
Title: `Claire`, Artists: `Jill Sobule`, Text: `Dear Claire she gets up a...`
...`e: `Lucy at the Gym`, Artists: `Jill Sobule`, Text: `Lucy at the gym
Title: `One of These Days`, Artists: `Jill Sobule`, Text: `One of these

...`e: `Cry`, Artists: `The Raspberries`, Text: `Cry
Title: `I Can Hardly Believe You're Mine`, Artists: `The Raspberries`, Text: `Darlin' I was feelin' som...`
Title: `Cruisin' Music`, Artists: `The Raspberries`, Text: `Get up in the morning, ch...`
Title: `All Through the Night`, Artists: `The Raspberries`, Text: `Hey there, sugar, let me ...`
...`e: `Rose Coloured Glasses`, Artists: `The Raspberries`, Text: `I'm an ivory tower boy
Title: `I Don't Know What I Want`, Artists: `The Raspberries`, Text: `My old man says success i...`
Skipped
Title: `Last Dance`, Artists: `The Raspberries`, Text: `Won't you let me get to k...`
...`e: `Tonight`, Artists: `Eric Carmen`, Text: `One, two, three, four
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Title: `Femme Fatale`, Artists: `Nico`, Text: `Here she comes, you bette...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skip

...`e: `Rush`, Artists: `Big Audio Dynamite`, Text: `If I have my time again
...`e: `Gonna Make You Sweat (Everybody Dance Now)`, Artists: `C+C Music Factory`, Text: `Everybody dance now
Title: `Try a Piece of My Love`, Artists: `Wild Cherry`, Text: `Oh, I can see by the way ...`
Skipped
...`e: `To Love Somebody`, Artists: `Bonnie Tyler`, Text: `There's a light
...`e: `Take Me Back`, Artists: `Bonnie Tyler`, Text: `You said it's over now 
Title: `If You Were a Woman (And I Was a Man)`, Artists: `Bonnie Tyler`, Text: `How's it feel to be a wom...`
Skipped
Skipped
Skipped
Skipped
Skipped
Title: `I'll Feel a Whole Lot Better`, Artists: `The Byrds`, Text: `The reason why oh, I can'...`
Skipped
...`e: `Celebration`, Artists: `Argent`, Text: `Celebration
Skipped
Skipped
Skipped
Skipped
Skipped
Title: `Reasons`, Artists: `Earth, Wind & Fire`, Text: `Now, I'm craving your bod...`
Title: `Me and Mrs. Jones`, Artists: `Billy Paul`, Text: `Me and Mrs. Jones...`
Title: `Best of My Love`, Artists: 

...`e: `She Shot a Hole in My Soul`, Artists: `Clifford Curry`, Text: `Oh
Title: `Make Me Yours`, Artists: `Bettye Swann`, Text: `I never had a love to cal...`
Title: `Lipstick Traces (On a Cigarette)`, Artists: `The O'Jays`, Text: `Don't ever leave me...`
...`e: `Love Makes a Woman`, Artists: `Barbara Acklin`, Text: `In the fire
Title: `I Sold My Heart to the Junkman`, Artists: `The Starlets`, Text: `All over town they're tal...`
Title: `The "In" Crowd`, Artists: `Dobie Gray`, Text: `I'm in with the in crowd ...`
Title: `Memphis Soul Stew`, Artists: `King Curtis`, Text: `Today's special is Memphi...`
Title: `Function at the Junction`, Artists: `Shorty Long`, Text: `I'm getting ready for the...`
Title: `Stay with Me`, Artists: `Lorraine Ellison`, Text: `Where did you go when thi...`
...`e: `In the Heat of the Night`, Artists: `Ray Charles`, Text: `In the heat of the night
...`e: `Wish Someone Would Care`, Artists: `Irma Thomas`, Text: `Cry, cry
Skipped
...`e: `Sunny`, Artists: `Bobby H

Skipped
Title: `Honey In the Honeycomb (extended version)`, Artists: `Lena Horne`, Text: `In this cloudy sky overhe...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
...`e: `Airwave [DVD]`, Artists: `Rank 1`, Text: `I feel you
...`e: `New York City Boy [Lange RMX]`, Artists: `Pet Shop Boys`, Text: `When you're a boy
Skipped
Skipped
Title: `Samurai Showdown`, Artists: `RZA`, Text: `Yo, it's a samurai showdo...`
Skipped
Skipped
Skipped
Skipped
...`e: `Bring the Boys Back Home`, Artists: `Pink Floyd`, Text: `Bring the boys back home
Title: `Vera`, Artists: `Pink Floyd`, Text: `Does anybody here remembe...`
Title: `Nobody Home`, Artists: `Pink Floyd`, Text: `I've got a little black b...`
Title: `Is There Anybody Out There?`, Artists: `Pink Floyd`, Text: `"Well, only got an hour o...`
Title: `Hey You [2011 Remastered Version]`, Artists: `Pink Floyd`, Text: `Hey you, out there in the...`
...`e: `Goodbye

Skipped
Skipped
Title: `Goodnight Irene`, Artists: `The Weavers`, Text: `Irene, goodnight...`
Skipped
Title: `Catch The Wind`, Artists: `Donovan`, Text: `In the chilly hours and m...`
Title: `We'll Sing in the Sunshine`, Artists: `Gale Garnett`, Text: `We'll sing in the sunshin...`
Title: `Where Have All the Flowers Gone?`, Artists: `The Kingston Trio`, Text: `Where have all the flower...`
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Title: `Deep Night [1929]`, Artists: `The Three Suns`, Text: `Deep night, stars in the ...`
Skipped
Skipped
...`e: `No No No`, Artists: `Tony Touch`, Text: `No no no
Title: `Return of the Diaz Bros.`, Artists: `Tony Touch`, Text: `[speaking to Pain in Da A...`
Skipped
Skipped
Skipped
Skipped
Skipped
Title: `What Are You Doing the Rest of Your Life`, Artists: `Melvin Rhyne`, Text: `I want to see your face i...`
Skipped
Skipped
Skipped
Skipped
Skipped
Title: `That's Not What I Said`, Artists: `Anita Cochran`, Text: `That's not what i said 

In [161]:
print(collection[0])

Title: `Georgia on My Mind`, Artists: `Billie Holiday`, Text: `Georgia...`


# 3. Inverted index
You will work with the boolean search model. Construct a dictionary which maps words to the postings.  

In [194]:
def make_inverted_index(collection):
    inverted_index = dict() # {term: [list of documents]}
    for i, song in enumerate(collection):
        text = preprocess(song.text, lemmatize=True)
        for term in text:
            if term in inverted_index:
                inverted_index[term].add(i)
            else:
                inverted_index[term] = {i}

    return inverted_index

In [195]:
inverted_index = make_inverted_index(collection)
print(inverted_index['love'])

{1, 2, 3, 4, 5, 6, 7, 9, 12, 13, 14, 21, 23, 24, 25, 26, 27, 29, 31, 32, 33, 34, 45, 48, 49, 61, 62, 63, 65, 81, 89, 91, 92, 93, 94, 99, 101, 102, 104, 106, 109, 111, 115, 117, 118, 120, 121, 122, 123, 125, 126, 127, 128, 129, 130, 131, 132, 136, 137, 139, 140, 141, 142, 143, 146, 147, 148, 149, 150, 151, 152, 153, 158, 160, 161, 162, 170, 171, 175, 179, 180, 181, 183, 184, 185, 186, 187, 188, 193, 204, 205, 206, 207, 208, 210, 212, 215, 216, 217, 220, 229, 230, 232, 234, 235, 241, 242, 245, 246, 247, 249, 250, 251, 252, 254, 255, 256, 257, 258, 261, 262, 266, 267, 268, 272, 274, 281, 286, 287, 288, 291, 295, 297, 301, 303, 304, 309, 311, 312, 313, 317, 318, 319, 320, 321, 322, 324, 325, 326, 327, 328, 330, 331, 332, 333, 335, 336, 339, 351, 353, 355, 356, 360, 366, 367, 368, 369, 370, 372, 373, 376, 377, 378, 379, 382, 383, 384, 385, 386, 389, 390, 391, 395, 396, 401, 404, 406, 407, 411, 414, 416, 417, 418, 427, 430, 432, 433, 434, 435, 437, 442, 443, 445, 446, 452, 455, 458, 459, 460

# 4. Query processing

Using given search query, find all relevant documents. In binary model the relevant document is the one which contains all words from the query.

Return the list of relevant documents indexes.

In [196]:
def search(query):
    query = preprocess(query)
    print(query)
    for i, term in enumerate(query):
        if i == 0:
            relevant_documents = inverted_index[term]
            continue
        if term not in inverted_index: # if there term hadn't occured
            return [] 
        relevant_documents = relevant_documents.intersection(inverted_index[term])
    return list(relevant_documents)

In [197]:
query = 'love true' # change for something else if you are searching song lyrics
relevant = search(query)  # select how many argumetns to pass to the function

print(len(relevant))
print(relevant)

['love', 'true']
66
[512, 640, 514, 256, 132, 5, 257, 768, 261, 777, 522, 523, 267, 141, 526, 142, 268, 149, 414, 543, 33, 547, 932, 297, 171, 811, 815, 304, 821, 311, 185, 61, 574, 319, 903, 321, 711, 715, 460, 974, 207, 465, 467, 596, 854, 857, 473, 474, 987, 93, 861, 477, 992, 376, 250, 747, 492, 506, 883, 500, 254, 120, 126, 637, 638, 639]


# Advanced query processing


In [198]:
class PrefixTree():  
    def __init__(self, letter, value=None):
        self.stop = False
        self.letter = letter
        self.children = dict()
        self.value = value
        
    def add_word(self, word, value=None):
        if len(word) == 0:
            self.stop = True
            self.value = value
            return
        
        letter = word[0]
        if letter in self.children.keys():
            child = self.children[letter]
            child.add_word(word[1:])
        else:
            child = PrefixTree(letter)
            self.children[letter] = child
            child.add_word(word[1:])
    
    def contains(self, word):
        if len(word) == 0:
            return self.stop
        
        letter = word[0]
        if letter not in self.children.keys():
            return False
        return self.children[letter].contains(word[1:])
    
    def print(self, prev = ''):
        if self.stop:
            print(prev + self.letter)
        for letter, child in self.children.items():
            child.print(prev + self.letter)
    
    def heirs(self, prev=''):
        res = list()
        if self.stop:
            res.append(prev)
        for letter, child in self.children.items():
            q = child.heirs(prev + letter)
            for x in q:
                res.append(x)
        return res
    
    def find_words(self, pattern, prev=''):
        assert pattern.count("*") == 0
        
        if len(pattern) == 0:
            return self.heirs(prev)
        
        letter = pattern[0]
        if letter not in self.children.keys():
            return []
        
        return self.children[letter].find_words(pattern[1:], prev+letter)

In [199]:
def shifted_permutations(term):
    term = term + "$"
    n = len(term)
    perms = []
    for i in range(n):
        perms.append(term[i:] + term[:i])
    return perms
  
  
def make_permuterm_index(collection):
    index = PrefixTree('')
    for i, song in enumerate(collection):
        text = preprocess(song.text, lemmatize=False)
        for term in text:
            perms = shifted_permutations(term)
            for perm in perms:
                index.add_word(perm, value=term)
    return index

In [200]:
permuterm_index = make_permuterm_index(collection[:2])
#permuterm_index.print()

q = permuterm_index.find_words("go")
print(q, len(q))

['gone$', 'goldiess$', 'gold$'] 3


In [201]:
def soundex(term):
    first = term[0]
    term = term[1:]
    # Change letters to digits as follows:
    term = re.sub("[aeiouhwy]", "0", term)  # A, E, I, O, U, H, W, Y
    term = re.sub("[bfpv]", "1", term)      # B, F, P, V → 1
    term = re.sub("[cgjkqsxz]", "2", term)  # C, G, J, K, Q, S, X, Z → 2
    term = re.sub("[dt]", "3", term)        # D,T → 3
    term = re.sub("l", "4", term)           # L → 4
    term = re.sub("[mn]", "5", term)        # M, N → 5
    term = re.sub("r", "6", term)           # R → 6
    if len(term):
        term = term[0] + "".join([term[i] for i in range(1, len(term)) if term[i] != term[i - 1]]) # Remove all pairs of consecutive digits.
    term = re.sub("0", "", term) # Remove all zeros from the resulting string
    while len(term) < 3:
        term = term + "0"
    term = first.upper() + term[:3]
    return term


def make_soundex_index(collection):
    index = dict()
    for i, song in enumerate(collection):
        text = preprocess(song.text, lemmatize=False)
        for term in text:
            s = soundex(term)
            if s in index:
                index[s].add(term)
            else:
                index[s] = set([term])
    return index


def levenshtein_distance(s, t):
    s = " " + s
    t = " " + t
    dp = [[0] * len(t) for _ in range(len(s))]
    for i in range(1, len(s)):
        dp[i][0] = i
    for j in range(1, len(t)):
        dp[0][j] = j
    for i in range(1, len(s)):
        for j in range(1, len(t)):
            if s[i] == t[j]:
                diag = dp[i - 1][j - 1]
            else:
                diag = dp[i - 1][j - 1] + 1
            up = dp[i - 1][j] + 1
            left = dp[i][j - 1] + 1
            dp[i][j] = min(diag, up, left)
    return dp[-1][-1]

In [202]:
soundex_index = make_soundex_index(collection[:2])
print(soundex_index)

{'G620': {'georgia'}, 'W400': {'whole'}, 'D000': {'day'}, 'O430': {'old'}, 'S300': {'sweet', 'said'}, 'S520': {'song'}, 'K120': {'keeps'}, 'M530': {'mind'}, 'C520': {'comes'}, 'C450': {'clean'}, 'M542': {'moonlight'}, 'P520': {'pines'}, 'A652': {'arms'}, 'R200': {'reach'}, 'E200': {'eyes'}, 'S540': {'smile'}, 'T536': {'tenderly'}, 'S340': {'still'}, 'P214': {'peaceful'}, 'D652': {'dreams'}, 'S000': {'saw', 'see'}, 'R320': {'roads'}, 'L320': {'leads'}, 'B200': {'back'}, 'P200': {'peace'}, 'F530': {'find'}, 'L100': {'love'}, 'L200': {'less', 'like'}, 'H365': {'hydrant'}, 'T652': {'turns'}, 'F653': {'friendships'}, 'M520': {'moneys'}, 'G500': {'gone'}, 'S353': {'stands'}, 'L500': {'loan'}, 'S620': {'sharks'}, 'H632': {'hearts'}, 'T520': {'times', 'tongs'}, 'S365': {'strong'}, 'W520': {'wings'}, 'A614': {'aeroplane'}, 'B630': {'broad'}, 'W430': {'would'}, 'F400': {'full', 'fly', 'flaw'}, 'A000': {'away'}, 'F616': {'forever'}, 'H160': {'hover'}, 'C500': {'come'}, 'O000': {'oh'}, 'L142': {'l

In [203]:
print(levenshtein_distance("hello", "hell"))
print(levenshtein_distance("innopolis", "innopoints"))
print(levenshtein_distance("inno", "inna"))

1
3
1


In [214]:
from itertools import chain

INF = 10**9

soundex_index = make_soundex_index(collection)
inverted_index = make_inverted_index(collection)
permuterm_index = make_permuterm_index(collection)


def process_wildcard(term):
    i = term.find("*")
    perm = term[i+1:] + "$" + term[:i]
    candidates = permuterm_index.find_words(perm)
    for i, cand in enumerate(candidates):
        pos = cand.find("$")
        candidates[i] = cand[pos + 1:] + cand[:pos]
    docs = [inverted_index[cand] for cand in candidates if cand in inverted_index]
    docs = set().union(*docs)
    return docs


def process_regular(term):
    docs = inverted_index[term]
    return docs


def process_misspelled(term):
    sx = soundex(term)
    if sx not in soundex_index:
        return set()

    candidates = soundex_index[sx]
    best_cand = set()
    min_dist = INF

    for cand in candidates:
        dist = levenshtein_distance(term, cand)
        if dist == min_dist:
            best_cand.add(cand)
        elif dist < min_dist:
            min_dist = dist
            best_cand = set([cand])

    best_cand = [lemmatization([cand])[0] for cand in best_cand]
    docs = [inverted_index[cand] for cand in best_cand if cand in inverted_index]
    docs = set().union(*docs)       
    return docs


def advanced_search(query):
    query = preprocess(query, lemmatize=False)
    ans = None
    
    for term in query:
        assert term.count("*") <= 1
        lemmatized_term = lemmatization([term])[0]
        
        if "*" in term:
            docs = process_wildcard(term)
            
        elif lemmatized_term in inverted_index:
            doc = process_regular(lemmatized_term)
        
        else:
            docs = process_misspelled(term)
        
        if ans is None:
            ans = docs
        else:
            ans = ans.intersection(docs)
    return ans 

In [215]:
res = advanced_search("l*e foregt")
print(res)

{516, 773, 774, 7, 268, 14, 528, 657, 146, 530, 660, 917, 152, 665, 923, 411, 799, 33, 673, 547, 680, 427, 46, 818, 691, 567, 569, 189, 65, 972, 591, 977, 595, 212, 469, 347, 606, 609, 231, 106, 107, 747, 237, 110, 368, 633, 375, 121, 379, 763}


In [216]:
res = advanced_search("inn* you")
print(res)

{290, 516, 903, 111, 660, 469, 822, 665, 349, 957, 190, 447}


In [217]:
print(collection[516])
print(collection[516].text)

Title: `Help Is on Its Way`, Artists: `Little River Band`, Text: `Help Is On Its Way...`
Help Is On Its Way
by The Little River Band
Why're you in so much hurry?
Is it really worth the worry?

Look around.
Then slow down.
What's it like inside the bubble?
Does your head ever give you trouble?
It's no sin.
Trade it in.

[Chorus]

Hang on.
Help is on its way.
I'll be there as fast as I can.
Hang on,

A tiny voice did say,
From somewhere deep inside the inner man.
Are you always in confusion?
Surrounded by illusion?

Sort it out.
You'll make out.
Seem to make a good beginning
Someone else ends up winning.

Don't seem fair.
Don't you care?

[Chorus]

Don't you forget who'll take care of you.
It don't matter what you do.
Form a duet let him sing melody.
You'll provide the harmony.

Why're you in so much hurry?
Is it really worth the worry?
Look around.
Then slow down.

What's it like inside the bubble?
Does your head ever give you trouble?
It's