# 1. Text processing

We will create the pipline of text preprocessing

# 1. 1 Normalization

The first step is normalisation.
It might include:
* converting all letters to lower or upper case
* converting numbers into words or removing numbers
* removing punctuations, accent marks and other diacritics
* removing white spaces
* expanding abbreviations

In this exercise it would be ok to have a lowercase text without specific characters and digits and without unnecessery space symbols.

How neural networks could be implemented for text normalization?

In [0]:
import re
# normilize text
def normalize(text):
    text = text.lower()
    text = re.sub('\'', '', text)                          # remove apostrophes
    text = re.sub('[!@#$.\-+*—,\(\):“”]', ' ', text)       # replace all punctuation signs with spaces
    text = re.sub('[0-9]', ' ', text)                      # replace all digits with spaces
    result = " ".join([x.lower() for x in text.split()])   # lower all letters and delete all doubled spaces
    return result
     

In [4]:
text = """Borrowed from Latin per sē (“by itself”), from per (“by, through”) and sē (“itself, himself, herself, themselves”)"""

text = normalize(text)
print(text)

borrowed from latin per sē by itself from per by through and sē itself himself herself themselves
borrowed from latin per sē by itself from per by through and sē itself himself herself themselves


# 1.2 Tokenize
Use nltk tokenizer to tokenize the text

In [5]:
# tokenize text using nltk lib
import nltk


nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

def tokenize(text):
    result = nltk.word_tokenize(text)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
tokens = tokenize(text)
print(tokens)

['borrowed', 'from', 'latin', 'per', 'sē', 'by', 'itself', 'from', 'per', 'by', 'through', 'and', 'sē', 'itself', 'himself', 'herself', 'themselves']


# 1.3 Lemmatization
What is the difference between stemming and lemmatization?

**Stemming**: cuts ending of the word (played -> play)

**Lemming**: convert the word to its initial form. (wolves -> wolf)

[Optional reading](https://towardsdatascience.com/state-of-the-art-multilingual-lemmatization-f303e8ff1a8)


In [0]:
def lemmatization(tokens):
    lemmatizer = nltk.stem.WordNetLemmatizer() 
    result = [None] * len(tokens)
    for i in range(len(tokens)):
        speech_part = nltk.pos_tag([tokens[i]])[0][1]
        if speech_part == 'VBN':
            result[i] = lemmatizer.lemmatize(tokens[i], pos='v')
        else:
            result[i] = lemmatizer.lemmatize(tokens[i])
    return result

In [8]:
lemmed = lemmatization(tokens)
print(lemmed)

['borrow', 'from', 'latin', 'per', 'sē', 'by', 'itself', 'from', 'per', 'by', 'through', 'and', 'sē', 'itself', 'himself', 'herself', 'themselves']


# 1.4 Stop words
The next step is to remove stop words. Take the list of stop words from nltk.

In [0]:
stopwords = set(nltk.corpus.stopwords.words('english'))

def remove_stop_word(tokens):
    result = [word for word in tokens if word not in stopwords]
    return result

In [10]:
clean = remove_stop_word(lemmed)
print(clean)

['borrow', 'latin', 'per', 'sē', 'per', 'sē']


# 1.5 Pipeline
Run a complete pipeline inone function.

In [0]:
def preprocess(text):
    text = normalize(text)
    text = tokenize(text)
    text = lemmatization(text)
    text = remove_stop_word(text)
    return text


In [14]:
clean = preprocess(text)
print(clean)

['borrow', 'latin', 'per', 'sē', 'per', 'sē']


# 2. Collection

Download Reuters data from here:
https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz

Read data description here:
https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

The function should return a list of strings - raw texts. Remove html tags using bs4 package.

## 2.1 Alternative (0.5 task bonus points)

Download songs (the process takes time, 1000 documents might be enough for a sake of exercise) from https://www.lyrics.com/. Implement a text search on it. In this case you have to creare class *Song* with fiels *title*, *artist* *and* text. The collection will contain a list of songs.

In [0]:
import requests
from bs4 import BeautifulSoup
from bs4.element import Comment

BASE_URL = "https://www.lyrics.com/lyric/{}/"
SONGS_LIMIT = 100    # number of songs to proceed
START_SONG = 3525851 # determines from which song id start searching (chosen randomly)

class Song():
    title = ""
    artists = []
    text = ""

    def __init__(self, title, artists, text):
        self.title = title
        self.artists = artists
        self.text = text

    def __str__(self):
        s = f"Title: `{self.title}`, Artists: `{self.artists}`, Text: `{self.text[:25]}...`"
        return s


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def get_collection():
    collection = []
    i = START_SONG
    while len(collection) < SONGS_LIMIT:
        url = BASE_URL.format(i)

        r = requests.get(url)
        

        if r.history: # if redirected
            i -= 10
            print("Skipped")
            continue
        
        html_page = r.content
        soup = BeautifulSoup(html_page)
        title = soup.findAll(id="lyric-title-text")[0].contents[0]
        artists = soup.findAll(class_="lyric-artist")
        artists = artists[0].findNext('a', href=True).contents[0]
      
        text = ""
        for t in soup.findAll(id="lyric-body-text"): 
            if tag_visible(t):  # is tag is visible to user
                for s in t:
                    if tag_visible(s):
                        try:
                            text += s
                        except:
                            text += s.contents[0]
        song = Song(title, artists, text)
        collection.append(song)
        print(song)
        i -= 1

    return collection

In [207]:
collection = get_collection()
print(len(collection))

Title: `Georgia on My Mind`, Artists: `Billie Holiday`, Text: `Georgia
Georgia, the who...`
Title: `Loveless Love [Take 1]`, Artists: `Billie Holiday`, Text: `Love is like a hydrant tu...`
Title: `What Is This Going to Get Us?`, Artists: `Billie Holiday`, Text: `What is this going to get...`
Title: `Night and Day`, Artists: `Billie Holiday`, Text: `Like the beat, beat, beat...`
...`
Title: `More Than You Know`, Artists: `Billie Holiday`, Text: `Whether you are here or y...`
Title: `Let's Dream in the Moonlight`, Artists: `Billie Holiday`, Text: `Let's dream in the moonli...`
Title: `Hello My Darling`, Artists: `Billie Holiday`, Text: `I'll forget your tender k...`
Title: `You're So Desirable`, Artists: `Billie Holiday`, Text: `You're so desirable
I ju...`
Title: `Please Keep Me in Your Dreams`, Artists: `Billie Holiday`, Text: `Please keep me in your dr...`
Title: `Let's Call a Heart a Heart`, Artists: `Billie Holiday`, Text: `When we're in a friendly ...`
Title: `These 'N' That 'N' T

In [209]:
print(collection[0])

Title: `Georgia on My Mind`, Artists: `Billie Holiday`, Text: `Georgia
Georgia, the who...`


# 3. Inverted index
You will work with the boolean search model. Construct a dictionary which maps words to the postings.  

In [0]:
def make_index(collection):
    inverted_index = dict() # {term: [list of documents]}
    for i, song in enumerate(collection):
        text = preprocess(song.text)
        for term in text:
            if term in inverted_index:
                inverted_index[term].add(i)
            else:
                inverted_index[term] = {i}

    return inverted_index

In [210]:
index = make_index(collection)
print(index['love'])

{1, 2, 3, 4, 5, 6, 7, 9, 12, 13, 14, 21, 23, 24, 25, 26, 27, 29, 31, 32, 33, 34, 45, 48, 49, 61, 62, 63, 65, 81, 82, 83, 84, 89, 91, 92, 94, 96, 99}


# 4. Query processing

Using given search query, find all relevant documents. In binary model the relevant document is the one which contains all words from the query.

Return the list of relevant documents indexes.

In [0]:
def search(query):
    query = preprocess(query)
    print(query)
    for i, term in enumerate(query):
        if i == 0:
            relevant_documents = index[term]
            continue
        if term not in index: # if there term hadn't occured
            return [] 
        relevant_documents = relevant_documents.intersection(index[term])
    return list(relevant_documents)

In [212]:
query = 'love friend' # change for something else if you are searching song lyrics
relevant = search(query)  # select how many argumetns to pass to the function
print(len(relevant))

['love', 'friend']
3


In [213]:
print(relevant[0])

29


In [214]:
print(relevant)

[29, 99, 5]
