# Metody Obliczeniowe w Nauce i Technice
## Laboratorium 4 - Singular Value Decomposition (Wyszukiwarka)
### Albert Gierlach

### 1. Przygotowanie danych
Dane przygotowano za pomocą wiki-crawlera. Wykorzystano skrypt w Pythonie (https://github.com/bornabesic/wikipedia-crawler), dostosowując go do potrzeb zadania (dodanie opcji, która pozwala pobrać N artykułów). Źródła (wikipedia.py oraz crawler.py) są dostępne w archiwum z zadaniem.

Użycie:
```
python crawler.py N subdomain
```
gdzie N to liczba dokumentow do pobrania, a 'subdomain' to subdomena (użyto wartości 'en').
Dla polepszenia rezultatów zapewniono, że długość artykułu będzie większa niż 512 znaków.

Dane w formacie .txt pobierane są do folderu ./data

### 2., 3. Określenie bag-of-words 
Stworzono klasę, która będzie przechowywać dane jednego dokumentu oraz odpowiednie jej metody, które będą wykorzystane później. Odrzucono kilka słów, które powinny zostać zignorowane podczas wyszukiwania artykułów. Stworzono także klasę, która będzie odpowiadać za cache'owanie wyliczonych wektorów i macierzy, gdyż operacja ta trwa dość długo. Zastosowanie takiej klasy pozwala na jednokrotne wyliczenie wartości, a później wystarczy wczytać gotowe dane. Pierwsze uruchomienie trwa maksymalnie 5 minut. Wielkość cache około 1GB (jeśli zastosowanoby kompresję rozmiar zmniejszyłby się do 10MB, gdyż zawartość plików to głownie zera).

In [1]:
from collections import Counter
from typing import List, Any
from scipy import sparse
import os
import pickle
import re
import numpy as np
import operator

data_dir = "./data"

In [2]:
class CacheManager:
    cache_dir = "./cache"  # place for storing calculated matrices, etc

    def __init__(self):
        self.loaded = set()

        if not os.path.exists(CacheManager.cache_dir):
            os.makedirs(CacheManager.cache_dir)

    def was_loaded(self, filename):
        return filename in self.loaded

    def save(self, filename, object):
        if self.was_loaded(filename):
            return

        try:
            with open('{}/{}'.format(CacheManager.cache_dir, filename), "wb") as f:
                pickle.dump(object, f, protocol=pickle.HIGHEST_PROTOCOL)
                print("> caching " + filename)
        except:
            return

    def load(self, filename):
        try:
            with open('{}/{}'.format(CacheManager.cache_dir, filename), "rb") as f:
                res = pickle.load(f)
                print("> using cached " + filename)
                self.loaded.add(filename)
                return res
        except:
            return None

class ArticleData:
    ignored_words = ["a", "the", "of", "is"]  # and probably more

    def __init__(self, title):
        self.title = title.split('.')[0]
        self.bag_of_words = Counter()
        self.words_vec = None
        self.words_vec_norm = None

    def load_bag_of_words(self, path):
        with open(path, "rt", encoding='utf-8') as f:
            words = re.findall(r'\w+', f.read().lower())
            loaded_words = [word for word in words if len(word) > 2]
            self.bag_of_words.update(loaded_words)

        for ignore_token in ArticleData.ignored_words:
            del self.bag_of_words[ignore_token]

    def create_full_bag_of_words(self, keyset, size):
        self.words_vec = np.zeros(size)  # d_j
        for i, k in enumerate(keyset):
            self.words_vec[i] = self.bag_of_words[k]

        self.words_vec_norm = np.linalg.norm(self.words_vec)

    def print_contents(self):
        with open('{}/{}.txt'.format(data_dir, self.title), "rt", encoding='utf-8') as f:
            print(f.read())

    def normalize_word_vec(self):
        self.words_vec = self.words_vec / np.linalg.norm(self.words_vec)

In [3]:
cache = CacheManager()

articles_data: List[ArticleData] = cache.load('articles_data.dump')
if articles_data is None:
    articles_data = []
    for file in os.listdir(data_dir):
        a_data = ArticleData(file)
        a_data.load_bag_of_words("{}/{}".format(data_dir, file))
        articles_data.append(a_data)
print("total number of articles {}".format(len(articles_data)))

total_bag_of_words: Counter = cache.load('total_bag_of_words.dump')
if total_bag_of_words is None:
    total_bag_of_words = Counter()
    for article in articles_data:
        total_bag_of_words += article.bag_of_words

sizeof_total = len(total_bag_of_words)
wordset: List[Any] = cache.load('wordset.dump')
if wordset is None:
    wordset = list(total_bag_of_words.keys())
print("total number of words: {}".format(sizeof_total))

if not cache.was_loaded('articles_data.dump'):
    print("creating bag of words for every article")
    for article in articles_data:
        article.create_full_bag_of_words(wordset, sizeof_total)
print("created {} bags, every has {} elements".format(len(articles_data), sizeof_total))

> using cached articles_data.dump
total number of articles 2018
> using cached total_bag_of_words.dump
> using cached wordset.dump
total number of words: 77275
created 2018 bags, every has 77275 elements


### 4., 5.  Rzadka macierz wektorów cech oraz IDF
Do budowy rzadkiej macierzy wykorzystano funckję crs_matrix(), która jest optymalizowana pod kątem przechowywania zer w wierszach.

In [4]:
def getIDF(wordset, articles_data):
    articles_num = len(articles_data)
    idf = []
    for word in wordset:
        cnt = 0
        for article in articles_data:
            if article.bag_of_words[word] != 0:
                cnt += 1

        idf.append(np.log10(articles_num/cnt))

    return idf


def create_sparse(articles_data, sizeof_total, idf):
    row = []
    column = []
    data = []

    for i in range(len(articles_data)):
        article = articles_data[i]
        for j in range(sizeof_total):
            if article.words_vec[j] != 0:
                row.append(j)
                column.append(i)
                data.append(article.words_vec[j] * idf[j])


    term_by_document_matirx = sparse.csr_matrix((data, (row, column)), shape=(sizeof_total, len(articles_data)))
    return term_by_document_matirx

In [5]:
idf: List[Any] = cache.load('idf.dump')
if idf is None:
    print('calculating idf')
    idf = getIDF(wordset, articles_data)

term_by_document_matirx: sparse.csr_matrix = cache.load('term_by_document_sparse_matrix.dump')
if term_by_document_matirx is None:
    print('creating sparse matrix')
    term_by_document_matirx = create_sparse(articles_data, sizeof_total, idf)
print("term by document matrix size: {}x{}".format(term_by_document_matirx.shape[0],
                                                   term_by_document_matirx.shape[1]))

> using cached idf.dump
> using cached term_by_document_sparse_matrix.dump
term by document matrix size: 77275x2018


In [6]:
cache.save('articles_data.dump', articles_data)
cache.save('wordset.dump', wordset)
cache.save('term_by_document_sparse_matrix.dump', term_by_document_matirx)
cache.save('total_bag_of_words.dump', total_bag_of_words)
cache.save('idf.dump', idf)

### 6.  Program pozwalający na wyszukiwanie artykułów

In [7]:
def parse_query(query, word_list):
    query = query.lower()
    words_dict = {word: index for index, word in enumerate(word_list)}
    words = re.findall(r'\w+', query)

    vec_query = np.zeros(len(word_list), dtype=int)
    for w in words:
        if w in words_dict.keys():
            vec_query[words_dict[w]] += 1

    if not np.any(vec_query):
        print("No results")
        return

    return vec_query


def print_search_results(res, k, query):
    res.sort(key=operator.itemgetter(0), reverse=True)
    print("Found articles for query [{}]:".format(query))
    for res_entry in res[:k]:
        print('> ' + res_entry[1].title.replace("_", " "))

    print("\n\nFull articles:")
    for res_entry in res[:k]:
        print(res_entry[1].print_contents())
        print('\n')
        print('*' * 40)


def do_query(query, k, word_list, articles):
    vec_query = parse_query(query, word_list)

    q_norm = np.linalg.norm(vec_query)
    vec_query = vec_query.T
    res = []
    for a in articles:
        divider = q_norm * a.words_vec_norm
        prod = vec_query @ a.words_vec
        cos_theta = prod / divider
        res.append((cos_theta, a))

    print_search_results(res, k, query)
        
        
        
# reassign variables, just for readibility
# articles - list with all of documents (words vectors + bag of words)
# word_list - bag_of_words_dict.keys()
# A - sparse matrix, columns are words vectors from articles_data
articles, word_list, A = articles_data, wordset, term_by_document_matirx

### Przykładowe wyszukania

In [8]:
do_query("Action film", 5, word_list, articles)

Found articles for query [Action film]:
> Arjun Sarja
> Hong Sangsoo
> Nangna Kappa Pakchade
> Ian Harnarine
> Prathap


Full articles:
Srinivasa Sarja (born 15 August 1964), known professionally Arjun, is an Indian actor, producer and director. Referred to by the media and his fans as "Action King" for his roles in action films, he works predominantly in Tamil, Kannada and Telugu language films, while also performing in a few Malayalam and Hindi films. As of 2017, Arjun had acted in more than 150 movies. Until his 150th film, he has mostly performed in lead roles and is one of few South Indian actors to attract fan following from multiple states of India. He has directed 11 films and also produced and distributed a number of films..
In 1993, he starred in S. Shankar's blockbuster Gentleman which opened to positive reviews, while Arjun went on to win the Tamil Nadu State Film Award for Best Actor. During this time, he starred in hits such as Jai Hind (1994), Karnaa (1995), and the acti

In [9]:
do_query("Winston Churchill", 5, word_list, articles)

Found articles for query [Winston Churchill]:
> Old Town Township Forsyth County North Carolina
> Austropyrgus pusillus
> United Goans Democratic Party
> Paradox Access Solutions
> Joe Lahoud


Full articles:
Old Town Township is one of fifteen townships in Forsyth County, North Carolina, United States. The township had a population of 149 according to the 2010 census.Geographically, Old Town Township occupies 0.53 square miles (1.4 km2) in central Forsyth County.  Parts of the town of Bethania are located here but nearly all of the original township has been annexed by the City of Winston-Salem and made part of Winston Township, including the original community of Old Town.
As if its small area was not enough it is geographically in three pieces.


== References ==
None


****************************************
Austropyrgus pusillus is a species of minute freshwater snail with an operculum, an aquatic gastropod mollusc or micromollusc in the Hydrobiidae family. This species is endemi

In [10]:
do_query("Beautiful places on the earth", 5, word_list, articles)

Found articles for query [Beautiful places on the earth]:
> Earth materials
> Jaspers Warp
> Garden of Prayer
> Glossary of geography terms
> National Register of Historic Places listings in Jefferson County Mississippi


Full articles:
Earth materials include minerals, rocks, soil and water. These are the naturally occurring materials found on Earth that constitute the raw materials upon which our global society exists. Earth materials are vital resources that provide the basic components for life, agriculture and industry. Earth materials can also include metals and precious rocks.


== Definitions ==
The type of materials available locally will of course vary depending upon the conditions in the area of the building site. Take considerations of what is explained below.
In many areas, indigenous stone is available from the local region, such as limestone, marble, granite, and sandstone. It may be cut in quarries or removed from the surface of the ground (flag and fieldstone). Ideally

### Interaktywna wyszukiwarka

In [11]:
from ipywidgets import Layout, Button, Box, FloatText, Textarea, Text, Label, IntSlider, Output
from IPython.display import display, clear_output

def btn(b):
    output.clear_output()
    with output:
        how_many = form.children[0].children[1].value
        text_to_search = form.children[1].children[1].value
        if len(text_to_search) > 1:
            do_query(text_to_search, how_many, word_list, articles)
        else:
            output.append_stdout("Text is too short")

form_item_layout = Layout(
    display='flex',
    flex_flow='row',
    justify_content='space-between'
)

form_items = [
    Box([Label(value='Results num'), IntSlider(min=1, max=30, value=10, descritpion='k_')], layout=form_item_layout),
    Box([Label(value='Query'), Text(placeholder="Wpisz zapytanie", descritpion='query_')], layout=form_item_layout),
    Box([Label(), Button(description="Search!")], layout=form_item_layout)
]

form = Box(form_items, layout=Layout(
    display='flex',
    flex_flow='column',
    align_items='stretch',
    width='50%'
))
output = Output()
form.children[2].children[1].on_click(btn)

In [12]:
form

Box(children=(Box(children=(Label(value='Results num'), IntSlider(value=10, max=30, min=1)), layout=Layout(dis…

In [13]:
display(output) # place for results

Output()

### 7. Normalizacja wektorów
Znormalizowano wektory przechowywane w klasie ArticleData oraz zbudowano na nowo macierz rzadką A używając nowych wektorów. Wektor zapytania także został znormalizowany. Wykonano takie same wyszukiwania jak w wariancie bez normalizacji w celu weryfikacji poprawności.

In [14]:
def normalize_vectors(articles):
    for a in articles:
        a.normalize_word_vec()
        

normalize_vectors(articles)
A_normalized: sparse.csr_matrix = cache.load('A_normalized.dump')
if A_normalized is None:
    print('calculating new sparse matrix with new vectors')
    A_normalized = create_sparse(articles, len(word_list), idf)
    cache.save('A_normalized.dump', A_normalized)

> using cached A_normalized.dump


In [15]:
def do_query2(query, k, word_list, articles, A):
    vec_query = parse_query(query, word_list)

    res = vec_query.T @ A
    probabilities = []
    for i, cos_theta in enumerate(res):
        probabilities.append((cos_theta, articles[i]))

    print_search_results(probabilities, k, query)

In [16]:
do_query2("Action film", 5, word_list, articles, A_normalized)

Found articles for query [Action film]:
> Arjun Sarja
> Hong Sangsoo
> Ian Harnarine
> Nangna Kappa Pakchade
> Prathap


Full articles:
Srinivasa Sarja (born 15 August 1964), known professionally Arjun, is an Indian actor, producer and director. Referred to by the media and his fans as "Action King" for his roles in action films, he works predominantly in Tamil, Kannada and Telugu language films, while also performing in a few Malayalam and Hindi films. As of 2017, Arjun had acted in more than 150 movies. Until his 150th film, he has mostly performed in lead roles and is one of few South Indian actors to attract fan following from multiple states of India. He has directed 11 films and also produced and distributed a number of films..
In 1993, he starred in S. Shankar's blockbuster Gentleman which opened to positive reviews, while Arjun went on to win the Tamil Nadu State Film Award for Best Actor. During this time, he starred in hits such as Jai Hind (1994), Karnaa (1995), and the acti

In [17]:
do_query2("Winston Churchill", 5, word_list, articles, A_normalized)

Found articles for query [Winston Churchill]:
> Old Town Township Forsyth County North Carolina
> Austropyrgus pusillus
> United Goans Democratic Party
> Paradox Access Solutions
> Joe Lahoud


Full articles:
Old Town Township is one of fifteen townships in Forsyth County, North Carolina, United States. The township had a population of 149 according to the 2010 census.Geographically, Old Town Township occupies 0.53 square miles (1.4 km2) in central Forsyth County.  Parts of the town of Bethania are located here but nearly all of the original township has been annexed by the City of Winston-Salem and made part of Winston Township, including the original community of Old Town.
As if its small area was not enough it is geographically in three pieces.


== References ==
None


****************************************
Austropyrgus pusillus is a species of minute freshwater snail with an operculum, an aquatic gastropod mollusc or micromollusc in the Hydrobiidae family. This species is endemi

In [18]:
do_query2("Beautiful places on the earth", 5, word_list, articles, A_normalized)

Found articles for query [Beautiful places on the earth]:
> Earth materials
> Jaspers Warp
> Garden of Prayer
> Glossary of geography terms
> Tarzan at the Earths Core


Full articles:
Earth materials include minerals, rocks, soil and water. These are the naturally occurring materials found on Earth that constitute the raw materials upon which our global society exists. Earth materials are vital resources that provide the basic components for life, agriculture and industry. Earth materials can also include metals and precious rocks.


== Definitions ==
The type of materials available locally will of course vary depending upon the conditions in the area of the building site. Take considerations of what is explained below.
In many areas, indigenous stone is available from the local region, such as limestone, marble, granite, and sandstone. It may be cut in quarries or removed from the surface of the ground (flag and fieldstone). Ideally, stone from the building site can be utilized. Depe

### 8. Normalizacja wektorów
Zastosowanie SVD, low rank approximation oraz nowej miary prawdopodo


### Źródła:
* [Latent semantic indexing](https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html)
* [Latent Semantic Analysis](https://www.engr.uvic.ca/~seng474/svd.pdf)