Wikipedia articles are saved as files and stemmed on the base of their origin for example:
- cats, catlike, catty = cat
- argue, argued, argues, arguing, argus = arg etc.
Subsequently, with the use of Singular Value Decomposition [See more] algorithm searches in mentioned above files the best results as for vector correlation with entry phrase.
Crawles at specific site on english wikipedia and downloads the text of article and saves it as a file with specific name. Crawler also looks for tags which contains links to be visited, so "crawling" is possible. Only wikipedia sites are taken into account.
File parser uses tools of ntlk library to make it possible to stem text in the correct way. Each file is:
- splitted into separate words
- filtered so it do not include words which are considered as stop words and shorter than 2 characters
- stemmed with the use of PorterSteemer
- new content is saved to a different file
Class SearchEngine is responsible for numeric site. It allows to look for articles in two ways:
- without SVD factorization
- with SVD factorization of degree N (defaults to: 100)
The simplest idea behind it is to make a dictionary (union) of words in all files and based on it create matrix that, for each file, assings a vector with the count of each word for a specific text. The top result would be a scalar of the correlation between entry phrase and the vector of the file. Though, obviously longer texts include more words, so it is not necesserly best idea to use this type of matrix.
There comes TF-IDF matrix (TF-IDF is acronym of term-frequency inverse-document-frequency) which is much better to work with as it respects also the length of the text. It is built with the use of TfidfVectorizer of Sklearn package.
Thanks to this we obtain matrix of shape: (count_of_words_in_union, count_of_files). To avoid high time complexity arrays/queues/matrixes are replaced by dictionaries/sparse-matrixes where is was possible and are loaded from a file if run for a second+ time.
Entering a phrase to search steps are similiar: words are stemmed and vector of the sentence is build based on forementioned union. The result is k highest values of multiplication of entry-phrase-vector and tf-idf-matrix.
As an example I collected 45000+ files of wikipedia articles with the union of words at size > 1 100 000. I have also prepared basic front-end [here] and default Flask server which handles the request here]
Algorithm seems to be working fine ;)

