# Self study 2


In this self-study we build an index that supports Boolean search over the web pages that you crawl with the crawler from the 1st self study. You can continue to just extract the titles of the web-pages you crawl, or you can be more adventurous and look at the whole text that you get from the .get_text() method of a BeautifulSoup parser. In either case, the collection of texts from the crawled web-pages is you corpus. You should then:

- construct the vocabulary of terms for your corpus
- build an 'inverted' index for your vocabulary
- implement Boolean search for your index (perhaps only for a limited set of Boolean queries)

In [61]:
# Some things already used in self study 1:
import requests
from bs4 import BeautifulSoup
from urllib.robotparser import RobotFileParser
from nltk.stem.snowball import SnowballStemmer
import string


A useful resource is the nltk natural language processing package:
https://www.nltk.org/
which provides methods for tokenization, stemming, and much more (the 'punkt' package is needed for tokenization):

In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\minhs\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

Now let's use the title string of the AAU homepage as an example:

In [6]:
r=requests.get('https://www.aau.dk/')
r_parse = BeautifulSoup(r.text, 'html.parser')
string=r_parse.find('title').string
print(string)

AAU - Viden for verden - Aalborg Universitet


We can tokenize:

In [7]:
tokens=nltk.word_tokenize(string)
for t in tokens:
    print(t)

AAU
-
Viden
for
verden
-
Aalborg
Universitet


And we can stem:

In [8]:
ps=nltk.PorterStemmer()
for t in tokens:
    print(ps.stem(t))



aau
-
viden
for
verden
-
aalborg
universitet


For Danish language the Porter stemmer will not be terribly useful! There is also a Danish option:

In [9]:
from nltk.stem.snowball import SnowballStemmer

dstemmer=SnowballStemmer("danish")

In [12]:
for t in tokens:
    print(dstemmer.stem(t))


aau
-
vid
for
verd
-
aalborg
universit


In [77]:
titles = []
corpus = []
postings = []

numOfArticles = 10

dstemmer=SnowballStemmer("danish")

def getTitles(link):
    titles = []

    rp=RobotFileParser()
    rp.set_url(link)
    rp.read()
    r=requests.get(link)

    r_parse = BeautifulSoup(r.text, 'html.parser')
    r_parse.find('title').string
    for i, a in enumerate(r_parse.find_all('a')):
        if(i == numOfArticles): break
        _link = a['href']
        if(_link == '#main'):   continue
        if(_link[0] == '/'):    _link = link+_link[1:]
        titles.append(_getTitles(_link))
    return titles

def _getTitles(link):
    r=requests.get(link)
    r_parse = BeautifulSoup(r.text, 'html.parser')
    return r_parse.find('title').string

def remove_non_ascii(a_str):
    ascii_chars = set(string.printable)

    return ''.join(
        filter(lambda x: x in ascii_chars, a_str)
    )

def tokenizeAndStemTitles(titles):
    _postings = []
    tokens = []
    for i, title in enumerate(titles):
        _tokens=nltk.word_tokenize(title)
        ps=nltk.PorterStemmer()
        for t in _tokens:
            s = ps.stem(t)
            s = remove_non_ascii(s)
            if(s == ''): continue
            if s not in tokens: tokens.append(s)

            flag = 0
            for el in _postings:
                if el['vocabulary'] == s:
                    el['postings'].append(i)
                    flag = 1

            if flag == 0: _postings.append(dict(vocabulary=s, postings=[i]))

    return tokens, _postings

titles = getTitles('https://www.aau.dk/')


#construct the vocabulary of terms for your corpus (corpus)
#build an 'inverted' index for your vocabulary (postings)
corpus, postings = tokenizeAndStemTitles(titles)
print(corpus)
#implement Boolean search for your index (perhaps only for a limited set of Boolean queries)
query = input("what is ur boolean query? \n(AND, NOT, OR | aau AND viden)")

def booleanSearch(query):
    pass

booleanSearch(query)


['aau', '-', 'viden', 'for', 'verden', 'aalborg', 'universitet', 'universitetsuddannels', 'videregend', 'uddannels', 'p', 'kandidatuddannels', 'sidefag', 'og', 'tilvalgsfag', 'studieby', 'her', 'kan', 'du', 'studer', 'su', 'sp', 'stttemulighed', 'forskn', 'forskningsnyt', 'fra']


In [2]:
k = 10
pyth = [(x,y,z) for x,y,z in range(k) if x**2+y**2==z**2]
print(pyth)

TypeError: 'str' object cannot be interpreted as an integer

What is most useful for you depends on which websites you crawl. It is not essential for the exercise that the stemming always is the best possible ...!