# INFO 4271 - Group Project

Issued: June 11, 2024

Due: July 22, 2024

Please submit a link to your code base (ideally with a branch that does not change anymore after the submission deadline) and your 4-page report via email to carsten.eickhoff@uni-tuebingen.de by the due date. One submission per team.

---

# 1. Web Crawling & Indexing
Crawl the web to discover **English content related to Tübingen**. The crawled content should be stored locally. If interrupted, your crawler should be able to re-start and pick up the crawling process at any time.

In [18]:
import requests
from bs4 import BeautifulSoup
import justext
from boilerpy3 import extractors
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import re
from trafilatura import fetch_url, extract
import contractions
from nltk.stem import WordNetLemmatizer
from duplicateCheck import check_simhash,computeHash
import math
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans




#Add a document to the index. You need (at least) two parameters:
	#doc: The document to be indexed.
	#index: The location of the local index storing the discovered documents.
def index(doc, index):
 #URL UND DOC aus DOC holen!
 url = 'X'
 index = ({},[],[], [])
 docID = len(index[1])
 doc = requests.get("https://uni-tuebingen.de/en/").text


####  Text Preprocessing of the doc ###################

 # Initialize trafilatura default extractor
 extractor = extractors.DefaultExtractor()
 #extract relevant text from raw html doc
 content = extract(doc)


 #convert to lowercase
 content = content.lower()

 #remove contractions:
 content = contractions.fix(content)
 
 
 #list where final tokens will be stored
 finalTokens = []
 #get nltk stopwords
 stop_words = set(stopwords.words('english'))

 #initialize lemmatizer
 lemmatizer = WordNetLemmatizer()

 #position of current regarded term t in unprocessed document 
 tPos = 0

 #store words of the processed document
 processedDoc = []
 #tokenize, go through all words and do the steps for each at once
 for t, tag in pos_tag(word_tokenize(content)):
      #convert to lowercase
      t = t.lower()
      #continue if is url (starting with http(s) or www.)
      if re.compile(r'https?://\S+|www\.\S+').match(t):
           tPos +=1
           continue 
      #remove special characters,punctuation -> continue in this case 
      if re.compile(r'[^a-zA-Z0-9äöüÄÖÜ\s]').match(t):
           tPos+=1
           continue
      #continue if is stopword
      if t in stop_words:
           tPos+=1
           continue
      
      #convert pos tag for lemmatization
      ltag = tag[0].lower()
      if ltag in ['a', 'r', 'n', 'v']:
          t = lemmatizer.lemmatize(t, ltag)

      #add processed term to processed doc
      processedDoc.append(t)
      tPos+=1

  ############# Near Duplicate checking ##############
  #near duplicate checking by means of the words in the processed document (processedDoc)
 docHash = computeHash(processedDoc)
 # compare to other doc hashes
 if(check_simhash(docHash, index[2])):
     #break and dont index if is near duplicate
     return False
 #if no duplicate save hash and insert terms in inverted index
 index[2].append(docHash)
 #insert processed doc in index
 index[3].append(processedDoc)


############ Build up inverted index #####################
     
 for t in processedDoc:
      #entry in inverted index is: [docID, [occurence1, occurence2, ...]]
      #add on thy fly to inverted index
      #if word already in index add doc idto the words list
      if(t in index[0].keys()):
           #check if doc id already in list of that word (must be at the end)
           if index[0][t][len(index[0][t])-1][0] == docID:
                #if so append position there
                index[0][t][len(index[0][t])-1][1].append(tPos)
           #else add new list [docID, [position of t]] for that word
           else:
               index[0][t].append([docID, [tPos], None])
               #### rearange skip pointers ###########
               #rearange such that sqare |p| evenly spaced per posting list (p: length of posting list of term t)
               p = len(index[0][t])
               #just if posting list has at least length 4
               if(p >= 4):
                nrPointers = math.sqrt(p)
                spaceBetween = math.floor(p/nrPointers)
                #current index in postingslist
                idx = 0
                while(idx + spaceBetween < p):
                    #set skip pointer [idx to jump to in postings list, docID at that index]
                    index[0][t][idx][2] = [idx + spaceBetween, index[0][t][idx + spaceBetween][0]]
                    idx += spaceBetween


      #if word not yet in index add "t: [[docID, [get position of t], tfidf weight for t in d, skip pointer (None)]]" to dict
      else:
           index[0][t] = [[docID, [tPos], None]]

 length = len(processedDoc)
 #add url , cluster of document (None so far) and length of preprocessed doc to list index[1] after indexing the doc
 index[1].append([url, None, length])

 return True


#clusters the docs of the index and inserts the labels into the index (currently 30 clusters)
# cluster using kmeans clustering with tf-idf vectors
 def cluster(index):
     #convert index to tf-idf vector representations for each document
     idx = index[0] 
     docs = index[1]
     #matrix to store vectors (rows: documents , cols: terms)
     tfIDFMatrix = np.zeros_like(len(docs), len(idx.keys()))

     for t in range(len(idx.keys())):
         for i in range(len(idx[idx.keys(t)])):
             #term index: t, doc index: idx[t][i][0]
             docID = idx[t][i][0]
             
             #calculate tf (nr occurences of t in doc/ length of doc) * idf(log (#docs/ #docs containing t))
             # occurences of t in doc: len(idx[t][i][1])
             # length of doc: docs[docID][2]
             # # docs = len[docs]
             # #docs containing t = len(idx[t])
             tfIDFMatrix[docID][t] = len(idx[t][i][1])/docs[docID][2] * math.log(len[docs]/len(idx[t]))
               
     # Initialize KMeans clustering
     num_clusters = 30  # Example: Number of clusters
     kmeans = KMeans(n_clusters=num_clusters, random_state=42)

     # Fit KMeans model to the TF-IDF matrix
     kmeans.fit(tfIDFMatrix)

     docLabels = kmeans.labels_

     #insert labels in index
     for d in range(len(index[1])):
         index[1][d][1] = docLabels[d]
      
 


 

 
 



#Crawl the web. You need (at least) two parameters:
	#frontier: The frontier of known URLs to crawl. You will initially populate this with your seed set of URLs and later maintain all discovered (but not yet crawled) URLs here.
	#index: The location of the local index storing the discovered documents. 
def crawl(frontier, index):
    #TODO: Implement me
	pass

index("", "")

         00       11       13       15      200       29        30     7071  \
0  0.186501  0.09325  0.09325  0.09325  0.09325  0.09325  0.186501  0.09325   

     74444  administration  ...  subject      sum  teaching    three  \
0  0.09325         0.09325  ...  0.09325  0.09325   0.09325  0.09325   

   thursday      tip      top  tübingen  university     word  
0   0.09325  0.09325  0.09325  0.186501    0.279751  0.09325  

[1 rows x 68 columns]


# 2. Query Processing 
Process a textual query and return the 100 most relevant documents from your index. Please incorporate **at least one retrieval model innovation** that goes beyond BM25 or TF-IDF. Please allow for queries to be entered either individually in an interactive user interface (see also #3 below), or via a batch file containing multiple queries at once. The batch file will be formatted to have one query per line, listing the query number, and query text as tab-separated entries. An example of the batch file for the first two queries looks like this:

```
1   tübingen attractions
2   food and drinks
```

In [None]:
#Retrieve documents relevnt to a query. You need (at least) two parameters:
	#query: The user's search query
	#index: The location of the local index storing the discovered documents.
def retrieve(query, index):
    #TODO: Implement me
	pass

# 3. Search Result Presentation
Once you have a result set, we want to return it to the searcher in two ways: a) in an interactive user interface. For this user interface, please think of **at least one innovation** that goes beyond the traditional 10-blue-links interface that most commercial search engines employ. b) as a text file used for batch performance evaluation. The text file should be formatted to produce one ranked result per line, listing the query number, rank position, document URL and relevance score as tab-separated entries. An example of the first three lines of such a text file looks like this:

```
1   1   https://www.tuebingen.de/en/3521.html   0.725
1   2   https://www.komoot.com/guide/355570/castles-in-tuebingen-district   0.671
1   3   https://www.unimuseum.uni-tuebingen.de/en/museum-at-hohentuebingen-castle   0.529
...
1   100 https://www.tuebingen.de/en/3536.html   0.178
2   1   https://www.tuebingen.de/en/3773.html   0.956
2   2   https://www.tuebingen.de/en/4456.html   0.797
...
```

In [1]:
#TODO: Implement an interactive user interface for part a of this exercise.

#Produce a text file with 100 results per query in the format specified above.
def batch(results):
    #TODO: Implement me.    
    pass

# 4. Performance Evaluation 
We will evaluate the performance of our search systems on the basis of five queries. Two of them are avilable to you now for engineering purposes:
- `tübingen attractions`
- `food and drinks`

The remaining three queries will be given to you during our final session on July 23rd. Please be prepared to run your systems and produce a single result file for all five queries live in class. That means you should aim for processing times of no more than ~1 minute per query. We will ask you to send carsten.eickhoff@uni-tuebingen.de that file.

# Grading
Your final projects will be graded along the following criteria:
- 25% Code correctness and quality (to be delivered on this sheet)
- 25% Report (4 pages, PDF, explanation and justification of your design choices)
- 25% System performance (based on how well your system performs on the 5 queries relative to the other teams in terms of nDCG)
- 15% Creativity and innovativeness of your approach (in particular with respect to your search system #2 and user interface #3 innovations)
- 10% Presentation quality and clarity

# Permissible libraries
You can use any general-puprose ML and NLP libraries such as scipy, numpy, scikit-learn, spacy, nltk, but please stay away from dedicated web crawling or search engine toolkits such as scrapy, whoosh, lucene, terrier, galago and the likes. Pretrained models are fine to use as part of your system, as long as they have not been built/trained for retrieval. 
