# **Algorithmic Methods of Data Mining - Fall 2022**

#### Group Members: Yoanna Efimova, Angelo Mandara, Cem Sirin
<yoanna.efimova@gmail.com>, <mandara.2077139@studenti.uniroma1.it>, <sirincem1@gmail.com>

## **Homework 3: Places of the World**

**We structured the notebook such that there is a section and subsection for each question and subquestion. The outline is as follows:**
* **Question 1: Data Collection**
    * **1.1:** *Getting the list of places*
    * **1.2:** *Crawling pages*
    * **1.3:** *Parsing pages*
* **Question 2: Search Engine**
    * **2.1:** *Conjunctive queries*
        * **2.1.1:** *Creating an index*
        * **2.1.2:** *Querying the index*
    * **2.2:** *Conjunctive queries and Ranking Scores*
        * **2.2.1:** *Creating an inverted index*
        * **2.2.2:** *Querying the inverted index*
* **Question 3: Defining a score**
* **Question 4: Visualization**
* **Question 5: Complex search engines**
* **Question 6: Command line interface**
* **Question 7: Theoretical Aspects**

**Packages that are used troughout the notebook:**

In [1]:
# TODO: Organize the libraries

# Library for data manipulation
import pandas as pd
import numpy as np
# Library for tracking progress
from tqdm import tqdm
import time
import os
# Libraries to scrape the web
import requests
from bs4 import BeautifulSoup
# Library to save data
import json
import pickle
# Library to work with dates
from datetime import datetime
import csv

import asyncio
from tqdm import tqdm
from requests_html import AsyncHTMLSession

## Question 1: Data Collection

We start by importing the necessary packages and defining the contstants and functions that will be used throughout this section.

In [2]:
# Constants
BASE_URL = 'https://www.atlasobscura.com'

### 1.1. Getting the list of places

Our task is to get the list of places in the top 400 pages sorted by popularity. The URLs of those pages follow the same format

In [3]:
# Top first 400 pages of Atlas Obscura
page_urls = [(f'{BASE_URL}/places?page={i}&sort=likes_count') for i in range(1, 401)]

In [6]:
s = requests.Session()

def save_html(url, path):
    if os.path.exists(path):
        return

    r = s.get(url)
    while r.status_code != 200:
        r = s.get(url)
        time.sleep(30)

    with open(path, 'w') as f:
        f.write(r.text)

We iterate over the URLs of the top 400 pages and save their html content in a folder.

In [11]:
for i, url in tqdm(enumerate(page_urls), total=len(page_urls)):
    path = f'data/pages/page_{i}.html'
    save_html(url, path)

100%|██████████| 400/400 [00:00<00:00, 106017.16it/s]


Now, we parse over the html content of each page and extract the list of places. We save the URLs of the places in a text file as instructed.

In [13]:
# Grabbing the urls of all the places
place_urls = []
for i in tqdm(range(400)):

    with open(f'data/pages/page_{i}.html', 'r') as f:
        html = f.read()
    
    soup = BeautifulSoup(html, 'html.parser')
    place_urls.extend([a['href'] for a in soup.find_all('a', class_='content-card content-card-place')])

# Save the list of hrefs as text file
with open('data/misc/place_urls.txt', 'w') as f:
    for url in place_urls:
        f.write(f'{url}\n')

100%|██████████| 400/400 [01:11<00:00,  5.60it/s]


In [19]:
# Delete all variables that are no longer needed
del html, soup, f, i, url, page_urls, place_urls

### 1.2. Crawl places

In [17]:
# Read the list of place urls
with open('data/misc/place_urls.txt', 'r') as f:
    place_urls = [url.strip() for url in f.readlines()]

In [18]:
for url in tqdm(place_urls):
    save_html(BASE_URL + url, f'data{url}.html')

100%|██████████| 7200/7200 [00:00<00:00, 112391.33it/s]


### 1.3. Parse downloaded pages

In [None]:
# Function to parse and extract data from the htmls
def parse_place(html, url):
    soup = BeautifulSoup(html, 'html.parser')
    placeName = soup.find('h1', class_='DDPage__header-title').text.strip()
    placeTags = [x.text.strip() for x in soup.find('div', class_='item-tags').find_all('a')] if soup.find('div', class_='item-tags') else None
    numPeopleVisited = int(soup.find_all('div', class_='title-md item-action-count')[0].text.strip())
    numPeopleWant = int(soup.find_all('div', class_='title-md item-action-count')[1].text.strip())
    placeDesc = soup.find('div', id='place-body').text.strip()
    placeShortDesc = soup.find('h3', class_='DDPage__header-dek').text.strip()
    placeNearby = [x['href'].strip() for x in soup.find('div', class_='DDPageSiderailRecirc').find_all('a')]
    placeAddress = '; '.join([x.strip() for x in soup.find('address').find('div').contents if isinstance(x, str)])
    placeLat = float(soup.find('div', class_='DDPageSiderail__coordinates js-copy-coordinates')['data-coordinates'].split(',')[0])
    placeLong = float(soup.find('div', class_='DDPageSiderail__coordinates js-copy-coordinates')['data-coordinates'].split(',')[1])
    placeEditors = soup.find('a', class_='DDPContributorsList__contributor')['href'] if soup.find('a', class_='DDPContributorsList__contributor') else None
    placePubDate = datetime.strptime(soup.find('div', class_='DDPContributor__name').text.strip(), '%B %d, %Y') if soup.find('div', class_='DDPContributor__name') else None
    placeRelatedLists  = [x['href'] for x in soup.find('div', attrs={'data-gtm-template': 'DDP Footer Recirc Lists'}).find_all('a')] if soup.find('div', attrs={'data-gtm-template': 'DDP Footer Recirc Lists'}) else None
    placeRelatedPlaces = [x['href'] for x in soup.find('div', attrs={'data-gtm-template': 'DDP Footer Recirc Related'}).find_all('a')] if soup.find('div', attrs={'data-gtm-template': 'DDP Footer Recirc Related'}) else None
    placeURL = '/places/' + url
    return {
        'placeName': placeName,
        'placeTags': placeTags,
        'numPeopleVisited': numPeopleVisited,
        'numPeopleWant': numPeopleWant,
        'placeDesc': placeDesc,
        'placeShortDesc': placeShortDesc,
        'placeNearby': placeNearby,
        'placeAddress': placeAddress,
        'placeLat': placeLat,
        'placeLong': placeLong,
        'placeEditors': placeEditors,
        'placePubDate': placePubDate,
        'placeRelatedLists': placeRelatedLists,
        'placeRelatedPlaces': placeRelatedPlaces,
        'placeURL': placeURL
    }

In [None]:
# Parse all the html files in the data/places
place_data = []
for i, file in tqdm(enumerate(os.listdir('data/places'))):
    with open(f'data/places/{file}', 'r') as f:
        html = f.read()
        d = parse_place(html, file.replace('.html', ''))

        # Write the values to a tsv file
        with open(f'data/parsed_places/place_{i}.tsv', 'w') as f:
            writer = csv.writer(f, delimiter='\t')
            writer.writerow(d.values())

        place_data.append(d)

In [None]:
# Convert the list of dictionaries to a dataframe
df = pd.DataFrame(place_data)

# Save the dataframe as pickle
df.to_pickle('places.pkl')

In [None]:
# Delete all variables that are no longer needed
del place_data, df

## Question 2: Search Engine

Libraries used in this section:

In [20]:
# Stopwords
from nltk.corpus import stopwords # nltk.download('stopwords')
from spacy.lang.en.stop_words import STOP_WORDS
stop_words = set(stopwords.words('english')) | set(STOP_WORDS) ## Cem: Here i comine to sets of stopwords to get a bigger list

# Lemmatizer
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

# Supress FutureWarning
import warnings; warnings.simplefilter(action='ignore', category=FutureWarning)

In [21]:
def preprocess_column(text_column: pd.Series) -> pd.Series:
    # Step 1: Lowercase
    text_column = text_column.str.lower()
    # Step 2.1: Remove all spaces, i.e., \n
    text_column = text_column.str.replace('\n', ' ')
    # Step 2.2: Remove all punctuation
    text_column = text_column.str.replace('[^\w\s]', '')
    # Step 3: Remove stopwords
    text_column = text_column.apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
    # Lemmatize
    text_column = text_column.apply(lambda x: ' '.join([wnl.lemmatize(word) for word in x.split()]))

    return text_column

def preprocess_str(text: str) -> str:
    # Step 1: Lowercase
    text = text.lower()
    # Step 2.1: Remove all spaces, i.e., \n
    text = text.replace('\n', ' ')
    # Step 2.2: Remove all punctuation
    text = text.replace('[^\w\s]', '')
    # Step 3: Remove stopwords
    text = ([word for word in text.split() if word not in stop_words])
    # Lemmatize
    text = ([wnl.lemmatize(word, 'v') for word in text])

    return text

In [23]:
# open places.pkl
df = pd.read_pickle('data/misc/places.pkl')
# Step 1: Lowercase
df['placeDescX'] = preprocess_column(df['placeDesc'])

### 2.1. Conjunctive query

#### 2.1.1. Create your index!

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

# Count terms in each document
vectorizer = CountVectorizer(min_df=5)
X = vectorizer.fit_transform(df['placeDescX'])

# Save the vocabulary
words = vectorizer.get_feature_names_out()
print('Last 10 words in the vocabulary:', words[-10:])

Last 10 words in the vocabulary: ['zone' 'zoning' 'zoo' 'zoological' 'zoology' 'zoom' 'zooming' 'zuni'
 'černý' 'černýs']


In [28]:
vocabulary = {} # Dictionary that maps words to term_id
inv_idx = {} # Dictionary that maps term_id to document_id
for i, word in tqdm(enumerate(words), total=len(words)):
    vocabulary[word] = i
    inv_idx[i] = X[:, i].nonzero()[0]

print('The term ID of the word "coffee" is:', vocabulary['coffee'], 'and the doc_IDs of the document that contains the word "coffee" is:', inv_idx[vocabulary['coffee']])

100%|██████████| 15335/15335 [01:21<00:00, 187.26it/s]

The term ID of the word "coffee" is: 3081 and the doc_IDs of the document that contains the word "coffee" is: [ 101  127  142  239  376  428  499  551  598  718  789  944  971 1047
 1155 1156 1169 1266 1284 1398 1428 1440 1534 1549 1723 1773 1860 1880
 1886 1932 2044 2056 2093 2184 2186 2237 2280 2290 2445 2581 2650 2755
 2868 2884 3016 3018 3092 3107 3110 3123 3187 3213 3302 3420 3423 3428
 3596 3682 3936 3951 4043 4153 4163 4214 4257 4264 4319 4358 4370 4399
 4407 4420 4561 4702 4739 4810 4894 5006 5155 5273 5319 5338 5498 5539
 5650 5766 5786 5802 5804 5852 5896 6070 6092 6105 6132 6363 6398 6513
 6535 6573 6575 6596 6748 6903 6969 7020 7186]





#### 2.1.2. Execute the query

In [29]:
def search(q):
    '''q: query string'''

    # Preprocess query
    q = preprocess_str(q)
    print(q)
    # Get term IDs for query terms
    idx = [vocabulary[word] for word in q if word in vocabulary]
    print(idx)
    # Document IDs that contain all query terms
    docs = set.intersection(*[set(inv_idx[i]) for i in idx])

    return df.iloc[list(docs)][['placeDesc', 'placeName', 'placeURL']]

In [30]:
# Example query
q0 = 'American Museum yomama' # I added 'yomama' to test
search(q0).head()

['american', 'museum', 'yomama']
[1068, 9156]


Unnamed: 0,placeDesc,placeName,placeURL
3072,"Once only open to academics, Lombroso’s Museum...",Cesare Lombroso's Museum of Criminal Anthropology,/places/cesare-lombrosos-museum-of-criminal-an...
2049,It’s easy to work up an appetite as you meande...,Museum of Food and Drink,/places/museum-of-food-and-drink-mofad
7169,With its rich collection of historic and conte...,Philbrook Museum of Art,/places/philbrook-museum-of-art
6661,"Located in Madison County, Tennessee, this par...",Pinson Mounds State Archeological Park,/places/pinson-mounds-state-archeological-park
6662,Steve McVoy was always fascinated by TV. In mi...,Early Television Museum,/places/early-television-museum


### 2.2. Conjunctive query & ranking score

#### 2.2.1. Inverted index

In [31]:
from sklearn.feature_extraction.text import TfidfTransformer

# Transform the count matrix to a normalized tf or tf-idf representation
tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X)

In [33]:
inv_idx_tfidf = {} # Dictionary that maps term_id to document_id as a tuple

# create an empty list for each term_id
inv_idx_tfidf = {term_id: [] for term_id in vocabulary.values()}

for doc_id, term_id in tqdm(zip(*X_tfidf.nonzero()), total=X_tfidf.nnz):
    inv_idx_tfidf[term_id].append((doc_id, X_tfidf[doc_id, term_id]))



100%|██████████| 752948/752948 [00:48<00:00, 15588.92it/s]


#### 2.2.2. Execute the query

In [None]:
def search_tfidf(q: str, k: int = 10):
    '''
    q: query string
    k: number of results to return
    '''

    # Preprocess query
    q = preprocess_str(q)

    # Get term IDs for query terms
    term_idx = [vocabulary[word] for word in q if word in vocabulary]

    # Documents that contain all query terms (intersection)
    docs = set.intersection(*[set([i[0] for i in inv_idx_tfidf[i]]) for i in term_idx])

    # Initialize a dictionary that maps document_id to score
    scores = {doc_id: 0 for doc_id in docs}
    for term_id in term_idx:
        for doc_id, score in inv_idx_tfidf[term_id]:
            if doc_id in docs:
                scores[doc_id] += score

    # Spread score values over [0, 1]
    scores = {k: (v - min(scores.values())) / (max(scores.values()) - min(scores.values())) \
         for k, v in scores.items()}

    # Sort the documents by their tf-idf scores
    docs = sorted(scores, key=scores.get, reverse=True)
    filtered = df.iloc[docs][['placeDesc', 'placeName', 'placeURL']]
    filtered['score'] = [scores[doc_id] for doc_id in docs]

    return filtered.head(k)

In [None]:
q0 = 'American Museum yomama' # I added 'yomama' to test
search_tfidf(q0).head()

### *Comparing Results*

## 3. Define a new score!

In [None]:
def search_new_score(q: str, k: int = 10):
    '''
    q: query string
    k: number of results to return
    '''

    filtered = search_tfidf(q, k=7200)

    # Preprocess query
    q = preprocess_str(q)

    # Processed URL words
    filtered['placeURLX'] = filtered['placeURL'].apply(
        lambda x: [wnl.lemmatize(y) for y in x.split('/')[-1].split('-')])
    
    # New score
    filtered['new_score'] = filtered['placeURLX'].apply(lambda x: len(set(q) & set(x))) + filtered['score']

    # spread new_score values over [0, 1]
    filtered['new_score'] = (filtered['new_score'] - min(filtered['new_score'])) / \
        (max(filtered['new_score']) - min(filtered['new_score']))

    # Drop unnecessary columns
    filtered.drop(['placeURLX', 'score'], axis=1, inplace=True)

    return filtered.sort_values(by='new_score', ascending=False).head(k)

In [None]:
# Example query
q0 = 'American Museum' # I added 'yomama' to test
search_new_score(q0, k=5)

### Comparing the results of TF-IDF vs. New Score
Firstly, let's introduce how we calculate the new score. We define the new score of document $d$ given a query $q$ as

$$
\text{new-score}(d, q) = \text{tf-idf-score}(d, q) + |q \cap \text{URL}(d)|,
$$

where $\text{tf-idf-score}(d, q)$ is the normalized tf-idf score of document $d$ given query $q$, and $|q \cap \text{URL}(d)|$ is the number of words in the query that are also in the URL of the document. The top 5 documents for the query "American Museum" are shown below.

| Rank | New Score                                   | TF-IDF Score |
|------|---------------------------------------------|--------------|
| 1    | American Writers Museum	                 | Siriraj Medical Museum |
| 2    | American Banjo Museum                       | Museum of the Weird |
| 3    | The American Kennel Club Museum of the Dog  | Harvard Museum of Natural History |
| 4    | American Museum of Western Art              | Milwaukee Art Museumy |
| 5    | American Museum of Magic                    | Sweet Home Cafe |

We can see that the new score gives a better result than the TF-IDF score. The new score gives a higher score to documents that have the query words in the URL, which is a good indicator of the relevance of the document.

## 4. Visualizing the most relevant places

In [None]:
import plotly.graph_objects as go

# Let's map the query results to a map using columns placeLat placeLong
def plot_map(q, k=10):

    # add other columns from df
    filtered = pd.merge(df, search_new_score(q, k=k), how='inner', left_index=True, right_index=True)

    fig = go.Figure(go.Scattermapbox(
        lat=filtered['placeLat'],
        lon=filtered['placeLong'],
        mode='markers',
        marker_color=filtered['new_score'],
        marker=go.scattermapbox.Marker(
            size=15,
            opacity=0.5,
            colorbar=go.scattermapbox.marker.ColorBar(
                title='Score'
            ) 
        ),
        text=df['placeName'],
        hoverinfo='text',
    ))

    fig.update_layout(
        mapbox_style="open-street-map",
        hovermode='closest',
        margin=dict(l=0, r=0, t=0, b=0)
        )

    fig.show()

In [None]:
# Example query
q0 = 'American Museum' # I added 'yomama' to test
plot_map(q0, k=30)

## 5. More complex search engine

## 6. Command line question

In [None]:
# read df from pickle
df = pd.read_pickle('places.pkl')

# Remove all tabs (\t) and newlines (\n) from the text
df['placeDesc'] = df['placeDesc'].apply(lambda x: x.replace('\t', ' ').replace('\n', ' '))
df.to_csv('df.tsv', sep='\t', index=False)

**How many places in Italy, Spain, France, England and United States are there in our dataset?**

In [None]:
%%bash

for country in Italy Spain France England;
do
    echo "The number of places in $country:"
    awk -F '\t' '$8 ~ /'$country'/{c++} END{print c}' df.tsv
done

echo "The number of places in United States:"
awk -F '\t' '$8 ~ /United States/{c++} END{print c}' df.tsv

**The average numer of visitors in places in Italy, Spain, France, England and United States**

In [None]:
%%bash

countries=(Italy Spain France England)

for country in "${countries[@]}";
do
    echo "The average number visitors for $country"
    awk -F '\t' '$8 ~ /'$country'/{total += $3; count++} END{print total/count}' df.tsv
done

echo "The average number of visitors for United States"
awk -F '\t' '$8 ~ /United States/{total += $3; count++} END{print total/count}' df.tsv

**The number of people who wants to visit the places in Italy, Spain, France, England and United States**

In [None]:
%%bash

countries=(Italy Spain France England)

for country in "${countries[@]}";
do
    echo "The number of people who want to visit $country"
    awk -F '\t' '$8 ~ /'$country'/{total += $4; count++} END{print total}' df.tsv
done

echo "The number of people who want to visit United States"
awk -F '\t' '$8 ~ /United States/{total += $4; count++} END{print total}' df.tsv

## 7. Theoretical question