# **Algorithmic Methods of Data Mining - Fall 2022**

#### Group Members: Yoanna Efimova, Angelo Mandara, Cem Sirin
<yoanna.efimova@gmail.com>, <mandara.2077139@studenti.uniroma1.it>, <sirincem1@gmail.com>

## **Homework 3: Places of the World**

**We structured the notebook such that there is a section and subsection for each question and subquestion. The outline is as follows:**
* **Question 1: Data Collection**
    * **1.1:** *Getting the list of places*
    * **1.2:** *Crawling pages*
    * **1.3:** *Parsing pages*
* **Question 2: Search Engine**
    * **2.1:** *Conjunctive queries*
        * **2.1.1:** *Creating an index*
        * **2.1.2:** *Querying the index*
    * **2.2:** *Conjunctive queries and Ranking Scores*
        * **2.2.1:** *Creating an inverted index*
        * **2.2.2:** *Querying the inverted index*
* **Question 3: Defining a score**
* **Question 4: Visualization**
* **Question 5: Complex search engines**
* **Question 6: Command line interface**
* **Question 7: Theoretical Aspects**

**Packages that are used troughout the notebook:**

In [2]:
# Library for data manipulation
import pandas as pd
import numpy as np
# Library for tracking progress
from tqdm import tqdm
import time
import os
import sys 
# Libraries to scrape the web
import requests
from bs4 import BeautifulSoup
# Library to save data
import json
import pickle
# Library to work with dates
from datetime import datetime
import csv
# Our scripts
import scripts

## Question 1: Data Collection

We start by importing the necessary packages and defining the contstants and functions that will be used throughout this section.

In [2]:
# Constants
BASE_URL = 'https://www.atlasobscura.com'

### 1.1. Getting the list of places

Our task is to get the list of places in the top 400 pages sorted by popularity. The URLs of those pages follow the same format

In [3]:
# Top first 400 pages of Atlas Obscura
page_urls = [(f'{BASE_URL}/places?page={i}&sort=likes_count') for i in range(1, 401)]

In [4]:
s = requests.Session()

def save_html(url, path):
    if os.path.exists(path):
        return

    r = s.get(url)
    while r.status_code != 200:
        r = s.get(url)
        time.sleep(30)

    with open(path, 'w') as f:
        f.write(r.text)

We iterate over the URLs of the top 400 pages and save their html content in a folder.

In [5]:
for i, url in tqdm(enumerate(page_urls), total=len(page_urls)):
    path = f'data/pages/page_{i}.html'
    save_html(url, path)

100%|██████████| 400/400 [00:00<00:00, 82744.21it/s]


Now, we parse over the html content of each page and extract the list of places. We save the URLs of the places in a text file as instructed.

In [6]:
# Grabbing the urls of all the places
place_urls = []
for i in tqdm(range(400)):

    with open(f'data/pages/page_{i}.html', 'r') as f:
        html = f.read()
    
    soup = BeautifulSoup(html, 'html.parser')
    place_urls.extend([a['href'] for a in soup.find_all('a', class_='content-card content-card-place')])

# Save the list of hrefs as text file
with open('data/misc/place_urls.txt', 'w') as f:
    for url in place_urls:
        f.write(f'{url}\n')

100%|██████████| 400/400 [01:01<00:00,  6.45it/s]


In [7]:
# Delete all variables that are no longer needed
del html, soup, f, i, url, page_urls, place_urls, path

### 1.2. Crawl places

In [9]:
# Read the list of place urls
with open('data/misc/place_urls.txt', 'r') as f:
    place_urls = [url.strip() for url in f.readlines()]

In [10]:
for url in tqdm(place_urls):
    save_html(BASE_URL + url, f'data{url}.html')

100%|██████████| 7200/7200 [00:00<00:00, 55704.23it/s]


### 1.3. Parse downloaded pages

In [23]:
# Function to parse and extract data from the htmls
def parse_place(html, url):
    soup = BeautifulSoup(html, 'html.parser')
    placeName = soup.find('h1', class_='DDPage__header-title').text.strip()
    placeTags = [x.text.strip() for x in soup.find('div', class_='item-tags').find_all('a')] if soup.find('div', class_='item-tags') else None
    numPeopleVisited = int(soup.find_all('div', class_='title-md item-action-count')[0].text.strip())
    numPeopleWant = int(soup.find_all('div', class_='title-md item-action-count')[1].text.strip())
    placeDesc = soup.find('div', id='place-body').text.strip()
    placeShortDesc = soup.find('h3', class_='DDPage__header-dek').text.strip()
    placeNearby = [x['href'].strip() for x in soup.find('div', class_='DDPageSiderailRecirc').find_all('a')]
    placeAddress = '; '.join([x.strip() for x in soup.find('address').find('div').contents if isinstance(x, str)])
    placeLat = float(soup.find('div', class_='DDPageSiderail__coordinates js-copy-coordinates')['data-coordinates'].split(',')[0])
    placeLong = float(soup.find('div', class_='DDPageSiderail__coordinates js-copy-coordinates')['data-coordinates'].split(',')[1])
    placeEditors = [*set([x.text.strip().split('\n')[-1] for x in soup.find_all('a', class_='DDPContributorsList__contributor')])] if soup.find_all('a', class_='DDPContributorsList__contributor') else None
    placePubDate = datetime.strptime(soup.find('div', class_='DDPContributor__name').text.strip(), '%B %d, %Y') if soup.find('div', class_='DDPContributor__name') else None
    placeRelatedLists  = [x['href'] for x in soup.find('div', attrs={'data-gtm-template': 'DDP Footer Recirc Lists'}).find_all('a')] if soup.find('div', attrs={'data-gtm-template': 'DDP Footer Recirc Lists'}) else None
    placeRelatedPlaces = [x['href'] for x in soup.find('div', attrs={'data-gtm-template': 'DDP Footer Recirc Related'}).find_all('a')] if soup.find('div', attrs={'data-gtm-template': 'DDP Footer Recirc Related'}) else None
    placeURL = '/places/' + url
    return {
        'placeName': placeName,
        'placeTags': placeTags,
        'numPeopleVisited': numPeopleVisited,
        'numPeopleWant': numPeopleWant,
        'placeDesc': placeDesc,
        'placeShortDesc': placeShortDesc,
        'placeNearby': placeNearby,
        'placeAddress': placeAddress,
        'placeLat': placeLat,
        'placeLong': placeLong,
        'placeEditors': placeEditors,
        'placePubDate': placePubDate,
        'placeRelatedLists': placeRelatedLists,
        'placeRelatedPlaces': placeRelatedPlaces,
        'placeURL': placeURL
    }

In [25]:
# Parse all the html files in the data/places
place_data = []
for i, file in tqdm(enumerate(os.listdir('data/places')), total=len(os.listdir('data/places'))):
    with open(f'data/places/{file}', 'r') as f:
        html = f.read()
        d = parse_place(html, file.replace('.html', ''))

        # Write the values to a tsv file
        with open(f'data/parsed_places/place_{i}.tsv', 'w') as f:
            writer = csv.writer(f, delimiter='\t')
            writer.writerow(d.values())

        place_data.append(d)

100%|██████████| 7200/7200 [36:24<00:00,  3.30it/s]  


In [26]:
# Convert the list of dictionaries to a dataframe
df = pd.DataFrame(place_data)

# Save the dataframe as pickle
df.to_pickle('places2.pkl')

In [27]:
# Delete all variables that are no longer needed
del place_data, df, d, file, f, html, i, writer, url

## Question 2: Search Engine

In [3]:
preprocessor = scripts.Preprocessor()

# open places.pkl
df = pd.read_pickle('data/misc/places2.pkl')
# Step 1: Lowercase
df['placeDescX'] = preprocessor.preprocess_column(df['placeDesc'])

### 2.1. Conjunctive query

#### 2.1.1. Create your index!

To create the index, we use the CountVectorizer module from the sklearn library. We use the default parameters of the CountVectorizer module, except minimum document frequency, which we set to 5. This means that we only consider words that appear in at least 5 documents.

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

# Count terms in each document
vectorizer = CountVectorizer(min_df=5)
X = vectorizer.fit_transform(df['placeDescX'])

# Save the vocabulary
words = vectorizer.get_feature_names_out()
print('Last 10 words in the vocabulary:', words[-10:])

Last 10 words in the vocabulary: ['zone' 'zoning' 'zoo' 'zoological' 'zoology' 'zoom' 'zooming' 'zuni'
 'černý' 'černýs']


In [37]:
vocabulary = {} # Dictionary that maps words to term_id
inv_idx = {} # Dictionary that maps term_id to document_id
for term_id, word in tqdm(enumerate(words), total=len(words)):
    vocabulary[word] = term_id
    inv_idx[term_id] = X[:, term_id].nonzero()[0]

100%|██████████| 15335/15335 [01:21<00:00, 188.38it/s]


In [38]:
print('The term ID of the word "kafka" is:', vocabulary['kafka'], \
     'and the doc_IDs of the document that contains the word "kafka" are:', \
        inv_idx[vocabulary['kafka']])

The term ID of the word "kafka" is: 7774 and the doc_IDs of the document that contains the word "kafka" are: [ 867 1104 1671 3174 5093 5554]


#### 2.1.2. Execute the query

In [39]:
def search(q):
    '''q: query string'''

    # Preprocess query
    q = preprocessor.preprocess_str(q)
    
    # Get term IDs for query terms
    idx = [vocabulary[word] for word in q if word in vocabulary]

    # Document IDs that contain all query terms
    docs = set.intersection(*[set(inv_idx[i]) for i in idx])

    return df.iloc[list(docs)][['placeDesc', 'placeName', 'placeURL']]

In [40]:
# Example query
q0 = 'American Museum yomama' # I added 'yomama' to test
search(q0).head()

Unnamed: 0,placeDesc,placeName,placeURL
3072,"Once only open to academics, Lombroso’s Museum...",Cesare Lombroso's Museum of Criminal Anthropology,/places/cesare-lombrosos-museum-of-criminal-an...
2049,It’s easy to work up an appetite as you meande...,Museum of Food and Drink,/places/museum-of-food-and-drink-mofad
7169,With its rich collection of historic and conte...,Philbrook Museum of Art,/places/philbrook-museum-of-art
6661,"Located in Madison County, Tennessee, this par...",Pinson Mounds State Archeological Park,/places/pinson-mounds-state-archeological-park
6662,Steve McVoy was always fascinated by TV. In mi...,Early Television Museum,/places/early-television-museum


### 2.2. Conjunctive query & ranking score

#### 2.2.1. Inverted index

Now, we use the TfidfTransformer module from the sklearn library to transform the count matrix into a tf-idf representation.

In [41]:
from sklearn.feature_extraction.text import TfidfTransformer

# Transform the count matrix to a normalized tf or tf-idf representation
tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X)

In [42]:
inv_idx_tfidf = {} # Dictionary that maps term_id to document_id and tf-idf score

# create an empty list for each term_id
inv_idx_tfidf = {term_id: [] for term_id in vocabulary.values()}

for doc_id, term_id in tqdm(zip(*X_tfidf.nonzero()), total=X_tfidf.nnz):
    inv_idx_tfidf[term_id].append((doc_id, X_tfidf[doc_id, term_id]))

# The length of the document vector fot tf-idf values
doc_len = np.sqrt(X_tfidf.power(2).sum(axis=0).A1)

100%|██████████| 752948/752948 [00:37<00:00, 19849.07it/s]


In [43]:
print('Now, the tf-idf scores for the word "kafka" are:',
inv_idx_tfidf[vocabulary['kafka']])

Now, the tf-idf scores for the word "kafka" are: [(867, 0.5921927432620192), (1104, 0.4840873049178918), (1671, 0.11799002687449961), (3174, 0.2164608512912014), (5093, 0.08311932460342333), (5554, 0.632818067340741)]


#### 2.2.2. Execute the query

In [44]:
import heapq

class SearchEngine:

    def __init__(self, df, inv_idx, vocabulary, doc_len, q: str):
        self.df = df
        self.inv_idx = inv_idx
        self.vocabulary = vocabulary
        self.doc_len = doc_len
        self.q = preprocessor.preprocess_str(q)
        self.heap_order = []

        self.rank_documents()

    def rank_documents(self):

        # Get term IDs for query terms
        term_idx = [self.vocabulary[word] for word in self.q if word in self.vocabulary]

        # Document IDs that contain all query terms
        docs = set.intersection(*[set(i[0] for i in self.inv_idx[i]) for i in term_idx])

        # Calculate the tf-idf score for each document
        scores = {doc_id: 0 for doc_id in docs}
        for term_id in term_idx:
            for doc_id, tfidf in self.inv_idx[term_id]:
                if doc_id in docs:
                    scores[doc_id] += tfidf / self.doc_len[doc_id]

        # Spread the scores between 0 and 1
        scores = {k: (v - min(scores.values())) / (max(scores.values()) - min(scores.values())) \
                    for k, v in scores.items()}

        # Sort the scores in descending order
        # self.heap_order = heapq.nlargest(len(scores), scores, key=scores.get)

        self.heap_order = [(-v, k) for k, v in scores.items()]
        heapq.heapify(self.heap_order)

        self.df = self.df.loc[docs][['placeName', 'placeDesc', 'placeURL']]
        self.df['score'] = [scores[doc_id] for doc_id in self.df.index]


    def get_top_k(self, k: int = 10):
        '''
        k: number of results to return
        '''

        top_k_docs = [i[1] for i in heapq.nsmallest(k, self.heap_order)]
        return self.df.loc[top_k_docs]

In [45]:
q0 = 'American Museum yomama' # I added 'yomama' to test
se = SearchEngine(df, inv_idx_tfidf, vocabulary, doc_len, q0)
se.get_top_k(5)

Unnamed: 0,placeName,placeDesc,placeURL,score
942,AAF Tank Museum,The American Armed Forces Tank Museum (AAF Tan...,/places/aaf-tank-museum,1.0
705,Self-Taught Genius Gallery,"In 2017, the American Folk Art Museum in Manha...",/places/self-taught-genius-gallery,0.892628
2224,Glore Psychiatric Museum,"Located in St. Joseph, Missouri, the Glore Psy...",/places/glore-psychiatric-museum,0.786417
2510,Indian Steps Museum,"Constructed by a local lawyer from 1908-1912, ...",/places/indian-steps-museum,0.748543
5232,Museum of the Weird,The dime or dime store museum is by all accoun...,/places/museum-weird,0.695378


## 3. Define a new score!

In [69]:
import numpy as np

class NewSearchEngine:

    def __init__(self, df, inv_idx, vocabulary, doc_len, q: str, lat = 41.9028, long = 12.4964):
        self.df = df
        self.inv_idx = inv_idx
        self.vocabulary = vocabulary
        self.doc_len = doc_len
        self.q = preprocessor.preprocess_str(q)
        self.lat = lat
        self.long = long
        self.heap_order = []

        self.preprocess_df()
        self.rank_documents()


    def preprocess_df(self):

        # The distance between the user and the place
        self.df['placeDistance'] = self.normalize_inv(
            self.df.apply(lambda x: ((x['placeLat'] - self.lat)**2 + (x['placeLong'] - self.long)**2)**0.5, axis=1)
        )
        # The interest in place with respect to popularity
        self.df['placeInterest'] = self.normalize(
            self.df['numPeopleWant'] / (self.df['numPeopleVisited'] + 1)
        )
        # The interest in place with respect to the past amount of time the post was made
        self.df['averageWants'] = self.normalize(
            self.df['numPeopleWant'] / ((datetime.now() - self.df['placePubDate']).dt.days + 1)
        )
        # The number of editors that worked on the post
        self.df['numEditors'] = self.normalize(self.df['placeEditors'].apply(lambda x: len(x) if x else 0))
        # Preprocessed tags
        self.df['placeTagsX'] = self.df['placeTags'].apply(lambda x: [scripts.wnl.lemmatize(y.lower()) for y in x] if x else [])
        # The number of matching tags with the query
        self.df['matchedTags'] = self.df['placeTagsX'].apply(lambda x: len(set(x).intersection(self.q)))
        # Preprocessed URL
        self.df['placeURLX'] = self.df['placeURL'].apply(lambda x: [scripts.wnl.lemmatize(y) for y in x.split('/')[-1].split('-')])
        # The number of matching words in the URL with the query
        self.df['matchedURL'] = self.df['placeURLX'].apply(lambda x: len(set(x).intersection(self.q)))

    def rank_documents(self):

        # Get term IDs for query terms
        term_idx = [self.vocabulary[word] for word in self.q if word in self.vocabulary]

        # Document IDs that contain all query terms
        docs = set.intersection(*[set(i[0] for i in self.inv_idx[i]) for i in term_idx])

        # Calculate the tf-idf score for each document
        scores = {doc_id: 0 for doc_id in docs}
        for term_id in term_idx:
            for doc_id, tfidf in self.inv_idx[term_id]:
                if doc_id in docs:
                    scores[doc_id] += tfidf / self.doc_len[doc_id]

        # Spread the scores between 0 and 1
        scores = {k: (v - min(scores.values())) / (max(scores.values()) - min(scores.values())) \
                    for k, v in scores.items()}

        # These scores are dependent on the user/query
        priority1 = self.df['matchedTags'] + self.df['matchedURL'] + self.df['placeDistance']
        # This scores are independent of the user/query
        priority2 = (self.df['placeInterest'] + self.df['averageWants'] + self.df['numEditors']) / 3

        for doc_id in docs:
            scores[doc_id] += priority1[doc_id] * np.exp(priority2[doc_id])

        scores = {k: (v - min(scores.values())) / (max(scores.values()) - min(scores.values())) \
                    for k, v in scores.items()}
        # Create list to heapify
        self.heap_order = [(-v, k) for k, v in scores.items()]
        heapq.heapify(self.heap_order)

        # Store teh filtered dataframe
        self.df = self.df.loc[docs][['placeName', 'placeDesc', 'placeURL']]
        self.df['score'] = [scores[doc_id] for doc_id in self.df.index]

    def search_new_score(self, k: int = 10):
        '''
        k: number of results to return
        '''

        top_k_docs = [i[1] for i in heapq.nsmallest(k, self.heap_order)]
        return self.df.loc[top_k_docs]

    def normalize_inv(self, column: pd.Series) -> pd.Series:
        minv, maxv = min(column), max(column)
        return (maxv - column) / (maxv - minv)

    def normalize(self, column: pd.Series) -> pd.Series:
        minv, maxv = min(column), max(column)
        return (column - minv) / (maxv - minv)


In [70]:
# Example query
q0 = 'American Museum' # I added 'yomama' to test

# Create a new search engine
nse = NewSearchEngine(df, inv_idx_tfidf, vocabulary, doc_len, q0)
nse.search_new_score(5)

Unnamed: 0,placeName,placeDesc,placeURL,score
133,The American Visionary Art Museum,"The art of farmers, postmen, the mentally ill,...",/places/the-american-visionary-art-museum-balt...,1.0
4432,American Prohibition Museum,When the 18th Amendment to the U.S. Constituti...,/places/american-prohibition-museum,0.9574
195,American Classic Arcade Museum,"Housed inside New Hampshire’s Funspot, which h...",/places/american-classic-arcade-museum,0.953753
2678,The American Kennel Club Museum of the Dog,At the intersection of the Venn diagram where ...,/places/the-american-kennel-club-museum-of-the...,0.932502
6932,American Museum of the House Cat,Cats have a regal bearing that seems to have f...,/places/american-museum-of-the-house-cat,0.915866


### Comparing the results of TF-IDF vs. New Score
Firstly, let's introduce how we calculate the new score. We define the new score of document $d$ given a query $q$ as

$$
\text{new-score}(d, q) = \text{tf-idf-score}(d, q) + |q \cap \text{URL}(d)|,
$$

where $\text{tf-idf-score}(d, q)$ is the normalized tf-idf score of document $d$ given query $q$, and $|q \cap \text{URL}(d)|$ is the number of words in the query that are also in the URL of the document. The top 5 documents for the query "American Museum" are shown below.

| Rank | New Score                                   | TF-IDF Score |
|------|---------------------------------------------|--------------|
| 1    | American Writers Museum	                 | AAF Tank Museum |
| 2    | American Computer Museum                    | Self-Taught Genius Gallery |
| 3    | The American Pigeon Museum                  | Glore Psychiatric Museum |
| 4    | American Banjo Museum                       | Indian Steps Museum |
| 5    | American Museum of Natural History          | Museum of the Weird |

We can see that the new score gives a better result than the TF-IDF score. The new score gives a higher score to documents that have the query words in the URL, which is a good indicator of the relevance of the document.

## 4. Visualizing the most relevant places

In [77]:
df['mapDesc'] = df.apply(lambda x:
    'Name: ' + x['placeName'] + '<br>' +
    'Adress: '+ x['placeAddress'] + '<br>' +
    'Number of people visited: ' + str(x['numPeopleVisited'])
    , axis=1)

In [78]:
import plotly.graph_objects as go

# Let's map the query results to a map using columns placeLat placeLong
def plot_map(q, k=10):

    # add other columns from df
    filtered = pd.merge(df, search_new_score(q, k=k), how='inner', left_index=True, right_index=True)

    fig = go.Figure(go.Scattermapbox(
        lat=filtered['placeLat'],
        lon=filtered['placeLong'],
        mode='markers',
        marker_color=filtered['new_score'],
        marker=go.scattermapbox.Marker(
            size=15,
            opacity=0.5,
            colorbar=go.scattermapbox.marker.ColorBar(
                title='Score'
            ) 
        ),
        # text=df['mapDesc'],
        hovertemplate=df['mapDesc'],
    ))

    fig.update_layout(
        mapbox_style="open-street-map",
        hovermode='closest',
        margin=dict(l=0, r=0, t=0, b=0)
        )

    fig.show()

In [79]:
# Example query
q0 = 'American Museum' # I added 'yomama' to test
plot_map(q0, k=30)

## 5. More complex search engine

In [14]:
import complex_engine as ComplexEngine
import pandas as pd
df = pd.read_pickle('data/misc/places2.pkl')

# Create a ComplexEngine instance
ce = ComplexEngine.ComplexEngine(df)

In [15]:
# Set query params
query = {
    'usernames': ['nick'],
    'tags': ['graffiti'],
    'address': 'prague',
}

# Get the results
ce.search(query)

Unnamed: 0,placeName,placeURL,placeEditors
1436,Squat Milada,/places/squat-milada,"[Blindcolour, Sebastian Wortys, Molly McBride ..."
3762,Lennon Wall,/places/lennon-wall,"[muzeumlennon, Steven Vacher, Giorgio, spirit3..."
4897,R2-D2 of Prague,/places/r2d2-of-prague,"[hrnick, Mathias Van de Velde, usarepublican, ..."


## 6. Command line question

In [16]:
# read df from pickle
df = pd.read_pickle('data/misc/places2.pkl')

# Remove all tabs (\t) and newlines (\n) from the text
df['placeDesc'] = df['placeDesc'].apply(lambda x: x.replace('\t', ' ').replace('\n', ' '))
df.to_csv('df.tsv', sep='\t', index=False)

**How many places in Italy, Spain, France, England and United States are there in our dataset?**

In [17]:
%%bash

for country in Italy Spain France England;
do
    echo "The number of places in $country:"
    awk -F '\t' '$8 ~ /'$country'/{c++} END{print c}' df.tsv
done

echo "The number of places in United States:"
awk -F '\t' '$8 ~ /United States/{c++} END{print c}' df.tsv

The number of places in Italy:
210
The number of places in Spain:
92
The number of places in France:
206
The number of places in England:
367
The number of places in United States:
4613


**The average numer of visitors in places in Italy, Spain, France, England and United States**

In [18]:
%%bash

countries=(Italy Spain France England)

for country in "${countries[@]}";
do
    echo "The average number visitors for $country"
    awk -F '\t' '$8 ~ /'$country'/{total += $3; count++} END{print total/count}' df.tsv
done

echo "The average number of visitors for United States"
awk -F '\t' '$8 ~ /United States/{total += $3; count++} END{print total/count}' df.tsv

The average number visitors for Italy
384.352
The average number visitors for Spain
446.424
The average number visitors for France
426.146
The average number visitors for England
476.659
The average number of visitors for United States
437.599


**The number of people who wants to visit the places in Italy, Spain, France, England and United States**

In [19]:
%%bash

countries=(Italy Spain France England)

for country in "${countries[@]}";
do
    echo "The number of people who want to visit $country"
    awk -F '\t' '$8 ~ /'$country'/{total += $4; count++} END{print total}' df.tsv
done

echo "The number of people who want to visit United States"
awk -F '\t' '$8 ~ /United States/{total += $4; count++} END{print total}' df.tsv

The number of people who want to visit Italy
182975
The number of people who want to visit Spain
72037
The number of people who want to visit France
205332
The number of people who want to visit England
389860
The number of people who want to visit United States
4350222


## 7. Theoretical question

You can look at the point 7 opening the notebook 'exer7.ipynb' in the repository.