# 1. Data collection

## 1.1. Get the list of master's degree courses
We created a file named 'urls.txt' that contains all the urls associated with the url of each master page.
for reaching such purpose, we iterate over all 400 pages and took the link for every 15 urls of each page.
we stored all urls in 'urls.txt' file.

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import os
from urllib.request import urlopen
import time
from concurrent.futures import ThreadPoolExecutor
import csv
import pandas as pd

In [None]:
f = open("urls.txt","w") # First we create a txt file where we can write the URLs  #  w means writing mode
for i in range(1, 401): #from first page to page 400
    url = f"https://www.findamasters.com/masters-degrees/msc-degrees/?PG={i}" #pages can be scrolled by changing the number after PG
    result = requests.get(url) # as we have done in class
    soup = BeautifulSoup(result.text, 'html.parser') # to get the html of each page

    for link in soup.find_all(class_ = re.compile('courseLink')): #as in class to get each tag of the page which belongs to class courseLink
        c = (link.get("href"))  # url of each page in the i-th page
        f.write("https://www.findamasters.com/"+c) #writing the rows
        f.write("\n")
f.close()
print('The "urls.txt" file is generated!')

### 1.2. Crawl master's degree pages

We wrote a function named 'download_url'.
Since the FindMaster website blocks us for 70 seconds for every (20 to 22) requests we send, we use 'time.sleep(70)' to wait and then resend the http get request. 
we also omit to download the http files that their directory are already existed.

for sending http get requests asynchronously, we can use async and await methods and take the advantage of using "aiohttp" library. the other way is to use ThreadPoolExecutor function executer. 
It means that we store the executer command in a variable named 'future_to_url' that we are able to call in the future.
The ThreadPoolExecutor is a built-in Python module that provides managing a pool of worker threads. It allows us to submit tasks to the pool, which are then executed by one of the worker threads in the pool.

In [2]:
from concurrent.futures import ThreadPoolExecutor

# Function to download and save HTML for a given URL
def download_url(url, folder_path, page_number):
    # Create a folder for each page if it doesn't exist
    page_folder = os.path.join(folder_path, f"page_{page_number}")
    if os.path.exists(page_folder):
        # uncomment the below code to see which pages are skiped, cause they have already been downloaded.
        # print(f"Skipping Page: {page_number} - Folder already exists.")
        return

    try:
        response = requests.get(url) # Send a GET request to the URL
        response.raise_for_status()  # Raise an exception for bad responses 

        # Create a folder for each page if it doesn't exist
        os.makedirs(page_folder, exist_ok=True)

        # Save the HTML content to a file
        file_path = os.path.join(page_folder, f"html_{page_number}.html")
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(response.text)
        print(f"Downloaded page {page_number}: {url}")
    except requests.exceptions.RequestException as e:
        print(f"Failed to download page {page_number}: {url}")
        print(f"Error: {e}")
        print("Retrying in 70 seconds...")
        time.sleep(70)  # Wait for 10 seconds before retrying
        download_url(url, folder_path, page_number)  # Retry the download

# Read all URLs one by one
with open('urls.txt', 'r') as urls_file:
    urls = urls_file.read().splitlines()

output_folder = 'HTML_folders' # Store all HTML files into this directory.

# We can use ThreadPoolExecutor for sending http requests asynchronously. 
# However, Since the FindMaster website blocks us for 70 seconds for every (20 to 22) requests we send, 
# the max_workers in below code assigned to number 1. So it sends requests synchronously.
with ThreadPoolExecutor(max_workers=1) as executor:
    # Enumerate through each URL and submit download tasks to the executor
    future_to_url = {executor.submit(download_url, url, output_folder, page_number): url for page_number, url in enumerate(urls, start=1)}

print("All HTML files are stored in the HTML_folders directory.")

All HTML files are stored in the HTML_folders directory.


### 1.3 Parse downloaded pages
Here we create a '.tsv' file including the following columns for each of the HTML files.

1. Course Name (to save as ```courseName```): string;
2. University (to save as ```universityName```): string;
3. Faculty (to save as ```facultyName```): string
4. Full or Part Time (to save as ```isItFullTime```): string;
5. Short Description (to save as ```description```): string;
6. Start Date (to save as ```startDate```): string;
7. Fees (to save as ```fees```): string;
8. Modality (to save as ```modality```):string;
9. Duration (to save as ```duration```):string;
10. City (to save as ```city```): string;
11. Country (to save as ```country```): string;
12. Presence or online modality (to save as ```administration```): string;
13. Link to the page (to save as ```url```): string.

Then, we merge all those files together to generate our final dataset.

In [5]:
current_path = os.getcwd()
# '/Users/armanfeili/Arman/Sapienza Courses/ADM/Homeworks/HW3/phase-2/ADM-HW3/HTML_folders'

for i in range(1,6001):
    # os.chdir(r'C:\Users\susan\Documents\DS\ADM\HW3\ADM-HW3\HTML_folders\page_'+str(i)) #change directories
    os.chdir(r'/Users/armanfeili/Arman/Sapienza Courses/ADM/Homeworks/HW3/phase-3/ADM-HW3/HTML_folders/page_'+str(i)) #change directories
    
    for filename in os.listdir(os.getcwd()): # get all the files in a folder
        if filename.endswith(".tsv"): continue # tsv file is already generated.
        elif filename.endswith(".html"): # if file extension is .html
            with open(os.path.join(os.getcwd(), filename), 'r',encoding='utf-8') as f: # open each file into a folder
                soup = BeautifulSoup(f,'html.parser') # get the html file by each file 
                out=[] # initialize a list where we append all the informations parsed from each html file

                # 1  Course Name
                courseName = soup.find_all(class_=re.compile("course-header__course-title"))
                out.append(courseName[0].text.strip() if courseName else "") #text.strip to eliminate strange simbols for the space
                # 2  University
                universityName = soup.find_all(class_=re.compile("course-header__institution"))
                out.append(universityName[0].text if universityName else "")
                # 3  Faculty
                facultyName = soup.find_all(class_=re.compile("course-header__department"))
                out.append(facultyName[0].text if facultyName else "")
                # 4  Full or Part Time
                isItFullTime = soup.find_all(class_=re.compile("concealLink"))
                out.append(isItFullTime[0].text if isItFullTime else "")
                # 5  Short Description
                description = soup.find_all(class_=re.compile("course-sections__content"))
                out.append(description[0].text.replace('\n', '') if description else "")
                # 6  Start Date
                startDate = soup.find_all(class_=re.compile("key-info__start-date"))
                out.append(startDate[0].text if startDate else "")
                # 7  Fees 
                fees_elements = soup.find_all(class_=re.compile("course-sections__fees")) # taking the fee
                fees_text = fees_elements[0].text.replace('\n', '') if fees_elements else "" 
                cleaned_fees = re.sub(r'Fees', '', fees_text)  # To not "Fees" at the beginning 
                out.append(cleaned_fees.strip() if cleaned_fees else "")
                # 8  Modality
                modality = soup.find_all(class_=re.compile("key-info__qualification"))
                out.append(modality[0].text if modality else "")
                # 9  Duration
                duration = soup.find_all(class_=re.compile("key-info__duration"))
                out.append(duration[0].text if duration else "")
                # 10  City
                city = soup.find_all(class_=re.compile("course-data__city"))
                out.append(city[0].text if city else "")
                # 11  Country
                country = soup.find_all(class_=re.compile("course-data__country"))
                out.append(country[0].text if country else "")
                # 12  Presence or online modality
                # We have seen that some courses has both online or oncampus modality, one of them is "Master of Business Administration"
                on_campus_elements = soup.find_all(class_=re.compile("course-data__on-campus"))
                online_elements = soup.find_all(class_=re.compile("course-data__online"))
                if on_campus_elements and online_elements:
                    out.append("both")
                else:
                    out.append(on_campus_elements[0].text if on_campus_elements else online_elements[0].text if online_elements else "Nan")
                # 13  Link to the page
                out.append(soup.find('link', {'rel': 'canonical'}).get('href') if soup.find('link', {'rel': 'canonical'}) else "Nan")
                f.close()
                
                # Creating file .tsv
                l = ['courseName','universityName','facultyName','isItFullTime','description','startDate','fees','modality','duration',
                    'city','country','administration','url']
                with open(filename+'.tsv','w',encoding='utf-8') as tsv:
                    tsv_output = csv.writer(tsv, delimiter='\t')
                    tsv_output.writerow(l)
                    tsv_output.writerow(out)
    os.chdir('..')  

print("All HTML files have been read and all .tsv files have been generated.")

All HTML files have been read and all .tsv files have been generated.


In [6]:
data=[]
# to merge all the .tsv files
for i in range(1,6001):
    # os.chdir(r'./HTML_folders/page_'+str(i)) #change directories
    os.chdir(r'/Users/armanfeili/Arman/Sapienza Courses/ADM/Homeworks/HW3/phase-3/ADM-HW3/HTML_folders/page_'+str(i)) #change directories
    for filename in os.listdir(os.getcwd()):
        if filename.endswith(".tsv"):
            a = pd.read_csv(filename,sep='\t')
            data.append(a)
    os.chdir('..')
data=pd.concat(data,ignore_index=True)   
data.to_csv('../dataset.tsv',sep='\t',index=False) # Saving the big one
print("dataset.tsv file has been generated as the main dataset.")

dataset.tsv file has been generated as the main dataset.


In [7]:
# An illustration to the dataset:
data.head(5)

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,Full time,3D visualisation and animation play a role in ...,September,Please see the university website for further ...,MSc,1 year full-time,Glasgow,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Full time,Businesses and governments rely on sound finan...,September,"UK: £18,000 (Total) International: £34,750 (To...",MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
2,"Accounting, Accountability & Financial Managem...",King’s College London,King’s Business School,Full time,"Our Accounting, Accountability & Financial Man...",September,Please see the university website for further ...,MSc,1 year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
3,"Accounting, Financial Management and Digital B...",University of Reading,Henley Business School,Full time,Embark on a professional accounting career wit...,September,Please see the university website for further ...,MSc,1 year full time,Reading,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
4,Addictions MSc,King’s College London,"Institute of Psychiatry, Psychology and Neuros...",Full time,Join us for an online session for prospective ...,September,Please see the university website for further ...,MSc,One year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...


In [None]:
data = pd.read_table(r"C:\Users\susan\Documents\DS\ADM\HW3\ADM-HW3\HTML_folders\Dataset.tsv")

## 2. Search Engine
### 2.0 Preprocessing 

### 2.0.0)  Preprocessing the text

1. Removing stopwords
2. Removing punctuation
3. Stemming


In [None]:
# Importing all the libraries for cleaning
import nltk
from nltk.stem import *
stemmer = PorterStemmer()
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
lst_stopwords = stopwords.words('english')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\susan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\susan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# The idea is to create a new column with clean description
# As we have done in class we use the apply and the lambda function to clean the column, first we use the stemmer and than we want that the word is not 
# in the stop-words list and not numeric so we remove also the puntuaction.
data['description_clean'] = data.description.apply(lambda row: [stemmer.stem(word) for word in nltk.word_tokenize(row) if not word in lst_stopwords and word.isalnum()])
data["description_clean"].head(5) # to check if it worked

0    [3d, visualis, anim, play, role, mani, area, p...
1    [busi, govern, reli, sound, financi, knowledg,...
2    [our, account, account, financi, manag, msc, c...
3    [embark, profession, account, career, academ, ...
4    [join, us, onlin, session, prospect, student, ...
Name: description_clean, dtype: object

### 2.0.1) Preprocessing the fees column

In [None]:
# DOES NOT WORK ON LINES 6 AND 7 :(

# So we need to work in the fee column and we have to:
# - pick the highest fee
# - only the number + the currency
# - use chatgpt to have an API to only have one currency

# We create this function because in the dataset there are a lot of ways in which they store the numbers, whit "," or with "." to indicate the thousands
# and so we want to replace the point and the comma with nothing in order to have numbers that the regex can deal with
def float_correct(value):
    try:
        # Rimuovi la virgola, sostituisci il punto con una stringa vuota e convergi in float
        return float(value.replace(',', '').replace('.', ''))
    except ValueError:
        return None

def extract_max_fee(text):
    # Regex to find all the patterns
    matches = re.findall(r'(?i)(GBP|USD|ISK|£|\$|₹|¥|₪|₽|₩|₦|₴|﷼|€|Euro)\s*[:,]?\s*([\d.]+(?:,\d{3})*(?:\.\d{1,2})?|\d+(?:\.\d{1,2})?)', str(text))
    # When there are not matches
    if not matches:
        return (None, None)
    max_fee = max([(currency, float_correct(fee)) for currency, fee in matches], key=lambda x: x[1], default=(None, None))
    return max_fee

# we apply the function and we create 2 new columns
data['fees_currency_clean'], data['fees_number_clean'] = zip(*data['fees'].apply(extract_max_fee))

# Just to see if it works
data[['fees', 'fees_currency_clean', 'fees_number_clean']].iloc[:20]
# We have a lot of Nan values...

Unnamed: 0,fees,fees_currency_clean,fees_number_clean
0,Please see the university website for further ...,,
1,"UK: £18,000 (Total) International: £34,750 (To...",£,34750.0
2,Please see the university website for further ...,,
3,Please see the university website for further ...,,
4,Please see the university website for further ...,,
5,"UK: £13,750 (Total) International: £31,000 (To...",£,31000.0
6,18.000 €,,
7,18.000 €,,
8,Please see the university website for further ...,,
9,Tuition fee per year (non-EU/EEA students): 15...,,


In [None]:
# Now that we have a column with the amount and one with the currency it's time to ask to chatgpt to convert all of them in one currency....

## 2.1. Conjunctive query

### 2.1.1) Create your index!

In [None]:
from collections import Counter
from functools import reduce
import json

# First we create our vocabulary
vocabulary = Counter(reduce(lambda x, y: x + y, data.description_clean.values))

# Than we create the unique index
unique_index = {}
unique_id = 1
for word in list(vocabulary):
  unique_index[unique_id] = word
  unique_id+=1

# As asked, here we create a file named vocabulary that maps each word to an integer
with open('vocabulary.json', 'w') as json_file:
    json.dump(unique_index, json_file)

# Here we will put the inverted index
inverted = {}
# We zip the keys and the values of the dictionarie, than we create an empty list and we add the document index at the list when the word
# of the unique index is inside the document.
for w,j in zip(unique_index.values(),unique_index.keys()):
    lista = []
    for word, idx in zip(data["description_clean"],data.index):
        if w in word:
            lista.append(idx)
    if w not in inverted:
        inverted[j] = lista #this part is to put the list inside the empty dictionary

# Let's save the inverted index
with open('inverted.json', 'w') as json_file:
    json.dump(inverted, json_file)


### 2.1.2) Execute the query

In [None]:
# First we need to load in the memory the inverted index
with open('inverted.json') as d:
    dictionary = d.read()
inverted = json.loads(dictionary)    

In [None]:
# We define a function to preprocess the query
def preprocess_query(query):
    cleaned_query = [stemmer.stem(word) for word in nltk.word_tokenize(query) if not word in lst_stopwords and word.isalnum()]
    return cleaned_query

In [None]:
# Now we ask the query from the input and preprocess it
query = input()
query = preprocess_query(query)
# copy for the input as in gitHub: advanced knowledge

l = []
for word in query:  # for every word inside the query
    if word in unique_index.values(): # we check the index of the word, stored in word_dict
        l.append(inverted[str(list(unique_index.keys())[list(unique_index.values()).index(word)])]) # we append the indexes of all the documents containing the word
    else:
        print('Sorry, no correspondence for word -->', word)    

x = set.intersection(*map(set,l))  # we want the documents that contain all the words of the query, so we use the intersection
y = list(sorted(x))
# We only want the columns courseName, universityName, description, URL
search = data.iloc[y, [0,1,4,10]]
search

Unnamed: 0,courseName,universityName,description,country
1,Accounting and Finance - MSc,University of Leeds,Businesses and governments rely on sound finan...,United Kingdom
4,Addictions MSc,King’s College London,Join us for an online session for prospective ...,United Kingdom
12,Analytical Toxicology MSc,King’s College London,The Analytical Toxicology MSc is a unique stud...,United Kingdom
48,Civil Engineering MSc,University of Greenwich,Meet the future demands of the construction in...,United Kingdom
86,Economics - MSc,University of Leeds,Our MSc Economics allows you to apply economic...,United Kingdom
96,Energy and Environment - MSc,University of Leeds,The sustainable use of energy is fundamental t...,United Kingdom
113,Executive MSc Strategic Marketing,King’s College London,Looking to develop your marketing strategy ski...,United Kingdom
115,Fashion Management,The New School,The online master’s in Fashion Management at P...,USA
129,Forensic Science MSc / MRes,King’s College London,The Forensic Science programme has a reputatio...,United Kingdom
148,"Healthcare Technologies MSc, MRes",King’s College London,The Healthcare Technologies MSc/MRes will trai...,United Kingdom
