In [30]:
# Importing useful libraries
import requests
from bs4 import BeautifulSoup
import time
import os
import asyncio
import aiohttp
import pandas as pd
import asyncio
from aiohttp import ClientSession, ClientResponseError
from parser import *
from crawler import *
pd.set_option("display.max_colwidth", None)
from functions import *
import warnings
warnings.filterwarnings("ignore")

## 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the MSc Degrees. Next, we want you to collect the URL associated with each site in the list from the previously collected list. The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in the first 400 pages (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a .txt file whose single line corresponds to the master's URL.

---
Let's use BeautifulSoup and Requests to scrape the links related to universities present on the first 400 pages of the following website:
'https://www.findamasters.com/masters-degrees/msc-degrees'.

Note that it's possible to navigate to different pages by modifying the link in the final part, appending "/?PG=" + the page number.

We are saving the URLs in each line of txt file called 'course_links'. In this case there aren't exceptions to handle with.

In [2]:
# let's create a file .txt called course_links which contains in every line the URL of the link
path = 'course_links.txt'
parser(path)

'The file already exists.'

## 1.2. Crawl master's degree pages
Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its HTML in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the downloaded HTML pages into folders. Each folder will contain the HTML of the courses on page 1, page 2, ... of the list of master's programs.

**Tip**: Due to the large number of pages you should download, you can use some methods that can help you shorten the time. If you employed a particular process or approach, kindly describe it.



---
The proposed solution for this task involves creating 400 folders, with each folder dedicated to a web page that was scraped in the previous exercise. Inside each of these folders, the corresponding HTML contents of the 15 website will be stored.

The script is organized into three asynchronous functions:
- *get_info(url, session, folder, page_number)*, which performs an asynchronous HTTP GET request to a line (URL) of the txt file created in the exercise above. It returns the html page, written in a specific folder. It manages exceptions such as "Error 429: Too many requests to the website" and if it happens there's a time.sleep of 1 second, until a new get request is sent.
- *process_batch(urls_session, folder)*, which takes in input 15 urls and creates a list of asynchronous tasks, each corresponding to fetching HTML content from a URL in the given list using the *get_info* function, defined before. It uses asyncio.gather to concurrently execute all tasks and returns the results.
- *main(urls, batch_size, starting_folder)* which creates the path where all the html are put. It iterates through batches of URLs, creating a sub-folder for each batch and calling process_batch to asynchronously download and save HTML content for each URL in the batch.


The Python script is designed for asynchronous tasks using the aiohttp library to fetch HTML content from a list of URLs concurrently. 

In this particular case, working on multiple downloads at the same time are not so effective because the 6000 URLs are all from the same server, so we need to insert a time.sleep to handle "Too many requests". As a consequence, the code takes several hours to complete. To mitigate this, we introduced the *starting_folder* parameter in the main function. This allows us to resume the download process from where it left off, avoiding the need to recreate files and folders from scratch each time.

In [3]:
if __name__ == "__main__":
    if not os.path.exists(path):
        with open(path, 'r') as file:
            urls = [line.strip() for line in file] # creating a list with the 6000 URLs from the lines of course_links.txt
            
        result = await main(urls, starting_folder = 1) # starting_folder represents the folder we are starting from

        for text in result:
            pass # text contains your html (text) response
        print("Download and organization of HTML pages completed.")
    else:
        print('The files already exist.')

The files already exist.


## 1.3 Parse downloaded pages
At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

- Course Name (to save as courseName): string;
- University (to save as universityName): string;
- Faculty (to save as facultyName): string
- Full or Part Time (to save as isItFullTime): string;
- Short Description (to save as description): string;
- Start Date (to save as startDate): string;
- Fees (to save as fees): string;
- Modality (to save as modality):string;
- Duration (to save as duration):string;
- City (to save as city): string;
- Country (to save as country): string;
- Presence or online modality (to save as administration): string;
- Link to the page (to save as url): string.

For each master's degree, you create a course_i.tsv file of this structure:

        courseName \t universityName \t  ... \t url
If an information is missing, you just leave it as an empty string.

---
First things first let's create the empty dataframe with the variables described above.

In [4]:
# Name dataframe columns

columns = [
    "courseName",
    "universityName",
    "facultyName",
    "isItFullTime",
    "description",
    "startDate",
    "fees",
    "modality",
    "duration",
    "city",
    "country",
    "administration",
    "url"
]

# Create a dataframe with the specific columns above
df = pd.DataFrame(columns=columns)

# Visualizza il DataFrame
df


Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url


Now we need to populate the dataframe, opening every html page from every folder created in the exercise 1.2.

After this operation we have to find specific elements in the html page, handling exceptions, for example if there's no matching element with the *find_all* function, variable will be "".

For every row of the dataframe we are also creating a .tsv file named as the master degree, containing the corresponding variables values.

In [5]:
pth = "master_programs_html" # path of the html pages
path = 'course_links.txt'   # URLs path
fold = 0  # change this parameter if you need to start the population process of the df not from folder 1, without losing progresses
if not os.path.exists('files_tsv'):
    df = populate_df(pth, path, fold, df)
else:
    print('Already satisfied. There are 21 corrupted links.')

Processing:   0%|          | 0/400 [00:00<?, ?it/s]

Processing: 100%|██████████| 400/400 [55:44<00:00,  8.36s/it] 


Now let's check if we have considered all the 6000 rows.

In [7]:
df.shape[0]

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url


And print the first and the last 10 rows of the dataframe created.

In [49]:
df.head(10)

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,Full time,3D visualisation and animation play a role in ...,September,Please see the university website for further ...,MSc,1 year full-time,Glasgow,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/3d...
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Full time,Businesses and governments rely on sound finan...,September,"UK: £18,000 (Total)International: £34,750 (Total)",MSc,1 year full time,Leeds,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ac...
2,Accounting and Finance (MSc),University of Bath,School of Management,Full time,Develop in-depth knowledge of accounting and f...,September,Please see the university website for further ...,MSc,1 year full-time,Bath,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ac...
3,"Accounting, Accountability & Financial Managem...",King’s College London,King’s Business School,Full time,"Our Accounting, Accountability & Financial Man...",September,Please see the university website for further ...,MSc,1 year FT,London,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ac...
4,"Accounting, Financial Management and Digital B...",University of Reading,Henley Business School,Full time,Embark on a professional accounting career wit...,September,Please see the university website for further ...,MSc,1 year full time,Reading,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ac...
5,Addictions MSc,King’s College London,"Institute of Psychiatry, Psychology and Neuros...",Full time&Part time,Join us for an online session for prospective ...,September,Please see the university website for further ...,MSc,One year FT,London,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ad...
6,Advanced Chemical Engineering - MSc,University of Leeds,School of Chemical and Process Engineering,Full time,The Advanced Chemical Engineering MSc at Leeds...,September,"UK: £13,750 (Total)International: £31,000 (Total)",MSc,1 year full time,Leeds,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ad...
7,Advanced Physiotherapy Practice - MSc,Glasgow Caledonian University,School of Health and Life Sciences,Full time&Part time,Progress your career as a physiotherapist with...,"January, September",Please see the university website for further ...,MSc,1 Year Full Time / 2-3 Years Part Time,Glasgow,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ad...
8,Agricultural Sciences - MSc (Agriculture and F...,University of Helsinki,International Masters Degree Programmes,Full time,Goal of the pro­grammeWould you like to be inv...,September,Tuition fee per year (non-EU/EEA students): 15...,MSc,2 years,Helsinki,Finland,On Campus,www.findamasters.com/masters-degrees/course/ag...
9,"Agricultural, Environmental and Resource Econo...",University of Helsinki,International Masters Degree Programmes,Full time,Goal of the pro­grammeAre you looking forward ...,September,Tuition fee per year (non-EU/EEA students): 15...,MSc,2 years,Helsinki,Finland,On Campus,www.findamasters.com/masters-degrees/course/ag...


In [50]:
df.tail(10)

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
5990,Master's of Financial Technology (Fintech),Harbour.Space University,Masters Programmes,Full time,Harbour.Space's FinTech Master programmeis des...,"September, January","€29,900/year","MBA,MSc",1 Year,Barcelona,Spain,On Campus,www.findamasters.com/masters-degrees/course/ma...
5991,Master's of Front-end Development,Harbour.Space University,Masters Programmes,Full time,Front-end DevelopmentatHarbour.Space Universit...,"September, January","€29,900/year",MSc,1 year,Barcelona,Spain,On Campus,www.findamasters.com/masters-degrees/course/ma...
5992,Masters of Science in Business,Oregon State University,School of Business,Full time&Part time,Our Master of Science in Business (MSB) will g...,See Course,You can find more information here:Please see ...,"MSc,MBA",12 months,Corvallis,USA,Online,www.findamasters.com/masters-degrees/course/ma...
5993,"Masters of Science in Business, Supply Chain A...",Oregon State University,School of Business,Full time&Part time,Master of Science in Business (MSB)Our Master ...,See Course,You can find more information here:Please see ...,MSc,12 months,Corvallis,USA,Online,www.findamasters.com/masters-degrees/course/ma...
5994,"Masters Program in Climate Change, Agriculture...",University of Galway,Ryan Institute,Full time,The world’s climate is rapidly changing due to...,September,Please seePlease see the university website fo...,MSc,1 year,Galway,Ireland,On Campus,www.findamasters.com/masters-degrees/course/ma...
5995,Masters's in Digital Politics and Governance,European School of Political and Social Scienc...,Masters Programs,Full time,Digitalisation is a critical issue in today’s ...,See Course,Please see the university website for further ...,MSc,2 years,Lille,France,On Campus,www.findamasters.com/masters-degrees/course/ma...
5996,Material Culture & Artefact Studies - MSc/PgDip,University of Glasgow,College of Arts & Humanities,Full time&Part time,Material culture and artefact studies combines...,September,Please see the university website for further ...,"MSc,PGDip","9-12 months full-time, 18-24 months part-time",Glasgow,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ma...
5997,Material Culture and Gàidhealtachd History MSc,University of the Highlands and Islands,"Arts, Humanities and Business",Part time,"This ground-breaking, internationally acclaime...",September,Please see the university website for further ...,"PGDip,PGCert,MSc",3-6 Years Part-time,Inverness,United Kingdom,Online,www.findamasters.com/masters-degrees/course/ma...
5998,Materials and Manufacturing,Jonkoping University,Masters Programmes,Full time,Numerical methods and knowledge about the rela...,See Course,Please see the university website for further ...,MSc,2 Years Full Time,Jonkoping,Sweden,On Campus,www.findamasters.com/masters-degrees/course/ma...
5999,Materials and Molecular Modelling MSc,University College London,Department of Chemistry,Full time,Register your interest in graduate study at UC...,September,"Full time - £14,100",MSc,1 year full time,London,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ma...


# 2. Search Engine


Importing my engine.py file

Note: instead of mentioning how each function, which has been imported from engine.p, works, engine.py is well commented and easy to read! Kindly check it out in parallel. 

In [8]:
from engine import *

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ricca\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ricca\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


First, I will load all the 6000 tsv files into one final dataset.

In [9]:
directory = "/Users/ricca/Desktop/ADM/files_tsv"
output_file = "/Users/ricca/Desktop/ADM-HW3/unique_tsv_file.tsv"
dataset = create_store_tsv(directory, output_file)

Now, I will make a copy to protect the original dataset

In [10]:
dataset.head(5)

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,Full time,3D visualisation and animation play a role in ...,September,Please see the university website for further ...,MSc,1 year full-time,Glasgow,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/3d...
1,"Agricultural, Environmental and Resource Econo...",University of Helsinki,International Masters Degree Programmes,Full time,Goal of the pro­grammeAre you looking forward ...,September,Tuition fee per year (non-EU/EEA students): 15...,MSc,2 years,Helsinki,Finland,On Campus,www.findamasters.com/masters-degrees/course/ag...
2,Energy Policy and Finance (MSc),University of St Andrews,Interdisciplinary Studies,Full time&Part time,The MSc in Energy Ethics explores how we balan...,September,"Home£13,470Overseas£27,230",MSc,"1 year full time, 2 year part time",St Andrews,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/en...
3,Aircraft Design option - Aerospace Vehicle Des...,Cranfield University,Aerospace,Full time,Study at Cranfield UniversityTo design modern ...,"October, March",Please see the university website for further ...,MSc,See website for details,Bedford,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ai...
4,Aircraft Engineering (MSc),Cranfield University,Aerospace,Part time,"With a projected demand for 27,000 new civil a...",February,Please see the university website for further ...,"PGDip,PGCert,MSc",See website for details,Bedford,United Kingdom,On Campus,www.findamasters.com/masters-degrees/course/ai...


Before going on, to protect the original dataset, let's create a copy of it, deleting the 21 links corrupted that gives us empty values for the variables. Furthermore, as we will be working with the description column, let's replace description missing values with `NA`.

In [11]:
df = dataset.copy()
print(df.shape)
df.description.isnull().sum() # outputs 21
df['description'].fillna("NA", inplace=True)
df = df.dropna(subset=['courseName'], axis=0)
print(df.shape)

(6000, 13)
(5979, 13)


## 2.0.0 Preprocessing the text
First, you must pre-process all the information collected for each MSc by:

  * Removing stopwords
  * Removing punctuation
  * Stemming
  * Anything else you think it's needed
---

Now, as the dataset is loaded, I will apply some preprocessing to all columns. This pre-processing includes, tokenizing, standardazing the text (lower case and removing stop words), and stemming the text of all columns.

In [12]:
desired_columns = [x for x in df.columns if x != "fees"]
for column in desired_columns:
    df[column+"_clean"] = df[column].apply(lambda x: preprocessing(x))

In [13]:
df.head()

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,...,facultyName_clean,isItFullTime_clean,description_clean,startDate_clean,modality_clean,duration_clean,city_clean,country_clean,administration_clean,url_clean
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,Full time,3D visualisation and animation play a role in ...,September,Please see the university website for further ...,MSc,1 year full-time,Glasgow,...,"[school, engin, built, environ]","[full, time]","[visualis, anim, play, role, mani, area, popul...",[septemb],[msc],"[year, full, time]",[glasgow],"[unit, kingdom]",[campu],"[www, findamast, com, master, degre, cours, de..."
1,"Agricultural, Environmental and Resource Econo...",University of Helsinki,International Masters Degree Programmes,Full time,Goal of the pro­grammeAre you looking forward ...,September,Tuition fee per year (non-EU/EEA students): 15...,MSc,2 years,Helsinki,...,"[intern, master, degre, programm]","[full, time]","[goal, pro, grammear, look, forward, futur, ex...",[septemb],[msc],[year],[helsinki],[finland],[campu],"[www, findamast, com, master, degre, cours, ag..."
2,Energy Policy and Finance (MSc),University of St Andrews,Interdisciplinary Studies,Full time&Part time,The MSc in Energy Ethics explores how we balan...,September,"Home£13,470Overseas£27,230",MSc,"1 year full time, 2 year part time",St Andrews,...,"[interdisciplinari, studi]","[full, time, part, time]","[msc, energi, ethic, explor, balanc, energi, d...",[septemb],[msc],"[year, full, time, year, part, time]","[st, andrew]","[unit, kingdom]",[campu],"[www, findamast, com, master, degre, cours, en..."
3,Aircraft Design option - Aerospace Vehicle Des...,Cranfield University,Aerospace,Full time,Study at Cranfield UniversityTo design modern ...,"October, March",Please see the university website for further ...,MSc,See website for details,Bedford,...,[aerospac],"[full, time]","[studi, cranfield, universityto, design, moder...","[octob, march]",[msc],"[see, websit, detail]",[bedford],"[unit, kingdom]",[campu],"[www, findamast, com, master, degre, cours, ai..."
4,Aircraft Engineering (MSc),Cranfield University,Aerospace,Part time,"With a projected demand for 27,000 new civil a...",February,Please see the university website for further ...,"PGDip,PGCert,MSc",See website for details,Bedford,...,[aerospac],"[part, time]","[project, demand, new, civil, airlin, industri...",[februari],"[pgdip, pgcert, msc]","[see, websit, detail]",[bedford],"[unit, kingdom]",[campu],"[www, findamast, com, master, degre, cours, ai..."


### 2.0.1 Preprocessing the fees column
Moreover, we want the field `fees` to collect numeric information. As you will see, you scraped textual information for this attribute in the dataset: sketch whatever method you need (using regex, for example, to find currency symbol) to collect information and, in case of multiple information, retrieve only the highest fees. Finally, once you have collected numerical information, you likely will have different currencies: this can be chaotic, so let chatGPT guide you in the choice and deployment of an API to convert this column to a common currency of your choice (it can be USD, EUR or whatever you want). Ultimately, you will have a `float` column renamed `fees (CHOSEN COMMON CURRENCY)`.

-----

Now, it is time to do some preprocessing with the fees column. We have to ensure that all the fees are in Euro (decided) and the column is of dtype float. So, let's do it!

First, I will replace all the entries that are prompting to contact the university or visit webpage. Then, I will see what currency symbols have been used through out the dataset so that I can desing my regex accordingly.

In [14]:
if_website_or_contact_replace(df, "fees")
df["fees_clean"] = df["fees_clean"].apply(lambda x: "Not Available" if pd.isnull(x) else x)
symbols_list = extract_symbols(df)
symbols_list

['$', '£', '€']

Now, from all of the text in each row of the `fees` column, I will only keep the max amount and the currency name.

In [15]:
df["fees_clean"] = df["fees_clean"].apply(lambda x: extract_numeric_and_currency(x))
df.loc[:, ["fees","fees_clean"]]

Unnamed: 0,fees,fees_clean
0,Please see the university website for further ...,Not Available
1,Tuition fee per year (non-EU/EEA students): 15...,15000.0 EUR
2,"Home£13,470Overseas£27,230",27230.0 GBP
3,Please see the university website for further ...,Not Available
4,Please see the university website for further ...,Not Available
...,...,...
5995,Please see the university website for further ...,Not Available
5996,UK Fees: 2022/23 fees TBC*;2021/22 fees - 1040...,17900.0 EUR
5997,Please see the university website for further ...,Not Available
5998,Please see the university website for further ...,Not Available


Now, I will create two more columns and use these columns to make make a new column `FEE_EUR` where all the fees are in EUR.

In [16]:
# Apply extraction functions to create new columns
df['numeric_value'] = df['fees_clean'].apply(extract_numeric_value)
df['currency_name'] = df['fees_clean'].apply(extract_currency_name)

For the purpose of converting all other currencies to EUR, I will use v6.exchangerate-api.com as suggested by Chat GPT. I have already signed up and acquired my api key. 

In [17]:
my_api_key = "1b7f9ee2370ead6160f6bd45"
url = f"https://v6.exchangerate-api.com/v6/{my_api_key}/latest/EUR" # to access the url with EUR rates
response = requests.get(url) # to get the latest eur to other currencies rates
data = response.json() # storing the data in the variable data

# Apply the conversion using a lambda function with our pre-defined convert_to_eur function
df['FEE_EUR'] = df.apply(lambda row: convert_to_eur(row['numeric_value'], row['currency_name'], data), axis=1)

Finally, I will delete the extra columns. Namely: `fees_clean`, `numeric_value`, `currency_name`.

In [18]:
df.columns
columns_to_drop = ["fees_clean", "numeric_value", "currency_name"]
df = df.drop(columns=columns_to_drop)

In [19]:
# Check if the data type of FEE_EUR column is float:
df['FEE_EUR'].dtype

dtype('float64')

### 2.1. Conjunctive query

#### 2.1.1 Create your index!
Before building the index,

- Create a file named `vocabulary`, in the format you prefer, that maps each word to an integer (`term_id`).

In [22]:
vocabulary = set() # set to remove duplicates and efficiency
# I will only used the description column as stated in the HW
df.description_clean.apply(lambda row: [vocabulary.add(word) for word in row])
vocabulary = list(vocabulary)

The variable vocabulary contains, in a list, all the words used in the description column. The question requires us to make a file named vocabulary that *maps each word to an integer (term_id).*

In [23]:
# maps each word to an integer (term_id)
vocab_with_index = dict()
unique_id = 0
for word in vocabulary:
  vocab_with_index[word] = unique_id
  unique_id+=1
vocab_with_index

with open("./vocabulary.json", "w") as file: # storing for reusability
    json.dump(vocab_with_index, file)

In [24]:
vocab_with_index = json.load(open("./vocabulary.json", "r")) # loading for use

Then, the first brick of your homework is to create the Inverted Index. It will be a dictionary in this format:

`{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}`

To create the inverted index, I can use the procedure used in the lab as follows:

In [25]:
Terms = pd.DataFrame(data=vocab_with_index.keys(), columns=['Term']); tqdm.pandas() # from lab
Terms['reverse'] = Terms.Term.progress_apply(lambda item: list(df.loc[df.description_clean.apply(lambda row: item in row)].index))

# word_and_appearances help with the search engine and inverted index. It contains as keys the words and as values a list which 
# represents the occurences of that word in our dataset

word_and_appearances = dict(zip(Terms.Term, Terms.reverse)) 

# The inverted index can also be made as commented below: 
# inverted_index = dict(zip(Terms.index, Terms['reverse'])); however I will implement my own function
Terms.head()

100%|██████████| 9704/9704 [01:05<00:00, 147.46it/s]


Unnamed: 0,Term,reverse
0,ofbig,[1549]
1,agenda,"[56, 468, 638, 783, 1238, 1286, 2297, 2341, 23..."
2,annual,"[478, 1420, 1520, 1662, 2692, 2782, 3240, 3241..."
3,conductor,[889]
4,hormon,[640]


Or, I can use the function that I created to make an inverted list. Here's how:

In [26]:
inverted_index = inverted_index(vocab_with_index, df)

I will now store this inverted index locally

In [27]:
with open("./inverted_index_1.json", "w") as file: # storing for reusability
    json.dump(inverted_index, file)
inverted_index = json.load(open("./inverted_index_1.json", "r")) # loading for use

### 2.1.2 Execute the query
Given a query input by the user, for example:

`advanced knowledge`

The Search Engine is supposed to return a list of documents.

---

For this task, I will first take the input from the user and then normalize the query using the same technique I used to preprocess the description column.

In [28]:
# user_input = input("Add your Input")
user_input = "data science"
normalized_query = normalize_query(user_input)

Instead of using the inverted index which has the term_id and a list of where it appears in the document, I will use a related inverted index, which I created above with name word_and_appearances which contains words as keys and list of places they occur in the document as values.

In [29]:
new_dataset = search_engine(normalized_query, df, word_and_appearances).head()
new_dataset

Unnamed: 0,courseName,universityName,description,url
0,Digital Forensics and Cyber Investigation (wit...,Teesside University,With a clear focus on forensics and industry p...,www.findamasters.com/masters-degrees/course/di...
1,Digital Forensics and Cyber Investigation MSc,Teesside University,With a clear focus on forensics and industry p...,www.findamasters.com/masters-degrees/course/di...
2,Applied Analytical Chemistry MSc,University College London,Register your interest in graduate study at UC...,www.findamasters.com/masters-degrees/course/ap...
3,Applied Artificial Intelligence and Data Analy...,University of Bradford,*SCHOLARSHIPS AVAILABLE*This new MSc programme...,www.findamasters.com/masters-degrees/course/ap...
4,Environmental Data Science and Analytics - MSc,University of Leeds,As global discussions are increasingly focused...,www.findamasters.com/masters-degrees/course/en...


## 2.2 Conjunctive query & Ranking score

For the second search engine, given a query, we want to get the top-k (the choice of k it's up to you!) documents related to the query. In particular:

Find all the documents that contain all the words in the query. Sort them by their similarity with the query.

Return in output k documents, or all the documents with non-zero similarity with the query when the results are less than k. You must use a heap data structure (you can use Python libraries) for maintaining the top-k documents.

To solve this task, you must use the tfIdf score and the Cosine similarity. The field to consider is still the description.

#### 2.2.1 Make an Inverted index

Below, I will make an inverted index which contains as keys the unique term_id of the word, and as values, its corresponding occurences and tf-idf values in the dataset's row.

In [46]:
# Our inverted_index with tf-idf value as well.
inverted_index_tf_idf = inverted_index_tfidf(word_and_appearances, df)

In [47]:
with open("./inverted_index_tfidf.json", "w") as file: # storing for reusability
    json.dump(inverted_index, file)
inverted_index = json.load(open("./inverted_index_tfidf.json", "r")) # loading for use

### 2.2.2 Execute the query
In this new setting, given a query, you get the proper documents (i.e., those containing all the query's words) and sort them according to their similarity to the query. For this purpose, as the scoring function, we will use the Cosine Similarity concerning the tfIdf representations of the documents.

---

First, I will take the user input and normalize it.

In [48]:
user_input = input("Add your Input")
normalized_query = normalize_query(user_input)

Time to compute and get the TF-IDF scores for our query:

In [49]:
query_tfidf = calculate_query_tfidf(normalized_query, word_and_appearances, df)
query_tfidf

{'data': 1.04, 'scienc': 0.74}

I will now compute the intersection list of query and documents.

In [50]:
appearances = list()
[appearances.append(word_and_appearances[word]) for word in normalized_query]
# initialising the set with the value of the first list of appearances in the list of appearances of the word
intersection_list = set(appearances[0])
for appearance in appearances[1:]: # for the rest of the terms in the list of term ids
    intersection_list.intersection_update(appearance)
intersection_list = sorted(list(intersection_list))

I will not compute the similarity scores using cosine similarity with respect to the tf-idf scores. I will store these scores in a dictionary named as similarity_scores where each key is the document_id (row of the dataframe) and value is the cosine similarity of the document and query

In [51]:
similarity_scores = {}
for document_i in intersection_list:
    document_vector = df.at[document_i, "description_clean"] # getting the document vector at the row of interest
    # calculated the tf_idf of the document vector. It will result in words as keys and values as tf-idf scores
    document_tfidf = calculate_document_tfidf(document_vector, word_and_appearances, df) 
    # Compute cosine similarity
    dot_product = 0
    for word in normalized_query:
        if word in document_tfidf:
# computing the dot product of only the query's tfidf score and the tfidf score of that word in the document
            dot_product += query_tfidf[word] * document_tfidf[word] 

    norm_doc_i = np.linalg.norm(list(document_tfidf.values())) # computing norm of the doc vector
    norm_query = np.linalg.norm(list(query_tfidf.values()) )# computing norm of the query vector
    
    if norm_doc_i != 0 and norm_query != 0: # only the non zero results
        cosine_similarity_doc_i_query = dot_product / (norm_doc_i * norm_query)
        similarity_scores[document_i] = cosine_similarity_doc_i_query

Using heap data structure to retrieve top 10 similarity scores:

In [52]:
top_10_documents = heapq.nlargest(10, similarity_scores, key=similarity_scores.get)

Finally, bringing the result of the query search.

In [53]:
results = return_results(top_10_documents, similarity_scores, df)
results

Unnamed: 0,courseName,universityName,description,url,Cosine_Similarity
0,"Data Science and its Applications, MSc",University of Greenwich,"Our MSc degree in Data Science and its Applications is designed to provide you with a solid grounding in data science theory and practice.Enter the exciting world of data science whatever your academic background! Our MSc in Data Science and its Applications course has been designed to increase the skilled workforce and diversity in qualified data science experts in the UK market. The programme is tailored to transform professionals from a wide range of backgrounds into accomplished data scientists who are well-placed to enhance their existing careers with an expansive set of data science skills, to move fully into data science roles, or to pursue further data science specialisations.",www.findamasters.com/masters-degrees/course/data-science-and-its-applications-msc/?i309d6382c70560,0.706323
1,Data Science - Master of Science (MS),University of Colorado Boulder,"The on-campus Master of Science in Data Science program focuses on developing knowledge and skills in interdisciplinary and collaborative data science competencies including statistical analysis, data structures and algorithms, data mining, machine learning, big data architecture and data visualization.Data science is a multidisciplinary field that focuses on the extraction of knowledge and insight from large datasets. Data scientists are tasked with using a range of skills in applied mathematics, statistics and computer science, and in domain applications such as information science, geography, business, media and the humanities.",www.findamasters.com/masters-degrees/course/data-science-master-of-science-ms/?i512d8309c68346,0.627953
2,Data Science MSc,University of Greenwich,"Our MSc in Data Science equips graduates to embark on highly skilled careers in data science, artificial intelligence and machine learning.This specialist Master's in Data Science provides a theoretical knowledge of data science alongside the practical skills that will help you prosper in the jobs market.You'll be exposed to a broad range of topics such as data science, statistics, specialist programming, machine learning and data visualisation.",www.findamasters.com/masters-degrees/course/data-science-msc/?i309d6382c56314,0.614736
3,Data Science - MSc/PgD/PgC,Cardiff Metropolitan University,"This Master's degree in Data Science is an industry-relevant and popular postgraduate programme of study. Data Science and ""Big Data"" are more important than ever. Graduates trained to garner insights from large data sets, extract patterns and provide actionable information to solve real-world problems are sought after in both the private and public sector.This course will equip you with in-demand theoretical knowledge and practical skills to develop data science systems, use software to analyse and synthesise data, and manage all aspects of data science.Course ContentCompulsory",www.findamasters.com/masters-degrees/course/data-science-msc-pgd-pgc/?i366d8243c45565,0.560065
4,Big Data Science and Technology - MSc,University of Bradford,"Develop the skills you need for a career in this fast emerging field of data science.This programme will deepen your understanding of advanced software development, systems for big data analytics, statistical data analysis, data mining, data privacy and security, data visualisation and exploration.It is an interdisciplinary programme, designed for students with a first degree in subjects such as:",www.findamasters.com/masters-degrees/course/big-data-science-and-technology-msc/?i285d7565c52845,0.548488
5,Data Science MSc,Coventry University,"Data is everywhere. As the volume and complexity of data collected continues to grow, there is increasing demand for expertise in data science to support the analysis and visualisation of all this information.The MSc Data Science is a conversion course for graduates from a wide range of disciplines and backgrounds looking to pursue a career, or upskill, in this new and rapidly developing field. Data Scientists are in short supply and there is high demand for data science skills across sectors including business, government, healthcare, science, finance, and marketing.",www.findamasters.com/masters-degrees/course/data-science-msc/?i49d2694c62559,0.541314
6,Data Science MSc,University of Southampton,"Become a proficient data scientist with a master’s in Data Science at the University of Southampton.You’ll study the latest techniques and technologies, including data mining, machine learning, and data visualisation. By the end of your studies, you’ll be able to develop original ideas and solve problems using advanced data science methods. You can apply what you have learned in fields such as data journalism, open government and social media.",www.findamasters.com/masters-degrees/course/data-science-msc/?i349d6709c37394,0.522204
7,Data Science - MSc,University of Glasgow,"The Masters in Data Science is a specialist version of the MSc (Computing Science) which will provide you with a thorough grounding in the analysis and use of large data sets together with experience of conducting a substantial development or research project focused on Data Science techniques, preparing you for responsible positions in the Big Data and IT industries.",www.findamasters.com/masters-degrees/course/data-science-msc/?i307d4813c32689,0.518929
8,Master of Science in Data Science and Business Analytics (ODL) - 100% ONLINE,Asia Pacific University of Technology & Innovation (APU),"This programme is specifically designed to provide:Knowledge and applied skills in data science, big data analytics and business intelligence.Overall understanding of the impact of data science upon modern processes and business.Exposure towards data science tools and techniques, as well as methods of data collection and utilization, to turn data into useful information via various processes.Joint Professional Certification by TIBCO Software",www.findamasters.com/masters-degrees/course/master-of-science-in-data-science-and-business-analytics-odl-100-online/?i3514d8592c71220,0.514846
9,Data Science MSc,University of Chester,"A conversion Master’s degree in Data Science, to enable students to make the leap to the fast-growing area of data science.Course overviewThe Data Science Master’s degree is aimed at people who have a technical, mathematical, or engineering background who want to develop their skills into the field of data science. Students who have worked in one of these areas but do not process a degree in the area are welcome to apply.Why study this course with us?The Data Science course is within the Department of Computer Science, who are a forward-thinking innovative department.We work with employers to tailor the course to real-world needs, giving students an in-depth knowledge in the area.",www.findamasters.com/masters-degrees/course/data-science-msc/?i292d1884c67154,0.513056


# 3. Define a new score!

Now it's your turn: build a new metric to rank MSc degrees.

Practically:

- The user will enter a text query. As a starting point, get the query-related documents by exploiting the search engine of Step 2.1.
- Once you have the documents, you need to sort them according to your new score. In this step, you won't have any more to take into account just the description field of the documents; you can use also the remaining variables in your dataset (or new possible variables that you can create from the existing ones or scrape again from the original web-pages). You must use a heap data structure (you can use Python libraries) for maintaining the top-k documents.

N.B.: You have to define a scoring function, not a filter!

The output, must contain:

- `courseName`
- `universityName`
- `description`
- `URL`
- The new `similarity score` of the documents with respect to the query

Are the results you obtain better than with the previous scoring function? Explain and compare results.

---

The idea behind the development of the new scoring function is an extension of the concept introduced in *Q2.2* to encompass other variables. This involves assigning varying weights based on the input query. This approach is motivated by the understanding that a user may have diverse interests when searching for information, such as specifying a city, a university name, or the course name. In such cases, if the query aligns with one of these specific variables, it is deemed to be more significant than others. This allows for a more nuanced and personalized scoring system tailored to the user's preferences and priorities.

Initially, we developed a function to efficiently compute weights based on whether the query matches any of the variables. Emphasis is given to the variable that matches at least a word of the input query, assigning higher importance to that specific variable.

After that we use the 'fuzzywuzzy' package, importing Levenshtein Distance, to calculate the similarity between the query and the variable value for each one of them.

In [32]:
user_query = "data science" 
normalized_query = normalize_query(user_query)
term_ids = get_term_id_from_query(vocab_with_index, normalized_query)
new_dataset = search_engine_full(term_ids, df, inverted_index)
# Let's start from the smaller_dataset derived from the engine.py

user_query_2 = "data science lancaster" # input user query

weights = calculate_weights(user_query_2, new_dataset) # computes the weights taking in input a second more specific query

# add the new_score as a new column of the df
new_dataset['score'] = new_dataset.apply(lambda row: calculate_total_score(user_query_2, row, weights), axis=1)

heap = []
for index, row in new_dataset.iterrows():
    heapq.heappush(heap, (-row['score'], index))

# Extract results
sorted_indices = [heapq.heappop(heap)[1] for _ in range(len(heap))]

# Create new df sorted
sorted_df = new_dataset.loc[sorted_indices]
k = 5 # k-top elements sorted by the new score
sorted_df[['courseName', 'universityName', 'description', 'score']].head(k)


Unnamed: 0,courseName,universityName,description,score
330,Data Science MSc,Lancaster University,"From business and finance to health and medicine, from infrastructure to societal studies, data science plays a vital role in all aspects of the modern world. Our MSc programme will ensure you have an advanced level of skills, knowledge, and experience in this rapidly expanding, highly in-demand field to achieve your career aspirations.",0.428696
267,Data Science - MSc,Lancaster University,"From business and finance to health and medicine, from infrastructure to societal studies, data science plays a vital role in all aspects of the modern world. Our MSc programme will ensure you have an advanced level of skills, knowledge, and experience in this rapidly expanding, highly in-demand field to achieve your career aspirations.",0.411304
211,Health Data Science MSc,Lancaster University,"Deep statistical thinking combined with expertise in health and computer science is becoming increasingly fundamental in tackling public health problems across the world. The MSc in Health Data Science will equip you with advanced technical skills which will allow you to develop a career as a data-scientist in the health and care sector.The MSc in Health Data Science, consists of an initial set of 4 core modules: “Statistical methods and models for health research”, “Programming for Health Data Science”, “Fundamentals for Health Data Science” and “Introduction to applied epidemiology”. These will allow you to develop and consolidate foundational skills in the three main areas of Health Data Science: epidemiology, statistics and computer science.",0.401739
210,Health Data Science MSc,Lancaster University,"Deep statistical thinking combined with expertise in health and computer science is becoming increasingly fundamental in tackling public health problems across the world. The MSc in Health Data Science will equip you with advanced technical skills which will allow you to develop a career as a data-scientist in the health and care sector.The MSc in Health Data Science, consists of an initial set of 4 core modules: “Statistical methods and models for health research”, “Programming for Health Data Science”, “Fundamentals for Health Data Science” and “Introduction to applied epidemiology”. These will allow you to develop and consolidate foundational skills in the three main areas of Health Data Science: epidemiology, statistics and computer science.",0.397391
327,Data Science MSc,Northumbria University,"This exciting Data Science master's has been designed by top academics in the field, in close consultation with leading data scientists from the industry and the Northumbria University Institute of Coding (IoC). It will provide you with the relevant skills needed to analyse, synthesise and manage different types and sizes of data efficiently.This programme is designed to train and produce data scientists who will fill a range of jobs requiring skills in methodical and statistical data analysis and help organisations (e.g., businesses, healthcare providers, financial institutions, industries) make the most of their huge amounts of data. You will develop knowledge insight from a variety of structured and unstructured data, using a range of data analysis methods, processes, algorithms, and systems.",0.392174


After several attempts, it can be concluded that the new similarity score yields more robust results and is more versatile compared to cosine similarity. This improvement stems from the consideration of additional variables, along with the analysis of the user query to gain insights into the user's search intent. This assumption is grounded in the idea that the user formulates queries with the intention of obtaining the most relevant results from the database, prioritizing them for easier visualization.

As we can also see in the output above, focusing solely on the '*description*' variable (as in Q2.2) can lead to biased results that may not align with the input query. Conversely, the new scoring approach demonstrates a more meaningful and relevant outcome.

Moreover, the utilization of Levenshtein distance as a string metric proves to be highly beneficial for quantifying the dissimilarity between two sequences, especially when they exhibit similarities. Informally, the Levenshtein distance between two words represents the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one word into the other. We opted for this measure precisely because, quite often, the input may contain errors such as missing or extra letters or concatenated words. The Levenshtein measure exhibits remarkable robustness in accommodating these variations.

# 4. Visualizing the most relevant MSc degrees


Using maps can help people understand how far one university is from another so they can plan their academic careers more adequately. Here, we challenge you to show a map of the courses found with the score defined in point 3. You should be able to identify at least the city and country for each MSc degree. You can find some ideas on how to create maps in Python [here](https://plotly.com/python/maps/) and [here](https://towardsdatascience.com/visualizing-geospatial-data-in-python-e070374fe621) but you will maybe need further information for a proper visualization, like coordinates (latitude and longitude). You can retrieve this data using various tools:

1. [Here](https://medium.com/@manilwagle/geocoding-the-world-using-google-api-and-python-1f6b6fb6ca48) you can find a helpful tutorial on how to encode geo-informations using Google API in Python (this tool can also be used in Google Sheets)
2. You can collect a list of unique places in the format (City, Country) and ask chatGPT (or, as usual, any other LLM chatbot) to provide you with a list of corresponding representative coordinates
3. Explore and find the best solution for your case!
Once you defined your visualization strategy, include a way to encode fees in your charts. The map should show (with a proper legend) different courses and associated taxation: the user wants a glimpse not only of how far he will need to move but also of how much it will cost him!

---

Let's start by creating a new column with the full address (universityName, city, country), as suggested in the links above.

In [33]:
k = 100
sorted_df['full_address'] = sorted_df['universityName'] + ", " + sorted_df['city'] + ", " + sorted_df['country']
sorted_df.head(k)

Unnamed: 0,courseName,universityName,facultyName,city,country,description,FEE_EUR,score,full_address
330,Data Science MSc,Lancaster University,Department of Mathematics and Statistics,Lancaster,United Kingdom,"From business and finance to health and medicine, from infrastructure to societal studies, data science plays a vital role in all aspects of the modern world. Our MSc programme will ensure you have an advanced level of skills, knowledge, and experience in this rapidly expanding, highly in-demand field to achieve your career aspirations.",,0.428696,"Lancaster University, Lancaster, United Kingdom"
267,Data Science - MSc,Lancaster University,School of Computing and Communications,Lancaster,United Kingdom,"From business and finance to health and medicine, from infrastructure to societal studies, data science plays a vital role in all aspects of the modern world. Our MSc programme will ensure you have an advanced level of skills, knowledge, and experience in this rapidly expanding, highly in-demand field to achieve your career aspirations.",,0.411304,"Lancaster University, Lancaster, United Kingdom"
211,Health Data Science MSc,Lancaster University,Lancaster Medical School,Lancaster,United Kingdom,"Deep statistical thinking combined with expertise in health and computer science is becoming increasingly fundamental in tackling public health problems across the world. The MSc in Health Data Science will equip you with advanced technical skills which will allow you to develop a career as a data-scientist in the health and care sector.The MSc in Health Data Science, consists of an initial set of 4 core modules: “Statistical methods and models for health research”, “Programming for Health Data Science”, “Fundamentals for Health Data Science” and “Introduction to applied epidemiology”. These will allow you to develop and consolidate foundational skills in the three main areas of Health Data Science: epidemiology, statistics and computer science.",,0.401739,"Lancaster University, Lancaster, United Kingdom"
210,Health Data Science MSc,Lancaster University,Division of Health Research,Lancaster,United Kingdom,"Deep statistical thinking combined with expertise in health and computer science is becoming increasingly fundamental in tackling public health problems across the world. The MSc in Health Data Science will equip you with advanced technical skills which will allow you to develop a career as a data-scientist in the health and care sector.The MSc in Health Data Science, consists of an initial set of 4 core modules: “Statistical methods and models for health research”, “Programming for Health Data Science”, “Fundamentals for Health Data Science” and “Introduction to applied epidemiology”. These will allow you to develop and consolidate foundational skills in the three main areas of Health Data Science: epidemiology, statistics and computer science.",,0.397391,"Lancaster University, Lancaster, United Kingdom"
327,Data Science MSc,Northumbria University,Computer and Information Sciences,Newcastle,United Kingdom,"This exciting Data Science master's has been designed by top academics in the field, in close consultation with leading data scientists from the industry and the Northumbria University Institute of Coding (IoC). It will provide you with the relevant skills needed to analyse, synthesise and manage different types and sizes of data efficiently.This programme is designed to train and produce data scientists who will fill a range of jobs requiring skills in methodical and statistical data analysis and help organisations (e.g., businesses, healthcare providers, financial institutions, industries) make the most of their huge amounts of data. You will develop knowledge insight from a variety of structured and unstructured data, using a range of data analysis methods, processes, algorithms, and systems.",,0.392174,"Northumbria University, Newcastle, United Kingdom"
...,...,...,...,...,...,...,...,...,...
51,Economics with Data Science - MSc,University of Bristol,Faculty of Social Sciences and Law,Bristol,United Kingdom,"Gain an in-depth understanding of economics, econometrics and data science on our interdisciplinary MSc Economics with Data Science.This programme is suited to both professionals and undergraduates seeking advanced training in economics and data science, following a related degree in economics, computer science, physics, maths, or engineering, with exposure to quantitative methods and/or programming.Combining theoretical training and the teaching of practical tools, this programme is taught by experts in both economics and mathematical engineering disciplines and provides training on essential programming and data science software, alongside robust economics content distinctive to this programme.",,0.314783,"University of Bristol, Bristol, United Kingdom"
257,Data Science - Master of Science (MS),University of Colorado Boulder,College of Arts and Sciences,Boulder,USA,"The on-campus Master of Science in Data Science program focuses on developing knowledge and skills in interdisciplinary and collaborative data science competencies including statistical analysis, data structures and algorithms, data mining, machine learning, big data architecture and data visualization.Data science is a multidisciplinary field that focuses on the extraction of knowledge and insight from large datasets. Data scientists are tasked with using a range of skills in applied mathematics, statistics and computer science, and in domain applications such as information science, geography, business, media and the humanities.",,0.314783,"University of Colorado Boulder, Boulder, USA"
189,Mechanical Engineering MSc,University of Groningen,Science and Engineering,Groningen,Netherlands,"Digital technology, AI and data science are integral and essential parts of the 21st century mechanical engineering discipline. 21st century mechanical engineers must be versed in both the classical design and construction of complex mechanical systems, as well as the physical, mathematical and digital representation and analysis of these systems.At the Faculty of Science and Engineering, we offer students a Masters degree programme in Mechanical Engineering that fits the needs of digital technology and digital society. The university is a unique place where engineering, natural sciences and social Sciences meet to tackle urgent societal challenges. It prepares you for an international digital and high-tech career in industry.",,0.314348,"University of Groningen, Groningen, Netherlands"
61,Geospatial Data Science & Modelling - MSc,University of Glasgow,College of Science and Engineering,Glasgow,United Kingdom,"The Masters in Geospatial Data Science and Modelling will equip you with specialist geospatial, analytical, programming and modelling skills that can be applied to data in a broad range of sectors.You will get hands-on experience and gain a practical understanding of geospatial statistics. Computationally and mathematically literate graduates with the ability to address geospatial problems are in increasing demand across industries.",,0.313913,"University of Glasgow, Glasgow, United Kingdom"


In [35]:
# Load the required libraries
import pandas as pd
# from pandas_profiling import ProfileReport
from googlemaps import Client as GoogleMaps
import googlemaps
import gmaps
from keplergl import KeplerGl
import geopandas as gpd

gmaps = googlemaps.Client(key='AIzaSyD5OXaJ1-4shfWoKJoPs-SeElfX2WFHgJg') # API key obtained from Google

In [36]:
addresses1= sorted_df.iloc[:,-1:]
addresses1.head(10)

Unnamed: 0,full_address
330,"Lancaster University, Lancaster, United Kingdom"
267,"Lancaster University, Lancaster, United Kingdom"
211,"Lancaster University, Lancaster, United Kingdom"
210,"Lancaster University, Lancaster, United Kingdom"
327,"Northumbria University, Newcastle, United Kingdom"
329,"University of Chester, Chester, United Kingdom"
321,"Wageningen University & Research, Wageningen, Netherlands"
302,"University of Westminster, London, United Kingdom"
255,"Politecnico di Milano, Milan, Italy"
328,"University of Exeter, Exeter, United Kingdom"


In [37]:
addresses1['long'] = ""
addresses1['lat'] = ""

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  addresses1['long'] = ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  addresses1['lat'] = ""


In [38]:
for x in range(len(addresses1)):
    geocode_result = gmaps.geocode(addresses1['full_address'][x])
    addresses1['lat'][x] = geocode_result[0]['geometry']['location'] ['lat']
    addresses1['long'][x] = geocode_result[0]['geometry']['location']['lng']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  addresses1['lat'][x] = geocode_result[0]['geometry']['location'] ['lat']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  addresses1['long'][x] = geocode_result[0]['geometry']['location']['lng']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  addresses1['lat'][x] = geocode_result[0]['geometry']['location'] ['lat']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_gu

In [40]:
addresses1

Unnamed: 0,full_address,long,lat
330,"Lancaster University, Lancaster, United Kingdom",-2.787729,54.010394
267,"Lancaster University, Lancaster, United Kingdom",-2.787729,54.010394
211,"Lancaster University, Lancaster, United Kingdom",-2.787729,54.010394
210,"Lancaster University, Lancaster, United Kingdom",-2.787729,54.010394
327,"Northumbria University, Newcastle, United Kingdom",-1.61778,54.978252
...,...,...,...
295,"University College London, London, United Kingdom",-0.127586,51.507218
74,"University College Cork, Cork, Ireland",-8.491198,51.893609
63,"University of Groningen, Groningen, Netherlands",6.56283,53.219327
107,"WHU, Koblenz, Germany",7.613507,50.400325


In [41]:
addresses1['lat'] += np.random.normal(-0.005, 0.005, len(addresses1))
addresses1['long'] += np.random.normal(-0.005, 0.005, len(addresses1))
addresses1

Unnamed: 0,full_address,long,lat
330,"Lancaster University, Lancaster, United Kingdom",-2.799807,54.008988
267,"Lancaster University, Lancaster, United Kingdom",-2.798462,54.002775
211,"Lancaster University, Lancaster, United Kingdom",-2.78574,53.998273
210,"Lancaster University, Lancaster, United Kingdom",-2.789567,54.003592
327,"Northumbria University, Newcastle, United Kingdom",-1.624651,54.968912
...,...,...,...
295,"University College London, London, United Kingdom",-0.132554,51.493346
74,"University College Cork, Cork, Ireland",-8.493422,51.894016
63,"University of Groningen, Groningen, Netherlands",6.558684,53.214482
107,"WHU, Koblenz, Germany",7.61584,50.390221


In [42]:
# Lets join the results with original file
sorted_df['Lat']=addresses1['lat']
sorted_df['Lon']= addresses1['long']
sorted_df.head()

Unnamed: 0,courseName,universityName,facultyName,city,country,description,FEE_EUR,score,full_address,Lat,Lon
330,Data Science MSc,Lancaster University,Department of Mathematics and Statistics,Lancaster,United Kingdom,"From business and finance to health and medicine, from infrastructure to societal studies, data science plays a vital role in all aspects of the modern world. Our MSc programme will ensure you have an advanced level of skills, knowledge, and experience in this rapidly expanding, highly in-demand field to achieve your career aspirations.",,0.428696,"Lancaster University, Lancaster, United Kingdom",54.008988,-2.799807
267,Data Science - MSc,Lancaster University,School of Computing and Communications,Lancaster,United Kingdom,"From business and finance to health and medicine, from infrastructure to societal studies, data science plays a vital role in all aspects of the modern world. Our MSc programme will ensure you have an advanced level of skills, knowledge, and experience in this rapidly expanding, highly in-demand field to achieve your career aspirations.",,0.411304,"Lancaster University, Lancaster, United Kingdom",54.002775,-2.798462
211,Health Data Science MSc,Lancaster University,Lancaster Medical School,Lancaster,United Kingdom,"Deep statistical thinking combined with expertise in health and computer science is becoming increasingly fundamental in tackling public health problems across the world. The MSc in Health Data Science will equip you with advanced technical skills which will allow you to develop a career as a data-scientist in the health and care sector.The MSc in Health Data Science, consists of an initial set of 4 core modules: “Statistical methods and models for health research”, “Programming for Health Data Science”, “Fundamentals for Health Data Science” and “Introduction to applied epidemiology”. These will allow you to develop and consolidate foundational skills in the three main areas of Health Data Science: epidemiology, statistics and computer science.",,0.401739,"Lancaster University, Lancaster, United Kingdom",53.998273,-2.78574
210,Health Data Science MSc,Lancaster University,Division of Health Research,Lancaster,United Kingdom,"Deep statistical thinking combined with expertise in health and computer science is becoming increasingly fundamental in tackling public health problems across the world. The MSc in Health Data Science will equip you with advanced technical skills which will allow you to develop a career as a data-scientist in the health and care sector.The MSc in Health Data Science, consists of an initial set of 4 core modules: “Statistical methods and models for health research”, “Programming for Health Data Science”, “Fundamentals for Health Data Science” and “Introduction to applied epidemiology”. These will allow you to develop and consolidate foundational skills in the three main areas of Health Data Science: epidemiology, statistics and computer science.",,0.397391,"Lancaster University, Lancaster, United Kingdom",54.003592,-2.789567
327,Data Science MSc,Northumbria University,Computer and Information Sciences,Newcastle,United Kingdom,"This exciting Data Science master's has been designed by top academics in the field, in close consultation with leading data scientists from the industry and the Northumbria University Institute of Coding (IoC). It will provide you with the relevant skills needed to analyse, synthesise and manage different types and sizes of data efficiently.This programme is designed to train and produce data scientists who will fill a range of jobs requiring skills in methodical and statistical data analysis and help organisations (e.g., businesses, healthcare providers, financial institutions, industries) make the most of their huge amounts of data. You will develop knowledge insight from a variety of structured and unstructured data, using a range of data analysis methods, processes, algorithms, and systems.",,0.392174,"Northumbria University, Newcastle, United Kingdom",54.968912,-1.624651


In [43]:
sorted_df['FEE_EUR'][sorted_df['FEE_EUR'].isnull()] = np.NaN
sorted_df['FEE_EUR'] = sorted_df['FEE_EUR'].astype(float)
sorted_df['FEE_EUR'].dropna().apply(round, 3)

14      1000
337       18
336    18044
358    14283
105    35500
       ...  
166    19263
66     52281
201    20687
74     18500
107    37200
Name: FEE_EUR, Length: 95, dtype: int64

In [44]:
# Create a basemap 
maps = KeplerGl(height=600, width=800) # show the map
# Create a gepdataframe
gdf = gpd.GeoDataFrame(sorted_df, geometry=gpd.points_from_xy(sorted_df.Lon, sorted_df.Lat)) # Add data to Kepler
map_2 = KeplerGl(height=400, data={"data_1": gdf})
map_2

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'data_1':                                                                                   cou…

In [45]:
map_2.save_to_html(file_name="map_courses.html")

Map saved to map_courses.html!


######## MISSING COMMENTS ########

# 5. BONUS: More complex search engine

IMPORTANT: This is a bonus step, so it's not mandatory. You can get the maximum score also without doing this. We will take this into account, only if the rest of the homework has been completed.

For the Bonus part, we want to ask you more sophisticated search engine. Here we want to let users issue more complex queries. The options of this new search engine are:

1. Give the possibility to specify queries for the following features (the user should have the option to issue none or all of them):
- `courseName`
- `universityName`
- `universityCity`
2. Specify a range for the fees to retrieve only MSc whose taxation is in that range.

3. Specify a list of countries which the search engine should only return the courses taking place in city within those countries.
4. Filter based on the courses that have already started.
5. Filter based on the presence of online modality.
Note 1: You should be aware that you should give the user the possibility to select any of the abovementioned options. How should the user use the options? We will accept any manual that you provide to the user.

Note 2: As you may have realized from 1st option, you need to build inverted indexes for those values and return all of the documents that have the similarity more than 0 concerning the given queries. Choose a logical way to aggregate the similarity coming from each of them and explain your idea in detail.

Note 3: The options other than 1st one can be considered as filtering criteria so the retrieved documents must respect all of those filters.

The output must contain the following information about the places:

- `courseName`
- `universityName`
- `url`

# 7. Algorithmic Question

Leonardo is an intern at a company. He is paid based on the total number of hours he has worked. They agreed d days ago that Leonardo could not work less than `l_i` or more than `r_i` hours per i-th day. Furthermore, he was warned by HR that on his last day at the company, he should provide a detailed report on how many hours he worked each day for the previous `d` days.

Today is the day Leonardo should report to HR, but the problem is that he didn't account for how many hours he put in for each day, so he only has the total sum of the hours (`total_hours`) he put in total in these `d` days. He believes that if he creates a report in which each number `x_i` corresponds to the total hours he worked on the i-th day while satisfying the HR limitations and the total sum of all `x_i` equals `total_hours`, he would be fine.

He cannot create such a report independently and requests your assistance. He will give you the number of days `d`, total hours spent `total_hours`, and the HR limitations for each day (`l_i`, `r_i`), and he wants you to assist him in determining whether it is possible to create such a fake report. If that is possible, make such a report.

**Input**

The first line of input contains two integers `d` and `total_hours` - the number of days Leonardo worked there and the total number of hours he worked for the company. Each of the following `d` lines contains two integer numbers `l_i` and `r_i` - the minimum and maximum hours he can work on the i-th day.

**Output**

If such a report cannot be generated, print 'NO' in one output line. If such a report is possible, print 'YES' in the output and `d` numbers - the number of hours Leonardo spent each day - in the second line. If more than one solution exists, print any of them.


### 1. Implement a code to solve the above mentioned problem.

In [None]:
d, total_hours = map(int, input().split())
limits = []
for _ in range(d):
   min_hours, max_hours = map(int, input().split())
   limits.append((min_hours, max_hours))

hours = [0]*d
for i in range(d):
   if total_hours < limits[i][0]:
       print('NO')
   hours[i] = limits[i][0]
   total_hours -= limits[i][0]

if total_hours >= 0:
   for i in range(d):
       if hours[i] < limits[i][1]:
           hours[i] += 1
           total_hours -= 1
   if total_hours == 0:
       print('YES')
       print(*hours)

### 2. What is the time complexity (the Big O notation) of your solution? Please provide a detailed explanation of how you calculated the time complexity.

Overall time complexity (the Big O notation) of the solution: $O(1) + O(d)+ O(d) + O(d)+ O(d)= O(d)$

Checking the individual time complexities by breaking down the operations:

In [None]:
d, total_hours = map(int, input().split())
# Reading the input is a constant time operation and has a time complexity of O(1).
limits = []
for _ in range(d):
    min_hours, max_hours = map(int, input().split())
    limits.append((min_hours, max_hours))

The next part of code creates a list and appends it with $d$ pairs of integers.
It involves a loop that runs d times, so it has a time complexity of $O(d)$.

In [None]:
hours = [0]*d
# This operation creates a list of d lenght and initializes each element to zero, so it has a time complexity of O(d).

for i in range(d):
    if total_hours < limits[i][0]:
        print('NO')
    hours[i] = limits[i][0]
    total_hours -= limits[i][0]

This next loop iterates over a range of $d$ days. Each iteration involves a constant number of operations and the time complexity is $O(d)$.

In [None]:
if total_hours >= 0:
    for i in range(d):
        if hours[i] < limits[i][1]:
            hours[i] += 1
            total_hours -= 1
    if total_hours == 0:
        print('YES')
        print(*hours)

Like the one from before, this loop iterates over a range of $d$ days. Each iteration involves constant time operations and the time complexity is also $O(d)$.

### 3. Ask ChatGPT or any other LLM chatbot tool to check your code's time complexity (the Big O notation). Compare your answer to theirs. Do you believe this is correct? If the two differ, which one is right? (why?)

ChatGPT had a different breakdown of the solutions time complexity, but the calcuations and result is the same: $O(d)$.

- Reading input: The first line reads two integers, which takes constant time. The loop that follows reads d pairs of integers, which has a time complexity of $O(d)$.
- Initializing the hours list: This operation has a time complexity of $O(d)$ since it involves creating a list of length d and initializing each element to zero.
- First loop (for i in range(d)): This loop iterates over the range of days (d). Each iteration involves a constant number of operations (assignments and condition checks) and does not depend on the size of the input. Therefore, the time complexity of this loop is0 $O(d)$.
- Second loop (for i in range(d)): Similar to the first loop, it iterates over the range of days (d). Each iteration involves constant time operations. The time complexity of this loop is also $O(d)$.
So, the overall time complexity of the provided code is $O(d)$, where d is the number of days.

### 4. What do you think of the optimality of your code? Do you believe it is optimal? Can you improve? Please elaborate on your response.

Given the task, I believe the solution code to be optimal. However, improvements could be made regarding readability and efficiency. The assignment of minimum hours and additional hours could be combined into the same loops. Another improvement could consist of using list comprehension when creating the limits list.