## 1. Data collection

For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here
we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a ```crawler.py``` module, a ```parser.py``` module, and a ```engine.py``` module: this is a good practice that improves readability in reporting and efficiency in deploying the code. Be careful; you are likely dealing with exceptions and other possible issues! 

### 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/). Next, we want you to **collect the URL** associated with each site in the list from the previously collected list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a `.txt` file whose single line corresponds to the master's URL.


In [2]:
from bs4 import BeautifulSoup
import requests
import re
import os
from urllib.request import urlopen
import time
from concurrent.futures import ThreadPoolExecutor
import csv
import pandas as pd

In [None]:
f = open("urls.txt","w") # First I create a txt file where I can write the URLs  #  w means writing mode
for i in range(1, 401): #from first page to page 400
    url = f"https://www.findamasters.com/masters-degrees/msc-degrees/?PG={i}" #pages can be scrolled by changing the number after PG
    result = requests.get(url) # as we have done in class
    soup = BeautifulSoup(result.text, 'html.parser') # to get the html of each page

    for link in soup.find_all(class_ = re.compile('courseLink')): #as in class to get each tag of the page which belongs to class courseLink
        c = (link.get("href"))  # url of each page in the i-th page
        f.write("https://www.findamasters.com/"+c) #writing the rows
        f.write("\n")
f.close()
print('The "urls.txt" file is generated!')

### 1.2. Crawl master's degree pages

We wrote a function named 'download_url'.
Since the FindMaster website blocks us for 70 seconds for every (20 to 22) requests we send, we use 'time.sleep(70)' to wait and then resend the http get request. 
we also omit to download the http files that their directory are already existed.

for sending http get requests asynchronously, we can use async and await methods and take the advantage of using "aiohttp" library. the other way is to use ThreadPoolExecutor function executer. 
It means that we store the executer command in a variable named 'future_to_url' that we are able to call in the future.
The ThreadPoolExecutor is a built-in Python module that provides managing a pool of worker threads. It allows us to submit tasks to the pool, which are then executed by one of the worker threads in the pool.

In [2]:
from concurrent.futures import ThreadPoolExecutor

# Function to download and save HTML for a given URL
def download_url(url, folder_path, page_number):
    # Create a folder for each page if it doesn't exist
    page_folder = os.path.join(folder_path, f"page_{page_number}")
    if os.path.exists(page_folder):
        # uncomment the below code to see which pages are skiped, cause they have already been downloaded.
        print(f"Skipping Page: {page_number} - Folder already exists.")
        return

    try:
        response = requests.get(url) # Send a GET request to the URL
        response.raise_for_status()  # Raise an exception for bad responses 

        # Create a folder for each page if it doesn't exist
        os.makedirs(page_folder, exist_ok=True)

        # Save the HTML content to a file
        file_path = os.path.join(page_folder, f"html_{page_number}.html")
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(response.text)
        print(f"Downloaded page {page_number}: {url}")
    except requests.exceptions.RequestException as e:
        print(f"Failed to download page {page_number}: {url}")
        print(f"Error: {e}")
        print("Retrying in 70 seconds...")
        time.sleep(70)  # Wait for 10 seconds before retrying
        download_url(url, folder_path, page_number)  # Retry the download

# Read all URLs one by one
with open('urls.txt', 'r') as urls_file:
    urls = urls_file.read().splitlines()

output_folder = 'HTML_folders' # Store all HTML files into this directory.

# We can use ThreadPoolExecutor for sending http requests asynchronously. 
# However, Since the FindMaster website blocks us for 70 seconds for every (20 to 22) requests we send, 
# the max_workers in below code assigned to number 1. So it sends requests synchronously.
with ThreadPoolExecutor(max_workers=1) as executor:
    # Enumerate through each URL and submit download tasks to the executor
    future_to_url = {executor.submit(download_url, url, output_folder, page_number): url for page_number, url in enumerate(urls, start=1)}


Downloaded page 1: https://www.findamasters.com//masters-degrees/course/3d-design-for-virtual-environments-msc/?i93d2645c19223
Downloaded page 2: https://www.findamasters.com//masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891
Downloaded page 3: https://www.findamasters.com//masters-degrees/course/accounting-accountability-and-financial-management-msc/?i132d7816c25522
Downloaded page 4: https://www.findamasters.com//masters-degrees/course/accounting-financial-management-and-digital-business-msc/?i345d4286c351
Downloaded page 5: https://www.findamasters.com//masters-degrees/course/addictions-msc/?i132d4318c27100
Downloaded page 6: https://www.findamasters.com//masters-degrees/course/advanced-chemical-engineering-msc/?i321d8433c50447
Downloaded page 7: https://www.findamasters.com//masters-degrees/course/advanced-master-in-financial-markets/?i1298d6514c28542
Downloaded page 8: https://www.findamasters.com//masters-degrees/course/advanced-master-in-innovation-and-strategic-

### 1.3 Parse downloaded pages

At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

1. Course Name (to save as ```courseName```): string;
2. University (to save as ```universityName```): string;
3. Faculty (to save as ```facultyName```): string
4. Full or Part Time (to save as ```isItFullTime```): string;
5. Short Description (to save as ```description```): string;
6. Start Date (to save as ```startDate```): string;
7. Fees (to save as ```fees```): string;
8. Modality (to save as ```modality```):string;
9. Duration (to save as ```duration```):string;
10. City (to save as ```city```): string;
11. Country (to save as ```country```): string;
12. Presence or online modality (to save as ```administration```): string;
13. Link to the page (to save as ```url```): string.
    

In [15]:
for i in range(1,2):   
    os.chdir(r'C:\Users\susan\Documents\DS\ADM\HW3\ADM-HW3\HTML_folders\page_'+str(i)) #change directories
    for filename in os.listdir(os.getcwd()): # get all the files in a folder
        if filename.endswith(".html"): # if file extension is .html
            with open(os.path.join(os.getcwd(), filename), 'r',encoding='utf-8') as f: # open each file into a folder
                soup = BeautifulSoup(f,'html.parser') # get the html file by each file 
                out=[] # initialize a list where we append all the informations parsed from each html file

                # 1  Course Name
                courseName = soup.find_all(class_=re.compile("course-header__course-title"))
                out.append(courseName[0].text.strip() if courseName else "Nan") #text.strip to eliminate strange simbols for the space
                # 2  University
                #[out.append(i.text) for i in soup.find_all(class_=re.compile("course-header__institution"))] 
                universityName = soup.find_all(class_=re.compile("course-header__institution"))
                out.append(universityName[0].text if universityName else "Nan")
                # 3  Faculty
                #[out.append(i.text) for i in soup.find_all(class_=re.compile("course-header__department"))]
                facultyName = soup.find_all(class_=re.compile("course-header__department"))
                out.append(facultyName[0].text if facultyName else "Nan")
                # 4  Full or Part Time
                #a = [i.text for i in soup.find_all(class_=re.compile("concealLink"))]
                #out.append(a[0])
                isItFullTime = soup.find_all(class_=re.compile("concealLink"))
                out.append(isItFullTime[0].text if isItFullTime else "Nan")
                # 5  Short Description
                #b = [i.text.replace('\n','') for i in soup.find_all(class_=re.compile("course-sections__content"))]
                #out.append(b[0])
                description = soup.find_all(class_=re.compile("course-sections__content"))
                out.append(description[0].text.replace('\n', '') if description else "Nan")
                # 6  Start Date
                #[out.append(i.text) for i in soup.find_all(class_=re.compile("key-info__start-date"))]
                startDate = soup.find_all(class_=re.compile("key-info__start-date"))
                out.append(startDate[0].text if startDate else "Nan")
                # 7  Fees
                #[out.append(i.text.replace('\n','')) for i in soup.find_all(class_=re.compile("course-sections__fees"))]
                fees = soup.find_all(class_=re.compile("course-sections__fees"))
                out.append(fees[0].text.replace('\n', '') if fees else "Nan")
                # 8  Modality
                #[out.append(i.text) for i in soup.find_all(class_=re.compile("key-info__qualification"))]
                modality = soup.find_all(class_=re.compile("key-info__qualification"))
                out.append(modality[0].text if modality else "Nan")
                # 9  Duration
                #[out.append(i.text) for i in soup.find_all(class_=re.compile("key-info__duration"))]
                duration = soup.find_all(class_=re.compile("key-info__duration"))
                out.append(duration[0].text if duration else "Nan")
                # 10  City
                #[out.append(i.text) for i in soup.find_all(class_=re.compile("course-data__city"))]
                city = soup.find_all(class_=re.compile("course-data__city"))
                out.append(city[0].text if city else "Nan")
                # 11  Country
                #[out.append(i.text) for i in soup.find_all(class_=re.compile("course-data__country"))]
                country = soup.find_all(class_=re.compile("course-data__country"))
                out.append(country[0].text if country else "Nan")
                # 12  Presence or online modality
                # We have seen that some courses has both online or oncampus modality, one of them is "Master of Business Administration"
                on_campus_elements = soup.find_all(class_=re.compile("course-data__on-campus"))
                online_elements = soup.find_all(class_=re.compile("course-data__online"))
                if on_campus_elements and online_elements:
                    out.append("both")
                else:
                    out.append(on_campus_elements[0].text if on_campus_elements else online_elements[0].text if online_elements else "Nan")
                # 13  Link to the page
                [out.append(soup.find('link',{'rel':'canonical'})['href'])]
                f.close()
                
                # Creating file .tsv
                l = ['courseName','universityName','facultyName','isItFullTime','description','startDate','fees','modality','duration',
                    'city','country','administration','url']
                with open(filename+'.tsv','w',encoding='utf-8') as tsv:
                    tsv_output = csv.writer(tsv, delimiter='\t')
                    tsv_output.writerow(l)
                    tsv_output.writerow(out)
    os.chdir('..')  

In [16]:
out

['3D Design for Virtual Environments - MSc',
 'Glasgow Caledonian University',
 'School of Engineering and Built Environment',
 'Full time',
 "3D visualisation and animation play a role in many areas, and the popularity of these media just keeps growing. Digital animation provides the eye-catching special effects in the 21st century's favourite films and television shows; 3D design is also essential to everyday work in everything from computer games development, online virtual world development and industrial design to marketing, product design and architecture.GCU's programme in 3D Design for Virtual Environments will help you develop the skills to thrive in a successful career as a visual designer. The programme is practical and career-focused, oriented towards current industry needs, technology and practice. No prior knowledge of 3D design is required.",
 'September',
 'FeesPlease see the university website for further information on fees for this course.',
 'MSc',
 '1 year full-tim

In [4]:
data=[]
# to merge all the .tsv files
for i in range(1,200):
    os.chdir(r'C:\Users\susan\Documents\DS\ADM\HW3\ADM-HW3\HTML_folders\page_'+str(i))
    for filename in os.listdir(os.getcwd()):
        if filename.endswith(".tsv"):
            a = pd.read_csv(filename,sep='\t')
            data.append(a)
    os.chdir('..')
data=pd.concat(data,ignore_index=True)   
data.to_csv('Dataset.tsv',sep='\t',index=False) # Saving the big one

In [10]:
data.head(30)

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,Full time,3D visualisation and animation play a role in ...,September,FeesPlease see the university website for furt...,MSc,1 year full-time,Glasgow,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Full time,Businesses and governments rely on sound finan...,September,"FeesUK: £18,000 (Total) International: £34,750...",MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
2,"Accounting, Accountability & Financial Managem...",King’s College London,King’s Business School,Full time,"Our Accounting, Accountability & Financial Man...",September,FeesPlease see the university website for furt...,MSc,1 year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
3,"Accounting, Financial Management and Digital B...",University of Reading,Henley Business School,Full time,Embark on a professional accounting career wit...,September,FeesPlease see the university website for furt...,MSc,1 year full time,Reading,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
4,Addictions MSc,King’s College London,"Institute of Psychiatry, Psychology and Neuros...",Full time,Join us for an online session for prospective ...,September,FeesPlease see the university website for furt...,MSc,One year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
5,Advanced Chemical Engineering - MSc,University of Leeds,School of Chemical and Process Engineering,Full time,The Advanced Chemical Engineering MSc at Leeds...,September,"FeesUK: £13,750 (Total) International: £31,000...",MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
6,Advanced Master in Financial Markets,Solvay Brussels School,Economics and Management,Full time,Programme overviewThe Advanced Master in Finan...,September,Fees18.000 €,"MA, MSc, Other, Pre-Masters, Masters Module",1 year,Brussels,Belgium,On Campus,https://www.findamasters.com/masters-degrees/c...
7,Advanced Master in Innovation & Strategic Mana...,Solvay Brussels School,Economics and Management,Full time,Programme overviewThe Advanced Master in Innov...,September,Fees18.000 €,"MA, MSc, Other, Pre-Masters, Masters Module",10 months,Brussels,Belgium,On Campus,https://www.findamasters.com/masters-degrees/c...
8,Advanced Physiotherapy Practice - MSc,Glasgow Caledonian University,School of Health and Life Sciences,Full time,Progress your career as a physiotherapist with...,"January, September",FeesPlease see the university website for furt...,MSc,1 Year Full Time / 2-3 Years Part Time,Glasgow,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
9,Agricultural Sciences - MSc (Agriculture and F...,University of Helsinki,International Masters Degree Programmes,Full time,Goal of the pro­grammeWould you like to be inv...,September,FeesTuition fee per year (non-EU/EEA students)...,MSc,2 years,Helsinki,Finland,On Campus,https://www.findamasters.com/masters-degrees/c...
