## 1. Data collection

For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here
we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a ```crawler.py``` module, a ```parser.py``` module, and a ```engine.py``` module: this is a good practice that improves readability in reporting and efficiency in deploying the code. Be careful; you are likely dealing with exceptions and other possible issues! 

### 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/). Next, we want you to **collect the URL** associated with each site in the list from the previously collected list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a `.txt` file whose single line corresponds to the master's URL.


In [6]:
from bs4 import BeautifulSoup
import requests
import re
import os
from urllib.request import urlopen
import time

In [3]:
f = open("urls.txt","w") # First I create a txt file where I can write the URLs  #  w means writing mode
for i in range(1, 401): #from first page to page 400
    url = f"https://www.findamasters.com/masters-degrees/msc-degrees/?PG={i}" #pages can be scrolled by changing the number after PG
    result = requests.get(url) # as we have done in class
    soup = BeautifulSoup(result.text, 'html.parser') # to get the html of each page

    for link in soup.find_all(class_ = re.compile('courseLink')): #as in class to get each tag of the page which belongs to class courseLink
        c = (link.get("href"))  # url of each page in the i-th page
        f.write("https://www.findamasters.com/"+c) #writing the rows
        f.write("\n")
f.close()

### 1.2. Crawl master's degree pages

Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its `HTML` in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the downloaded `HTML` pages into folders. Each folder will contain the `HTML` of the courses on page 1, page 2, ... of the list of master's programs.
   
__Tip__: Due to the large number of pages you should download, you can use some methods that can help you shorten the time. If you employed a particular process or approach, kindly describe it.
 

In [8]:
# Suggested by chatgpt.. to not be blocked but I'm blocked :(
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

In [9]:
for i in range(1, 2): #scraping only in the first page to see if it works
    url = f"https://www.findamasters.com/masters-degrees/msc-degrees/?PG={i}" 
    reqs = requests.get(url, headers=headers)
    soup = BeautifulSoup(reqs.text, 'html.parser')
    os.makedirs("PAGE" + str(i)) # Creating a folder, its name is PAGE and the number of the page
    os.chdir("PAGE" + str(i)) # Going into the folder
    
    for link in soup.find_all(class_=re.compile('courseLink')):
        a = link.get('href')
        url_ = "https://www.findamasters.com/" + a
        filename = re.sub(r'[^a-zA-Z0-9_.]', '_', str(a)[8:]) + '.html' #to have a right file name 
        file = open(filename, 'w', encoding="UTF-8")
        try:
            with urlopen(url_) as webpage:
                content = webpage.read().decode()
                file.write(content)
        except Exception as e:
            print(f"Error fetching {url_}: {e}")
        
        file.close()
        time.sleep(2)  # Introduce a delay to avoid rate limiting
    os.chdir('..')

Error fetching https://www.findamasters.com//masters-degrees/course/3d-design-for-virtual-environments-msc/?i93d2645c19223: HTTP Error 403: Forbidden
Error fetching https://www.findamasters.com//masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891: HTTP Error 403: Forbidden
Error fetching https://www.findamasters.com//masters-degrees/course/accounting-accountability-and-financial-management-msc/?i132d7816c25522: HTTP Error 403: Forbidden
Error fetching https://www.findamasters.com//masters-degrees/course/accounting-financial-management-and-digital-business-msc/?i345d4286c351: HTTP Error 403: Forbidden
Error fetching https://www.findamasters.com//masters-degrees/course/addictions-msc/?i132d4318c27100: HTTP Error 403: Forbidden
Error fetching https://www.findamasters.com//masters-degrees/course/advanced-chemical-engineering-msc/?i321d8433c50447: HTTP Error 403: Forbidden
Error fetching https://www.findamasters.com//masters-degrees/course/advanced-master-in-financial-markets/