# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [2]:
# your code here
response = requests.get(url)

if response.status_code == 200:
    content = response.content
    print("Retrieved content from URL")
else:
    print("Failed to retrieve content from URL:", url)

Retrieved content from URL


### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [3]:
from bs4 import BeautifulSoup

In [4]:
if response.status_code == 200:
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    
    links = set()
    for link in soup.find_all("a"):
        href = link.get("href")
        if href is not None and href.startswith("http"):
            links.add(href)

    print("Found", len(links), "links on the page:")
    for link in links:
        print(link)
else:
    print("Failed to retrieve content from URL:", url)

Found 108 links on the page:
https://api.semanticscholar.org/CorpusID:6107147
https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/
https://dstf.acm.org/DSTF_Final_Report.pdf
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
https://uk.wikipedia.org/wiki/%D0%9D%D0%B0%D1%83%D0%BA%D0%B0_%D0%BF%D1%80%D0%BE_%D0%B4%D0%B0%D0%BD%D1%96
https://el.wikipedia.org/wiki/%CE%95%CF%80%CE%B9%CF%83%CF%84%CE%AE%CE%BC%CE%B7_%CE%B4%CE%B5%CE%B4%CE%BF%CE%BC%CE%AD%CE%BD%CF%89%CE%BD
https://es.wikipedia.org/wiki/Ciencia_de_datos
http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf
https://api.semanticscholar.org/CorpusID:114558008
https://vi.wikipedia.org/wiki/Khoa_h%E1%BB%8Dc_d%E1%BB%AF_li%E1%BB%87u
https://hy.wikipedia.org/wiki/%D5%8F%D5%BE%D5%B5%D5%A1%D5%AC%D5%B6%D5%A5%D6%80%D5%AB_%D5%A3%D5%AB%D5%BF%D5%B8%D6%82%D5%A9%D5%B5%D5%B8%D6%82%D5%B6
https://kk.wikipedia.org/wiki/%D0%94%D0%B5%D1%80%D0%B5%D0%BA%D1%82%D0%B5%D1%80_%D1

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [5]:
domain = 'http://wikipedia.org'

In [6]:
if response.status_code == 200:
    content = response.content
    soup = BeautifulSoup(content, "html.parser")

    absolute_links = [link.get("href") for link in soup.find_all("a") if link.has_attr("href") and link.get("href").startswith("http") and "%" not in link.get("href")]
    relative_links = [domain + link.get("href") for link in soup.find_all("a") if link.has_attr("href") and link.get("href").startswith("/") and "%" not in link.get("href")]

    unique_links = set(absolute_links + relative_links)

    print("Found", len(unique_links), "unique links on the page:")
    for link in unique_links:
        print(link)
else:
    print("Failed to retrieve content from URL:", url)

Found 251 unique links on the page:
https://api.semanticscholar.org/CorpusID:6107147
http://wikipedia.org/wiki/Journal_of_Computational_and_Graphical_Statistics
http://wikipedia.org/wiki/Data_set
http://wikipedia.org/wiki/Data_integrity
https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/
http://wikipedia.org/wiki/Data_steward
https://dstf.acm.org/DSTF_Final_Report.pdf
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
http://wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Data+science
http://wikipedia.org/wiki/Jim_Gray_(computer_scientist)
http://wikipedia.org/wiki/Special:Search
http://wikipedia.org/wiki/Ben_Fry
http://wikipedia.org/wiki/Data_augmentation
http://wikipedia.org/wiki/Data_retention
http://wikipedia.org/w/index.php?title=Data_science&action=edit&section=3
http://wikipedia.org/wiki/Buzzword
http://wikipedia.org/wiki/Special:MyContributions
http://wikipedia.org/wiki/Special:Random
http://wi

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [7]:
import os

In [8]:
# your code here
folder_name = "wikipedia"
if not os.path.exists(folder_name):
    os.mkdir(folder_name)

# Set the current working directory to 'wikipedia'
os.chdir(folder_name)

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [9]:
from slugify import slugify

In [10]:
# your code here
def index_page(link):
    try:
        # Request the content of the page referenced by the link
        response = requests.get(link)
        content = response.content
        
        # Slugify the filename and add a .html file extension
        filename = slugify(link) + ".html"
        
        # Create a file in the wikipedia folder using the slugified filename and write the contents of the page to the file
        with open(os.path.join(os.getcwd(), filename), "wb") as f:
            f.write(content)
            
    except:
        pass

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [11]:
# your code here
%%time
for link in unique_links:
    index_page(link)

UsageError: Line magic function `%%time` not found.


In [12]:
#getting this error as so using "time" library

In [16]:
import time

start_time = time.time()

for link in unique_links:
    try:
        for link in unique_links:
            index_page(link)
    except:
        pass

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [17]:
import multiprocessing

In [18]:
# your code here
if __name__ == "__main__":
    # Create the wikipedia directory and make it the current working directory
    os.makedirs("wikipedia", exist_ok=True)
    os.chdir("wikipedia")

    # Get the list of unique links
    domain = "http://wikipedia.org"
    response = requests.get(domain)
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    absolute_links = [link.get("href") for link in soup.find_all("a") if link.get("href").startswith("http") and "%" not in link.get("href")]
    relative_links = [domain + link.get("href") for link in soup.find_all("a") if link.get("href").startswith("/") and "%" not in link.get("href")]
    unique_links = set(absolute_links + relative_links)

    # Index the pages in parallel
    start_time = time.time()
    with multiprocessing.Pool(processes=4) as pool:
        pool.map(index_page, unique_links)
    end_time = time.time()

    # Print the elapsed time
    elapsed_time = end_time - start_time
    print(f"Elapsed time: {elapsed_time:.2f} seconds")