# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [3]:
# your code here
response = requests.get(url)
content = response.content
content

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Data science - Wikipedia</title>\n<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cookie=document.cookie.m

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [None]:
from bs4 import BeautifulSoup

In [4]:
# your code hereimport requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Data_science'
response = requests.get(url)
content = response.content

soup = BeautifulSoup(content, 'html.parser')

links = []
for link in soup.find_all('a'):
    href = link.get('href')
    if href and href.startswith('http'):
        links.append(href)

unique_links = set(links)
print(unique_links)


{'https://ca.wikipedia.org/wiki/Ci%C3%A8ncia_de_les_dades', 'https://doi.org/10.3390%2Fmake1010015', 'https://fa.wikipedia.org/wiki/%D8%B9%D9%84%D9%85_%D8%AF%D8%A7%D8%AF%D9%87%E2%80%8C%D9%87%D8%A7', 'https://zh.wikipedia.org/wiki/%E6%95%B0%E6%8D%AE%E7%A7%91%E5%AD%A6', 'http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf', 'https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/', 'https://www.wikidata.org/wiki/Special:EntityPage/Q2374463', 'http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext', 'https://foundation.wikimedia.org/wiki/Privacy_policy', 'https://qu.wikipedia.org/wiki/Willakuy_hamut%27ay', 'https://fr.wikipedia.org/wiki/Science_des_donn%C3%A9es', 'https://web.archive.org/web/20190620184935/https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/', 'https://fi.wikipedia.org/wiki/Datatiede', 'https://www.worldcat.org/issn/0017-8012', 'https://en.wikiversity.org/wiki/Data

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [None]:
domain = 'http://wikipedia.org'

In [8]:
# your code here
from bs4 import BeautifulSoup

html = """
<html>
<body>
<a href="http://example.com">Link</a>
<a href="/page">Page</a>
<a href="http://example.com/#section">Section</a>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

soup


<html>
<body>
<a href="http://example.com">Link</a>
<a href="/page">Page</a>
<a href="http://example.com/#section">Section</a>
</body>
</html>

In [9]:
links = soup.find_all('a')
links

[<a href="http://example.com">Link</a>,
 <a href="/page">Page</a>,
 <a href="http://example.com/#section">Section</a>]

In [11]:
urls = [link.get('href') for link in links]
urls

['http://example.com', '/page', 'http://example.com/#section']

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [None]:
import os

In [1]:
# your code here
import os

folder_name = 'wikipedia'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

os.chdir(folder_name)


### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [None]:
from slugify import slugify

In [2]:
# your code here
import os
import requests
from slugify import slugify

def index_page(link):
    try:
        response = requests.get(link)
        filename = slugify(link) + '.html'
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(response.content.decode('utf-8'))
    except:
        pass


### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [3]:
# your code here
%%time
for link in cleaned_links:
    index_page(link)


UsageError: Line magic function `%%time` not found.


### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [4]:
import multiprocessing

In [5]:
# your code here
%%time
from multiprocessing import Pool

with Pool(processes=8) as pool:
    pool.map(index_page, cleaned_links)


UsageError: Line magic function `%%time` not found.
