# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [5]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [6]:
# your code here

response = requests.get(url)
print(response)

<Response [200]>


### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [7]:
from bs4 import BeautifulSoup

In [8]:
# your code here

soup = BeautifulSoup(response.content)
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Data science - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"d3a77995-a605-4fde-9b08-ab51d8dbb8f4","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":1120574273,"wgRevisionId":1120574273,"wgArticleId":35458904,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description matches Wikidata","Use dmy dates from August 2021","Information science","Computer oc

In [9]:
table = soup.find_all('a', href=True)
links = []
for item in table:
    if str(item['href'])[0] in ('/','h'):
        links.append(item.get('href'))

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [10]:
domain = 'http://wikipedia.org'

In [11]:
# your code here
absolute_links = [('http:'+link) if (link[0:2]=='//') else link for link in links if ((link[0]=='h') | (link[0:2]=='//')) & ('%' not in link)]

relative_links = [domain+link for link in links if (link[0]=='/') & ('%' not in link) & (link[0:2]!='//')]

list_combined = absolute_links + list(set(relative_links) - set(absolute_links))
list_combined

['http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext',
 'https://api.semanticscholar.org/CorpusID:6107147',
 'https://web.archive.org/web/20141109113411/http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext',
 'http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/',
 'https://web.archive.org/web/20140102194117/http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/',
 'http://www.worldcat.org/issn/0360-0300',
 'https://api.semanticscholar.org/CorpusID:207595944',
 'https://www.springer.com/book/9784431702085',
 'https://books.google.com/books?id=oGs_AQAAIAAJ',
 'https://web.archive.org/web/20170320193019/https://books.google.com/books?id=oGs_AQAAIAAJ',
 'http://www.worldcat.org/issn/0036-8075',
 'http://pubmed.ncbi.nlm.nih.gov/19265007',
 'https://api.semanticscholar.org/CorpusID:9743327',
 'https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [12]:
import os

In [13]:
# your code here

if not os.path.exists('wikipedia'):
   os.makedirs('wikipedia')

os.chdir('wikipedia')

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [14]:
from slugify import slugify

In [21]:
# your code here

#!pip install python-slugify

def index_page(link):
    try:
        resp = requests.get(url)

        file_name = slugify(link) + '.html'

        with open(file_name, 'w+') as f:
            f.write(resp.text)
    except:
        pass
    

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [22]:
%%time
# your code here

for web in list_combined:
    index_page(web)

CPU times: user 4.43 s, sys: 251 ms, total: 4.68 s
Wall time: 39.3 s


### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [23]:
import multiprocessing

In [25]:
%%time
# your code here


pool = multiprocessing.Pool()
result = pool.map(index_page, list_combined)
pool.terminate()
pool.join()

CPU times: user 21.5 ms, sys: 42.3 ms, total: 63.8 ms
Wall time: 5.25 s
