# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [2]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [3]:
response = requests.get(url)
response.status_code

200

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [4]:
from bs4 import BeautifulSoup

In [9]:
soup = BeautifulSoup(response.content)

links = [link["href"] for link in soup.find_all("a",href=True)]
links


['#bodyContent',
 '/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random',
 '/wiki/Wikipedia:About',
 '//en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 '/wiki/Help:Contents',
 '/wiki/Help:Introduction',
 '/wiki/Wikipedia:Community_portal',
 '/wiki/Special:RecentChanges',
 '/wiki/Wikipedia:File_upload_wizard',
 '/wiki/Main_Page',
 '/wiki/Special:Search',
 '/w/index.php?title=Special:CreateAccount&returnto=Data+science',
 '/w/index.php?title=Special:UserLogin&returnto=Data+science',
 '/w/index.php?title=Special:CreateAccount&returnto=Data+science',
 '/w/index.php?title=Special:UserLogin&returnto=Data+science',
 '/wiki/Help:Introduction',
 '/wiki/Special:MyContributions',
 '/wiki/Special:MyTalk',
 '#',
 '#Foundations',
 '#Relationship_to_statistics',
 '#Etymology',
 '#Early_usage',
 '#Modern_usag

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [18]:
domain = 'http://wikipedia.org'

In [25]:
abs_links = [link for link in links if (link.startswith("http") and ("%" not in link))]
rel_links = [domain+link for link in links if (link.startswith("/") and (not link.startswith("//")) and ("%" not in link))]

all_links = list(set(abs_links).union(set(rel_links))) # Combining the lists without duplicates


### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [26]:
import os

In [29]:
path = "wikipedia"
os.makedirs(path)

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [30]:
from slugify import slugify

In [44]:
def index_page(url):
    try:
        response = requests.get(url)
        f = open(path+'/'+slugify(url)+'.html','w')
        f.write(response.text)
        f.close()
    except:
        print("Error: ",url)
        pass

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [45]:
%%time

for link in all_links:
    index_page(link)

Error:  https://www2.isye.gatech.edu/~jeffwu/publications/fazhan.pdf
Error:  http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf
CPU times: user 43.4 s, sys: 3.32 s, total: 46.7 s
Wall time: 4min 28s


### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [65]:
import multiprocessing

# It only worked when I imported the function from an external file

from helper_function import index_page

In [73]:
%%time

pool = multiprocessing.Pool(10) # I tried with higher values but it either froze (100, 250) or the time was similar (50)
pool.map(index_page, all_links)
pool.terminate()
pool.join()


Error:  https://www2.isye.gatech.edu/~jeffwu/publications/fazhan.pdf
Error:  http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf
CPU times: user 39.4 ms, sys: 173 ms, total: 212 ms
Wall time: 29.2 s
