# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [None]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [None]:
response = requests.get(url)

response.content[:500]

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(response.content)

links = soup.find_all('a')

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [None]:
domain = 'http://wikipedia.org'

In [None]:
import re

absolute_links = [link for link in links if 'href="http' in str(link)]
absolute_links = [re.findall('href="([^"]+)', str(link)) for link in absolute_links]
absolute_links = [link[0].replace('http', ' http') for link in absolute_links if link]
absolute_links = [link.split(' ') for link in absolute_links]
absolute_links = [link for sublist in absolute_links for link in sublist if link]
absolute_links = [link for link in absolute_links if '%' not in link]

relative_links = [link for link in links if 'href="/' in str(link)]
relative_links = [re.findall('/wiki/[^"]+', str(link)) for link in relative_links]
relative_links = [domain + link[0] for link in relative_links if link]
relative_links = [link for link in relative_links if '%' not in link]

links = absolute_links + relative_links

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [None]:
import os

In [None]:
os.mkdir('wikipedia')
os.chdir('wikipedia')

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [None]:
from slugify import slugify

In [None]:
def index_page(link):
   try:
      response = requests.get(link)
      soup = BeautifulSoup(response.content)
      file_name = slugify(link) + '.html'
      with open(file_name, 'w') as file:
         file.write(str(soup))
   except:
      print(link, 'failed')

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [None]:
%%time

import time

for link in links[:10]:
   index_page(link)
   time.sleep(3)

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [None]:
import threading

# Multiprocessing in Python on Windows using VS-Code + Jupyter/Ipython is bugged

In [None]:
%%time

processes = []

for link in links[:20]:
   p = threading.Thread(target = index_page, args = [link])
   p.start()
   processes.append(p)

for process in processes:
   process.join()