# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests
url = 'https://en.wikipedia.org/wiki/Data_science'

In [2]:
# your code here
response = requests.get(url)
print(response)
print(response.text)

<Response [200]>
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Data science - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"d6d38968-74aa-4a6d-a702-64828bf180ff","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":986108274,"wgRevisionId":986108274,"wgArticleId":35458904,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: others","Use dmy dates from December 2012","Lists having no precise inclusion criteria from June 2020","All lists having no precise inclusion c

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [3]:
from bs4 import BeautifulSoup

In [4]:
# your code here
soup = BeautifulSoup(response.content, features = 'lxml')
links_raw = soup.find_all('a')
links_pur = [link.get('href') for link in links_raw]
links_pur = list(set(links_pur))
links_pur

['/wiki/Anomaly_detection',
 '/wiki/Data_reduction',
 'https://en.wikipedia.org/w/index.php?title=Data_science&oldid=986108274',
 'https://doi.org/10.3390%2Fbdcc2020014',
 'https://doi.org/10.1126%2Fscience.1170411',
 '/wiki/TensorFlow',
 '/wiki/CURE_data_clustering_algorithm',
 '#cite_ref-23',
 '#cite_ref-:5_28-0',
 '#cite_ref-26',
 'https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute',
 '/wiki/Special:Random',
 'https://books.google.com/books?id=oGs_AQAAIAAJ',
 '/wiki/National_Science_Board',
 '/wiki/Data_loss',
 '#cite_ref-2',
 '/wiki/Cluster_analysis',
 '/wiki/Category:Computational_fields_of_study',
 '/wiki/T-distributed_stochastic_neighbor_embedding',
 '/wiki/Machine_learning',
 '/wiki/Decision_tree_learning',
 '#cite_note-16',
 '/wiki/Special:BookSources/978-0-9825442-0-4',
 '/wiki/Occam_learning',
 '#cite_note-18',
 'https://az.wikipedia.org/wiki/Veril%C9%99nl%C9%99r_elmi_(Data_Science)',
 '#Early_usage',
 '/wiki/OPTICS_algorithm',
 '#cite_ref-:7_17-0',
 'https:

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [5]:
domain = 'https://wikipedia.org'

In [6]:
# your code here
links_abs = [str(link) for link in links_pur if '%' not in str(link) and str(link).startswith('http')]
links_rel = [domain+str(link) for link in links_pur if '%' not in str(link) and str(link).startswith('/')] 
links_tot = links_abs + links_rel
# Removing potential duplicates
links = list(set(links_tot))
links

['https://en.wikipedia.org/w/index.php?title=Data_science&oldid=986108274',
 'https://wikipedia.org/wiki/Decision_tree_learning',
 'https://wikipedia.org/wiki/Unsupervised_learning',
 'https://wikipedia.org/wiki/Artificial_neural_network',
 'https://wikipedia.org/wiki/Probably_approximately_correct_learning',
 'https://wikipedia.org/wiki/Machine_Learning_(journal)',
 'https://wikipedia.org/wiki/Restricted_Boltzmann_machine',
 'https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute',
 'https://wikipedia.org/wiki/Category:Computer_occupations',
 'https://books.google.com/books?id=oGs_AQAAIAAJ',
 'https://wikipedia.org/wiki/Unstructured_data',
 'https://wikipedia.org/wiki/Data_mining',
 'https://wikipedia.org/w/index.php?title=Data_science&action=edit&section=12',
 'https://wikipedia.org/wiki/Montpellier_2_University',
 'https://wikipedia.org/wiki/Data_compression',
 'https://wikipedia.org/wiki/Conference_on_Neural_Information_Processing_Systems',
 'https://wikipedia.org/w/in

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [7]:
import os

In [8]:
# your code here
if not os.path.exists('wikipedia'):
    os.mkdir('wikipedia')

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [9]:
# from slugify import slugify
import re
#  install python-slugify -> failed too

In [10]:
# your code here

# instead of slugify package that didn't work (NameError: name 'unicode' is not defined) I'll use the following function
# ref: https://blndxp.wordpress.com/2016/03/04/python-django-wagtail-slugify-name-unicode-is-not-defined/
def slug(string):
    return re.sub(r'[-\s]+', '-',(re.sub(r'[^\w\s-]', '',string).strip().lower()))

def index_page(url):
    try:
        response = requests.get(url)
        #print(response)
        filename = slug(link)+'.html'
        #print(filename)
        with open('./wikipedia/'+filename, 'w') as file:
            file.write(response.text)
    except:
        pass

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [11]:
# your code here
import time
start = time.perf_counter()
for link in links:
    index_page(link)
finish = time.perf_counter()
print('finished in ' + str(finish-start)+' seconds')

finished in 304.09935527700003 seconds


### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [12]:
import multiprocessing
import time

In [13]:
# your code here
start = time.perf_counter()
processes = []
for link in links_tot:
  p = multiprocessing.Process(target = index_page, args = [link])
  p.start()
  # here we are creating the list of processes which we will then want to distribute
  processes.append(p)
finish = time.perf_counter()    
print('finished in ' + str(finish-start)+' seconds')

finished in 4.601657735999993 seconds
