# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [2]:
# your code here
response = requests.get(url) 
print(response.status_code)

200


### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [3]:
from bs4 import BeautifulSoup

In [4]:
# your code here
soup = BeautifulSoup(response.content)
list_of_As = soup.find_all("a", href=True)
list_links = [link["href"] for link in list_of_As if not link["href"].startswith("#")]

len(list_links) #340
list_links = list(set(list_links))
len(list_links) #288
list_links

['/wiki/Computer_science',
 '/wiki/Data_storage',
 'https://nl.wikipedia.org/wiki/Datawetenschap',
 '/wiki/Category:Information_science',
 '/w/index.php?title=Data_science&action=info',
 '/wiki/Boston',
 '/wiki/Data_(computing)',
 '/wiki/Special:RecentChangesLinked/Data_science',
 'https://eu.wikipedia.org/wiki/Datu_zientzia',
 '/wiki/Peter_Naur',
 '/wiki/Data_validation',
 '/wiki/Wikipedia:File_Upload_Wizard',
 '/wiki/Exploration',
 'https://doi.org/10.3390%2Fmake1010015',
 '/wiki/Montpellier_2_University',
 'https://bn.wikipedia.org/wiki/%E0%A6%89%E0%A6%AA%E0%A6%BE%E0%A6%A4%E0%A7%8D%E0%A6%A4_%E0%A6%AC%E0%A6%BF%E0%A6%9C%E0%A7%8D%E0%A6%9E%E0%A6%BE%E0%A6%A8',
 'https://api.semanticscholar.org/CorpusID:6107147',
 '/wiki/Wikipedia:General_disclaimer',
 '/wiki/Special:Random',
 'https://en.wikipedia.org/w/index.php?title=Template:Data&action=edit',
 '/wiki/ODSC',
 '/wiki/Data_archaeology',
 '/wiki/Database',
 '/wiki/Scientific_method',
 '/wiki/Basic_research',
 '/w/index.php?title=Data_sci

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [5]:
domain = 'http://wikipedia.org'

In [6]:
# your code here
list_links_abs = [link for link in list_links if link.startswith("http") and "%" not in link]
list_links_rel = [domain+link for link in list_links if not link.startswith("http") and link.startswith("/")]

list_links_clean = list(set(list_links_rel+list_links_abs))
len(list_links_clean) #252
list_links_clean

['https://nl.wikipedia.org/wiki/Datawetenschap',
 'http://wikipedia.org/wiki/Special:Search',
 'http://wikipedia.org/wiki/William_S._Cleveland',
 'http://wikipedia.org/wiki/Main_Page',
 'http://wikipedia.org/wiki/Data_fusion',
 'http://wikipedia.org/wiki/Data_philanthropy',
 'https://eu.wikipedia.org/wiki/Datu_zientzia',
 'http://wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use',
 'https://api.semanticscholar.org/CorpusID:6107147',
 'http://wikipedia.org/wiki/Data_loss',
 'http://wikipedia.org/wiki/Boston_Globe',
 'http://wikipedia.org/wiki/Data_retention',
 'http://wikipedia.org/wiki/Data_archaeology',
 'https://en.wikipedia.org/w/index.php?title=Template:Data&action=edit',
 'http://wikipedia.org/wiki/Analysis',
 'https://www.stat.purdue.edu/~wsc/',
 'https://www2.isye.gatech.edu/~jeffwu/publications/fazhan.pdf',
 'https://books.google.com/books?id=oGs_AQAAIAAJ',
 'http://wikipedia.org/wiki/Graphic_design',
 'http://wikipedia.org/wiki/American_Statistical_Association',
 'http

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [7]:
import os

In [8]:
# your code here
cwd = os.getcwd()
path_wikifolder = os.path.join(cwd, '..', 'wikipedia')
if not os.path.exists(path_wikifolder):
    os.makedirs(path_wikifolder)
os.chdir(path_wikifolder)

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [9]:
from slugify import slugify

In [10]:
# your code here
def index_page(link):
    try: 
        filename = slugify(link)
        with open(filename+'.txt', 'w') as f:
            response = requests.get(link) 
            if response.status_code == 200:
                f.write(str(response.content))
            else:
                print("Page could not be called:", link)
    except:
        pass

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [11]:
import time

In [25]:
%%time
for link in list_links_clean:
    index_page(link)
    
## finished in 4:25 min

Page could not be called: http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext
Page could not be called: http://wikipedia.org//en.m.wikipedia.org/w/index.php?title=Data_science&mobileaction=toggle_view_mobile
Page could not be called: http://wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us
Page could not be called: http://www.datascienceassn.org/about-data-science
Page could not be called: http://wikipedia.org/w/index.php?title=Application_of_Statistics_and_Management&action=edit&redlink=1
Page could not be called: https://cacm.acm.org/blogs/blog-cacm/267286-why-is-it-hard-to-define-data-science/fulltext
Page could not be called: http://priceonomics.com/whats-the-difference-between-data-science-and/
Page could not be called: http://wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
Page could not be called: https://magazine.amstat.org/blog/2016/06/01/datascience-2/
Page could not be called: http://wikipedia.org//foundation.wikimedia.org/wiki/Pr

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [12]:
import multiprocessing

In [13]:
print("Number of cpu : ", multiprocessing.cpu_count())

Number of cpu :  8


In [14]:
starttotal = time.perf_counter()

#do it in packages of 5 links at a time:
list_links_part = []
for i,link in enumerate(list_links_clean):
    list_links_part.append(link)
    if (len(list_links_part) == 8) or (i == len(list_links_clean)-1):
        
        #and now do the multiprocessing for these 5
        #start = time.perf_counter()
        processes = []

        for linkx in list_links_part:
            p = multiprocessing.Process(target = index_page(linkx))
            p.start()
            processes.append(p)
            
        for process in processes:
            process.join()
            
        #finish = time.perf_counter()
        #print('5 subprocesses: finished in ' + str(finish - start) + ' seconds')

        list_links_part = []

finishtotal = time.perf_counter()
print('finished in ' + str(finishtotal - starttotal) + ' seconds')

## multiprocessing of 5 parallel: finished in 144.8067623 seconds -> 2:24 min
## multiprocessing of 8 parallel: finished in 147.4203463 seconds -> 2:27 min

Page could not be called: http://wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
Page could not be called: http://www.datascienceassn.org/about-data-science
Page could not be called: http://wikipedia.org//en.m.wikipedia.org/w/index.php?title=Data_science&mobileaction=toggle_view_mobile
Page could not be called: http://wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us
Page could not be called: http://wikipedia.org//www.wikimediafoundation.org/
Page could not be called: http://wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
Page could not be called: https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/
Page could not be called: https://statmodeling.stat.columbia.edu/2013/11/14/statistics-least-important-part-data-science/
Page could not be called: http://priceonomics.com/whats-the-difference-between-data-science-and/
Page could not be called: https://cacm.acm.org/blogs/blog-cacm/267286-why-is-it-hard-to-define-data-s