# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests
url = 'https://en.wikipedia.org/wiki/Data_science'

In [2]:
# your code here
response = requests.get(url)
html = response.content

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [3]:
from bs4 import BeautifulSoup

In [4]:
import re

In [5]:
# your code here
soup = BeautifulSoup(html)
soup = str(soup.find_all('a'))
soup = re.findall('"/wiki/[a-zA-Z_-]{1,50}"', soup)
soup = set(soup)
soup

{'"/wiki/American_Statistical_Association"',
 '"/wiki/Andrew_Gelman"',
 '"/wiki/Anomaly_detection"',
 '"/wiki/Apache_Hadoop"',
 '"/wiki/Artificial_neural_network"',
 '"/wiki/Association_rule_learning"',
 '"/wiki/Autoencoder"',
 '"/wiki/Automated_machine_learning"',
 '"/wiki/BIRCH"',
 '"/wiki/Basic_research"',
 '"/wiki/Bayesian_network"',
 '"/wiki/Ben_Fry"',
 '"/wiki/Big_data"',
 '"/wiki/Bootstrap_aggregating"',
 '"/wiki/CURE_data_clustering_algorithm"',
 '"/wiki/Canonical_correlation"',
 '"/wiki/Cluster_analysis"',
 '"/wiki/Clustering"',
 '"/wiki/Committee_on_Data_for_Science_and_Technology"',
 '"/wiki/Computational_learning_theory"',
 '"/wiki/Computational_science"',
 '"/wiki/Computer_science"',
 '"/wiki/Conditional_random_field"',
 '"/wiki/Convolutional_neural_network"',
 '"/wiki/DBSCAN"',
 '"/wiki/DJ_Patil"',
 '"/wiki/Data_analysis"',
 '"/wiki/Data_mining"',
 '"/wiki/Data_science"',
 '"/wiki/Database"',
 '"/wiki/David_Donoho"',
 '"/wiki/Decision_tree_learning"',
 '"/wiki/DeepDream"'

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [6]:
domain = 'http://wikipedia.org'

In [7]:
# your code here
link = [domain + item.replace('"', '') for item in soup]
link

['http://wikipedia.org/wiki/Jeff_Hammerbacher',
 'http://wikipedia.org/wiki/Temporal_difference_learning',
 'http://wikipedia.org/wiki/Ensemble_learning',
 'http://wikipedia.org/wiki/Occam_learning',
 'http://wikipedia.org/wiki/Autoencoder',
 'http://wikipedia.org/wiki/Data_science',
 'http://wikipedia.org/wiki/Nathan_Yau',
 'http://wikipedia.org/wiki/DBSCAN',
 'http://wikipedia.org/wiki/Structured_prediction',
 'http://wikipedia.org/wiki/Distributed_computing',
 'http://wikipedia.org/wiki/Nate_Silver',
 'http://wikipedia.org/wiki/Tableau_Software',
 'http://wikipedia.org/wiki/Computational_science',
 'http://wikipedia.org/wiki/Multilayer_perceptron',
 'http://wikipedia.org/wiki/Association_rule_learning',
 'http://wikipedia.org/wiki/Generative_adversarial_network',
 'http://wikipedia.org/wiki/Bootstrap_aggregating',
 'http://wikipedia.org/wiki/Unsupervised_learning',
 'http://wikipedia.org/wiki/Information_visualization',
 'http://wikipedia.org/wiki/Unstructured_data',
 'http://wikipe

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [8]:
import os

In [9]:
# your code here
#os.mkdir('wikipedia')

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [10]:
#!pip3 install python-slugify

In [11]:
from slugify import slugify

In [12]:
slugify(link[0])

'http-wikipedia-org-wiki-jeff-hammerbacher'

In [13]:
#your code here
for item in link:
    try:
        arquivo = open('./wikipedia/' + slugify(item) + '.html', 'wb')
        arquivo.write(requests.get(item).content)
    except:
        pass

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run. 

_hint: Use tqdm to keep track of the time._ 

In [20]:
%%time
for item in link:
    try:
        arquivo = open('./wikipedia/' + slugify(item) + '.html', 'wb')
        arquivo.write(requests.get(item).content)
    except:
        pass

Wall time: 5min 22s


### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

Use both methods, i.e., for one hand use the `multiprocess` module to use the function created in the jupyter notebook and run the download in parallel.

And for another hand create a python file containing the function to download the file and use the `multiprocessing` module to run. 

In [16]:
#!pip install multiprocess

Collecting multiprocess
  Downloading https://files.pythonhosted.org/packages/58/17/5151b6ac2ac9b6276d46c33369ff814b0901872b2a0871771252f02e9192/multiprocess-0.70.9.tar.gz (1.6MB)
Collecting dill>=0.3.1 (from multiprocess)
  Downloading https://files.pythonhosted.org/packages/c7/11/345f3173809cea7f1a193bfbf02403fff250a3360e0e118a1630985e547d/dill-0.3.1.1.tar.gz (151kB)
Building wheels for collected packages: multiprocess, dill
  Building wheel for multiprocess (setup.py): started
  Building wheel for multiprocess (setup.py): finished with status 'done'
  Created wheel for multiprocess: filename=multiprocess-0.70.9-cp37-none-any.whl size=108035 sha256=baf3b40293d7be7b137c2d72d0d004421439b82075e93b2e9b25731456e90ffd
  Stored in directory: C:\Users\yukar\AppData\Local\pip\Cache\wheels\96\20\ac\9f1d164f7d81787cd6f4401b1d05212807d021fbbbcc301b82
  Building wheel for dill (setup.py): started
  Building wheel for dill (setup.py): finished with status 'done'
  Created wheel for dill: filename=

In [17]:
import multiprocess

In [18]:
def Links(item):
    try:
        arquivo = open('./wikipedia/' + slugify(item) + '.html', 'wb')
        arquivo.write(requests.get(item).content)
    except:
        pass

In [19]:
%%time
pool = multiprocess.Pool()
result = pool.map(Links, link)
pool.terminate()
pool.join()

Wall time: 478 ms


**BONUS**: Create a function that counts how many files are there in the wikipedia folder using the `os` module. 

Delete the files from the folder before you run and perform the above solution asynchronously. 

Use your function to check how many files are being downloaded.