# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests
import pandas as pd
import numpy as np

url = 'https://en.wikipedia.org/wiki/Data_science'

In [2]:
response = requests.get(url)
response

<Response [200]>

In [3]:
# response.content

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [4]:
from bs4 import BeautifulSoup

In [5]:
# your code here

soup = BeautifulSoup(response.content)
# print(soup.prettify())

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [6]:
# Absolute links

# Get the request and clean string
absol_links = soup.find_all("a", attrs={'class':'external text'})
list = [str(link) for link in absol_links]
list = [link.replace('<a class="external text" href=', '') for link in list]
list = [link.split('"') for link in list]

# Create link list, delete the ones with "?"
list2 = [(i[1]) for i in list] 
absol_list = [item for item in list2 if not "?" in item]

In [7]:
# Relative links

domain = 'http://wikipedia.org'
relat_links = soup.find_all("a",attrs={'class':'mw-redirect'})

# Get the request and clean string
list = [str(link) for link in relat_links]
list = [link.replace('<a class="mw-redirect" href=', '') for link in list]
list = [link.split('"') for link in list]

# Create link list, delete the ones with "?"
list2 = [(domain+i[1]) for i in list] 
relat_list = [item for item in list2 if not "?" in item]

In [8]:
# Convert to set to erase repeated and then again into list
final_set = set(absol_list+relat_list)
final_list = [item for item in final_set]
final_list2 = []

# Adding http where is missing
for item in final_list:
    if item[0]=="/":
        item = 'http:'+ item
        final_list2.append(item)
    else:
        final_list2.append(item)

final_list2

['http://www.worldcat.org/issn/0036-8075',
 'http://pubmed.ncbi.nlm.nih.gov/23074866',
 'https://api.semanticscholar.org/CorpusID:9743327',
 'https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html',
 'http://wikipedia.org/wiki/Interdisciplinary',
 'http://wikipedia.org/wiki/Data_transformation',
 'https://www.bostonglobe.com/business/2015/11/11/behind-scenes-sexiest-job-century/Kc1cvXIu31DfHhVmyRQeIJ/story.html',
 'https://hbr.org/2022/07/is-data-scientist-still-the-sexiest-job-of-the-21st-century',
 'http://wikipedia.org/wiki/New_York_Times',
 'http://www.worldcat.org/oclc/489990740',
 'https://doi.org/10.3390%2Fmake1010015',
 'https://api.semanticscholar.org/CorpusID:207595944',
 'https://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks',
 'http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/',
 'https://flowingdata.com/2009/06/04/rise-of-the-data-scientist/',
 'http://archive.nyu.edu/handle/2451/315

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [9]:
import os

In [10]:
# your code here

#     os.mkdir('wikipedia') 
os.chdir('wikipedia')

print("New working directory: {0}".format(os.getcwd()))


New working directory: C:\Users\fcavanagh\Desktop\Ironhack\W10\Day3\lab-parallelization\your-code\wikipedia


### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [11]:
from slugify import slugify
import time

In [12]:
# your code here

def index_page(link):
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.content)
        file_name = slugify(link)+'.html'
        fp = open(file_name, 'w')
        fp.write(str(soup))
        fp.close()
        print('Link: ', link, 'done..')
    except:
#         print('This link: ', link, 'failed...')
        pass
    

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [13]:
%%time

# os.chdir('wikipedia')
print("New working directory: {0}".format(os.getcwd()))

for link in final_list2:
    index_page(link)
    

New working directory: C:\Users\fcavanagh\Desktop\Ironhack\W10\Day3\lab-parallelization\your-code\wikipedia
Link:  http://pubmed.ncbi.nlm.nih.gov/23074866 done..
Link:  https://api.semanticscholar.org/CorpusID:9743327 done..
Link:  https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html done..
Link:  https://hbr.org/2022/07/is-data-scientist-still-the-sexiest-job-of-the-21st-century done..
Link:  https://api.semanticscholar.org/CorpusID:207595944 done..
Link:  https://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks done..
Link:  http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/ done..
Link:  https://www.springer.com/book/9784431702085 done..
Link:  http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf done..
Link:  https://web.archive.org/web/20190620184935/https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/ done..
Link:  https://magazi

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [14]:
import multiprocessing
from multiprocessing import Pool

In [None]:
%%time

print("New working directory: {0}".format(os.getcwd()))
                                
pool = multiprocessing.Pool(2) 
pool.map(index_page, [link for link in final_list2])
pool.join()


New working directory: C:\Users\fcavanagh\Desktop\Ironhack\W10\Day3\lab-parallelization\your-code\wikipedia
