Go to https://en.wikipedia.org/wiki/Main_Page, and find the link "Random article" in the left-hand column. Use this tool to generate a URL for a random page. This will be your seed; add it as the first value to a deque. Write code that:

a) implements a queue (First In First Out) approach to your list of URLs, downloading the oldest entry first

b) for each downloaded page, extracts all links to internal pages in /wiki/

In [1]:
from collections import deque 
import urllib.request
import re
import os
from nltk import flatten

In [2]:
def getlinks(seed):
    web_page = urllib.request.urlopen(seed)
    contents = web_page.read().decode(errors="replace")
    web_page.close()
    queue = []
    #use lookbehind and lookahead get wiki pages
    queue.append(re.findall('(?<=href="/wiki/).+?(?=")', contents, re.DOTALL))
    #flatten the nested list
    queue = flatten(queue)
    return queue
print(getlinks("https://en.wikipedia.org/wiki/Fran_Mullins"))

['File:Fran_Mullins_Giants.jpg', 'Infielder', 'Oakland,_California', '1980_Major_League_Baseball_season', 'Chicago_White_Sox', '1986_Major_League_Baseball_season', 'Cleveland_Indians', 'Batting_average_(baseball)', 'Home_run', 'Run_batted_in', 'Chicago_White_Sox', '1980_Major_League_Baseball_season', 'San_Francisco_Giants', '1984_Major_League_Baseball_season', 'Cleveland_Indians', '1986_Major_League_Baseball_season', 'Professional_baseball', 'Major_League_Baseball', 'Infielder', 'Santa_Clara_University', 'College_baseball', 'Santa_Clara_Broncos', 'Chicago_White_Sox', 'Cincinnati_Reds', 'Steve_Christmas', 'San_Francisco_Giants', 'Rule_5_draft', '1984_Major_League_Baseball_season', 'Cleveland_Indians', 'File:Baseball_(crop).jpg', 'File:Flag_of_the_United_States.svg', 'File:Crystal_Clear_app_Login_Manager_2.png', 'Wikipedia:Stub', 'Template:US-baseball-infielder-stub', 'Template_talk:US-baseball-infielder-stub', 'Help:Category', 'Category:1957_births', 'Category:Living_people', 'Category:

c) changes these links from relative links to full links (do this by adding the domain of the URL of the page from which you scraped the link -- for this exercise it's http://www.wikipedia.org);

d) adds these links to the frontier/deque (remember, this is a queue, so consider to which side of the deque you'll add them to preserve FIFO order) -- and be sure not to allow duplicates.


#### To keep with FIFO order, use popleft() function if we want to remove the links from a regular order queue
#### Or use extendleft() function if we want to append new links when we have an empty queue

In [3]:
def convert(seed):
    queue = getlinks(seed)
    #convert relative links to full links
    queue = ["https://en.wikipedia.org/wiki/" + link for link in queue if "https://en.wikipedia.org/wiki/" not in link] 
    #remove the duplicates while perserving the order
    seen = set()
    queue = [x for x in queue if not (x in seen or seen.add(x))]
    #change list to deque
    queue = deque(queue)
    return queue
print(convert("https://en.wikipedia.org/wiki/Fran_Mullins"))

deque(['https://en.wikipedia.org/wiki/File:Fran_Mullins_Giants.jpg', 'https://en.wikipedia.org/wiki/Infielder', 'https://en.wikipedia.org/wiki/Oakland,_California', 'https://en.wikipedia.org/wiki/1980_Major_League_Baseball_season', 'https://en.wikipedia.org/wiki/Chicago_White_Sox', 'https://en.wikipedia.org/wiki/1986_Major_League_Baseball_season', 'https://en.wikipedia.org/wiki/Cleveland_Indians', 'https://en.wikipedia.org/wiki/Batting_average_(baseball)', 'https://en.wikipedia.org/wiki/Home_run', 'https://en.wikipedia.org/wiki/Run_batted_in', 'https://en.wikipedia.org/wiki/San_Francisco_Giants', 'https://en.wikipedia.org/wiki/1984_Major_League_Baseball_season', 'https://en.wikipedia.org/wiki/Professional_baseball', 'https://en.wikipedia.org/wiki/Major_League_Baseball', 'https://en.wikipedia.org/wiki/Santa_Clara_University', 'https://en.wikipedia.org/wiki/College_baseball', 'https://en.wikipedia.org/wiki/Santa_Clara_Broncos', 'https://en.wikipedia.org/wiki/Cincinnati_Reds', 'https://en

In [4]:
#download html file
'''
for each in queue:
    web_page = urllib.request.urlopen(each)
    contents = web_page.read().decode(errors="replace")
    web_page.close()
    file_out = open("{}.html".format(os.path.basename(each)), "w", encoding="utf-8")
    file_out.write(contents)
    file_out.close()
'''

'\nfor each in queue:\n    web_page = urllib.request.urlopen(each)\n    contents = web_page.read().decode(errors="replace")\n    web_page.close()\n    file_out = open("{}.html".format(os.path.basename(each)), "w", encoding="utf-8")\n    file_out.write(contents)\n    file_out.close()\n'