In [1]:
# Import the requests library
import requests

In [2]:
# Assign the web page url to a variable 
wiki_home = "https://en.wikipedia.org/wiki/Main_page"

In [3]:
# use the get method from the requests library to get a response
response = requests.get(wiki_home)

In [4]:
# create a status check function
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [5]:
status_check(response)

Success!


1

In [6]:
def encoding_check(r):
    return(r.encoding)

In [7]:
encoding_check(response)

'UTF-8'

In [8]:
def decode_content(r, encoding):
    return(r.content.decode(encoding))

contents = decode_content(response, encoding_check(response))

In [9]:
type(contents)

str

In [10]:
len(contents)

79120

In [11]:
# print the first 10000 characters
contents[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"ab9247cb-4b3a-4e81-b11c-ba9cee5502e6","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":969106986,"wgRevisionId":969106986,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":!1,"wgRelevant

In [12]:
# import the BeautifulSoup package and then pass the whole string (HTML content) to a method for parsing
from bs4 import BeautifulSoup
soup = BeautifulSoup(contents, 'html.parser')

In [13]:
txt_dump = soup.text

In [14]:
type(txt_dump)

str

In [15]:
len(txt_dump)

9136

##### The length of the text dump is much smaller than the raw HTML's string length. This is because bs4 has parsed through the HTML and extracted only human readable text for further processing. 

In [16]:
# print the initial portion of this text
txt_dump[100:1000]

"ree encyclopedia\n\xa0\xa0(Redirected from Main page)\n\n\nJump to navigation\nJump to search\n\n\n\nWelcome to Wikipedia,\nthe free encyclopedia that anyone can edit.\n6,181,201 articles in English\n\n\nArts\nBiography\nGeography\nHistory\nMathematics\nScience\nSociety\nTechnology\nAll portals\n\n\n\n\n\nFrom today's featured article\n\n\nMounted Cetiosauriscus skeleton\n\nCetiosauriscus was a sauropod dinosaur that lived between 166 and 164\xa0million years ago, during the Middle Jurassic. It was a herbivore with a moderately long tail and long forelimbs, compared to other sauropods. It has been estimated at about 15 metres (49\xa0ft) long and between 4 and 10 tonnes (3.9 and 9.8 long tons; 4.4 and 11.0 short tons) in weight. Its only known fossil includes a hindlimb and most of the rear half of a skeleton. Found in Cambridgeshire, England, in the 1890s, it was described by Arthur Smith Woodward in 1905 as a new specimen of the s"

In [17]:
print(txt_dump[100:1000])

ree encyclopedia
  (Redirected from Main page)


Jump to navigation
Jump to search



Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
6,181,201 articles in English


Arts
Biography
Geography
History
Mathematics
Science
Society
Technology
All portals





From today's featured article


Mounted Cetiosauriscus skeleton

Cetiosauriscus was a sauropod dinosaur that lived between 166 and 164 million years ago, during the Middle Jurassic. It was a herbivore with a moderately long tail and long forelimbs, compared to other sauropods. It has been estimated at about 15 metres (49 ft) long and between 4 and 10 tonnes (3.9 and 9.8 long tons; 4.4 and 11.0 short tons) in weight. Its only known fossil includes a hindlimb and most of the rear half of a skeleton. Found in Cambridgeshire, England, in the 1890s, it was described by Arthur Smith Woodward in 1905 as a new specimen of the s


In [18]:
# extracting text from sections
idx1 = txt_dump.find("From today's featured article")
idx2 = txt_dump.find("Recently featured")

print(txt_dump[idx1 + len("From today's featured article"):idx2])




Mounted Cetiosauriscus skeleton

Cetiosauriscus was a sauropod dinosaur that lived between 166 and 164 million years ago, during the Middle Jurassic. It was a herbivore with a moderately long tail and long forelimbs, compared to other sauropods. It has been estimated at about 15 metres (49 ft) long and between 4 and 10 tonnes (3.9 and 9.8 long tons; 4.4 and 11.0 short tons) in weight. Its only known fossil includes a hindlimb and most of the rear half of a skeleton. Found in Cambridgeshire, England, in the 1890s, it was described by Arthur Smith Woodward in 1905 as a new specimen of the species Cetiosaurus leedsi, which was moved to the new genus Cetiosauriscus in 1927 by Friedrich von Huene. In 1980, Alan Charig proposed the current name Cetiosauriscus stewarti. The fossil was found in the marine deposits of the Oxford Clay Formation alongside many invertebrate groups, marine ichthyosaurs, plesiosaurs and crocodylians, a single pterosaur, and various dinosaurs, including an ankylos

In [19]:
# using advanced bs4 techniques to extract relevant text
# the find_all method returns a NavigableString class which has useful text method associated with it for extraction
# create an empty list and append the text 
text_list=[] # Empty list

for d in soup.find_all('div'):
    if (d.get('id')=='mp-otd'):
        for i in d.find_all('ul'):
            text_list.append(i.text)

In [20]:
for i in text_list:
    print(i)
    print("="*100)

1453 – Ladislaus the Posthumous (pictured) was crowned King of Bohemia, although George of Poděbrady remained in control of the government.
1707 – The Hōei earthquake ruptured all segments of the Nankai megathrust simultaneously – the only earthquake known to have done this.
1918 – The Czechoslovak provisional government declared the country's independence from Austria-Hungary, forming the First Czechoslovak Republic in Prague.
1940 – World War II: Italy invaded Greece after Greek prime minister Ioannis Metaxas rejected Benito Mussolini's ultimatum demanding the cession of Greek territory.
1995 – A fire broke out on a Baku Metro train in Azerbaijan's capital, killing 289 people and injuring 270 others in the world's deadliest subway disaster.
Peter Tordenskjold  (b. 1691)Charlotte Turner Smith  (d. 1806)Carlos Guastavino  (d. 2000)
October 27
October 28
October 29
Archive
By email
List of days of the year
