# Lesson 7: Advanced Web Scraping and Data Gathering
## Topic 1: Basics of web-scraping using `requests` and `BeautifulSoup` libraries

### Import `requests` library

In [0]:
!pip install beautifulsoup4

In [0]:
import requests

### Exercise 1: Use `requests` to get a response from the Wikipedia home page

In [0]:
# First assign the URL of Wikipedia home page to a string 
wiki_home = "https://en.wikipedia.org/wiki/Main_Page"

In [0]:
# Use the 'get' method from requests library to get a response
response = requests.get(wiki_home)

In [0]:
# What is this 'response' object anyway
type(response)

In [0]:
for r in response: print(r)

### Exercise 2: Write a small function to check the status of web request

This kind of small helper/utility functions are incredibly useful for complex projects.

Start building **a habit of writing small functions to accomplish small modular tasks**, instead of writing long scripts, which are hard to debug and track.

In [0]:
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [0]:
status_check(response)

### Exercise 3: Write small function to check the encoding of the web page

In [0]:
def encoding_check(r):
    return (r.encoding)

In [0]:
encoding_check(response)

### Exercise 4: Write a small function to decode the concents of the `response`

In [0]:
def decode_content(r,encoding):
    return (r.content.decode(encoding))

In [0]:
contents = decode_content(response,encoding_check(response))

#### What is the type of the contents?

In [0]:
type(contents)

#### Fantastic! Finally we got a string object. Did you see how easy it was to read text from a popular webpage like Wikipedia?

### Exercise 5: Check the length of the text you got back and print some selected portions

In [0]:
len(contents)

In [0]:
print(contents[:1000])

In [0]:
print(contents[15000:16000])

### Exercise 6: Use `BeautifulSoup` package to parse the raw HTML text more meaningfully and search for a particular text

In [0]:
from bs4 import BeautifulSoup

In [0]:
soup = BeautifulSoup(contents, 'html.parser')

#### What is this new `soup` object?

In [0]:
type(soup)

### Exercise 7: Can we somehow read intelligible text from this `soup` object?

In [0]:
txt_dump=soup.text

In [0]:
type(txt_dump)

In [0]:
len(txt_dump)

In [0]:
print(txt_dump[0:])

### Exercise 8: Extract the text from the section *'From today's featured article'*

In [0]:
# First extract the starting and end indecies of the text we are interested in
idx1=txt_dump.find("From today's featured article")
idx2=txt_dump.find("Recently featured")

In [0]:
print(txt_dump[idx1+len("From today's featured article"):idx2])

### Exercise 9: Try to extract the important historical events that happened on today's date...

In [0]:
idx3=txt_dump.find("On this day")

In [0]:
print(txt_dump[idx3+len("On this day"):idx3+len("On this day")+1000])

### Exercise 10: Use advanced BS4 technique to extract relevant text without guessing or hard coding where to look

In [0]:
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                print(i.text)

In [0]:
text_list=[]
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                text_list.append(i.text)
                

In [0]:
len(text_list)

In [0]:
for i in text_list:
    print(i)
    print('-'*100)

In [0]:
print(text_list[0])

### Functionalize this process i.e. write a compact function to extract "On this day" text from Wikipedia Home Page

In [0]:
def wiki_on_this_day(url="https://en.wikipedia.org/wiki/Main_Page"):
    """
    Extracts the text corresponding to the "On this day" section on the Wikipedia Home Page.
    Accepts the Wikipedia Home Page URL as a string, a default URL is provided.
    """
    import requests
    from bs4 import BeautifulSoup
    
    wiki_home = str(url)
    response = requests.get(wiki_home)
    
    def status_check(r):
        if r.status_code==200:
            return 1
        else:
            return -1
    
    status = status_check(response)
    if status==1:
        contents = decode_content(response,encoding_check(response))
    else:
        print("Sorry could not reach the web page!")
        return -1
    
    soup = BeautifulSoup(contents, 'html.parser')
    text_list=[]
    
    for d in soup.find_all('div'):
            if (d.get('id')=='mp-otd'):
                for i in d.find_all('ul'):
                    text_list.append(i.text)
    
    return (text_list[0])

In [0]:
print(wiki_on_this_day())

#### A wrong URL produces an error message as expected

In [0]:
print(wiki_on_this_day("https://en.wikipedia.org/wiki/Main_Page1"))