Scraping Wikipedia Page - Import required libraries
In this step we will start by importing the libraries required to scrape the Wikipedia page.

In [1]:
import requests

Scraping Wikipedia Page - Adding browser agent string
In this step we will set the browser agent string. A browser agent string helps identify which browser is being used, what version, and on which operating system.

You can get your browser agent string from here:

https://udger.com/resources/online-parser

We need a browser agent here so that when a request is sent to a web page, it "thinks" that the request came from a browser and not a Python program. So the web page returns the same information to our program that it would return to a web browser.

In [2]:
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'

Scraping Wikipedia Page - URL to scrape
Now we will define the Wikipedia URL that we will scrape. We will also set the User-Agent request header which is a string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent.

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
headers = {'User-Agent': user_agent}


Scraping Wikipedia Page - Get URL content
Now we will use the get function from the requests module to make a request to a web page. The get method sends a GET request to the specified URL and returns a requests.Response object. You can read more about requests from the below link:

https://requests.readthedocs.io/en/master/user/quickstart/

Next, we will import the Beautiful Soup object. Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open file handle. First, the document is converted to Unicode, and HTML entities are converted to Unicode characters. Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser.

In [4]:
page = requests.get(url, headers)


In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [6]:
print(page.content)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of state and union territory capitals in India - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"749a42ee-073f-4ce2-9b17-3ef7e0cd4987","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_state_and_union_territory_capitals_in_India","wgTitle":"List of state and union territory capitals in India","wgCurRevisionId":1012830803,"wgRevisionId":1012830803,"wgArticleId":2371868,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from January 2021","Articles wit

Scraping Wikipedia Page - Get table data
In this final step, we will try to scrape all the data from a table present in the following Wikipedia page:

https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India

To understand how this work, you need to see how the page is formatted in HTML. Below is a screenshot of the same, you can view it on your own computer by opening the web page in a Chrome browser, and then pressing CTrl+Shift+I:



In [7]:
all_tables = soup.find_all('table')
right_table = soup.find('table', class_='wikitable sortable plainrowheaders')
print(right_table)

<table class="wikitable sortable plainrowheaders">
<tbody><tr>
<th scope="col">No.
</th>
<th scope="col">State
</th>
<th scope="col">Administrative / Executive
<p>capital 
</p>
</th>
<th scope="col">Legislative capital
</th>
<th scope="col">Judicial capital
</th>
<th scope="col">Year of establishment
</th>
<th scope="col">Former capital
</th></tr>
<tr>
<td>1
</td>
<th scope="row"><a href="/wiki/Andhra_Pradesh" title="Andhra Pradesh">Andhra Pradesh</a>
</th>
<td>Visakhapatnam
</td>
<td><a href="/wiki/Amaravati" title="Amaravati">Amaravati</a><sup class="reference" id="cite_ref-capitals_3-0"><a href="#cite_note-capitals-3">[3]</a></sup>
</td>
<td><a href="/wiki/Kurnool" title="Kurnool">Kurnool</a><sup class="reference" id="cite_ref-capitals_3-1"><a href="#cite_note-capitals-3">[3]</a></sup>
</td>
<td>1956
</td>
<td><a href="/wiki/Hyderabad" title="Hyderabad">Hyderabad</a><sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[a]</a></sup>(1956–2017)
<p><a href="/wiki/Amaravati" ti

In [8]:
#Now define various lists to store the data from this table

A = [] # For Number
B = [] # For State/UT
C = [] # For Administrative capitals
D = [] # For Legislative capitals
E = [] # For Judiciary capitals
F = [] # For Year capital was established
G = [] # The Former capital

In [9]:
#Finally we will store the data in the respective lists from the Wikipedia table
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    states = row.findAll('th') #To store second column data

    if 0 < len(cells):
        A.append(cells[0].find(text=True).rstrip())
    if 0 < len(states):
        B.append(states[0].find(text=True).rstrip())
    if 1 < len(cells):
        C.append(cells[1].find(text=True).rstrip())
    if 2 < len(cells):
        D.append(cells[2].find(text=True).rstrip())
    if 3 < len(cells):
        E.append(cells[3].find(text=True).rstrip())
    if 4 < len(cells):
        F.append(cells[4].find(text=True).rstrip())
    if 5 < len(cells):
        G.append(cells[5].find(text=True).rstrip())


In [10]:
print(A)

print(B)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28']
['No.', 'Andhra Pradesh', 'Arunachal Pradesh', 'Assam', 'Bihar', 'Chhattisgarh', 'Goa', 'Gujarat', 'Haryana', 'Himachal Pradesh', 'Jharkhand', 'Karnataka', 'Kerala', 'Madhya Pradesh', 'Maharashtra', 'Manipur', 'Meghalaya', 'Mizoram', 'Nagaland', 'Odisha', 'Punjab', 'Rajasthan', 'Sikkim', 'Tamil Nadu', 'Telangana', 'Tripura', 'Uttar Pradesh', 'Uttarakhand', 'West Bengal']


Scraping Amazon Product Reviews - Import the libraries
In the previous assessment, we scraped a Wikipedia page using Beautiful Soup. This time, we would use lxml to scrape the reviews from the Amazon page given below:

https://www.amazon.in/Test-Exclusive-558/product-reviews/B077PWJRFH/?pageNumber=2

lxml is is a Pythonic binding for the C libraries libxml2 and libxslt. It is one of the fastest and feature-rich libraries for processing XML and HTML in Python. Using Python lxml library, XML and HTML documents can be created, parsed, and queried.

You can find out more about lxml from their official page given below:

https://pypi.org/project/lxml/

In [13]:
from lxml import html
import requests
import pandas as pd

In [14]:
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'

In [16]:
amazon_url = 'https://www.amazon.in/Test-Exclusive-558/product-reviews/B077PWJRFH/?pageNumber=2'
headers = {'User-Agent': user_agent}


Scraping Amazon Product Reviews - Get URL content
Now we will use the get function from the requests module to make a request to a web page. The get method sends a GET request to the specified URL and returns a requests.Response object. You can read more about requests from the below link:

https://requests.readthedocs.io/en/master/user/quickstart/

Next, we will use the fromstring function to parse XML from a string directly into an Element, which is the root element of the parsed tree.

In [17]:
page = requests.get(amazon_url, headers = headers)

In [18]:
parser = html.fromstring(page.content)


In [19]:
reviews_df = pd.DataFrame()

In [20]:
xpath_reviews = '//div[@data-hook="review"]'
reviews = parser.xpath(xpath_reviews)

In [21]:
xpath_rating  = './/i[@data-hook="review-star-rating"]//text()'
xpath_title   = './/a[@data-hook="review-title"]//text()'
xpath_author  = './/span[@class="a-profile-name"]//text()'
xpath_date    = './/span[@data-hook="review-date"]//text()'
xpath_body    = './/span[@data-hook="review-body"]//text()'
xpath_helpful = './/span[@data-hook="helpful-vote-statement"]//text()'

In [22]:
for review in reviews:
    rating  = review.xpath(xpath_rating)
    title   = review.xpath(xpath_title)
    author  = review.xpath(xpath_author)
    date    = review.xpath(xpath_date)
    body    = review.xpath(xpath_body)
    helpful = review.xpath(xpath_helpful)

    review_dict = {'rating': rating,
                   'title': title,
                   'author': author,
                   'date': date,
                   'body': body,
                   'helpful': helpful}

    reviews_df = reviews_df.append(review_dict, ignore_index=True)

In [23]:
reviews_df.to_csv("amazon_product_reviews.csv", sep='\t', encoding='utf-8')


In [24]:
print(reviews_df)


                author                                               body  \
0          [RaghulJay]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  After 5...   
1    [Md ali ganiyani]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  Please🙏...   
2        [Prashant s.]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  Best ph...   
3  [Anandakrishnan Gs]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  Do not ...   
4     [Sagar Narkhede]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  Initial...   
5          [Sathish R]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  I purch...   
6     [Piyush Mohanta]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  Worst p...   
7              [Nidhy]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  Guys ia...   
8        [Arun Pandey]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  Excelle...   
9            [Dheeraj]  [\n\n\n\n\n\n\n\n\n\n  \n  \n    , \n  ITS HAN...   

                                   date                          helpful  \
0    [Reviewed in India on 1 July 2020]  [284 people found this helpful]   
