# 20MAI0038
# Rahul Laxman Vasanad

# Web Scrapping 

### Introduction:

In the time when the internet is rich with so much data, and apparently, data has become the new oil, web scraping has become even more important and practical to use in various applications. Web scraping deals with extracting or scraping the information from the website. Web scraping is also sometimes referred to as web harvesting or web data extraction. Copying text from a website and pasting it to your local system is also web scraping. However, it is a manual task. Generally, web scraping deals with extracting data automatically with the help of web crawlers. Web crawlers are scripts that connect to the world wide web using the HTTP protocol and allows you to fetch data in an automated manner.

Whether we are a data scientist, engineer, or anybody who analyzes vast amounts of datasets, the ability to scrape data from the web is a useful skill to have. Let's say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can then be imported and used in various ways.

### Some of the practical applications of web scraping could be:

Gathering resume of candidates with a specific skill,

Extracting tweets from twitter with specific hashtags,

Lead generation in marketing,

Scraping product details and reviews from e-commerce websites.

### Potential Challenges of Web Scraping:

One of the challenges you would come across while scraping information from websites is the various structures of websites. Meaning, the templates of websites will differ and will be unique; hence, generalizing across websites could be a challenge.

Another challenge could be longevity. Since the web developers keep updating their websites, you cannot certainly rely on one scraper for too long. Even though the modifications might be minor, but they still might create a hindrance for you while fetching the data.

# 1. Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
import time
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

# 2. Scrapping the Amazon Best Selling Books

In [2]:
no_pages = 2

def get_data(pageNo):  
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    r = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies)
    content = r.content
    soup = BeautifulSoup(content)
    #print(soup)

    alls = []
    for d in soup.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'}):
        #print(d)
        name = d.find('span', attrs={'class':'zg-text-center-align'})
        n = name.find_all('img', alt=True)
        #print(n[0]['alt'])
        author = d.find('a', attrs={'class':'a-size-small a-link-child'})
        rating = d.find('span', attrs={'class':'a-icon-alt'})
        users_rated = d.find('a', attrs={'class':'a-size-small a-link-normal'})
        price = d.find('span', attrs={'class':'p13n-sc-price'})

        all1=[]

        if name is not None:
            #print(n[0]['alt'])
            all1.append(n[0]['alt'])
        else:
            all1.append("unknown-product")

        if author is not None:
            #print(author.text)
            all1.append(author.text)
        elif author is None:
            author = d.find('span', attrs={'class':'a-size-small a-color-base'})
            if author is not None:
                all1.append(author.text)
            else:    
                all1.append('0')

        if rating is not None:
            #print(rating.text)
            all1.append(rating.text)
        else:
            all1.append('-1')

        if users_rated is not None:
            #print(price.text)
            all1.append(users_rated.text)
        else:
            all1.append('0')     

        if price is not None:
            #print(price.text)
            all1.append(price.text)
        else:
            all1.append('0')
        alls.append(all1)    
    return alls

In [3]:
#creating the csv file of scrapped data from Amazon best seeling books web page
results = []
for i in range(1, no_pages+1):
    results.append(get_data(i))
flatten = lambda l: [item for sublist in l for item in sublist]
df = pd.DataFrame(flatten(results),columns=['Book Name','Author','Rating','Customers_Rated', 'Price']) 
df.to_csv('amazon_products.csv', index=False, encoding='utf-8') 

# 3. Loading the created file after scrapping

In [4]:
df = pd.read_csv("amazon_products.csv")
df.shape #Display the number of records in the above loaded file

(100, 5)

In [5]:
#Displaying first 5 records
df.head()

Unnamed: 0,Book Name,Author,Rating,Customers_Rated,Price
0,My First Book of Patterns Pencil Control: Patt...,Wonder House Books,4.4 out of 5 stars,5170,₹ 89.00
1,Karma: A Yogi's Guide to Crafting Your Destiny...,Sadhguru (Author),4.6 out of 5 stars,254,₹ 260.00
2,Ikigai: The Japanese secret to a long and happ...,Héctor García,4.6 out of 5 stars,14152,₹ 329.00
3,My First Library: Boxset of 10 Board Books for...,Wonder House Books,4.5 out of 5 stars,22878,₹ 399.00
4,Grandma's Bag of Stories: Collection of 20+ Il...,Sudha Murty,4.6 out of 5 stars,5227,₹ 198.00


In [6]:
#Displaying last 5 records
df.tail()

Unnamed: 0,Book Name,Author,Rating,Customers_Rated,Price
95,Mathematics for Class 9 by R D Sharma (Examina...,R.D. Sharma,4.6 out of 5 stars,1056,₹ 435.00
96,My First Mythology Tale (Illustrated) (Set of ...,Maple Press,4.4 out of 5 stars,363,₹ 191.00
97,Indian Art and Culture for Civil Services and ...,Nitin Singhania,4.6 out of 5 stars,2542,₹ 550.00
98,NCERT textbooks physics chemistry maths and En...,0,4.1 out of 5 stars,245,₹ 800.00
99,Autobiography of a Yogi,Paramahansa Yogananda,4.6 out of 5 stars,3233,₹ 99.00


# 4. Data Preprocessing

In [7]:
#Since we know the ratings are out of 5, we can keep only the rating and remove the extra part from it.
#From the customers_rated column, remove the comma.
#From the price column, remove the rupees symbol, comma, and split it by dot.
#Finally, convert all the three columns into integer or float.

df['Rating'] = df['Rating'].apply(lambda x: x.split()[0])
df['Rating'] = pd.to_numeric(df['Rating'])
df["Price"] = df["Price"].str.replace('₹', '')
df["Price"] = df["Price"].str.replace(',', '')
df['Price'] = df['Price'].apply(lambda x: x.split('.')[0])
df['Price'] = df['Price'].astype(int)
df["Customers_Rated"] = df["Customers_Rated"].str.replace(',', '')
df['Customers_Rated'] = pd.to_numeric(df['Customers_Rated'], errors='ignore')

In [8]:
# Top 5 Data after preprocessing 
df.head()

Unnamed: 0,Book Name,Author,Rating,Customers_Rated,Price
0,My First Book of Patterns Pencil Control: Patt...,Wonder House Books,4.4,5170,89
1,Karma: A Yogi's Guide to Crafting Your Destiny...,Sadhguru (Author),4.6,254,260
2,Ikigai: The Japanese secret to a long and happ...,Héctor García,4.6,14152,329
3,My First Library: Boxset of 10 Board Books for...,Wonder House Books,4.5,22878,399
4,Grandma's Bag of Stories: Collection of 20+ Il...,Sudha Murty,4.6,5227,198


In [9]:
#Let's verify the data types
df.dtypes

Book Name           object
Author              object
Rating             float64
Customers_Rated      int64
Price                int32
dtype: object

In [10]:
#Replace the zero values in the DataFrame to NaN
df.replace(str(0), np.nan, inplace=True)
df.replace(0, np.nan, inplace=True)

In [11]:
#Counting the Number of NaNs in the DataFrame
count_nan = len(df) - df.count()
count_nan

Book Name          0
Author             2
Rating             0
Customers_Rated    0
Price              0
dtype: int64

In [12]:
#Let's drop these NaNs
df = df.dropna()

In [13]:
#Authors Highest Priced Book
data = df.sort_values(["Price"], axis=0, ascending=False)[:15]
data

Unnamed: 0,Book Name,Author,Rating,Customers_Rated,Price
83,My First Complete Learning Library: Boxset of ...,Wonder House Books,4.6,4368,799
54,"NCERT textbooks physics, chemistry and biology...",NCERT,4.1,787,729
7,Indian Polity - For Civil Services and Other S...,M. Laxmikanth,4.6,8547,632
74,Atomic Habits: The life-changing million copy ...,James Clear,4.6,18960,623
53,A Modern Approach to Verbal & Non-Verbal Reaso...,R.S. Aggarwal,4.4,5046,570
60,Objective NCERT at your FINGERTIPS for NEET-AI...,MTG Editorial Board,4.6,1460,560
97,Indian Art and Culture for Civil Services and ...,Nitin Singhania,4.6,2542,550
33,Stories I Must Tell: The Emotional Life of an ...,Kabir Bedi,4.4,43,524
90,"Indian Economy for Civil Services, Universitie...",Ramesh Singh,4.3,1512,522
75,The Intelligent Investor (English) Paperback –...,Benjamin Graham,4.5,23273,516


In [14]:
# Importing bokehJS library for displaying geometrical graphs
from bokeh.models import ColumnDataSource
from bokeh.transform import dodge
import math
from bokeh.io import curdoc
curdoc().clear()
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.models import Legend
output_notebook()

In [20]:
#Top Rated Books and Authors wrt Customers Rated
data = df[df['Customers_Rated'] > 1000]
data = data.sort_values(['Rating'],axis=0, ascending=False)[:15]
data


Unnamed: 0,Book Name,Author,Rating,Customers_Rated,Price
50,Bhagavad Gita: Yatharoop (Hindi),A.C. Bhaktivendanta Swami Prabhupada,4.7,6079,188
52,Death; An Inside Story: A book for all those w...,Sadhguru,4.7,4618,206
34,Harry Potter and the Philosopher's Stone,J.K. Rowling,4.7,26937,280
31,Think Like a Monk: The secret of how to harnes...,Jay Shetty,4.7,13181,313
25,Sapiens: A Brief History of Humankind,Yuval Noah Harari,4.7,32647,388
22,"The Magic of the Lost Temple: Illustrated, eas...",Sudha Murty,4.7,2390,148
20,Concept of Physics Part-1 (2019-2020 Session) ...,H.C. Verma,4.6,5237,347
2,Ikigai: The Japanese secret to a long and happ...,Héctor García,4.6,14152,329
51,Mathematics for Class 10 by R D Sharma (Examin...,R.D. Sharma,4.6,1494,455
60,Objective NCERT at your FINGERTIPS for NEET-AI...,MTG Editorial Board,4.6,1460,560


In [27]:
#Most Customer Rated Authors and Books
data = df.sort_values(["Customers_Rated"], axis=0, ascending=False)[:20]
data

Unnamed: 0,Book Name,Author,Rating,Customers_Rated,Price
81,"The Silent Patient: The record-breaking, multi...",Alex Michaelides,4.5,68390,305
8,The Alchemist,Paulo Coelho,4.6,53165,250
35,Think and Grow Rich,Napoleon Hill,4.5,47253,139
15,Rich Dad Poor Dad: What the Rich Teach Their K...,Robert T. Kiyosaki,4.6,40450,430
88,To Kill A Mockingbird: 50th Anniversary Editio...,Harper Lee,4.5,38763,325
25,Sapiens: A Brief History of Humankind,Yuval Noah Harari,4.7,32647,388
19,The Power of Your Subconscious Mind,Joseph Murphy,4.5,31055,141
36,Man's Search For Meaning: The classic tribute ...,Viktor E Frankl,4.5,30436,223
34,Harry Potter and the Philosopher's Stone,J.K. Rowling,4.7,26937,280
89,A Man Called Ove: The life-affirming bestselle...,Fredrik Backman,4.6,26093,255


In [28]:
from bokeh.transform import factor_cmap
from bokeh.models import Legend
from bokeh.palettes import Dark2_5 as palette
import itertools
from bokeh.palettes import d3
#colors has a list of colors which can be used in plots
colors = itertools.cycle(palette)

palette = d3['Category20'][20]

In [29]:
index_cmap = factor_cmap('Author', palette=palette,
                         factors=data["Author"])

In [30]:
p = figure(plot_width=700, plot_height=700, title = "Top Authors: Rating vs. Customers Rated")
p.scatter('Rating','Customers_Rated',source=data,fill_alpha=0.6, fill_color=index_cmap,size=20,legend='Author')
p.xaxis.axis_label = 'RATING'
p.yaxis.axis_label = 'CUSTOMERS RATED'
p.legend.location = 'top_left'




In [31]:
show(p)

### Conclusion:

In this assignment we have carried out the web scrapping of a web page. In this we considered the Amazon best selling books webpage and we also considered the second page and extracted the information such as book name, author, rate of rating, price using http request and beautiful soup inbuilt package and then stored the data into one csv file. At last we did preprocessing on the scrapped data.
