# Ebooks.com Web Scraping using network traffic

![](https://i.imgur.com/uiLHO7j.jpg)

## 1. Introduction : 


### 1.1. What is web scraping 

Web scraping is the process of collecting structured web data in an automated fashion. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

There are a number of tools and methods for performing web scraping; using network traffic, Scrappy, Selenium and Beautiful Soup are the most popular methods. 


### 1.2. Problem statement 

Nowadays books play a big part in the online markets. The most famouse Book marketplace is Amazon but there are other players that are worth assessing.

Data on books in the market, their average price, biggest names and publisher are just some of the point important to assess. 

eBooks.com is a leading retailer of ebooks, with a vast range of ebooks from academic, popular and professional publishers.

Launched in 2000, eBooks.com is a popular ebook retailer. They sell ebooks direct to consumers around the world, with five local sales portals in the US, Canada, UK, Europe and Australia.

That's why, we are going to scrape the information on ebooks.com to get information from this page.


### 1.3. Tools used in this project

Most scraping projects use Beautiful Soup but using network traffic can also be used to get the information we need 

In this project, we are going to use Python as our coding language to scrape. 

In Python, We will mainly use Requests library to get the information from the websites.

Then the information that has been scrapped will be turned into a Pandas DataFrame and then
we are going to save the file as CSV. 



## 2. Project Steps:

1. Assessing the Disney Movies page on Wikipedia: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films 
2. Scraping all the links to movies on Disney Movies Wikipedia and save it to a CSV file
3. Scraping Info-Box of a sample movie
4. Extracting all the Info-Boxes of all the movies on Disney Movies Wikipedia automatically (Putting it all together)
5. Saving all the scrapped information to a csv file

In [None]:
#checking inspec - network - fetch/xhr - name - preview and then finding the info we need

In [None]:
# to bring it to python -- right click on the object that we need -- copy -- copy as cURL (cmd)

In [None]:
# using a cURL converter online - curl.trillworks.com

# Imports

In [4]:
import requests
import pandas as pd

### Requests and cURL

In [16]:
headers = {
    'sec-ch-ua': '^\\^Chromium^\\^;v=^\\^92^\\^, ^\\^',
    'Referer': 'https://www.ebooks.com/en-ca/subjects/computers/',
    'sec-ch-ua-mobile': '?0',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'Content-Type': 'application/json',
}

params = (
    ('subjectId', '13'),
    ('pageNumber', '1'),
    ('countryCode', 'CA'),
)

response =requests.get('https://www.ebooks.com/api/search/subject/', headers=headers, params=params)


#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
#response = requests.get('https://www.ebooks.com/api/search/subject/?subjectId=13&pageNumber=1&countryCode=CA', headers=headers)

### Status Code

In [17]:
response

<Response [200]>

### Create Json Object

In [20]:
json_response =response.json()
json_response
#check a couple of books to see we have the file

{'pages': [{'number': '1',
   'is_selected': True,
   'show_mobile': True,
   'show_tablet': True,
   'search_url': 'https://www.ebooks.com/en-ca/subjects/computers/'},
  {'number': '2',
   'is_selected': False,
   'show_mobile': True,
   'show_tablet': True,
   'search_url': 'https://www.ebooks.com/en-ca/subjects/computers/?pageNumber=2'},
  {'number': '3',
   'is_selected': False,
   'show_mobile': True,
   'show_tablet': True,
   'search_url': 'https://www.ebooks.com/en-ca/subjects/computers/?pageNumber=3'},
  {'number': '4',
   'is_selected': False,
   'show_mobile': True,
   'show_tablet': True,
   'search_url': 'https://www.ebooks.com/en-ca/subjects/computers/?pageNumber=4'},
  {'number': '5',
   'is_selected': False,
   'show_mobile': True,
   'show_tablet': True,
   'search_url': 'https://www.ebooks.com/en-ca/subjects/computers/?pageNumber=5'},
  {'number': '6',
   'is_selected': False,
   'show_mobile': False,
   'show_tablet': True,
   'search_url': 'https://www.ebooks.com/en

In [21]:
# check what type of object is our json

type(json_response)

dict

### extracting the keys of the json file

In [23]:
json_response.keys()

#the same number of keys as the ones in the network header in page

#the info we need as shown on the page is in "books" key values

dict_keys(['pages', 'previous_page', 'next_page', 'books', 'result_page_range'])

### extracting data from json file

In [None]:
# the information we are going to get from the json is :
# title
# subtitle
# author
# publisher
# publication date
# price 

In [26]:
results_json =json_response["books"]
results_json

[{'id': 209755044,
  'book_url': '/en-ca/book/209755044/dark-data/david-j-hand/',
  'image_url': 'https://image.ebooks.com/previews/209/209755/209755044/209755044-sml-1.jpg',
  'image_alt_tag': 'Dark Data: Why What You Don&#x2019;t Know Matters',
  'title': 'Dark Data',
  'edition': '',
  'subtitle': 'Why What You Don’t Know Matters',
  'authors': [{'author_name': 'David J. Hand',
    'author_url': '/en-ca/author/david-j.-hand/165723/'}],
  'num_authors': 1,
  'series': ' Series',
  'series_number': '',
  'has_series': False,
  'series_url': '',
  'publisher': 'Princeton University Press',
  'publication_year': '2020',
  'price': 'CA$37.30',
  'desktop_short_description': 'A practical guide to making good decisions in a world of missing data   In the era of big data, it is easy to imagine that we have all the information we need to make good decisions. But in fact the data we have are never complete, and may be only the tip of the iceberg. Just as much of the universe is composed of da

In [27]:
len(results_json)

#this shows we have made the right selection

10

In [34]:
# now we need to grab the information we need for one of the books
results_json[0]

{'id': 209755044,
 'book_url': '/en-ca/book/209755044/dark-data/david-j-hand/',
 'image_url': 'https://image.ebooks.com/previews/209/209755/209755044/209755044-sml-1.jpg',
 'image_alt_tag': 'Dark Data: Why What You Don&#x2019;t Know Matters',
 'title': 'Dark Data',
 'edition': '',
 'subtitle': 'Why What You Don’t Know Matters',
 'authors': [{'author_name': 'David J. Hand',
   'author_url': '/en-ca/author/david-j.-hand/165723/'}],
 'num_authors': 1,
 'series': ' Series',
 'series_number': '',
 'has_series': False,
 'series_url': '',
 'publisher': 'Princeton University Press',
 'publication_year': '2020',
 'price': 'CA$37.30',
 'desktop_short_description': 'A practical guide to making good decisions in a world of missing data   In the era of big data, it is easy to imagine that we have all the information we need to make good decisions. But in fact the data we have are never complete, and may be only the tip of the iceberg. Just as much of the universe is composed of dark matter, invisib

In [35]:
#title
results_json[0]["title"]

'Dark Data'

In [37]:
# subtitle

results_json[0]["subtitle"]

'Why What You Don’t Know Matters'

In [43]:
# author/authors
results_json[0]["authors"][0]["author_name"]

'David J. Hand'

In [44]:
# publisher
results_json[0]["publisher"]

'Princeton University Press'

In [55]:
# publication year
results_json[0]["publication_year"]

'2020'

In [46]:
# price
results_json[0]["price"]

'CA$37.30'

### Putting up a for loop to get the info of all 10 books

In [72]:
title = []
subtitle = []
author = []
publisher = []
publication_year = []
price = []


for result in results_json:
    title.append(result["title"])
    subtitle.append(result["subtitle"])
    author.append(result["authors"][0]["author_name"])
    publisher.append(result["publisher"])
    publication_year.append(result["publication_year"])
    price.append(result["price"])

In [73]:
len(publication_year)

10

### Turning our data into a pandas dataframe

In [76]:
books_df = pd.DataFrame ({"title": title,
                          "subtitle": subtitle,
                          "author": author,
                          "publisher": publisher,
                          "publication_year": publication_year,
                          "price": price})

In [75]:
books_df

Unnamed: 0,title,subtitle,author,publisher,publication_year,price
0,Dark Data,Why What You Don’t Know Matters,David J. Hand,Princeton University Press,2020,CA$37.30
1,HTML and CSS,Design and Build Websites,Jon Duckett,Wiley,2011,CA$31.50
2,Linux Basics for Hackers,"Getting Started with Networking, Scripting, an...",OccupyTheWeb,No Starch Press,2018,CA$36.95
3,Effective SEO and Content Marketing,The Ultimate Guide for Maximizing Free Web Tra...,Nicholas Papagiannis,Wiley,2020,CA$47.50
4,The Dragonfly Effect,"Quick, Effective, and Powerful Ways To Use Soc...",Jennifer Aaker,Wiley,2010,CA$26.99
5,Business Analysis,,Debra Paul,BCS Learning & Development Limited,2014,CA$57.87
6,The Official (ISC)2 Guide to the CISSP CBK Ref...,,John Warsinske,Wiley,2019,CA$97.20
7,"Patterns, Principles, and Practices of Domain-...",,Scott Millett,Wiley,2015,CA$63.50
8,CompTIA Security+ SY0-601 Exam Cram,,Martin M. Weiss,Pearson Education,2020,CA$45.11
9,OCA: Oracle Certified Associate Java SE 8 Prog...,Exam 1Z0-808,Jeanne Boyarsky,Wiley,2014,CA$63.50


### store data to an excel file


In [81]:
books_df.to_excel("firstpagebooks.xlsx", index = False)