<a href="https://colab.research.google.com/github/valentijnbongers/internship-assignment-webscraping/blob/main/internship_assignment_valentijn_karan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction To Web Scraping**

# Structure


1.   Ethical considerations
2.  Legal aspects
3.   Techical approachs

  *   `BeatifulSoup`
  * `requests`




Link to [Github](https://github.com/valentijnbongers/internship-assignment-webscraping)

---

#### What is web scraping?

Web scraping is a technique used to automatically extract information from websites.


### **Ethical Considerations:**

1. Website Strain: Rate of requests should be limited. Too many requests may violate the website's terms of service and put undue load on the web servers.

2. Privacy issues: Scraping of personal information such as emails or phone numbers is not advised unless explicity allowed.

3. Data usage: Only scrape data you need. Don't sell os misuse scraped data.

### **Legal Aspects:**

1. Respecting Robots.txt: Websites can have robots.txt file that tells bots (including web scrapers) which parts of the site they are not allowed to access. Following robots.txt is considered good practice.

2. Public Vs. Private Data: Generally, scarping publicily available data is okay, as long as you follow the website's terms of service.




## HTML page structure

**Hypertext Markup Language (HTML)** is the standard markup language for documents designed to be displayed in a web browser. HTML describes the structure of a web page and it can be used with **Cascading Style Sheets (CSS)** and a scripting language such as **JavaScript** to create interactive websites. HTML consists of a series of elements that "tell" to the browser how to display the content. Lastly, elements are represented by **tags**.

Here are some tags:
* `<!DOCTYPE html>` declaration defines this document to be HTML5.  
* `<html>` element is the root element of an HTML page.  
* `<div>` tag defines a division or a section in an HTML document. It's usually a container for other elements.
* `<head>` element contains meta information about the document.  
* `<title>` element specifies a title for the document.  
* `<body>` element contains the visible page content.  
* `<h1>` element defines a large heading.  
* `<p>` element defines a paragraph.  
* `<a>` element defines a hyperlink.













































## Web Scraping with `requests` and `BeautifulSoup`



```
#Imports

import requests
from bs4 import BeautifulSoup
```



### What is `BeautifulSoup`?

It is a Python library for pulling data out of HTML and XML files. It provides methods to navigate the document's tree structure that we discussed before and scrape its content.

### What is `requests`?

It is a Python library that allows you to send HTTP requests to websites and retrieve their responses. It simplifies the process of fetching web content compared to using the built-in `urllib` library in Python.

<img src='https://miro.medium.com/v2/resize:fit:1400/1*atk74YuoKuEn_b-Qs8nOgg.png'>

#### Web Parsing Tasks
* Find the _Sign In_ button
* Find the Shopee logo.
* Locate one of products in the main section of the page.
* What is the _heading_ size of the titles in the main section of the page?

# **Website Structure**

# Structure
* Lazada & Shopee
  * Textbox
  * Result Page
  * Visuals of products or services

### Overall Structure


Both websites have a very similar structure on the main page, including a search bar at the top, followed by some promotions and categories of products to chooose from. Smaller details were also similar for both websites. Like a shopping cart in the top right and a logo in the top left.

## Textbox:

<img src='https://blog.leapecommerce.com/content/images/2022/06/Shopee-Auto-Complete-1.png' width = 500>

There is always a textbox at the top of the homepage, which allows website visitors to type in their “search” directly to find what they are looking for.   

## Result page:

<img src='https://thelead.io/wp-content/uploads/2018/11/Screen-Shot-2018-11-06-at-1.19.29-AM.png' width=500>

After you key in what you are looking for, the website will redirect you to a filtered display of what you are looking for according to the keywords you have entered earlier in the textbox.  

## Visuals of products or services:

<img src='https://beebot-sg-knowledgecloud.oss-ap-southeast-1.aliyuncs.com/kc/kc-media/kc-oss-1690256248802-image.png' width=500>

Another key similarity you will see on e-commerce sites, such as Shopee and Lazada, is high-resolution product images from multiple angles, lifestyle photos, infographics, videos, and AR features, all organized in a clean, branded layout to enhance the shopping experience.


## Differences

While both are similar in a lot of ways, there are some noticible differences that allow you to seperate these 2 e-commerce platforms.

1. Homepage Layout:
    * Lazada: Features a more traditional e-commerce layout with a prominent search bar, rotating banners, and categorized product listings. It often highlights major sales and promotions at the top.
    * Shopee: Utilizes a mobile-first design, with a focus on interactive features like Shopee Live, flash sales, and games. The homepage often has more vibrant and dynamic elements.
2. Product Pages:
    * Lazada: Provides detailed product descriptions, specifications, and user reviews. It also includes a comparison feature and related product suggestions.
    * Shopee: Focuses on user-generated content, with a heavy emphasis on customer reviews and ratings. It often includes interactive elements like Q&A sections and short video reviews.
3. Navigation:
    * Lazada: Offers a more structured navigation with clear categories and subcategories. It often includes filters and sorting options for easier product search.
    * Shopee: Emphasizes a simpler navigation structure that relies heavily on search and discovery through personalized recommendations and trending products.
4. HTML elements:
    * The two websites are also different in the code supporting it. For example, they use different tags and attribute names. This will require some changes in the webscraping code for each website.

# **HTML Elements:**

## Product names, descriptions, Prices

### HTML elements are made of 3 components;
* an opening tag,
* the content,
* and a closing tag.

They can be found by 'inspecting' a web page


<img src='https://assets.digitalocean.com/django_gunicorn_nginx_2004/articles/new_learners/html-element-diagram.png' width=250>

The opening tag contains the tag name, the attribute name, and the attribute value.
The content contains what will be written onto the web page.

Example from a title on Lazada:

<img src='https://drive.google.com/uc?export=view&id=1rlBSoCaT0u3-31K1jAnqOUWFD8em3Cv8' height=40 width=500>


On Lazada, all product titles have a tag of `'h1'` and an attribute name of `'pdp-mod-product-badge-title'`. All prices have a tag of `'span'` and an attribute name of `'notranslate pdp-price pdp-price_type_normal pdp-price_color_orange pdp-price_size_xl'`. All descriptions have a tag of 'article and a atterbute name of 'lzd-article'.

On shopee, titles have tags `'div'` and attribute name `'WBVL_7'`. Prices had tags `'div'` and attribute name `'G27FPf'`. Descriptions have tags 'div' and name `'e8lZp3'`. These tags and attribute names can be used to filter out, and print only wanted data. This also allows data to be sorted into titles, prices, and descriptions.

# Scraping data from a single product page

In [None]:
!pip install beautifulsoup4
!pip install requests



In [2]:
from typing import Text
import requests
from bs4 import BeautifulSoup

# Scraping from a whole search page

[Test Page 1 ](https://www.ebay.com.my/sch/Car-Truck-Parts/6030/i.html?_from=R40&LH_BIN=1&_nkw=Mercedes+Benz+-Kaidon&rt=nc&_trkparms=parentrq%3A4d7233ae1900a72e7b702ecafffdd672%7Cpageci%3A508e36ed-32a3-11ef-bac3-6adee66f9ad8%7Cc%3A1%7Ciid%3A1%7Cli%3A8874)

[Test Page 2 ](https://www.ebay.com.my/sch/Car-Truck-Parts/6030/i.html?_from=R40&LH_BIN=1&_nkw=Mercedes+Benz+-Kaidon&rt=nc&_trkparms=parentrq%3A4d7233ae1900a72e7b702ecafffdd672%7Cpageci%3A508e36ed-32a3-11ef-bac3-6adee66f9ad8%7Cc%3A1%7Ciid%3A1%7Cli%3A8874&LH_PrefLoc=2&imm=1&_pgn=2)


[Test Page 3](https://www.ebay.com.my/sch/Car-Truck-Parts/6030/i.html?_from=R40&LH_BIN=1&_nkw=Mercedes+Benz+-Kaidon&rt=nc&_trkparms=parentrq%3A4d7233ae1900a72e7b702ecafffdd672%7Cpageci%3A508e36ed-32a3-11ef-bac3-6adee66f9ad8%7Cc%3A1%7Ciid%3A1%7Cli%3A8874&LH_PrefLoc=2&imm=1&_pgn=3)


....



*etc.*

In [1]:
# Functions for collecting product information

#Product name
def get_name(soup):
  content = soup.find('div', id = 'mainContent')
  product_name = content.find('span', class_ = 'ux-textspans ux-textspans--BOLD').text
  return product_name

#Product price
def get_price(soup):
  product_price = soup.find('div', class_ = 'x-price-primary').text
  return product_price

#Product condition
def get_condition(soup):
  product_condition = soup.find('span', class_ = 'ux-icon-text__text').text
  return product_condition

#Function for scraping the whole page
def scrape_product_page(page_url):
  response = requests.get(page_url)
  soup = BeautifulSoup(response.content,'html.parser')
  name = get_name(soup)
  price = get_price(soup)
  condition = get_condition(soup)
  if condition != '--not specified':
    condition = condition[:int((len(condition)/2))]
  arr = [name, price, condition]
  return arr

In [3]:
#Finds all products with selected key word

def scrape_search_page(url,num_pages):
  items_array = []
  for k in range(1, num_pages + 1):

    paged_url = url + "&_pgn=" + str(k)
    response2 = requests.get(paged_url)
    soup2 = BeautifulSoup(response2.content, 'html.parser')
    all_items = soup2.find('div', class_ = 'srp-river-results clearfix')
    items = all_items.find_all('li', class_ = 's-item s-item__pl-on-bottom')

  #Displays information

    for item in items:
      try:
        #finds element that contains link to product page
        link_element = item.find('a', href = True)

        #finds link to product page
        href = link_element.get('href')
        array_data = scrape_product_page(href)
        items_array.append(array_data)
      except AttributeError:
        pass
      else:
        pass
      finally:
        pass
    return items_array

In [7]:
test = scrape_search_page('https://www.ebay.com.my/sch/i.html?_from=R40&_nkw=iPhone&LH_PrefLoc=2&imm=1', 100)
print(len(test))

71


## Handling pagination:
[Test Page 1:](https://www.ebay.com.my/sch/Car-Truck-Parts/6030/i.html?_from=R40&LH_BIN=1&_nkw=Mercedes+Benz+-Kaidon&rt=nc&_trkparms=parentrq%3A4d7233ae1900a72e7b702ecafffdd672%7Cpageci%3A508e36ed-32a3-11ef-bac3-6adee66f9ad8%7Cc%3A1%7Ciid%3A1%7Cli%3A8874) url + &LH_PrefLoc=2&imm=1&_pgn=1


[Test Page 2:](https://www.ebay.com.my/sch/Car-Truck-Parts/6030/i.html?_from=R40&LH_BIN=1&_nkw=Mercedes+Benz+-Kaidon&rt=nc&_trkparms=parentrq%3A4d7233ae1900a72e7b702ecafffdd672%7Cpageci%3A508e36ed-32a3-11ef-bac3-6adee66f9ad8%7Cc%3A1%7Ciid%3A1%7Cli%3A8874&LH_PrefLoc=2&imm=1&_pgn=2) url + &LH_PrefLoc=2&imm=1&_pgn=2

[Test Page 3:](https://www.ebay.com.my/sch/Car-Truck-Parts/6030/i.html?_from=R40&LH_BIN=1&_nkw=Mercedes+Benz+-Kaidon&rt=nc&_trkparms=parentrq%3A4d7233ae1900a72e7b702ecafffdd672%7Cpageci%3A508e36ed-32a3-11ef-bac3-6adee66f9ad8%7Cc%3A1%7Ciid%3A1%7Cli%3A8874&LH_PrefLoc=2&imm=1&_pgn=3) url + &LH_PrefLoc=2&imm=1&_pgn=3

....


*etc.*

-------

However, `'&LH_PrefLoc=2&imm=1'` is unnecesary and doesnt affect the page it is linked to so subsequent pages can be accessed by appending `'&_pgn=k'` onto the first page url, where k is the page number. (first page can also be represented as `url + '&_pgn=1'`)


In [12]:
import re
import math

In [9]:
#function to extract all integers from a string and concatenate them
def extract_int(string):
  integers = re.findall(r'\d+', string)
  ints = ''.join(integers)
  return ints

In [13]:
string1 = '20,000'
print(extract_int(string1))

20000


In [8]:
def find_num_pages(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.content, 'html.parser')
  num_items_phrase = soup.find('div', class_ = 'srp-controls__control srp-controls__count')
  num_items = num_items_phrase.find('span', class_ = 'BOLD').text
  num_items = int(extract_int(num_items))
  num_of_pages = math.ceil(num_items/60)
  return num_of_pages

In [14]:
test_url = 'https://www.ebay.com.my/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=iPhone&_sacat=0'
find_num_pages(test_url)

367

## Transferring data to Excel

In [20]:
!pip install pandas
!pip install openpyxl
!pip install xlsxwriter

Collecting xlsxwriter
  Downloading XlsxWriter-3.2.0-py3-none-any.whl (159 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.9/159.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xlsxwriter
Successfully installed xlsxwriter-3.2.0


In [21]:
import pandas as pd
import xlsxwriter

In [22]:


def export_to_excel(items_array, sheet_number, file_name):
  # Create a DataFrame from the scraped data
  df = pd.DataFrame(items_array, columns=['Name', 'Price', 'Condition'])
  name_of_sheet = 'sheet' + str(sheet_number + 1)
  #create file and transfers first dataframe
  if sheet_number == 0:
    with pd.ExcelWriter(file_name, engine='xlsxwriter') as writer:
        df.to_excel(writer, index= False, sheet_name= name_of_sheet)
  else:
    # Write the DataFrame to an Excel file using xlsxwriter
    with pd.ExcelWriter(file_name, mode= 'a', engine='openpyxl', if_sheet_exists= 'replace') as writer:
      df.to_excel(writer, index= False, sheet_name= name_of_sheet)

  # Read and display the Excel file
  #df_read = pd.read_excel(file_name)
  #print(df_read)

## Code to run:

In [16]:
#array containing all search pages to be scraped
test_urls_arr = ['https://www.ebay.com.my/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=iPhone&_sacat=0&_odkw=iPhone+11&_osacat=0', 'https://www.ebay.com.my/sch/i.html?_nkw=casio+protrek&_sop=12']

In [23]:
filename = input('Enter file name: ')
for i in range(0, len(test_urls_arr)):
  num_pages = find_num_pages(test_urls_arr[i])
  print(num_pages)
  items_array = scrape_search_page(test_urls_arr[i], num_pages)
  export_to_excel(items_array, i, filename)

Enter file name: scraped_data_result.xlsx
367
62


# Trying Selenium
(doesnt work)



In [None]:
!pip install beautifulsoup4
!pip install requests
!pip install selenium

Collecting selenium
  Downloading selenium-4.22.0-py3-none-any.whl (9.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.25.1-py3-none-any.whl (467 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m467.7/467.7 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!apt-get update
!apt-get install chromium-driver

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [973 kB]
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,125 kB]

In [None]:
!ls /usr/lib/chromium-browser/chromedriver

ls: cannot access '/usr/lib/chromium-browser/chromedriver': No such file or directory


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected
from bs4 import BeautifulSoup
import requests
import time

# Set up Selenium with ChromeDriver
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument("--window-size=1920, 1200")
options.add_argument('--disable-dev-shm-usage')

# Open the product page
url = 'https://www.ebay.com.my/itm/355691390771?itmmeta=01J16M6TS1SWW3940J3WYD1SPA&hash=item52d0dbe733%3Ag%3Aa3sAAOSwR8xmOEzN&itmprp=enc%3AAQAJAAAA0DRrpJMK90GzYsnXI%2Bv%2BxUHgVsGiJ%2BkPfcZt2ZmSWz8y88vPiUT0UAlKDYT%2FsSckeQKxPmXBRZKQ4%2FIjrQN%2BFRBUrgA10hmCz1xRwSfe3jWO3SIEXJOZ4cWzrOBw1O4RMnK7wFYrScTc6YGbe3BaR%2BJG%2BfFv366CEPO6rvCCvjav7%2BPNshwmGRNlM8r3IvgYWco8E%2BGT0XImkQ6DPEE4Ct87%2FjLZezMeQAV5uJpBbbWMhdxCqP3Wtqq1zq8aMWxuLrTRXtoaqrCjynaPHpXb39M%3D%7Ctkp%3ABk9SR8qsm9SJZA&LH_BIN=1'
driver = webdriver.Chrome(options=options)
driver.get(url)

WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#root')))
time.sleep(2)

# Get the page source and parse it with BeautifulSoup
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')

# Extract product information
product_name = soup.find('h1', class_='attM6y')
price = soup.find('div', class_='G27FPf')
description = soup.find('div', class_='YPqix5')

# Print the extracted information
print("Product Name:", product_name)
print("Price:", price)
print("Description:", description)

# Close the WebDriver
driver.quit()

ImportError: cannot import name 'expected' from 'selenium.webdriver.support' (/usr/local/lib/python3.10/dist-packages/selenium/webdriver/support/__init__.py)

In [None]:

def scrape_ebay_search_results(url):
    headers = {
        'User-Agent': 'Your User Agent String'  # Set a proper user agent to avoid getting blocked
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract product data from the first page
    product_data = parse_product_data(soup)

    return product_data

In [None]:
def scrape_all_pages(url, max_pages=3):
    all_product_data = []

    for page_num in range(1, max_pages + 1):
        page_url = f"{url}&_pgn={page_num}"  # Modify the URL to include the page number
        headers = {
            'User-Agent': 'Your User Agent String'
        }
        response = requests.get(page_url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract product data from this page
        page_product_data = parse_product_data(soup)

        # Add the extracted data to the main list
        all_product_data.extend(page_product_data)

    return all_product_data

In [None]:
def parse_product_data(soup):
    product_data = []
    # Example: Assuming each product has a class "product"
    products = soup.find_all('div', class_='product')

    for product in products:
        # Extract details like title, price, etc.
        title = product.find('h2').text.strip()
        price = product.find('span', class_='price').text.strip()
        # Add more fields as necessary

        # Store the data in a dictionary
        product_info = {
            'title': title,
            'price': price,
            # Add more fields as necessary
        }

        product_data.append(product_info)

    return product_data

In [None]:
if __name__ == "__main__":
    ebay_url = "https://www.ebay.com/sch/i.html?_nkw=laptop"
    all_product_data = scrape_all_pages(ebay_url)

    # Print or process the collected product data
    print(all_product_data)

# using encoding
(doesnt work)

In [None]:
import requests

In [None]:
with open('outputlazada.txt', 'w') as w:
  search = input('Search item on Lazada: ')
  with requests.Session() as c:
    url = 'https://www.lazada.com.my/tag/' + search
    c.get(url)
    search_data = dict(q = search)
    page = c.post(url, data = search_data, headers = {"Referer" : "www.lazada.com.my"})
    page = page.content
    new_url = page.encode('utf-8')

KeyboardInterrupt: Interrupted by user