# **Before you do anything, choose File --> Save a Copy to Drive**

<img src="https://s3.cloud.cmctelecom.vn/tinhte1/2018/03/4267082_CV.jpg" width=browser_width >



# **Tiki Web Scraping with Selenium**


**Overview**: Build a web-crawler that take in a Tiki URL and return a dataframe 

**Due Date**: Before Monday next week.

**Requirements** 
1. Your function should be able to take in an URL and return a pandas dataframe
2. The final dataframe should contain the following informations: 
    * Product Name
    * Price
    * URL of the product image
    * URL of that product page

Try to follow the guideline below

#Install resources

In [None]:
# install selenium and other resources for crawling data
!pip install selenium
!apt-get update
!apt install chromium-chromedriver

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
[?25l[K     |▍                               | 10 kB 21.3 MB/s eta 0:00:01[K     |▊                               | 20 kB 27.1 MB/s eta 0:00:01[K     |█                               | 30 kB 32.6 MB/s eta 0:00:01[K     |█▌                              | 40 kB 35.6 MB/s eta 0:00:01[K     |█▉                              | 51 kB 36.3 MB/s eta 0:00:01[K     |██▏                             | 61 kB 38.4 MB/s eta 0:00:01[K     |██▌                             | 71 kB 27.2 MB/s eta 0:00:01[K     |███                             | 81 kB 28.4 MB/s eta 0:00:01[K     |███▎                            | 92 kB 28.5 MB/s eta 0:00:01[K     |███▋                            | 102 kB 30.1 MB/s eta 0:00:01[K     |████                            | 112 kB 30.1 MB/s eta 0:00:01[K     |████▍                           | 122 kB 30.1 MB/s eta 0:00:01[K     |████▊                           | 133 kB 30.1 MB

In [None]:
### IMPORTS ###
import re
import time
import pandas as pd

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

#Configuration for Driver and links

In [None]:
###############
### GLOBALS ###
###############

# Header for chromedriver
HEADER = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'}
# Urls
TIKI                = 'https://tiki.vn'
MAIN_CATEGORIES = [
    {'Name': 'Điện Thoại - Máy Tính Bảng',
     'URL': 'https://tiki.vn/dien-thoai-may-tinh-bang/c1789?src=c.1789.hamburger_menu_fly_out_banner'},

    {'Name': 'Điện Tử - Điện Lạnh',
     'URL': 'https://tiki.vn/tivi-thiet-bi-nghe-nhin/c4221?src=c.4221.hamburger_menu_fly_out_banner'},

    {'Name': 'Phụ Kiện - Thiết Bị Số', 
     'URL': 'https://tiki.vn/thiet-bi-kts-phu-kien-so/c1815?src=c.1815.hamburger_menu_fly_out_banner'},

    {'Name': 'Laptop - Thiết bị IT', 
     'URL': 'https://tiki.vn/laptop-may-vi-tinh/c1846?src=c.1846.hamburger_menu_fly_out_banner'},

    {'Name': 'Máy Ảnh - Quay Phim', 
     'URL': 'https://tiki.vn/may-anh/c1801?src=c.1801.hamburger_menu_fly_out_banner'},

    {'Name': 'Điện Gia Dụng', 
     'URL': 'https://tiki.vn/dien-gia-dung/c1882?src=c.1882.hamburger_menu_fly_out_banner'},

    {'Name': 'Nhà Cửa Đời Sống', 
     'URL': 'https://tiki.vn/nha-cua-doi-song/c1883?src=c.1883.hamburger_menu_fly_out_banner'},

    {'Name': 'Hàng Tiêu Dùng - Thực Phẩm', 
     'URL': 'https://tiki.vn/bach-hoa-online/c4384?src=c.4384.hamburger_menu_fly_out_banner'},

    {'Name': 'Đồ chơi, Mẹ & Bé', 
     'URL': 'https://tiki.vn/me-va-be/c2549?src=c.2549.hamburger_menu_fly_out_banner'},

    {'Name': 'Làm Đẹp - Sức Khỏe', 
     'URL': 'https://tiki.vn/lam-dep-suc-khoe/c1520?src=c.1520.hamburger_menu_fly_out_banner'},

    {'Name': 'Thể Thao - Dã Ngoại', 
     'URL': 'https://tiki.vn/the-thao/c1975?src=c.1975.hamburger_menu_fly_out_banner'},

    {'Name': 'Xe Máy, Ô tô, Xe Đạp', 
     'URL': 'https://tiki.vn/o-to-xe-may-xe-dap/c8594?src=c.8594.hamburger_menu_fly_out_banner'},

    {'Name': 'Hàng quốc tế', 
     'URL': 'https://tiki.vn/hang-quoc-te/c17166?src=c.17166.hamburger_menu_fly_out_banner'},

    {'Name': 'Sách, VPP & Quà Tặng', 
     'URL': 'https://tiki.vn/nha-sach-tiki/c8322?src=c.8322.hamburger_menu_fly_out_banner'},

    {'Name': 'Voucher - Dịch Vụ - Thẻ Cào', 
     'URL': 'https://tiki.vn/voucher-dich-vu/c11312?src=c.11312.hamburger_menu_fly_out_banner'}
]

# Global driver to use throughout the script
DRIVER = None

#Function to Start and Close Driver

In [None]:
# Function to (re)start driver
def start_driver(force_restart=False):
    global DRIVER
    
    if DRIVER is not None:
        if force_restart:
            DRIVER.close()
        else:
            raise RuntimeError('ERROR: cannot overwrite an active driver. Please close the driver before restarting.')
    
    # Setting up the driver
    print('Initiating driver...')
    options = webdriver.ChromeOptions()
    options.add_argument('-headless') # we don't want a chrome browser opens, so it will run in the background
    options.add_argument('-no-sandbox')
    options.add_argument('-disable-dev-shm-usage')

    DRIVER = webdriver.Chrome('chromedriver',options=options)
    print('Finished!')
    
# Wrapper to close driver if its created
def close_driver():
    global DRIVER
    if DRIVER is not None:
        DRIVER.close()
    DRIVER = None

#Function to get info from one product

### Function to get_product_info_single

In [None]:
#################
### FUNCTIONS ###
#################


# Function to extract product info from the product
def get_product_info_single(product_item):
    product_info = {'name':'',
                    'price':'',
                    'product_url':'',
                    'image':'',
                    'tikinow':'',
                    'free_delivery':'',
                    'discount_percentage':'',
                    'badge_under_price':'',
                    'number_of_sold_units':'',
                    }

    # name get name through find_element_by_class_name
    try:
        product_info['name'] = product_item.find_element_by_class_name('name').get_attribute("textContent")
    except NoSuchElementException:
        pass

    # get price find_element_by_class_name
    try:
        product_info['price'] = product_item.find_element_by_class_name('price-discount__price').get_attribute("textContent")
    except NoSuchElementException:
        product_info['price'] = -1

    try:
          # String manipulation
        product_info['product_url'] = product_item.get_attribute('href')
    except NoSuchElementException:
        pass
    
    # get thumbnail by class_name and Tag name and get_attribute()
    try:
        thumbnail = product_item.find_element_by_class_name('thumbnail').find_elements_by_tag_name('img')[-1]
        product_info['image'] = thumbnail.get_attribute('src')
    except NoSuchElementException:
        pass

    # get TikiNow
    try:
        tikinow = product_item.find_element_by_class_name('badge-service').find_element_by_tag_name('img')
        product_info['tikinow'] = 'Yes'
    except NoSuchElementException:
        product_info['tikinow'] = 'No'
        pass

    # get free_delivery
    try:
        free_delivery = product_item.find_element_by_class_name('thumbnail').find_elements_by_tag_name('img')
        for i in range(1, len(free_delivery) + 1):
          if i == 2: 
              product_info['free_delivery'] = 'Yes'
          else:
              product_info['free_delivery'] = 'No'
    except NoSuchElementException:
        pass
    
    # discount_percentage
    try:
        product_info['discount_percentage'] = product_item.find_element_by_class_name('price-discount__discount').text
    except NoSuchElementException:
        product_info['discount_percentage'] = 'No discount'
        

    # badge_under_price
    try:
        get_badge_under_price_img_tag = product_item.find_element_by_class_name('badge-under-price').find_element_by_tag_name('img')
        product_info['badge_under_price'] = 'Yes'    
    except NoSuchElementException:
        product_info['badge_under_price'] = 'No'

    # number_of_sold_units
    try:
        product_info['number_of_sold_units'] = product_item.find_element_by_css_selector('.info [class|="styles__StyledQtySold-sc"]').get_attribute("textContent")
    except NoSuchElementException:
        pass

    return product_info


### Function to get additional info inside of product

In [None]:
#FUNCTION To Get AUTHORS and REVIEWS

def get_additional_info(product_item):

  additional_info = {'author':'',
                     'reviews':'',
                    }

  #Get AUTHOR NAME
  try:
    additional_info['author'] = product_item.find_element_by_xpath("//span[contains(@class,'brand-and-author')]//a").text
  except NoSuchElementException:
    pass

  #Get REVIEWS NUMBERS:
  try:
    reviews = product_item.find_element_by_xpath("//div[contains(@class,'below-title')]//a").text
    count = reviews.replace('()','').split()
    for i in count:
      if i.isdigit():
        additional_info['reviews'] =  i
      else:
        pass
  except NoSuchElementException:
    pass


  return additional_info

#Function to scrape info of all products from a Page URL

In [None]:
# Function to scrape all products from a page
def get_product_info_from_page(page_url):
    """ Extract info from all products of a specfic page_url on Tiki website
        Args:
            page_url: (string) url of the page to scrape
        Returns:
            data: list of dictionary of products info. If no products shown, return empty list.
    """
    global DRIVER
    
    results = []      
    DRIVER.get(page_url) # Use the driver to get info from the product page
    time.sleep(5) ## Must have the sleep function

      # Scrape all products listed on a page
    products_all = DRIVER.find_elements_by_class_name('product-item')
    print(f'Found {len(products_all)} products')

    for product in products_all:
      try:
        product_info = get_product_info_single(product)
        results.append(product_info)
      except NoSuchElementException:
        pass

    return results

# Start of Web Scraper

In [None]:
######################
### START SCRAPING ###
######################

num_max_page = 2
main_cart_url = MAIN_CATEGORIES[-2]['URL']

#Close & Start DRIVER
close_driver()
start_driver(force_restart=True) 

#CODE TO GET DATA    
prod_data = get_product_info_from_page(main_cart_url)       

page_2 = DRIVER.find_element_by_tag_name('li').find_element_by_xpath("//a[@data-view-label ='2']").get_attribute('href')

prod_data_next_page = get_product_info_from_page(page_2)
prod_data.extend(prod_data_next_page)
print(f'Found {len(prod_data)} products total on {num_max_page} pages')

df = pd.DataFrame(prod_data, columns=prod_data[0].keys())
df.index += 1 

## get all product_url
url = [dic['product_url'] for dic in prod_data]

addition_info_result = []
#Start getting additional_info of each product on 2 pages
for i in url:
  DRIVER.get(i)
  additional_info = get_additional_info(DRIVER)  
  addition_info_result.append(additional_info)

#Save to DF:
df_add = pd.DataFrame(addition_info_result, columns=addition_info_result[0].keys())
df_add.index += 1

##Concatenate 2 dataframes, Rename and Sort
df_all = pd.concat([df,df_add],axis=1,sort=False)
df_all.columns = ['Title','Price','Product URL','Image','Tikinow','Free Delivery','Discount %','Badge Under Price','Number of Sold Units','Author','Number of Reviews']
custom_sort = ['Title','Author','Price','Image','Product URL','Discount %','Number of Sold Units','Number of Reviews', 'Tikinow','Free Delivery','Badge Under Price']
df_all = df_all.reindex(custom_sort, axis=1)

#SAVE TO FILE
df_all.to_csv('tiki_products.csv')

Initiating driver...
Finished!
Found 64 products
Found 61 products
Found 125 products total on 2 pages


In [None]:
df_all.head(10)

Unnamed: 0,Title,Author,Price,Image,Product URL,Discount %,Number of Sold Units,Number of Reviews,Tikinow,Free Delivery,Badge Under Price
1,AdĐời Ngắn Đừng Ngủ Dài (Tái Bản),Robin Sharma,52.000 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tka.tiki.vn/pixel?data=djAwMQkd_PgEob_...,-31%,Đã bán 1000+,2287,Yes,Yes,Yes
2,Muôn Kiếp Nhân Sinh 2,Nguyên Phong,186.400 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tiki.vn/muon-kiep-nhan-sinh-2-p9417346...,-30%,Đã bán 1000+,3903,Yes,Yes,No
3,Cây Cam Ngọt Của Tôi,,71.700 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tiki.vn/cay-cam-ngot-cua-toi-p74021317...,,Đã bán 1000+,2629,Yes,Yes,No
4,"Cân Bằng Cảm Xúc, Cả Lúc Bão Giông",Richard Nicholls,61.500 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tiki.vn/can-bang-cam-xuc-ca-luc-bao-gi...,,Đã bán 1000+,3697,Yes,Yes,No
5,AdHọc Viện - The Institute (Stephen King),Stephen King,149.500 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tka.tiki.vn/pixel?data=djAwMXck_DLeeqV...,-50%,Đã bán 676,202,Yes,Yes,Yes
6,Muôn Kiếp Nhân Sinh (Many Lives - Many Times),Nguyên Phong,104.800 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tiki.vn/muon-kiep-nhan-sinh-many-lives...,-38%,Đã bán 1000+,3915,Yes,Yes,Yes
7,Thay Đổi Cuộc Sống Với Nhân Số Học,David A. Phillips,161.500 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tiki.vn/thay-doi-cuoc-song-voi-nhan-so...,,Đã bán 1000+,2759,Yes,Yes,Yes
8,Rèn Luyện Tư Duy Phản Biện,Albert Rutherford,61.200 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tiki.vn/ren-luyen-tu-duy-phan-bien-p46...,,Đã bán 1000+,1162,Yes,Yes,No
9,AdCombo 2 Cuốn: Tâm Lý Học Tội Phạm,Stanton E. Samenow,206.300 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tka.tiki.vn/pixel?data=djAwMUdZzYx9Xvd...,-28%,Đã bán 1000+,608,Yes,Yes,No
10,Đọc Vị Bất Kỳ Ai (Tái Bản 2019),TS. David J. Lieberman,52.400 ₫,https://salt.tikicdn.com/cache/280x280/ts/prod...,https://tiki.vn/doc-vi-bat-ky-ai-tai-ban-2019-...,-34%,Đã bán 1000+,1762,Yes,Yes,Yes


**Extra Optional Requirement**


Bonus information:

* Is it TikiNow (delivery within 2 hours) <img src="https://salt.tikicdn.com/ts/upload/9f/32/dd/8a8d39d4453399569dfb3e80fe01de75.png">?
* Is it free delivery?
* Number of reviews?
* How many stars or percentage of stars?
* Does it got "badge under price" (Rẻ hơn hoàn tiền) <img src="https://salt.tikicdn.com/ts/upload/51/ac/cc/528e80fe3f464f910174e2fdf8887b6f.png">?
* Discount percentage?
* Does it got "shocking price" badge ? <img src="https://salt.tikicdn.com/ts/upload/75/34/d2/4a9a0958a782da8930cdad8f08afff37.png">
* Does it allowed to be paid by installments? <img src="https://salt.tikicdn.com/ts/upload/ba/4e/6e/26e9f2487e9f49b7dcf4043960e687dd.png">
* Does it comes with free gifts? <img src="https://salt.tikicdn.com/ts/upload/47/35/8c/446f61d046eba9a305d3f39dc0834c4a.png">
    
