
# Tiki Web Scraping with Beautiful Soup

<img src="https://i.imgur.com/S6f1yCQ.jpg" width=600>

**Due Date**: Before Monday next week.
**Overview**: Build a web-crawler that take in a Tiki URL and return a dataframe 

**Libraries:** To complete this project, we need 
- pandas to manage dataframe 
- requests to retrieve the HTML code in our website 
- BeatifulSoup: to parse the HTML code and get relavant information through HTML tags

**Requirements** 
1. Your function should be able to take in an URL and return a pandas dataframe
2. The final dataframe should contain the following informations: 
    * Product ID
    * Seller ID
    * Product title
    * Price
    * URL of the product image
    * URL of that product page

Bonus information:

* Is it TikiNow (delivery within 2 hours) <img src="https://salt.tikicdn.com/ts/upload/9f/32/dd/8a8d39d4453399569dfb3e80fe01de75.png">?
* Is it free delivery?
* Number of reviews?
* How many stars or percentage of stars?
* Does it got "badge under price" (Rẻ hơn hoàn tiền) <img src="https://salt.tikicdn.com/ts/upload/51/ac/cc/528e80fe3f464f910174e2fdf8887b6f.png">?
* Discount percentage?
* Does it got "shocking price" badge ? <img src="https://salt.tikicdn.com/ts/upload/75/34/d2/4a9a0958a782da8930cdad8f08afff37.png">
* Does it allowed to be paid by installments? <img src="https://salt.tikicdn.com/ts/upload/ba/4e/6e/26e9f2487e9f49b7dcf4043960e687dd.png">
* Does it comes with free gifts? <img src="https://salt.tikicdn.com/ts/upload/47/35/8c/446f61d046eba9a305d3f39dc0834c4a.png">
    

<br>

**Here is Sample Result with basic information of products**

![](https://i.imgur.com/QezTlCw.png)



##Below let's assemble the final project 

In [2]:
# # installing  selenium 
!pip install selenium
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install webdriver-manager

# imports
from selenium import webdriver 
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
#setting up the options
options = webdriver.ChromeOptions()
options.add_argument('-headless') # since we run selenium on Google Colab so we don't want a chrome browser opens, so it will run in the background
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 6.7MB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  chromium-browser chromium-browser-l10n chromium-codecs-ffmpeg-extra
Suggested packages:
  webaccounts-chromium-extension unity-chromium-extension
The following NEW packages will be installed:
  chromium-browser chromium-browser-l10n chromium-chromedriver
  chromium-codecs-ffmpeg-extra
0 upgraded, 4 newly installed, 0 to remove and 30 not upgraded.
Need to get 83.2 MB of archives.
After this operation, 282 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd6

In [3]:


# This cell gets all the product info into a list of lists 
all_pages = []


for i in range(1,22) :
  # our url 
  url = 'https://tiki.vn/o-to-xe-may-xe-dap/c8594?page='+str(i)+'&src=c.8594.hamburger_menu_fly_out_banner'
# get request
  driver = webdriver.Chrome('chromedriver',options=options)
  driver.implicitly_wait(10)
  driver.get(url)
  r = driver.page_source
  driver.close()
  soup = BeautifulSoup(r, 'html.parser')
  products = soup.find_all('a',{'class':'product-item'})
  all_pages.append(products)

In [4]:
# Making it into a single list which will be easier to loop through
all_pages_list = []
for sublist in all_pages:
    for item in sublist:
        all_pages_list.append(item)
all_pages_list

[<a class="product-item" data-view-id="product_list_item" data-view-index="0" href="//tka.tiki.vn/pixel?data=djAwMZUcahaoRYWFK3m5Kc6BVLpG3OMt7NkJcehUKJe-egpqELBvE5egN_Y9aCpf1PiRvFZVpeA1VNYvm9ve2HAz_6WB_C3Dr3ZI1pUGkKW40PAMjzXUoyBfZxoyfAuNe-PNXyD-_nia0o8oZs6Hpj1udQQ_HrFOIX-AvpIq3rfAbA0hPsgbCxzbIcfn8Rh1kUUtXarSC6HGKodNFeMHbJKPvlDtlbWtM3h7YMSh8YaPuj-ZBxSCc7W1fVBd6fLavU0TdTkp0yggzcQ3dzPcQwlLkYgWD_a_9vN89yxC7wJNySlXBL2SKuABKGWlBLTDRhmafQKrKLywgn7ad_fq8vhRXYN6nI6Gsa64WZnGRgiCSdSYlaCrUqctRSyxrp1febXVc7Ap_-XNEeiuoXwgXE7-mzDrOFCLzhXeen7YgngoCNRP5Pych2-Ddapbblt3aKbrIzq8raHEDad1nsT8iOsJTgrezvpk0yFvi7ukQAKx2xJoCnJklnBwP5lgWiuakLSMQiuLwSGXDFR7VH4WnkXJpBVU1gmZkDNSW7wRbQ6YbQGx-AQQ9gfoX9UV3vEcmpSYt1oj2ldVTj-XzmRqhZvsSmYrsC-0FMZTdyBcW1ySAzpzLkGraMQ2Zbvao8bFO1UtyLeXbvkl1iHE6qSvEtLifY4QxWSj-I-zTXbSWx84HwnurGTA9czIUv3TGYYO8UORgGowI-NWZN4CJYEPDFAMGijIWD4jgaeswJWkp9NjdhoSGm04iU7KBzROUZPjHAoLRCBnPelLwW5wPRQaJFKKEo-BHk_yanCZNafjJdAI5rMDSUUoIiZ62Pa-U-8qGQQexio8-CFjxq3c3H4jOXvsa21SZGwiEhFZ7pwplVQqwSeM2A0r3j_eXAI

In [5]:

def find_sku(url):
    product_id_list = []

    driver = webdriver.Chrome('chromedriver',options=options)
    driver.get(url)  
    driver.implicitly_wait(10)
    b =  driver.page_source
    driver.close()
    skusoup = BeautifulSoup(b, 'html.parser') 
    sku = bsoup.find('div', {'class': 'content has-table'}).text[-13:]
    product_id_list.append(sku)
    return  product_id_list


In [6]:
#defining all the other functions required 


def badge_under(prod):
  if len(prod.find('div',{'class':'badge-under-price'})) > 0 :
    return True
  else:
    return False

def installments(prod):
  if len(prod.find('div',{'class':'badge-benefits'})) > 0 :
    return True
  else:
    return False

def free_gift(prod):
  if prod.find('div',{'class': 'freegift-list'}):
    return True
  else: 
    return False

import re 
#checked_correct !
def img_url(prod):    
  imgs = prod.find_all('img')
  pattern = r'.*/280x280/.*'
  product_pic =[]
  img_len = len(imgs)
  for i in range(img_len-1):
    match_obj = re.findall(pattern,imgs[i]['src'])
    if match_obj:
      product_pic.append(match_obj)
  return product_pic
def free_deli (prod) : 
  imgs = prod.find_all('img')
  pattern_fp = r'https://salt.tikicdn.com/ts/upload/f3/74/46/f4c52053d220e94a047410420eaf9faf.png'
  free_pic = []
  img_len = len(imgs)
  for i in range(img_len-1):
    match_fp = re.findall(pattern_fp,imgs[i]['src'])
    if match_fp :
      return True
    return False
  
def tikinow (prod) : 
  imgs = prod.find_all('img')
  pattern_fp = r'https://salt.tikicdn.com/ts/upload/9f/32/dd/8a8d39d4453399569dfb3e80fe01de75.png'
  tiki_pic = []
  img_len = len(imgs)
  for i in range(img_len-1):
    match_fp = re.findall(pattern_fp,imgs[i]['src'])
    if match_fp :
      return True
    return False


In [7]:
# the final function 

def product_info():  
  """ Gets us all the info for the tiki products and puts it into a dictionary
  """
  data = []
  for product in all_pages_list:
    type(product['href'])
    categories = {"Product_ID": "",
                  "Product_title":"",
                  "Price":"",
                  "URL_image":"",
                  "URL_product":"",
                  "TikiNow":"",
                  "free_delivery":"",
                  "Number_of_reviews": "",
                  "rating_average" : "" ,
                  "badge_under_price" : "", 
                  "discount_percent": "",  
                  "paid_by_installments": "",
                  "free_gifts" : "" } 
    try:
      categories["Product_title"]= product.find('div',{'class':'name'}).text 
      categories["Price"]= product.find('div', {"class":'price-discount__price'})  
      categories["URL_image"]= img_url(product)
      categories["URL_product"] = 'https://tiki.vn'+ product['href']  # not sure 
      categories["TikiNow"]= tikinow(product)      # need to finish
      categories["free_delivery"]= free_deli(product) # need to finish 
      categories["Number_of_reviews"]= product.find('div',{'class':'review'}).text  
      categories["rating_average"]= product.find('div',{'class':'rating__average'}).get('style').lstrip('width: ') 
      categories["badge_under_price"]= badge_under(product)
      categories["discount_percent"]= product.find('div', {"class":'price-discount__discount'}).text 
      categories["paid_by_installments"]= installments(product) 
      categories["free_gifts"]= free_gift(product)  
      #categories["Product_ID"] = find_sku('https://tiki.vn'+product['href'])[all_pages_list.index(product)] This one makes it take forever to run 
    except:
      print('there was an error')
    
    data.append(categories)
  
  return data
  
  

In [8]:
data = product_info()

import pandas as pd

product_DF  = pd.DataFrame(data = data, columns = data[0].keys())

there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an error
there was an

In [9]:
product_DF

Unnamed: 0,Product_ID,Product_title,Price,URL_image,URL_product,TikiNow,free_delivery,Number_of_reviews,rating_average,badge_under_price,discount_percent,paid_by_installments,free_gifts
0,,"AdBộ pin sạc dự phòng kích nổ ắc quy xe ô tô, ...",[968.000 ₫],[],https://tiki.vn//tka.tiki.vn/pixel?data=djAwMZ...,False,True,(11),78%;,False,-19%,False,False
1,,Xe máy Honda Air Blade (2021) 125cc CBS,[39.490.000 ₫],[[https://salt.tikicdn.com/cache/280x280/ts/pr...,https://tiki.vn/xe-may-honda-air-blade-2021-12...,False,False,(6),100%;,False,-14%,True,False
2,,Xe máy Honda Air Blade (2021) 125cc Đặc biệt P...,[40.950.000 ₫],[[https://salt.tikicdn.com/cache/280x280/ts/pr...,https://tiki.vn/xe-may-honda-air-blade-2021-12...,False,True,(12),100%;,False,-15%,True,False
3,,Xe máy Honda Air Blade (2021) 150cc ABS,[49.890.000 ₫],[[https://salt.tikicdn.com/cache/280x280/ts/pr...,https://tiki.vn/xe-may-honda-air-blade-2021-15...,False,False,(2),100%,False,-14%,True,False
4,,"AdGối tựa đầu chống mỏi vai, cổ dùng trên xe h...",[510.000 ₫],[],https://tiki.vn//tka.tiki.vn/pixel?data=djAwMY...,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1033,,Bơm Xe Đạp Mini Treo Xe Gắn Khung Sườn Có Đế G...,[99.000 ₫],[[https://salt.tikicdn.com/cache/280x280/ts/pr...,https://tiki.vn/bom-xe-dap-mini-treo-xe-gan-kh...,False,False,(12),88%;,False,-45%,False,False
1034,,Đèn pha led 3 chân h4 /e01c (Ánh sáng trắng xanh),[118.990 ₫],[],https://tiki.vn/den-pha-led-3-chan-h4-e01c-anh...,False,True,,,,,,
1035,,Mũ Bảo Hiểm Nửa Đầu RONA Sơn Tem Khủng Long - ...,[300.000 ₫],[[https://salt.tikicdn.com/cache/280x280/ts/pr...,https://tiki.vn/mu-bao-hiem-nua-dau-rona-son-t...,False,False,(9),100%;,False,,,
1036,,Thảm Lót Chân Dành Cho AIR BLADE Cao Su Loại Đ...,[99.887 ₫],[[https://salt.tikicdn.com/cache/280x280/ts/pr...,https://tiki.vn/tham-lot-chan-danh-cho-air-bla...,False,False,(7),92%;,False,-47%,False,False


In [10]:
product_df.to_csv("motorbike.csv", index=False)

NameError: ignored