# Real world (kickass) data scraping
---
### Goal
To save the products from the given website
### Evaluation
We should be able to retrieve lazada products and their details
### Methods
Use request library to fetch website contents and beautiful soup to parse it and retrieve the specific tag contents to display
### Tools
* Python 2.7
* Beautiful soup
* Request


In [1]:
import requests
from bs4 import BeautifulSoup
from IPython.display import Image
import csv

#### Creating the category page URL

In [2]:
main_url = 'http://www.lazada.sg'
category = 'computers-laptops'
page_number = 1

category_scrape_url = main_url + '/shop-' + category + '/?page='+str(page_number)
category_webpage = requests.get(category_scrape_url).text
print "Got the category page! ",category_scrape_url

Got the category page!  http://www.lazada.sg/shop-computers-laptops/?page=1


#### Fetching the products on that category page (according to the specified page number)

In [4]:
soup = BeautifulSoup(category_webpage, "html.parser")
links = soup.findAll('a', {'class': 'c-product-card__name'})
link_number = 1
for link in links:
    print link_number,") ",link.text.strip()
    link_number = link_number + 1

1 )  Seagate Backup Plus Slim 2TB Portable External Hard Drive with
Mobile Device Backup USB 3.0 (Black)
2 )  Logitech G102 Optical Gaming Mouse (Online Exclusive)
3 )  Xiaomi Mi Notebook Air 13.3″ Silver (EXPORT)
4 )  SanDisk iXpand Mini Flash Drive 32GB USB3.0 for iPhone and iPad
5 )  USB Flash Drive Memory USB Stick U Disk Pen Drive 2TB Pendrive -
intl
6 )  New Asus Zenbook ROSE GOLD UX330CA-FC045T (Intel m3, 4GB RAM, 128
SSD)
7 )  WD RED 4TB NAS Hard Disk
8 )  WD MY Book 4TB Desktop External Hard Drive (WDBBGB0040HBK)
9 )  GIGABYTE GeForce® GTX 1080 Ti Gaming OC 11GB DDR5
10 )  Dell SE2417HG 24" Gaming Monitor
11 )  Dell U2417H 24" InfinityEdge IPS Monitor
12 )  Green alliance usb splitter dragged four more than 7 port computer
usb extension hub usb otg interface hub converter
13 )  Lenovo ideapad 310 (Silver)
14 )  Lenovo ideapad 100S (Silver)
15 )  HP Printer 2130 Color All in One Print Scan Copy Photo Deskjet
16 )  AMD RYZEN 7 1700X 8-Core 3.4 GHz (3.8 GHz Turbo) Socket AM4 95W


#### Let's get a product from the page!

In [5]:
product_number = 1
product_url = main_url + links[product_number-1].get('href')
product_webpage = requests.get(product_url).text
soup = BeautifulSoup(product_webpage, "html.parser")
print "Got product! ",product_url

Got product!  http://www.lazada.sg/seagate-backup-plus-slim-2tb-portable-external-hard-drive-withmobile-device-backup-usb-30-black-10034952.html?ff=1&sc=ETY=


#### Display the product details!

In [7]:
product_name = soup.find('h1', {'id': 'prod_title'}).string.strip()
product_cost = float(soup.find('span', {'id':'product_price'}).string)
product_img_url = soup.findAll('img', {'class' : 'itm-img'})[-1].get('src')

Image(url=product_img_url)
print "\n\nName: ",product_name
print "Price: ",product_cost,"SGD"
Image(url=product_img_url)




Name:  Seagate Backup Plus Slim 2TB Portable External Hard Drive with
Mobile Device Backup USB 3.0 (Black)
Price:  129.0 SGD


In [8]:
rating = soup.find('span', {'class': 'ratingNumber'}).find('em').string
print rating

4.6


In [11]:
product_details = soup.find('ul',{'class' : 'js-short-description'})
print product_details


<ul class="prd-attributesList ui-listBulleted js-short-description"><li class=""><span>200GB of free OneDrive cloud storage for 2 years is included
when you register a new Backup Plus drive ($95US value)</span></li><li class=""><span>After registering your drive on Seagate.com, a link will be
provided to add 200GB to any new or existing OneDrive account</span></li><li class=""><span>Only one offer can be redeemed per OneDrive account, offers
must be activated by June 30, 2017 and may not be available in all
countries</span></li><li class=""><span>Create easy customized backup plans with included Seagate
Dashboard software</span></li><li class=""><span>Quick file transfer with USB 3.0 connectivity</span></li><li class=""><span>USB powered -no power supply necessary</span></li></ul>


In [20]:
import json
product_details_array = []
for product in product_details:
    product_details_array.append(product.string.strip()) # for csv!
print json.dumps(product_details_array, indent=2)
print '============'
print '\n - '.join(product_details_array)

[
  "200GB of free OneDrive cloud storage for 2 years is included\nwhen you register a new Backup Plus drive ($95US value)", 
  "After registering your drive on Seagate.com, a link will be\nprovided to add 200GB to any new or existing OneDrive account", 
  "Only one offer can be redeemed per OneDrive account, offers\nmust be activated by June 30, 2017 and may not be available in all\ncountries", 
  "Create easy customized backup plans with included Seagate\nDashboard software", 
  "Quick file transfer with USB 3.0 connectivity", 
  "USB powered -no power supply necessary"
]
200GB of free OneDrive cloud storage for 2 years is included
when you register a new Backup Plus drive ($95US value)
 - After registering your drive on Seagate.com, a link will be
provided to add 200GB to any new or existing OneDrive account
 - Only one offer can be redeemed per OneDrive account, offers
must be activated by June 30, 2017 and may not be available in all
countries
 - Create easy customized backup plan

In [22]:
product_discount = soup.find('span', {'id' : 'product_saving_percentage'}).string
print product_discount

 28%


In [23]:
product_reviews = soup.find('ul', {'class' : 'ratRev_reviewList' , 'id' : 'js_reviews_list'})
print product_reviews

<ul class="ratRev_reviewList" id="js_reviews_list">
<li class="ratRev_reviewListRow">
<div class="ratRev_revDetails">
<span class="ratRev_rating-option">
<ul class="ratRev_ratOptions">
<li>
<div class="product-card__rating__stars ">
<div>
<span class="icon-svg product-card__icon-star product-card__rating__icon-star-grey"></span><!--
--><span class="icon-svg product-card__icon-star product-card__rating__icon-star-grey"></span><!--
--><span class="icon-svg product-card__icon-star product-card__rating__icon-star-grey"></span><!--
--><span class="icon-svg product-card__icon-star product-card__rating__icon-star-grey"></span><!--
--><span class="icon-svg product-card__icon-star product-card__rating__icon-star-grey"></span>
</div>
<div style="width: 100%">
<span class="icon-svg product-card__icon-star product-card__rating__icon-star-orange"></span><!--
--><span class="icon-svg product-card__icon-star product-card__rating__icon-star-orange"></span><!--
--><span class="icon-svg product-card__ic

In [30]:
import json
reviews_array = []

for review in product_reviews:
    if review.string is None:
        title = review.find('span', {'class': 'ratRev_revTitle'})
        detail = review.find('div', {'class': 'ratRev_revDetail'})
        reviews_array.append({
            'title': title.string,
            'detail': detail.string
        })

print json.dumps(reviews_array, indent=2)

[
  {
    "detail": "\n                        It works as it should with no issues; compatible with my Macbook Pro.                    ", 
    "title": "\n                            It Works                        "
  }, 
  {
    "detail": null, 
    "title": "\n                            Seagate backup plus Slim 2TB HDD                        "
  }, 
  {
    "detail": "\n                        Small compact size and yet 2TB.  Can't ask for more.                    ", 
    "title": "\n                            Five Stars                        "
  }, 
  {
    "detail": "\n                        Item received in good condition. So far so gd.                    ", 
    "title": "\n                            Good buy                        "
  }, 
  {
    "detail": "\n", 
    "title": "\n                            Five stars                        "
  }, 
  {
    "detail": "\n                        Item received as described                    ", 
    "title": "\n               