# Collecting unstructured data with webscraping from lazada.sg

* Let's try web scraping the "computers-laptops" category from lazada.sg (http://www.lazada.sg/shop-computers-laptops/).
* Our goal is to get the product links related to "computers-laptop".
* Set the main_url and category to appropriate level in main. We will start from scraping the very first page only. 
* Start by installing necessary libraries.

In [7]:
import requests # Downloads files and web pages from the Internet.
from bs4 import BeautifulSoup # Parses HTML, the format that web pages are written in.
import csv

**More about libraries..**

** <font color = green>Request</font> **

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

** <font color=green>Beautiful Soup</font>**

After downloading HTML document using request, We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

* The function "get_product_links_from_categories()" will extract the data.

* Now we have got all the links inside the appropriate class.
* It's time to validate the product information and filter it out.

In [8]:
def product_info(product_url):
    '''
    identifies various needed information per product to put into store_product function: 
    [id, name, details, rating, store, price, discount, img_url, comments]
    :param product_url: url to product
    '''
    soup = BeautifulSoup(requests.get(product_url).text, "html.parser")
    print('from: ' + product_url)
    
    # Get product ID
    id = product_url.split('-')[-1][:-18] # index values (0 ~ -1, 0 ~ -18) will extract unique product ID
    
    # Get product name
    name = soup.find('h1').string.strip() 
    # strip() removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns. 
    
    # Get product details
    details = ""
    for bullet in soup.find('ul', {'class': 'prd-attributesList ui-listBulleted js-short-description'}).findAll('span'):
        # if there is a string
        if bullet.string:
            details += "--" + bullet.string # Concatenate all details with "--"
            
    # Get product rating
    rating = soup.findAll('div', {'class': 'product-card__rating__stars'})
    if len(rating) != 0:
        rating = int(str(rating[0].findAll('div')[1].get('style')).split()[-1][:-1])
    else:
        rating = 0
    
    # Get store info
    store = soup.findAll('a', {'class': 'product-fulfillment__item__link product__seller__name__anchor'})
    # when SOLD & FULFILLED BY Lazada no store link
    if len(store) == 0:
        store = 'Lazada'
    else:
        store = store[0].find('span').string.strip()
        
    # Get product price
    price = float(soup.find('span', {'id':'product_price'}).string)
    
    # Get product discount
    discount = soup.findAll('span', {'id': 'product_saving_percentage'})
    if len(discount) != 0:
        discount = float(discount[0].string[:-1])
    else:
        discount = 0
        
    # Get product image
    img_url = soup.findAll('img', {'class' : 'itm-img'})[-1].get('src')
    
    print name
    print rating
    data = [id, name, details, rating, store, price, discount, img_url]
    # Store data into the product table, returns True if successful
    csv_writer(data)
    return

* Now let's go through what needs to be done in get_product_links_from_categories().
* For further extension in data collection, we will leave arguments in general form: (category_url, main_url, num_pages).

In [9]:
def get_product_links_from_categories(category_url, main_url, num_pages):
    '''
     goes through each category and finds product_url to put into per product function
    :param category_url: eg. 'http://lazada.sg/shop-category/?page='
    :param num_pages: number of pages to go through per category
    '''
    for i in range(1,num_pages+1):

        soup = BeautifulSoup(requests.get(category_url+str(i)).text, "html.parser")
        # The findAll method traverses the tree, starting at the given point, 
        # and finds all the Tag and NavigableString objects that match the criteria you give. 
        # Note that findAll returns a list, so we’ll have to loop through, or use list indexing, it to extract text.
        # We should go with option soup.a and it should return the links available in the web page.
        # To show only links, we need to iterate over each a tag and then return the link using attribute “href” with get.
        # Note href attribute specifies the URL of the page the link goes to.
        
        for link in soup.findAll('a', {'class': 'c-product-card__name'}):
            href = main_url + link.get('href')
            product_info(href)
    return

## Store data in CSV. file

In [10]:
def csv_writer(line):
    # write data to a CSV file path
    with open("result.csv","a") as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        writer.writerow([unicode(s).encode("utf-8") for s in line])

## <font color = red> main </font>

In [None]:
if __name__ == "__main__":
    main_url = 'http://www.lazada.sg'
    category = 'computers-laptops'
    cat_url = main_url + '/shop-' + category + '/?page='    
    get_product_links_from_categories(cat_url, main_url, 1)
    print("") #output not shown as it is too long. 

## Cursory check about the output using pandas

### What is Pandas?
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

In [13]:
import pandas as pd
df1 = pd.read_csv("result.csv")
print(df1) #output not shown as it is too long. 

      8458787       Xiaomi Mi Notebook Air 13.3″ Silver (EXPORT)  \
0    17867726                        Lenovo ideapad 310 (Silver)   
1    17867705                       Lenovo ideapad 100S (Silver)   
2    19158272  GIGABYTE GeForce® GTX 1080 Ti Gaming OC 11GB DDR5   
3    11130118                   Dell SE2417HG 24" Gaming Monitor   
4    11108646  WD MY Book 4TB Desktop External Hard Drive (WD...   
5    10034952  Seagate Backup Plus Slim 2TB Portable External...   
6    16667718  Microsoft Office 2016 Professional Plus Digita...   
7    17867729                        Lenovo ideapad 310 (Silver)   
8    17867727                          Lenovo ideapad 100S (Red)   
9    20355376  New Asus Zenbook ROSE GOLD UX330CA-FC045T (Int...   
10   17598325  Acer Aspire V13 Hello Kitty Limited Edition La...   
11    7954758                           WD RED 4TB NAS Hard Disk   
12   12945473  SanDisk iXpand Mini Flash Drive 32GB USB3.0 fo...   
13   14498643  LZX EVOLVE S3 Gaming Desktop (i5-