# Summer Practicum Program in Data Science and AI 2023

## 
## Challenge Overview:
The challenge requires you to develop a tool to scrape data from a website of your choice from a selected list. You will then need to submit your code and the scraped data (CSV files) to a GitHub repository. Please note that you have the freedom to choose the programming language and tools that best align with your expertise and preferences, and can learn from other online tutorials that might be available (for example https://kienthuclaptrinh.vn/2020/10/22/xay-dung-crawler-sieu-don-gian-voi-java/, https://github.com/hoadh/simple-java-crawler). 

## Instructions:
One of the topics in the training program is to develop a semantic search engine that helps real estate users search for their potential real estate properties in Vietnam. There could be multiple data sources. Select a website from the provided list and scrape relevant data using the tool you develop.
- https://batdongsan.com.vn/
- https://batdongsan24h.com.vn/
- https://nhadat24h.net/
- https://alonhadat.com.vn/
- https://dothi.net/
- http://123nhadat.vn/
- https://nhadat24h.net/

- Consider scraping images as well, which could be used for visual search
- Consider scraping timestamp of the posts
- Create a GitHub repository to host your code and data. The more data the better, in terms of both data fields and instances, and you can continue collecting more data and improving the tool afterward.
- Commit and push your code and the scraped data to a GitHub repository.

## Import some important libraries
- We will use BeautifulSoup for extracting text from HTLM pages.
- URLLib request is used to read the html page associated with the given URL
- We choose the first website https://batdongsan.com.vn/ for crawling data. Since this web is protected by cloudflare so we also use the library cloudscraper to bypass Cloudflare's anti-bot page. This module is very simple and useful while crawling a website 
protected with Cloudflare
- Pandas for preprosessing the obtained raw data.


In [35]:
# -*- coding: utf-8 -*-
import sys
import codecs
import json
from bs4 import BeautifulSoup
import requests
from http.server import BaseHTTPRequestHandler, HTTPServer
import urllib.parse as urlparse
from bs4 import BeautifulSoup as beauty
import cloudscraper
from datetime import datetime
import time
import pandas as pd

The below function soup_to_data_dict extracting useful data from HTLM file output of beautifulsoup.
Input: Soup output of htlm file of a real estate post in https://batdongsan.com.vn/
Output: Dictionary correspond to that post, we take some important feature like: 
- product info : including Mức Giá (Price),Diện Tích (Area), Phòng Ngủ (numbers of bedroom)
- product spec : including Mặt tiền (Facade), Đường vào (Entrance), Hướng nhà (The direction of the house), Hướng ban công (Balcony direction), Số tầng (Number of floors), Pháp lý (Legal document)
- product desc : including the description written by the corresponding seller
- seller info : including name of Seller , mobile of Seller , email ofSeller , userId
- item info : including  latitude, longitude, productId, count , price , pricePerM2 , categoryId , cityCode ,districtId , streetId , wardId
- image: including link to corresponding images.

In [2]:
# From soup format of one page to data under dictionary format
def soup_to_data_dict(soup_temp):
    attribute_names = []
    attribute_values = []
    ## Product info
    product_info = soup_temp.find_all('ul', attrs={'class' : 're__product-info'})
    if len(product_info) !=0:
        for el in product_info[0].find_all('span', attrs={'class' : 're__sp1'}):
            attribute_names.append(el.text)
        for el in product_info[0].find_all('span', attrs={'class' : 're__sp3'}):
            attribute_values.append(el.text)  

    ## Product spec
    product_spec = soup_temp.find_all('div', attrs={'class' : 're__pr-specs-content-item'})
    if len(product_spec) !=0:
        for el_spec in product_spec:
            attr = el_spec.find('span', attrs={'class' : 're__pr-specs-content-item-title'}).text
            attribute_names.append(attr)
            val = el_spec.find('span', attrs={'class' : 're__pr-specs-content-item-value'}).text
            attribute_values.append(val) 

    ## Product desc
    product_desc = soup_temp.find_all('div', attrs={'class' : 're__detail-content js__tracking'})
    if len(product_desc) !=0:
        attribute_names.append('Thông tin mô tả')
        attribute_values.append(soup_temp.find_all('div', attrs={'class' : 're__detail-content js__tracking'})[0].text) 

    # Item and User info in script
    string = ''

    for product_script in soup_temp.find_all('script'):
        string = string + ' '.join(product_script)

    if 'function initProductDetails()' in string:
        str_user = string.split('function initProductDetails()')[1].split('});')[0]

        for i in ['nameSeller: ','mobileSeller: ','emailSeller: ','userId: ','productId: ']:
            if i in str_user:
                attribute_names.append(i)
                attribute_values.append(str_user.split(i)[1].split(',')[0])
    
    if 'function initListingHistoryLazy()' in string:
        str_item = string.split('function initListingHistoryLazy()')[1].split('});')[0]
        for i in ['count: ','price: ','pricePerM2: ','categoryId: ','cityCode: ','districtId: ','streetId: ','wardId: ','latitude: ']:
            if i in str_item:
                attribute_names.append(i)
                attribute_values.append(str_item.split(i)[1].split(',')[0])
        if 'longitude: ' in str_item:
            attribute_names.append('longitude: ')
            attribute_values.append(str_item.split('longitude: ')[1].split('\n')[0])

    # img
    number_image = len(soup_temp.find_all('a', attrs={'class' : 're__pr-image-cover lazyload'}))
    if number_image !=0:
        for index in range(number_image):
            attribute_names.append('Img '+ str(index))
            attribute_values.append(soup_temp.find_all('a', attrs={'class' : 're__pr-image-cover lazyload'})[index]['data-bg'])


    # scraped timestamp
    attribute_names.append('timestamp')
    now = datetime.now()
    timestamp = str(now.strftime("%Y%m%d_%H-%M-%S"))
    attribute_values.append(timestamp)

    dist = {}
    for index,names in enumerate(attribute_names):
        dist[names] = attribute_values[index]

    return dist

Each page of the website has about 19 post. Each page is under the form of https://batdongsan.com.vn/nha-dat-ban/p plus the number of page. 

We can the number of pages correspoding to out expectation. The output is stored in list_link_post variable. Each element of  list_link_post contains one URL linked to one post.

In [101]:
number_crawl_pages = 1000
link_post = "https://batdongsan.com.vn/nha-dat-ban/p"

list_link_post = []
for i in range(number_crawl_pages):
    list_link_post.append(link_post+str(i+2))

list_link_item = []
for url in list_link_post:
    scraper = cloudscraper.create_scraper(delay=5, browser='chrome') 
    info = scraper.get(url).text
    soup = beauty(info, "html.parser")
    
    
    productDivs = soup.findAll('div', attrs={'class' : 'js__card js__card-full-web pr-container re__card-full re__vip-diamond'})
    for el in productDivs:
        list_link_item.append('https://batdongsan.com.vn'+el.find('a')['href'])

    productDivs = soup.findAll('div', attrs={'class' : 'js__card js__card-wap re__card re__vip-silver'})
    for el in productDivs:
        list_link_item.append('https://batdongsan.com.vn'+el.find('a')['href'])  
        
    productDivs = soup.findAll('div', attrs={'class' : 'js__card js__card-wap re__card re__vip-normal re__card-col-2'})
    for el in productDivs:
        list_link_item.append('https://batdongsan.com.vn'+el.find('a')['href'])  
    #time.sleep(5)

    
    


['https://batdongsan.com.vn/ban-can-ho-chung-cu-phuong-trung-van-prj-nha-o-xa-hoi-nhs-trung-van/ban-nhanh-goc-69-9m2-76-8m2-88-8m2-tai-du-an-gia-uu-dai-vao-ten-truc-tiep-cdt-pr37505276',
 'https://batdongsan.com.vn/ban-can-ho-chung-cu-duong-nguyen-trai-phuong-thuong-dinh-prj-royal-city/chuyen-nhuong-royalcity-thang-6-2023-tu-van-xem-nha-24-7-khach-mua-lam-viec-truc-tiep-chu-nha-pr37423641',
 'https://batdongsan.com.vn/ban-can-ho-chung-cu-phuong-tay-mo-prj-the-tonkin-vinhomes-smart-city/tonkin2-3pn-s-90m2-chi-4-1-ty-ck-gan-1-ty-tru-vao-gia-nhan-nha-o-luon-co-2-cho-do-o-to-ky-cdt-pr37558003',
 'https://batdongsan.com.vn/ban-can-ho-chung-cu-phuong-long-thanh-my-prj-vinhomes-grand-park/ro-hang-ngop-q9-thich-hop-dau-tu-an-studio-1-2ty-1pn-1-7ty-2pn-1-99ty-pr31392615',
 'https://batdongsan.com.vn/ban-can-ho-chung-cu-duong-mai-chi-tho-phuong-an-phu-prj-the-sun-avenue/nh-chu-gui-ban-3pn-2wc-90m2-full-noi-that-suat-mua-ban-nguoi-nuoc-ngoai-5-1ty-bao-het-phi-pr37557599',
 'https://batdongsan.com

In [102]:
# number of extracted post
len(list_link_item)

7869

In [29]:
#save list link of item
with open('./text_file/list_link_item.txt', 'w') as f:
    for line in list_link_item:
        f.write(f"{line}\n")

In [5]:
# #open file list_link_item file
# with open('list_link_item.txt') as f:
#     lines = [line.rstrip('\n') for line in f]
# list_link_item = lines



In [6]:
#save list link of item
with open('/raw_data_list.txt', 'a') as f:
    for link in list_link_item[:500]:
        scraper = cloudscraper.create_scraper(delay=10, browser='chrome') 
        info = scraper.get(link).text
        soup = beauty(info, "html.parser")
        dict = soup_to_data_dict(soup)
        f.write(f"{dict}\n")
        time.sleep(10)

In [None]:
#open file list_link_item file
with open('raw_data_list.txt') as f:
    lines = [line.rstrip('\n') for line in f]
raw_data_list = lines

In [31]:
len(raw_data_list)

500

In [28]:
#save list link of item
with open('./text_file/raw_data_list.txt', 'w') as f:
    for line in raw_data_list:
        f.write(f"{line}\n")



In [12]:
df = pd.DataFrame(raw_data_list)

In [46]:
#df.to_csv('raw_data.csv', sep='\t', encoding='utf-8')

In [48]:
#pd.read_csv('raw_data.csv', sep='\t', encoding='utf-8')

Unnamed: 0.1,Unnamed: 0,Diá»n tÃ­ch,Má»©c giÃ¡,HÆ°á»ng nhÃ,HÆ°á»ng ban cÃ´ng,Sá» phÃ²ng ngá»§,Sá» toilet,PhÃ¡p lÃ½,Ná»i tháº¥t,productId:,...,ÄÆ°á»ng vÃ o,Img 14,Img 15,Img 16,Img 17,Img 18,Img 19,Img 20,Img 21,Img 22
0,0,"69,9 m²",35 triệu/m²,Tây - Bắc,Đông - Nam,2 phÃ²ng,2 phÃ²ng,Há»£p Äá»ng mua bÃ¡n,Cơ bản,37505276.0,...,,,,,,,,,,
1,1,"106,8 m²",5 tỷ,,,2 phÃ²ng,2 phÃ²ng,Sá» Äá»/ Sá» há»ng,Đầy đủ,37423641.0,...,,,,,,,,,,
2,2,90 m²,4 tỷ,,,3 phÃ²ng,2 phÃ²ng,Sá» Äá»/ Sá» há»ng,Đầy đủ,37558003.0,...,,,,,,,,,,
3,3,"30,1 m²","1,19 tỷ",,,1 phÃ²ng,1 phÃ²ng,Sá» Äá».,Nội thất chuẩn dòng căn hộ Saphire Thiết bị vệ...,31392615.0,...,,,,,,,,,,
4,4,90 m²,"5,1 tỷ",,,3 phÃ²ng,2 phÃ²ng,Há»£p Äá»ng mua bÃ¡n,Đầy đủ,37557599.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,495,,,,,,,,,37544297.0,...,,,,,,,,,,
496,496,,,,,,,,,37480387.0,...,,,,,,,,,,
497,497,75 m²,9 tỷ,,,4 phÃ²ng,4 phÃ²ng,Sá» Äá»/ Sá» há»ng,Đầy đủ,37544272.0,...,,,,,,,,,,
498,498,161 m²,"6,5 tỷ",,,8 phÃ²ng,4 phÃ²ng,Sá» Äá»/ Sá» há»ng,Không nội thất.,37455102.0,...,6 m,,,,,,,,,
