# Scraping Lianjia Data

We applied machine learning methods by training on a Beijing Lianjia transaction dataset to predict housing price in our paper "*Unfolding Beijing in a Hedonic Way*". See this paper [here](https://github.com/zhentaoshi/Econ5821/blob/main/data_example/2022%20Lin%20Shi%20Wang%20Yan%20Computational_Economics.pdf).

Real transaction records used to be accessible on the Lianjia websites. But sadly, due to some policy restrictions arised in 2021 just after we finished the paper, Lianjia no longer shows the transaction webpages to the public. As a consequence, the webpage example we provided in the paper, https://bj.lianjia.com/chengjiao/101084782030.html, is no longer available. 

Here, for illustration purpose, we show how to scrape the on-sale second-hand property data  (https://bj.lianjia.com/ershoufang/), instead of the sold-out transaction data (https://bj.lianjia.com/chengjiao/) from the Beijing Lianjia website. 

First, we need to import some packages, where: 
- `selenium` fetches the webpage source code； 
- `BeautifulSoup` organizes the fetched html code and searches the information we need in it. 

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import pandas as pd

A web browser, e.g. **Firefox** or **Chrome**, is need for `selenium` to open a webpage in the background. To call the browser, we need a corresponding **webdriver**. To download and know more about the webdrivers, see: 
- Firefox: https://github.com/mozilla/geckodriver/releases; 
- Chrome: https://chromedriver.chromium.org/. 

Here, we use the Firefox webdriver as an example. 

In [2]:
options = webdriver.firefox.options.Options()
options.headless = True # call the browser in the background

In [3]:
driver = webdriver.Firefox(options=options) # Firefox webdriver must be put into PATH (windows)
driver.implicitly_wait(10) # set the maximum waiting time to load the webpage completely

# If you want to use the Chrome webdriver, please uncomment and run the code in the following cell. 

# options = webdriver.chrome.options.Options()
# options.headless = True
# driver = webdriver.Chrome(options=options)
# driver.implicitly_wait(10)

See an example objective webpage here: https://bj.lianjia.com/ershoufang/101114772718.html. What we want is a program that given a **url**, returns many information fields on the webpage, including attributes of the house, price, location, and etc. Here, we define a function `Scrape_Lianjia` to fulfill this demand. 

In [4]:
def Scrape_Lianjia(url, driver):
    
    # Fetch the page
    driver.get(url)
    
    # Load the page into BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # Initialize the container as a dictionary
    info = {}
    
    # Basic information
    labels = soup.find_all('span', class_='label')
    for label in labels:
        content = label.next_sibling
        while content.text.strip()=='':
            content = content.next_sibling
        info[label.text.strip()] = content.text.strip().replace('\xa0', ' ').replace('举报', '').replace('㎡', '平米')
    info.pop('风险提示')
    
    # Follower
    follow = soup.find('span', class_='count')
    info['关注人数'] = follow.text.strip()
    
    # Total Price
    price_total = soup.find('span', class_='total')
    price_unit = soup.find('span', class_='unit')
    info['总价'] = price_total.text.strip()+price_unit.text.strip()+'元'
    
    # Unit Price
    unitPrice = soup.find('span', class_='unitPriceValue')
    info['单价'] = unitPrice.text.strip()
    
    # Construction Year
    constructYear = soup.find('div', class_='subInfo noHidden')
    info['建成时间'] = re.findall(r'.+年建', constructYear.text.strip())[0]
    
    # Coordinates and Uniqueness
    scripts = soup.find_all('script')
    for script in scripts:
        isUnique = re.findall(r"isUnique:'\w+'", script.text.strip())
        coord = re.findall(r"resblockPosition:'\d+\.\d+,\d+\.\d+'", script.text.strip())
        if isUnique != []:
            isUnique = re.split("[':]", isUnique[0])
            info['是否唯一'] = isUnique[2]
        if coord != []:
            coord = re.split("[':,]", coord[0])
            info['Longitude'] = coord[2]
            info['Latitude'] = coord[3]
    
    return info

See how this program work on the example webpage. 

In [5]:
Scrape_Lianjia('https://bj.lianjia.com/ershoufang/101114772718.html', driver)

{'小区名称': '莫奈花园',
 '所在区域': '顺义 后沙峪 五至六环',
 '看房时间': '提前预约随时可看',
 '链家编号': '101114772718',
 '房屋户型': '4室2厅1厨4卫',
 '所在楼层': '中楼层 (共4层)',
 '建筑面积': '212.87平米',
 '户型结构': '复式',
 '套内面积': '194.33平米',
 '建筑类型': '板楼',
 '房屋朝向': '南 北',
 '建筑结构': '混合结构',
 '装修情况': '精装',
 '梯户比例': '一梯九户',
 '供暖方式': '自供暖',
 '配备电梯': '无',
 '挂牌时间': '2022-04-05',
 '交易权属': '商品房',
 '上次交易': '2017-01-23',
 '房屋用途': '普通住宅',
 '房屋年限': '满五年',
 '产权所属': '非共有',
 '抵押信息': '有抵押 350万元 工商银行 客户偿还',
 '房本备件': '已上传房本照片',
 '关注人数': '20',
 '总价': '930万元',
 '单价': '43689元/平米',
 '建成时间': '2005年建',
 '是否唯一': '唯一住宅',
 'Longitude': '116.53916',
 'Latitude': '40.107599'}

The program works well. Since one url is corresponding for one house on sale, then the next question is how to get all these urls. It can be easily observed that the lists of the on-sale houses are shown on a series of webpages from https://bj.lianjia.com/ershoufang/pg1/ to https://bj.lianjia.com/ershoufang/pg100/ in a good order, where `pg` stands for "page" and there are 100 pages of house lists under https://bj.lianjia.com/ershoufang/. Here, as an example, we collect all housing urls in the first 5 pages. 

In [6]:
page_max = 5 # set the number of pages to search here

urls = []
for p in list(range(1, page_max+1)):
    driver.get('https://bj.lianjia.com/ershoufang/pg'+str(p)+'/')
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    items = soup.find_all('a', class_='noresultRecommend img LOGCLICKDATA')
    for item in items:
        urls.append(item['href'])

See what are contained in the list `urls`. 

In [7]:
urls[0:5]

['https://bj.lianjia.com/ershoufang/101115028158.html',
 'https://bj.lianjia.com/ershoufang/101114580929.html',
 'https://bj.lianjia.com/ershoufang/101115003779.html',
 'https://bj.lianjia.com/ershoufang/101113640862.html',
 'https://bj.lianjia.com/ershoufang/101111602994.html']

Finally, what we need to do is just run our program `Scrape_Lianjia` through all urls inside the list `urls`, and then save the result. 

In [8]:
lianjia = pd.DataFrame()
for url in urls:
    info = Scrape_Lianjia(url, driver)
    lianjia = lianjia.append(info, ignore_index=True)

lianjia.to_csv('lianjia.csv', index=False, encoding='utf-8')

See what we got! 

In [None]:
lianjia.head(5)