<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#前言" data-toc-modified-id="前言-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>前言</a></span></li><li><span><a href="#爬虫" data-toc-modified-id="爬虫-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>爬虫</a></span></li><li><span><a href="#数据可视化" data-toc-modified-id="数据可视化-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>数据可视化</a></span><ul class="toc-item"><li><span><a href="#房源面积-总价散点图" data-toc-modified-id="房源面积-总价散点图-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>房源面积-总价散点图</a></span></li><li><span><a href="#各行政区均价" data-toc-modified-id="各行政区均价-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>各行政区均价</a></span></li><li><span><a href="#均价最高的10个小区" data-toc-modified-id="均价最高的10个小区-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>均价最高的10个小区</a></span></li><li><span><a href="#均价最高的10个地段" data-toc-modified-id="均价最高的10个地段-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>均价最高的10个地段</a></span></li><li><span><a href="#户型分布" data-toc-modified-id="户型分布-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>户型分布</a></span></li><li><span><a href="#词云图" data-toc-modified-id="词云图-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>词云图</a></span></li></ul></li></ul></div>


### 前言
**本项目总共分为两个部分：**
* 爬虫：累计爬取链家深圳二手房源信息累计18906条 shenzhen.csv；

* 数据可视化

### 爬虫
* **爬取各个行政区房源信息；**
* **数据保存为DataFrame；**

In [12]:
from bs4 import BeautifulSoup  
import pandas as pd
from tqdm import tqdm
import math
import requests  
import lxml
import re
import time

In [13]:
area_dic = {'罗湖区':'luohuqu',
            '福田区':'futianqu',
            '南山区':'nanshanqu',
            '盐田区':'yantianqu',
            '宝安区':'baoanqu',
            '龙岗区':'longgangqu',
            '龙华区':'longhuaqu',
            '坪山区':'pingshanqu'}
area_dic = {'罗湖区':'luohuqu'}

# 加个header以示尊敬
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36',
           'Referer': 'https://sz.lianjia.com/ershoufang/'}

# 新建一个会话
sess = requests.session()
sess.get('https://sz.lianjia.com/ershoufang/', headers=headers)

# url示例：https://sz.lianjia.com/ershoufang/luohuqu/pg2/
url = 'https://sz.lianjia.com/ershoufang/{}/pg{}/'

In [14]:
# 当正则表达式匹配失败时，返回默认值（errif）
def re_match(re_pattern, string, errif=None):
    try:
        return re.findall(re_pattern, string)[0].strip()
    except IndexError:
        return errif

In [4]:
import time

#新建一个DataFrame存储信息
data = pd.DataFrame()

for key_, value_ in area_dic.items():
    # 获取该行政区下房源记录数
    start_url = 'https://sz.lianjia.com/ershoufang/{}/'.format(value_)
    html = sess.get(start_url).text
    house_num = re.findall('共找到<span> (.*?) </span>套.*二手房', html)[0].strip()
    print('{}: 二手房源共计「{}」套'.format(key_, house_num))
    time.sleep(1)
    # 页面限制 每个行政区只能获取最多100页共计3000条房源信息
    total_page = int(math.ceil(min(3000, int(house_num)) / 30.0))
    
    for i in tqdm(range(total_page), desc=key_):
        #print('开始抓取',url.format(value_, i+1))
        html = sess.get(url.format(value_, i+1)).text
        soup = BeautifulSoup(html, 'lxml')
        info_collect = soup.find_all(class_="info clear")
        
        for info in info_collect:
            info_dic = {}
            # 行政区
            info_dic['area'] = key_
            # 房源的标题
            info_dic['title'] = re_match('target="_blank">(.*?)</a><!--', str(info))
            # 小区名
            info_dic['community'] = re_match('xiaoqu.*?target="_blank">(.*?)</a>', str(info))
            # 位置
            info_dic['position'] = re_match('<a href.*?target="_blank">(.*?)</a>.*?class="address">', str(info))
            # 税相关，如房本满5年
            info_dic['tax'] = re_match('class="taxfree">(.*?)</span>', str(info))
            try:
                # 总价
                info_dic['total_price'] = float(re_match('<span class="">(.*?)</span><i>万', str(info)))
            except:
                info_dic['total_price'] =None
            # 单价
            try:
                info_dic['unit_price'] = float(re_match('<span>(.*?)元/平</span>', str(info)).replace(',',''))
            except:
                info_dic['unit_price'] =None
            
            # 匹配房源标签信息，通过|切割
            # 包括面积，朝向，装修等信息
            icons = re.findall('class="houseIcon"></span>(.*?)</div>', str(info))[0].strip().split('|')
            info_dic['hourseType'] = icons[0].strip()
            info_dic['hourseSize'] = float(icons[1].replace('平米', ''))
            info_dic['direction'] = icons[2].strip()
            info_dic['fitment'] = icons[3].strip()
            
            # 存入DataFrame
            if data.empty:
                data = pd.DataFrame(info_dic,index=[0])
            else:
                data1 = pd.DataFrame(info_dic,index=[0])
                data = pd.concat([data,data1],ignore_index=True)
                
            

罗湖区: 二手房源共计「5372」套


罗湖区: 100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [01:02<00:00,  1.61it/s]


In [5]:
# 去掉一条面积10000+平米的房源记录
data = data[data['hourseSize'] < 10000]
data.head()

Unnamed: 0,area,title,community,position,tax,total_price,unit_price,hourseType,hourseSize,direction,fitment
0,罗湖区,国贸商圈，正规大两房，70年产权，红本满5年，有钥匙,天俊大厦,春风路,房本满五年,381.0,45400.0,2室2厅,83.77,东南,精装
1,罗湖区,中海天钻大社区 绿化好 正对小区园林景观 只交契税,中海天钻,万象城,房本满五年,1039.0,85600.0,3室2厅,121.36,南,精装
2,罗湖区,地铁口物业，精装修，高层采光好，带大露台养花,兴业大厦,春风路,,358.0,42000.0,3室2厅,85.18,东南,精装
3,罗湖区,自如的装修，业主1998年买回来，价格好谈,庐山花园,罗湖口岸,房本满五年,132.0,36700.0,1室0厅,35.85,西南,精装
4,罗湖区,国贸精装一房一厅，业主诚心出售。看房随时欢迎,银座金钻,春风路,房本满五年,183.0,48600.0,1室1厅,37.53,南,精装


In [14]:
# 保存数据
# data.to_csv('./shenzhen2.csv')

### 数据可视化
* 房源面积-总价散点图
* 各行政区均价
* 均价最贵的10个地段
* 均价最贵的10个小区
* 户型分布
* 标题文本词云图

In [19]:
from pyecharts.charts import *
from pyecharts import options as opts
from pyecharts.commons.utils import JsCode
from jieba import posseg as psg
import collections
import pandas as pd

In [20]:
from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_NOTEBOOK

In [21]:
data = pd.read_csv('./shenzhen.csv')
data = data.drop(columns=['Unnamed: 0'])
data.head()

Unnamed: 0,area,title,community,position,tax,total_price,unit_price,hourseType,hourseSize,direction,fitment
0,罗湖区,满五红本， 户型方正朝南，自住装修保养好，花园社区,金城华庭,螺岭,房本满五年,710.0,79552.0,3室1厅,89.25,南,精装
1,罗湖区,7号线洪湖站前59万平洪湖公园后京基水贝*2个万象城,洪湖东岸,翠竹,房本满五年,408.0,54736.0,3室1厅,74.54,西,精装
2,罗湖区,《供电南苑。复式三层四房户型》万象城，地理位置优越,供电南苑,万象城,房本满五年,850.0,67649.0,4室1厅,125.65,西,简装
3,罗湖区,不用明额 满两年红本 高层东南三房 有钥匙随时可看,翡翠公寓,翠竹,房本满五年,360.0,60627.0,3室2厅,59.38,南,精装
4,罗湖区,都市名园 2室1厅 370万,都市名园,万象城,,370.0,48259.0,2室1厅,76.67,东北,简装


#### 房源面积-总价散点图
* 💥最贵的一套房源是位于宝安-曦城的房源，总价8800W；

In [22]:
scatter = (Scatter(init_opts=opts.InitOpts(theme='dark'))
           .add_xaxis(data['hourseSize'])
           .add_yaxis("房价", data['total_price'])
           .set_series_opts(label_opts=opts.LabelOpts(is_show=False),
                           markpoint_opts=opts.MarkPointOpts(data=[opts.MarkPointItem(type_="max", name="最大值"),]))
           .set_global_opts(
               legend_opts=opts.LegendOpts(is_show=False),
               title_opts=opts.TitleOpts(title="深圳二手房 总价-面积 散点图"),
               xaxis_opts=opts.AxisOpts(
                   name='面积',
                   # 设置坐标轴为数值类型
                   type_="value", 
                   # 不显示分割线
                   splitline_opts=opts.SplitLineOpts(is_show=False)),
               yaxis_opts=opts.AxisOpts(
                   name='总价',
                   name_location='middle',
                   # 设置坐标轴为数值类型
                   type_="value",
                   # 默认为False表示起始为0
                   is_scale=True,
                   splitline_opts=opts.SplitLineOpts(is_show=False),),
               visualmap_opts=opts.VisualMapOpts(is_show=True, type_='color', min_=100, max_=1000)
    ))



scatter.render_notebook() 

#### 各行政区均价
* 🗣**最贵的是南山区，整体均价9.2W/平米；**

* 🗣最便宜的是坪山区，均价3.5W/平米；

In [23]:
temp = data.groupby(['area'])['unit_price'].mean().reset_index()
data_pair = [(row['area'], round(row['unit_price']/10000, 1)) for _, row in temp.iterrows()]

map_ = (Map(init_opts=opts.InitOpts(theme='dark'))
        .add("二手房均价", data_pair, '深圳', is_roam=False)
        .set_series_opts(label_opts=opts.LabelOpts(is_show=True))
        .set_global_opts(
            title_opts=opts.TitleOpts(title="深圳各行政区二手房均价"),
            legend_opts=opts.LegendOpts(is_show=False),
            tooltip_opts=opts.TooltipOpts(formatter='{b}:{c}万元'),
            visualmap_opts=opts.VisualMapOpts(min_=3, max_=10)
        )
       )

        
# map_.render_notebook()
map_.render('map.html')

'C:\\myproject\\cx_xiangmu\\Python数据分析\\上架\\2301-Python数据分析与可视化项目\\房地产-二手房信息抓取+可视化-约300行（爬虫+pyecharts可视化）\\map.html'

#### 均价最高的10个小区

* **🚫该小区内在售房源至少3套才统计**

In [24]:
temp = data.groupby(['community'])['unit_price'].agg(['mean', 'count']).reset_index()

# 该小区内至少3套在售房源才统计
data_pair = sorted([(row['community'], round(row['mean']/10000, 1)) if row['count']>=3 else (0, 0)
                    for _, row in temp.iterrows()], key=lambda x: x[1], reverse=True)[:10]

bar = (Bar(init_opts=opts.InitOpts(theme='dark'))
       .add_xaxis([x[0] for x in data_pair[::-1]])
       .add_yaxis('二手房均价', [x[1] for x in data_pair[::-1]])
       .set_series_opts(label_opts=opts.LabelOpts(is_show=True, 
                                                       position='insideRight',
                                                       font_style='italic'),
                            itemstyle_opts=opts.ItemStyleOpts(
                                color=JsCode("""new echarts.graphic.LinearGradient(1, 0, 0, 0, 
                                             [{
                                                 offset: 0,
                                                 color: 'rgb(0,206,209)'
                                             }, {
                                                 offset: 1,
                                                 color: 'rgb(218,165,32)'
                                             }])"""))
                            )
       .set_global_opts(
           title_opts=opts.TitleOpts(title="深圳二手房均价TOP 10小区"),
           legend_opts=opts.LegendOpts(is_show=False),
           tooltip_opts=opts.TooltipOpts(formatter='{b}:{c}万元'),
           xaxis_opts=opts.AxisOpts(min_=14),
       )
       .reversal_axis()
      )

bar.render_notebook()

#### 均价最高的10个地段

In [55]:
temp = data.groupby(['position'])['unit_price'].mean().reset_index()
data_pair = sorted([(row['position'], round(row['unit_price']/10000, 1))
                    for _, row in temp.iterrows()], key=lambda x: x[1], reverse=True)[:10]

bar = (Bar(init_opts=opts.InitOpts(theme='dark'))
       .add_xaxis([x[0] for x in data_pair])
       .add_yaxis('二手房均价', [x[1] for x in data_pair])
       .set_series_opts(label_opts=opts.LabelOpts(is_show=True, font_style='italic'),
                            itemstyle_opts=opts.ItemStyleOpts(
                                color=JsCode("""new echarts.graphic.LinearGradient(0, 1, 0, 0, 
                                             [{
                                                 offset: 0,
                                                 color: 'rgb(0,206,209)'
                                             }, {
                                                 offset: 1,
                                                 color: 'rgb(218,165,32)'
                                             }])"""))
                            )
       .set_global_opts(
           title_opts=opts.TitleOpts(title="深圳二手房均价TOP 10地段"),
           legend_opts=opts.LegendOpts(is_show=False),
           tooltip_opts=opts.TooltipOpts(formatter='{b}:{c}万元'))
      )

bar.render_notebook()

#### 户型分布

* 三室依然是主力；

* 在深圳这种寸土寸金的地方，小户型占比也很多；

In [56]:
temp = data.groupby(['hourseType'])['area'].count().reset_index()
data_pair = sorted([(row['hourseType'], row['area'])
                    for _, row in temp.iterrows()], key=lambda x: x[1], reverse=True)[:10]

pie = (Pie(init_opts=opts.InitOpts(theme='dark'))
       .add('', data_pair,
            radius=["30%", "75%"],
            rosetype="radius")
       .set_global_opts(title_opts=opts.TitleOpts(title="深圳二手房 户型分布"),
                       legend_opts=opts.LegendOpts(is_show=False),)
       .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
      )

pie.render_notebook()

#### 词云图

* 来看看房源标题中出现最多的词语

In [57]:
word_list = []
stop_words = ['花园','业主','出售']
string =  str(''.join([i for i in data['title'] if isinstance(i, str)]))

words = psg.cut(string)
for x in words:
    if len(x.word)==1:
        pass
    elif x.flag in ('m', 'x'):
        pass
    elif x.word in stop_words:
        pass
    else:
        word_list.append(x.word)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\2\jieba.cache
Loading model cost 0.937 seconds.
Prefix dict has been built successfully.


In [58]:
data_pair = collections.Counter(word_list).most_common(100)


wc = (WordCloud()
      .add("", data_pair, word_size_range=[20, 100], shape='triangle')
      .set_global_opts(title_opts=opts.TitleOpts(title="房源描述词云图"))
    )

wc.render_notebook()