# Project of Chinese Provincial Official Appointment
Dec 13, 2023 Yuqi Cheng

* Use BeautifulSoup (messy code error?) /**Playwright** to scrape data from multi web pages, save into csv.
* Use regex to **extract geography info** (or maybe other NLP tools) and then **explode** into multiple rows.
* Use **googletrans** (ReadTimeout error?) to translate the titles from Chinese to English.
* Use tqdm to generate **progress bar**.
* Use pandas to group by provinces, and generate html for each province.
* Decide the key value for mapping.
* Merge the propeties dataframe with the shape file, and save into json.
* Modify the html template to color by ratings.
* And then dive into infinite debugging for **potential improvement**!

## Scraping
data from: http://district.ce.cn/zt/rwk/rw/sbj/

In [3]:
url = 'http://district.ce.cn/zt/rwk/rw/sbj/'
url_list = [url]
for i in range(33):
    url_list.append(url + f'index_{i+1}.shtml')

In [26]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import time

In [34]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

**What I got when using BeautifulSoup:**

"Some characters could not be decoded, and where replaced with REPLACEMENT CHARACTER"

See in debug.ipynb

In [35]:
list = []
for urls in url_list:
    # page = requests.get(urls)
    # doc_soup = BeautifulSoup(page.content, "html.parser")
    await page.goto(urls)
    html = await page.content()
    doc_soup = BeautifulSoup(html)
    for item in doc_soup.find_all(class_='f1'):
        dic = {}
        dic['title'] = item.a.text
        dic['link'] = "http://district.ce.cn/" + item.a['href'].replace('../', '')
        dic['date'] = item.find_next_sibling().text.replace('[', '').replace(']', '')
        list.append(dic)
print(len(list))
list[0]

1000


{'title': '张巍任河南省委常委、省纪委书记',
 'link': 'http://district.ce.cn/newarea/sddy/202312/09/t20231209_38823329.shtml',
 'date': '2023/12/09'}

In [38]:
df = pd.DataFrame(list)
df.head()

Unnamed: 0,title,link,date
0,张巍任河南省委常委、省纪委书记,http://district.ce.cn/newarea/sddy/202312/09/t...,2023/12/09
1,黑龙江省委常委张巍调任河南省委常委,http://district.ce.cn/newarea/sddy/202312/09/t...,2023/12/09
2,石谋军任甘肃省委副书记,http://district.ce.cn/newarea/sddy/202312/08/t...,2023/12/08
3,刘宇辉任国务院副秘书长 此前担任北京市副市长,http://district.ce.cn/newarea/sddy/202312/08/t...,2023/12/08
4,陈辐宽任天津市委副书记,http://district.ce.cn/newarea/sddy/202312/07/t...,2023/12/07


In [39]:
df.to_csv("Appointment1000.csv", index=False)

## Extract geography info
Why not try cool Natural Language Processing tools???

Well I tried, and decided that it's ok to build the 34-province-list manually.
* translate + nltk: can't figure out provinces in China
* HanLP (for Chinese): only for python 3.7

But if I need to specify the cities, it's worth a try.

In [1]:
import pandas as pd
df = pd.read_csv("Appointment1000.csv")

In [3]:
chinese_provinces = [
    '北京', '天津', '河北', '山西', '内蒙古',
    '辽宁', '吉林', '黑龙江', '上海', '江苏',
    '浙江', '安徽', '福建', '江西', '山东',
    '河南', '湖北', '湖南', '广东', '广西',
    '海南', '重庆', '四川', '贵州', '云南',
    '西藏', '陕西', '甘肃', '青海', '宁夏',
    '新疆', '台湾', '香港', '澳门'
]

In [4]:
df['province'] = df['title'].apply(lambda x: [province for province in chinese_provinces if province in x])

In [5]:
df_exploded = df.explode('province').reset_index(drop=True)

In [6]:
df_exploded.head()

Unnamed: 0,title,link,date,province
0,张巍任河南省委常委、省纪委书记,http://district.ce.cn/newarea/sddy/202312/09/t...,2023/12/09,河南
1,黑龙江省委常委张巍调任河南省委常委,http://district.ce.cn/newarea/sddy/202312/09/t...,2023/12/09,黑龙江
2,黑龙江省委常委张巍调任河南省委常委,http://district.ce.cn/newarea/sddy/202312/09/t...,2023/12/09,河南
3,石谋军任甘肃省委副书记,http://district.ce.cn/newarea/sddy/202312/08/t...,2023/12/08,甘肃
4,刘宇辉任国务院副秘书长 此前担任北京市副市长,http://district.ce.cn/newarea/sddy/202312/08/t...,2023/12/08,北京


In [7]:
df_exploded.to_csv("df_exploded.csv", index=False)

In [10]:
df_exploded.shape

(1033, 4)

## Tranlate to English
Improvement: add a button to change language

Improvement: Sometimes the traslation of Chinese name is ridiculous!

* Error: ReadTimeout: The read operation timed out

  I tried to process each chunk separately, and it works, so that's fine.

* Translation is slow

  You may want to add a progress bar using tqdm

In [8]:
from googletrans import Translator

def translate_text(text, target_language='en'):
    translator = Translator()
    translation = translator.translate(text, dest=target_language)
    return translation.text

In [13]:
list = []
from tqdm import tqdm
import time
for i in tqdm(range(10), desc="Processing"):
    list1 = df_exploded['title'][i*100:i*100+100].apply(lambda x: translate_text(x, target_language='en'))
    list.append(list1)

Processing: 100%|███████████████████████████████| 10/10 [13:52<00:00, 83.20s/it]


In [15]:
list1 = df_exploded['title'][1000:].apply(lambda x: translate_text(x, target_language='en'))
list.append(list1)

In [21]:
import numpy as np
smart_list = np.concatenate(list)
len(smart_list)

1033

In [22]:
smart_list[0:10]

array(['Zhang Wei is the member of the Standing Committee of the Henan Provincial Party Committee, Secretary of the Provincial Discipline Inspection Commission',
       'Zhang Wei, member of the Standing Committee of Heilongjiang Provincial Party Committee, was transferred to the Standing Committee of the Henan Provincial Party Committee',
       'Zhang Wei, member of the Standing Committee of Heilongjiang Provincial Party Committee, was transferred to the Standing Committee of the Henan Provincial Party Committee',
       'Stone Mouzun as Deputy Secretary of the Gansu Provincial Party Committee',
       'Liu Yuhui served as Deputy Secretary -General of the State Council.',
       'Chen Tiekuan is the deputy secretary of the Tianjin Municipal Party Committee',
       'Cheng Lihua is the secretary of the party group of Chongqing CPPCC',
       'Wu Haitao no longer serves as the deputy governor of Hubei Province as a member of the Standing Committee of the Provincial Party Committee and 

In [24]:
df_exploded['en'] = smart_list

In [25]:
df_exploded.head()

Unnamed: 0,title,link,date,province,en
0,张巍任河南省委常委、省纪委书记,http://district.ce.cn/newarea/sddy/202312/09/t...,2023/12/09,河南,Zhang Wei is the member of the Standing Commit...
1,黑龙江省委常委张巍调任河南省委常委,http://district.ce.cn/newarea/sddy/202312/09/t...,2023/12/09,黑龙江,"Zhang Wei, member of the Standing Committee of..."
2,黑龙江省委常委张巍调任河南省委常委,http://district.ce.cn/newarea/sddy/202312/09/t...,2023/12/09,河南,"Zhang Wei, member of the Standing Committee of..."
3,石谋军任甘肃省委副书记,http://district.ce.cn/newarea/sddy/202312/08/t...,2023/12/08,甘肃,Stone Mouzun as Deputy Secretary of the Gansu ...
4,刘宇辉任国务院副秘书长 此前担任北京市副市长,http://district.ce.cn/newarea/sddy/202312/08/t...,2023/12/08,北京,Liu Yuhui served as Deputy Secretary -General ...


In [26]:
df_exploded.to_csv("df_exploded.csv", index=False)

## Group by provinces, and generate html
Improvement: seperate name, province & city, position

Improvement: remove duplicate rows. 

For example: 
* **Zhang Wei** is the member of **the Standing Committee of the Henan Provincial Party Committee**, Secretary of the Provincial Discipline Inspection Commission.
* **Zhang Wei**, member of the Standing Committee of Heilongjiang Provincial Party Committee, was transferred to **the Standing Committee of the Henan Provincial Party Committee**.

In [27]:
df2 = df_exploded.groupby('province')[['title', 'en', 'link', 'date']].apply(lambda x: x.to_dict(orient='records')).reset_index(name='content')
df2.head()

Unnamed: 0,province,content
0,上海,"[{'title': '上海市委副书记吴清兼任政法委书记', 'en': 'Wu Qing,..."
1,云南,"[{'title': '刘非任云南省委组织部部长', 'en': 'Liu Fei is t..."
2,内蒙古,"[{'title': '内蒙古自治区政协副主席杨劼兼任教育厅厅长', 'en': 'Yang..."
3,北京,"[{'title': '刘宇辉任国务院副秘书长 此前担任北京市副市长', 'en': 'Li..."
4,吉林,"[{'title': '韩福春任吉林省委常委', 'en': 'Han Fuchun ser..."


In [47]:
df2.to_csv("groupby_province.csv",index=False)

In [31]:
list = []
for province in df2['province']:
    list.append(translate_text(province, target_language='en'))

In [32]:
df2['province_en'] = list

In [40]:
def convert_article(list):
    article = ""
    for item in list:
        article += f" <p><a href={item['link']}>{item['date']}</a>   {item['en']}</p>"
    return article
df2['properties.article'] = df2['province_en'].apply(lambda x: f"<div class='appointment'><h1><b>{x}</b></h1>") + df2['content'].apply(lambda x: convert_article(x))

In [41]:
df2.head(1)

Unnamed: 0,province,content,province_en,properties.article
0,上海,"[{'title': '上海市委副书记吴清兼任政法委书记', 'en': 'Wu Qing,...",Shanghai,<div class='appointment'><h1><b>Shanghai</b></...


In [42]:
df2 = df2.rename(columns={'province_en': 'properties.headline'})

## Decide the key value for mapping
I choose to color by rating, the number of which is how many appointments happened in the time period. The color represents the change frequency of officials in one province.

They are about 20-40, so I set the rating color in html in this way.

In [43]:
df2['properties.rating'] = df2['content'].apply(lambda x: len(x))

In [44]:
df2.head()

Unnamed: 0,province,content,properties.headline,properties.article,properties.rating
0,上海,"[{'title': '上海市委副书记吴清兼任政法委书记', 'en': 'Wu Qing,...",Shanghai,<div class='appointment'><h1><b>Shanghai</b></...,36
1,云南,"[{'title': '刘非任云南省委组织部部长', 'en': 'Liu Fei is t...",Yunnan,<div class='appointment'><h1><b>Yunnan</b></h1...,21
2,内蒙古,"[{'title': '内蒙古自治区政协副主席杨劼兼任教育厅厅长', 'en': 'Yang...",Inner Mongolia,<div class='appointment'><h1><b>Inner Mongolia...,23
3,北京,"[{'title': '刘宇辉任国务院副秘书长 此前担任北京市副市长', 'en': 'Li...",Beijing,<div class='appointment'><h1><b>Beijing</b></h...,26
4,吉林,"[{'title': '韩福春任吉林省委常委', 'en': 'Han Fuchun ser...",Jilin,<div class='appointment'><h1><b>Jilin</b></h1>...,32


## Merge with the shape file
I downloaded the shape file of Chinese provinces from github.

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Delete "var infoData = " and it will be fine.

In [46]:
import requests
import json
import numpy as np
import pandas as pd
from pandas import json_normalize

In [49]:
with open('geo-data.js') as json_data:
    geometry_data = json.load(json_data)

In [50]:
df_geo = pd.DataFrame.from_dict(json_normalize(geometry_data['features']), orient='columns')

In [51]:
df_geo.shape

(34, 8)

In [52]:
df_geo.head()

Unnamed: 0,type,properties.id,properties.size,properties.name,properties.cp,properties.childNum,geometry.type,geometry.coordinates
0,Feature,65,550,新疆,"[84.9023, 42.148]",18,Polygon,"[[[96.416, 42.7588], [96.416, 42.7148], [95.97..."
1,Feature,54,550,西藏,"[87.8695, 31.6846]",7,Polygon,"[[[79.0137, 34.3213], [79.1016, 34.4531], [79...."
2,Feature,15,450,内蒙古,"[112.5977, 46.3408]",12,Polygon,"[[[97.207, 42.8027], [99.4922, 42.583], [100.8..."
3,Feature,63,800,青海,"[95.2402, 35.4199]",8,Polygon,"[[[89.7363, 36.0791], [89.9121, 36.0791], [90,..."
4,Feature,51,900,四川,"[101.9199, 30.1904]",21,Polygon,"[[[101.7773, 33.5303], [101.8652, 33.5742], [1..."


In [53]:
df2.head()

Unnamed: 0,province,content,properties.headline,properties.article,properties.rating
0,上海,"[{'title': '上海市委副书记吴清兼任政法委书记', 'en': 'Wu Qing,...",Shanghai,<div class='appointment'><h1><b>Shanghai</b></...,36
1,云南,"[{'title': '刘非任云南省委组织部部长', 'en': 'Liu Fei is t...",Yunnan,<div class='appointment'><h1><b>Yunnan</b></h1...,21
2,内蒙古,"[{'title': '内蒙古自治区政协副主席杨劼兼任教育厅厅长', 'en': 'Yang...",Inner Mongolia,<div class='appointment'><h1><b>Inner Mongolia...,23
3,北京,"[{'title': '刘宇辉任国务院副秘书长 此前担任北京市副市长', 'en': 'Li...",Beijing,<div class='appointment'><h1><b>Beijing</b></h...,26
4,吉林,"[{'title': '韩福春任吉林省委常委', 'en': 'Han Fuchun ser...",Jilin,<div class='appointment'><h1><b>Jilin</b></h1>...,32


Notice: the column to merge doesn't need to show out. For example, I used Chinese names of provinces to merge. 

It's ok not having data for all districts. However, I want the map itself to be complete (or it's serious political problem)

In [54]:
df_geo = df_geo.merge(df2, left_on='properties.name', right_on='province', how='left')

In [55]:
ok_json = json.loads(df_geo.to_json(orient='records'))

In [56]:
def process_to_geojson(file):
    geo_data = {"type": "FeatureCollection", "features":[]}
    for row in file:
        this_dict = {"type": "Feature", "properties":{}, "geometry": {}}
        for key, value in row.items():
            key_names = key.split('.')
            if key_names[0] == 'geometry':
                this_dict['geometry'][key_names[1]] = value
            if str(key_names[0]) == 'properties':
                this_dict['properties'][key_names[1]] = value
        geo_data['features'].append(this_dict)
    return geo_data

In [57]:
geo_format = process_to_geojson(ok_json)

In [58]:
with open('geo-data.js', 'w') as outfile:
    outfile.write("var infoData = ")
#geojson output
with open('geo-data.js', 'a') as outfile:
    json.dump(geo_format, outfile)

The final version of json add some other information for grouping.

You will notice that for most of the provinces, the secretary and the NPC chairman is the same person. However there are some distincts. Usually for sub-provincial cities and Xinjiang, Tibet, Guangzhou, these two positions are for two persons. There are also some current vacancies.

Scrape from: http://district.ce.cn/zt/rwk/rw/rspd/201302/17/t20130217_766061.shtml