## 文本摘要任务: 数据预处理

数据来源: [搜狗实验室（Sogou Labs）- 搜狐新闻数据](http://www.sogou.com/labs/resource/cs.php)

In [2]:
import re
from urllib.parse import urlparse 
from collections import Counter

In [2]:
!ls -l

total 5048284
-rw-r--r-- 1 root root      10374 May 12  2017 data.ipynb
-rw-r--r-- 1 root root          0 May 12  2017 dev_contents.txt
-rw-r--r-- 1 root root          0 May 12  2017 dev_titles.txt
-rw-r--r-- 1 root root 1537763850 May 10  2017 news_sohusite_xml.dat
-rw-r--r-- 1 root root     229888 Aug 15  2012 news_sohusite_xml.smarty.dat
-rw-r--r-- 1 root root 1538501754 May 10  2017 news_tensite_xml.dat
-rw-r--r-- 1 root root     218029 May 10  2017 news_tensite_xml.smarty.dat
-rw-r--r-- 1 root root 1878985341 May 10  2017 raw_contents.txt
-rw-r--r-- 1 root root     286060 May 11  2017 raw_contents_s.txt
-rw-r--r-- 1 root root  118695671 May 10  2017 raw_titles.txt
-rw-r--r-- 1 root root      17720 May 11  2017 raw_titles_s.txt
-rw-r--r-- 1 root root   94713250 May 11  2017 raw_urls.txt
-rw-r--r-- 1 root root          0 May 12  2017 test_contents.txt
-rw-r--r-- 1 root root          0 May 12  2017 test_titles.txt
-rw-r--r-- 1 root root          0 May 12  2017 train_co

在命令行下转换编码并分别提取出 content 和 title:

```
cd \^data
cat news_sohusite_xml.dat | iconv -f gb18030 -t utf-8 | grep "<contenttitle>" > raw_titles.txt
cat news_sohusite_xml.dat | iconv -f gb18030 -t utf-8 | grep "<content>" > raw_contents.txt
cat news_sohusite_xml.dat | iconv -f gb18030 -t utf-8 | grep "<url>" > raw_urls.txt
```

In [6]:
with open('raw_titles.txt', encoding='utf-8') as f:
    num_examples = 0
    for line in f:
        num_examples += 1

print(num_examples)

1411996


搜狐新闻数据总数为 1411996

### 处理 raw 文本, 并划分为 train/dev/test

* 去除行首行末的 xml 标签
* 去除特殊字符 `\u3000` 和 `\ue40c`
* [x] 全角数字和英文转换为半角
* 去除括号内字符 (xx) [xx]
* 过滤掉长度过大的

In [7]:
raw_contents_file = 'raw_contents.txt'
raw_titles_file = 'raw_titles.txt'
raw_urls_file = 'raw_urls.txt'

num_train_raw = 1200000
num_dev_raw = 100000
num_test_raw = 100000
assert num_train_raw + num_dev_raw + num_test_raw <= num_examples, 'Not enough examples.'

根据 url 判断新闻类别. 映射关系: http://download.labs.sogou.com/dl/sogoulabdown/categories_2012.txt
<- 实际的 url 跟此链接里给的不同.

In [8]:
cnt = Counter()

with open(raw_urls_file, encoding='utf-8') as f:
    for line in f:
        line = re.sub('<.{3,4}>', '', line)
        netloc = urlparse(line).netloc
        cnt[netloc] += 1

In [14]:
# 语料来源
cnt.most_common(40)

[('roll.sohu.com', 720957),
 ('product.it.sohu.com', 176727),
 ('news.sohu.com', 70900),
 ('db.auto.sohu.com', 56275),
 ('sports.sohu.com', 38281),
 ('stock.sohu.com', 36968),
 ('pic.yule.sohu.com', 27371),
 ('business.sohu.com', 26179),
 ('dealer.auto.sohu.com', 25663),
 ('saa.auto.sohu.com', 19671),
 ('q.stock.sohu.com', 13814),
 ('yule.sohu.com', 12779),
 ('drug.health.sohu.com', 11480),
 ('haodf.health.sohu.com', 10890),
 ('it.sohu.com', 10797),
 ('pic.news.sohu.com', 9529),
 ('s.sohu.com', 8678),
 ('data.yule.sohu.com', 8071),
 ('money.sohu.com', 7263),
 ('daxue.learning.sohu.com', 7021),
 ('auto.sohu.com', 6843),
 ('vip.book.sohu.com', 6009),
 ('learning.sohu.com', 5570),
 ('zone.it.sohu.com', 5132),
 ('digi.it.sohu.com', 4039),
 ('q.fund.sohu.com', 3883),
 ('picture.auto.sohu.com', 3567),
 ('goche.auto.sohu.com', 3469),
 ('db.money.sohu.com', 2766),
 ('baobao.sohu.com', 2560),
 ('baodian.women.sohu.com', 2534),
 ('club.mil.news.sohu.com', 2346),
 ('t.stock.sohu.com', 1933),
 ('w

`roll.sohu.com`, `news.sohu.com` 都是比较常规的新闻

In [29]:
def count_hanzi(text):
    count = 0
    for char in text:
        if '\u4e00' <= char <= '\u9fff':
            count += 1
    return count


def process(text):
    def full2half(text):
        result = ''
        length = 0
        for char in text:
            if char == '\u3000':  # 全角空格
                char = ' '
            elif char == '\ue40c':  # 全角换行符?
                char = ''
            elif '\uff01' <= char <= '\uff5e':
                char = chr(ord(char) - 0xfee0)
            result += char
        return result

    text = full2half(text)
    text = re.sub('[<\[\(].{0,30}?[>\]\)]', '', text)
    return text


def prepare_data(start_line, num_lines, contents_file, titles_file, max_content_length=200):
    contents_reader = open(raw_contents_file, encoding='utf-8')
    titles_reader = open(raw_titles_file, encoding='utf-8')
    urls_reader = open(raw_urls_file, encoding='utf-8')
    
    contents_writer = open(contents_file, 'w', encoding='utf-8')
    titles_writer = open(titles_file, 'w', encoding='utf-8')

    for i in range(start_line):
        contents_reader.readline()
        titles_reader.readline()
        url = urls_reader.readline()

    written = 0  # 记录写入的行数
    for i in range(start_line, start_line + num_lines):
        content = contents_reader.readline()
        title = titles_reader.readline()
        url = urls_reader.readline().strip()
        
        if len(content) > max_content_length:
            continue
        
        url = re.sub('<.{3,4}>', '', url)
        netloc = urlparse(url).netloc
        if netloc == 'roll.sohu.com':
            content = process(content)
            title = process(title)
            if len(content) >= 15 and count_hanzi(title) >= 7:
                contents_writer.write(content)
                titles_writer.write(title)
                written += 1

    # 打印出最后一行, 检验是否对齐
    print(content, title, url)

    contents_reader.close()
    titles_reader.close()
    urls_reader.close()
    
    contents_writer.close()
    titles_writer.close()

    return written

In [30]:
num_train = prepare_data(0, num_train_raw, 'train_contents.txt', 'train_titles.txt')
num_dev = prepare_data(num_train_raw + 1, num_dev_raw, 'dev_contents.txt', 'dev_titles.txt')
num_test = prepare_data(num_train_raw + num_dev_raw + 1, num_test_raw, 'test_contents.txt', 'test_titles.txt')

num_train, num_dev, num_test

<content>产品系列：华硕　Ｎ８０系列屏幕尺寸：１４．１英寸ＣＰＵ型号：Ｉｎｔｅｌ　酷睿２双核　Ｔ６４００ＣＰＵ主频：２０００ＭＨｚ内存容量：２ＧＢ硬盘容量：３２０ＧＢ显卡芯片：ＮＶＩＤＩＡ　ＧｅＦｏｒｃｅ　９３００Ｍ　Ｇ操作系统：Ｗｉｎｄｏｗｓ　Ｖｉｓｔ　更多参数＞＞</content>
 <contenttitle>￥４９００</contenttitle>
 http://product.it.sohu.com/search/subcategoryid=16&manuid=227&seriesid[]=5026&seriesid[]=3937&seriesid[]=7310
<content>经销商　型号　经销商报价　经销商信息万利达　ＭＪＳ－４８Ｅ联系电话：０１０－８２８５２５１４　８２８５１２８５手机号码：１３８１０７９０３１０　１３８１０５０７４４３店铺地址：北京硅谷电脑城５层５１８－５２４室</content>
 <contenttitle>万利达　ＭＪＳ－４８Ｅ</contenttitle>
 http://product.it.sohu.com/detail/188099_price.html
<content>７月１０日，在浙江乐清一趟７路公交车上，一名年轻男子突然癫痫病发作，两眼翻白，手脚抽搐。公交车司机董师傅赶紧将车停在路边，扶着发病男子平躺在公交车过道上。看着躺在地上的男子，乘客似乎有所顾虑。董师傅说：“我车里有监控，说什么都能听到，大家尽管救。”随后，乘客开始上前帮忙。在一车人的全力救护下，司机一路闯红灯，将发病男子送到了医院。感慨善良之心仍在只是环境变了＠近水楼台：说到底，还是这个社会阴暗的一面让人们遇到突发事件变得更加小心，不敢伸手了！＠林朝霞：想做好事，又怕惹祸上身。唉，地球太危险！＠璇：救人之前还要犹豫，因为不知道这一秒救了人，下一秒会不会被告上法庭。＠大粤新闻：请保持自己的善心，请不要对躺在路边的人过于畏惧，如果每个人都如此冷漠，那么，当你自己倒下去的时候，你会发现，这就是自己一手造成的冷漠。＠董应群－比多：善良的心仍在，只是环境变了。＠一片冰心在茶壶：做好事怕带来严重的后果，结果就是导致没人做好事。没人做好事的社会是一个堕落的社会，一个堕落的社会是不能教化人们做好事的！思考健全保障制度提倡

(113925, 9210, 10326)

筛选条件: 
- 处理前的正文字数 <= 200
- 处理后的正文字数 >= 15
- 处理后的标题中 中文字数 >= 7
- 只选 roll.sohu.com 的

### 另: 搜狐 "全网新闻数据"
http://www.sogou.com/labs/resource/ca.php

In [1]:
with open('raw_titles_tensite.txt', encoding='utf-8') as f:
    num_examples_tensite = 0
    for line in f:
        num_examples_tensite += 1

print(num_examples_tensite)

1294233


全网新闻数据总数为 1294233

In [12]:
cnt_ten = Counter()

with open('raw_urls_tensite.txt', encoding='utf-8') as f:
    for line in f:
        line = re.sub('<.{3,4}>', '', line)
        netloc = urlparse(line).netloc
        cnt_ten[netloc] += 1

In [15]:
# 语料来源
cnt_ten.most_common(50)

[('news.163.com', 209076),
 ('news.sohu.com', 85454),
 ('www.people.com.cn', 71087),
 ('henan.people.com.cn', 57003),
 ('news.cn.yahoo.com', 52484),
 ('ent.cn.yahoo.com', 44567),
 ('sports.cn.yahoo.com', 34435),
 ('finance.people.com.cn', 29574),
 ('world.people.com.cn', 25347),
 ('ha.people.com.cn', 24051),
 ('sports.people.com.cn', 21635),
 ('fujian.people.com.cn', 20973),
 ('haodf.health.people.com.cn', 19378),
 ('pic.news.sohu.com', 19162),
 ('cpc.people.com.cn', 18729),
 ('dfgwy.edu.people.com.cn', 16426),
 ('politics.people.com.cn', 16346),
 ('data.fund.people.com.cn', 15211),
 ('js.people.com.cn', 15098),
 ('biz.cn.yahoo.com', 14984),
 ('hi.people.com.cn', 14877),
 ('cul.cn.yahoo.com', 14077),
 ('legal.people.com.cn', 14048),
 ('game.people.com.cn', 13259),
 ('society.people.com.cn', 12819),
 ('tv.people.com.cn', 12309),
 ('lady.cn.yahoo.com', 11949),
 ('cq.people.com.cn', 10648),
 ('shipin.people.com.cn', 10638),
 ('auto.data.people.com.cn', 10214),
 ('ah.people.com.cn', 10014)