# 数据预处理——`keras.preprocessing`

- text
- sequence
- image

In [8]:
from keras import preprocessing

## [文字预处理](http://keras-cn.readthedocs.io/en/latest/preprocessing/text/)

In [10]:
txt = 'In this paper, we present a simple and modularized neural network architecture, named interleaved group convolutional \
neural networks (IGCNets). The main point lies in a novel building block, a pair of two successive \
interleaved group convolutions: primary group convolution and secondary group convolution.'

In [12]:
out1 = preprocessing.text.text_to_word_sequence(txt)
out1[:6]

['in', 'this', 'paper', 'we', 'present', 'a']

In [13]:
out2 = preprocessing.text.text_to_word_sequence(txt, lower=False)
out2[:6]

['In', 'this', 'paper', 'we', 'present', 'a']

In [14]:
out3 = preprocessing.text.text_to_word_sequence(txt, lower=False, filters='Tha')
out3[:6]

['In', 't', 'is', 'p', 'per,', 'we']

### 中文处理——[`jieba`](https://github.com/fxsjy/jieba)

`jieba` 是一个 python 实现的分词库，对中文有着很强大的分词能力。

#### jieba 库的优点
1. 支持三种分词模式：
    - a. 精确模式，试图将句子最精确地切开，适合文本分析；
    - b. 全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；
    - c. 搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。
2. 支持自定义词典

#### 分词

In [16]:
import jieba

In [17]:
words=jieba.cut("他来到了网易杭研大厦")
print("/".join(words))

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\q7356\AppData\Local\Temp\jieba.cache
Loading model cost 1.575 seconds.
Prefix dict has been built succesfully.


他/来到/了/网易/杭研/大厦


#### 加入自定义字典
```py
jieba.load_userdict("dict.txt")
words=jieba.cut("他来到了网易杭研大厦")
print "/".join(words)
print type(words)
```
自定义的词典：`dict.txt`
```md
杭研大厦 100 n
```

#### 允许程序在运行的时候，动态的修改词典

In [19]:
words =jieba.cut("我们中出了一个叛徒",HMM=False)
#jieba.suggest_freq(('中出'),True)
print('/'.join(words))

我们/中/出/了/一个/叛徒


## [序列预处理](http://keras-cn.readthedocs.io/en/latest/preprocessing/sequence/)

- 填充序列 `pad_sequences`
- 跳字 `skipgrams`

## [图片预处理](http://keras-cn.readthedocs.io/en/latest/preprocessing/image/)

- 图片生成器 `ImageDataGenerator`