# 词性标注与命名实体识别

## 词性标注

### jieba分词中的词性标注

In [6]:
import jieba.posseg as psg
sent = '中文分词是文本处理不可或缺的一步！'
seg_list = psg.cut(sent)
#print(str(list(seg_list)))
print(' '.join(['{0}/{1}'.format(w,t) for w,t in seg_list]))

[pair('中文', 'nz'), pair('分词', 'n'), pair('是', 'v'), pair('文本处理', 'n'), pair('不可或缺', 'l'), pair('的', 'uj'), pair('一步', 'm'), pair('！', 'x')]


<b>可以参考前边的代码，实现HMM进行词性标注（语料可选用1998年人民日报词性标注集）</b>

## 命名实体识别（NER）

### 命名实体识别简介

<b>中文命名实体识别主要难点：</b>
- 各类命名实体的数量众多
- 命名实体的构成规律复杂
- 嵌套情况复杂
- 长度不确定

<b>命名实体识别的三种主要方式：</b>
- 基于规则的命名实体识别
- 基于统计的命名实体识别：主流方法是序列标注方式（例如：基于条件随机场）
- 混合方法

### 基于条件随机场的命名实体识别

条件随机场的概念

### 实战一：日期识别

In [2]:
#导入需要的库
import re
from datetime import datetime,timedelta
from dateutil.parser import parse
import jieba.posseg as psg

In [44]:
#判断日期串的有效性
def check_time_valid(word):
    m = re.match("\d+$",word)
    if m:
        if len(word) <= 6:
            return None
    word1 = re.sub('[号|日]\d+$','日',word)
    if word1 != word:
        return check_time_valid(word1)
    else:
        return word1

In [45]:
#将提取到的文本日期串进行时间转换
def parse_datetime(msg):
    if msg is None or len(msg) == 0:
        return None
    try:
        dt = parse(msg,fuzzy=True)
        return dt.strftime('%Y-%m-%d %H:%M:%S')
    except Exception as e:
        return None

In [46]:
#提取所有表示日期时间的词并进行上下文拼接
def time_extract(text):
    time_res = []
    word = ''
    keyDate = {'今天':0,'明天':1,'后天':2,'大后天':3}
    #对文本进行词性标注，提取m（数字）和t（时间）词性的词
    for k,v in psg.cut(text):
        #如果文本在预定义的字典中，则将其转换为实际的日期（格式为：%Y年%m月%d日）
        if k in keyDate:
            if word != '':
                time_res.append(word)
            word = (datetime.today() + timedelta(days = keyDate.get(k,0))).strftime('%Y年%m月%d日')
        elif word != '':
            if v in ['m','t']:
                word = word + k
            else:
                time_res.append(word)
                word = ''
        elif v in ['m','t']:
            word = k
    if word != '':
        time_res.append(word)
    #filter过滤函数：第一个参数为函数，用于定义过滤条件；第二个参数为列表，是过滤的目标
    result = list(filter(lambda x:x is not None, [check_time_valid(w) for w in time_res]))
    final_res = [parse_datetime(w) for w in result]
    return [x for x in final_res if x is not None]

In [51]:
text1 = '我要从26号下午4点住到11月2号'
print(time_extract(text1))

['2019-11-02 00:00:00']


In [53]:
text1 = '我要从26号下午4点住到11月2号'
time_res = []
word = ''
keyDate = {'今天': 0, '明天':1, '后天': 2}
for k, v in psg.cut(text1):
    print(k,v)
    if k in keyDate:
        if word != '':            
            time_res.append(word)
        word = (datetime.today() + timedelta(days=keyDate.get(k, 0))).strftime('%Y年%m月%d日')
    elif word != '':
        if v in ['m', 't']:
            word = word + k
        else:
            time_res.append(word)
            word = ''
    elif v in ['m', 't']:
        word = k
if word != '':
    time_res.append(word)
#result = list(filter(lambda x:x is not None, [check_time_valid(w) for w in time_res]))
#print(result)

m = [re.sub("[号|日]\d+$","日",word) for word in time_res]
#dt = [parse(msg,fuzzy=True) for msg in m]
print(time_res,m)    

我 r
要 v
从 p
26 m
号 m
下午 t
4 m
点 m
住 v
到 v
11 m
月 m
2 m
号 m
['26号下午4点', '11月2号'] ['26号下午4点', '11月2号']


In [43]:
print(parse.__doc__)



    Parse a string in one of the supported formats, using the
    ``parserinfo`` parameters.

    :param timestr:
        A string containing a date/time stamp.

    :param parserinfo:
        A :class:`parserinfo` object containing parameters for the parser.
        If ``None``, the default arguments to the :class:`parserinfo`
        constructor are used.

    The ``**kwargs`` parameter takes the following keyword arguments:

    :param default:
        The default datetime object, if this is a datetime object and not
        ``None``, elements specified in ``timestr`` replace elements in the
        default object.

    :param ignoretz:
        If set ``True``, time zones in parsed strings are ignored and a naive
        :class:`datetime` object is returned.

    :param tzinfos:
        Additional time zone names / aliases which may be present in the
        string. This argument maps time zone names (and optionally offsets
        from those time zones) to time zones. This parame

### 实战二：地名识别