# 第6章: 英語テキストの処理

In [1]:
!curl -O http://www.cl.ecei.tohoku.ac.jp/nlp100/data/nlp.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8594  100  8594    0     0   8594      0  0:00:01 --:--:--  0:00:01 54050


In [2]:
!cat nlp.txt

Natural language processing
From Wikipedia, the free encyclopedia

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of humani-computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.

History

The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translatio

## 50. 文区切り

In [3]:
import re

sentences = []
with open('nlp.txt') as f:
    for line in f:
        for it in re.finditer('[A-Z].*?[\.!;:](?=(\s[A-Z]|$))', line, flags=0):
            sentences.append(it.group(0))
print(sentences[:4])

['Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.', 'As such, NLP is related to the area of humani-computer interaction.', 'Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.', 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.']


## 51. 単語の切り出し

In [4]:
all_words = [] # 配列の配列にする
for sen in sentences:
    all_words.append(list(map(lambda w: re.sub('[^a-zA-Z]','',w),sen.split(' '))))

for words in all_words[:1]:
    for word in words:
        print(word)
    print()

Natural
language
processing
NLP
is
a
field
of
computer
science
artificial
intelligence
and
linguistics
concerned
with
the
interactions
between
computers
and
human
natural
languages



## 52. ステミング

https://en.wikipedia.org/wiki/Stemming

> stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

要は英単語の語幹？を切り出す

```
#問題に書いてあったモジュールがpython2用だったので、python3で動きそうなのをインストール
pip install porter2stemmer
```

In [5]:
import porter2stemmer as p2s

stemmer = p2s.Porter2Stemmer()
for words in all_words[:1]:
    for word in words:
        print("{}\t{}".format(word,stemmer.stem(word)))

Natural	Natur
language	languag
processing	process
NLP	NLP
is	is
a	a
field	field
of	of
computer	comput
science	scienc
artificial	artifici
intelligence	intellig
and	and
linguistics	linguist
concerned	concern
with	with
the	the
interactions	interact
between	between
computers	comput
and	and
human	human
natural	natur
languages	languag


## 53. Tokenization

https://stanfordnlp.github.io/CoreNLP/

このためにJDKを入れました。。。

In [6]:
with open('for53.txt','w') as f:
    for sentence in sentences:
        print(sentence,file=f)


```
java -cp "corenlp/*" -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file for53.txt
```

In [7]:
!head -n30 for53.txt.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <docId>for53.txt</docId>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Natural</word>
            <lemma>natural</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>7</CharacterOffsetEnd>
            <POS>JJ</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>
          <token id="2">
            <word>language</word>
            <lemma>language</lemma>
            <CharacterOffsetBegin>8</CharacterOffsetBegin>
            <CharacterOffsetEnd>16</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>
          <token id="3">
            <word>processing</word>
            <lemma>processing</lemma>
       

xmlの読み込み

https://docs.python.jp/3/library/xml.etree.elementtree.html

In [8]:
import xml.etree.ElementTree as ET
tree = ET.parse('for53.txt.xml')
root = tree.getroot()

In [9]:
for token in root.findall('./document/sentences/sentence[@id="1"]/tokens/token'):
    print(token.find('word').text)

Natural
language
processing
-LRB-
NLP
-RRB-
is
a
field
of
computer
science
,
artificial
intelligence
,
and
linguistics
concerned
with
the
interactions
between
computers
and
human
-LRB-
natural
-RRB-
languages
.


## 54. 品詞タグ付け

In [10]:
for token in root.findall('./document/sentences/sentence[@id="1"]/tokens/token'):
    print("{}\t{}\t{}".format(token.find('word').text,token.find('lemma').text,token.find('POS').text))

Natural	natural	JJ
language	language	NN
processing	processing	NN
-LRB-	-lrb-	-LRB-
NLP	nlp	NN
-RRB-	-rrb-	-RRB-
is	be	VBZ
a	a	DT
field	field	NN
of	of	IN
computer	computer	NN
science	science	NN
,	,	,
artificial	artificial	JJ
intelligence	intelligence	NN
,	,	,
and	and	CC
linguistics	linguistics	NNS
concerned	concern	VBN
with	with	IN
the	the	DT
interactions	interaction	NNS
between	between	IN
computers	computer	NNS
and	and	CC
human	human	JJ
-LRB-	-lrb-	-LRB-
natural	natural	JJ
-RRB-	-rrb-	-RRB-
languages	language	NNS
.	.	.


## 55. 固有表現抽出

人名はNERが"PERSON"のもの

NERとは？→named entity recognition

https://ja.wikipedia.org/wiki/固有表現抽出

In [11]:
for token in root.findall('./document/sentences/sentence/tokens/token'):
    if token.find('NER').text == "PERSON":
        print(token.find('word').text)

Alan
Turing
Joseph
Weizenbaum
MARGIE
Schank
Wilensky
Meehan
Lehnert
Carbonell
Lehnert
Racter
Jabberwacky
Moore


## 56. 共参照解析

共参照解析=coreference analysis

https://en.wikipedia.org/wiki/Coreference

>In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. Bill said he would come; the proper noun Bill and the pronoun he refer to the same person, namely to Bill.

なるほどね。それで、xmlの中ではどこ？

In [12]:
!cat for53.txt.xml | grep -A 40 'coreference' | head -n40

    <coreference>
      <coreference>
        <mention representative="true">
          <sentence>1</sentence>
          <start>24</start>
          <end>25</end>
          <head>24</head>
          <text>computers</text>
        </mention>
        <mention>
          <sentence>3</sentence>
          <start>14</start>
          <end>15</end>
          <head>14</head>
          <text>computers</text>
        </mention>
      </coreference>
      <coreference>
        <mention representative="true">
          <sentence>4</sentence>
          <start>4</start>
          <end>5</end>
          <head>4</head>
          <text>NLP</text>
        </mention>
        <mention>
          <sentence>17</sentence>
          <start>7</start>
          <end>8</end>
          <head>7</head>
          <text>NLP</text>
        </mention>
        <mention>
          <sentence>21</sentence>
          <start>15</start>
          <end>16</e

`corerefence`タグ中の複数の`mention`タグでcoreferenceを表現しているようだ。また、代表参照表現には'representative' attributeをtrueとしている。それを踏まえて。。。

In [13]:
words_in_xml = []
    
for sentence in root.findall('./document/sentences/sentence'):
    tokens = []
    for token in sentence.findall('./tokens/token'):
        tokens.append(token.find('word').text)
    words_in_xml.append(tokens[:])

def cor_sub(mention,rep):
    sub = "{}({})".format(rep,mention.find('text').text)
    i = int(mention.find('sentence').text) - 1
    start = int(mention.find('start').text) - 1
    end   = int(mention.find('end').text) - 1
    for word in words_in_xml[i][start:end]:
        word = ''
    words_in_xml[i][start]=sub
    
for coreference in root.findall('./document/coreference/coreference'):
    for mention in coreference.findall('mention'):
        if 'representative' in mention.attrib.keys():
            rep = mention.find('text').text
            continue
        cor_sub(mention,rep)

for sentence in words_in_xml:
    for word in sentence:
        print(word,end=' ')
    print()

Natural language processing -LRB- NLP -RRB- is a field of computer science , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages . 
As such , NLP is related to the area of humani-computer interaction . 
Many challenges in NLP involve natural language understanding , that is , enabling computers(computers) to derive meaning from human or natural language input , and others involve natural language generation . 
The history of NLP generally starts in the 1950s , although work can be found from earlier periods . 
In 1950 , Alan Turing published an article titled `` Computing Machinery and Intelligence '' which proposed what is now called the Alan Turing(Turing) test as a criterion of intelligence . 
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English . 
The authors claimed that within three or five years , a solved problem(machine translation

Increasingly , however , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to the features making up the input data(statistical models) models , which make soft , probabilistic decisions(soft , probabilistic decisions) , probabilistic decisions based on attaching real-valued weights(real-valued weights) weights to each input feature . 
Such models have the advantage that Some of the earliest-used algorithms , such as decision trees(they) can express the relative certainty of many different possible answers rather than only one , producing more reliable results when such a model is included as a component of a larger system . 
Systems based on machine-learning algorithms have many advantages over hand-produced rules : The learning procedures used during machine learning , especially statistical machine learning(machine learning) learning automatically focus on the most common cases , whereas when writing hard if

## 57. 係り受け解析

係り受けの関係はつまり属性(type)をもつ有向グラフ。

隣接行列を応用すればどうか。

||idx=0|idx=1|idx=2|idx=3|...|
|-----|---|-----|-----|-----|---|
|idx=0|''|''|''|root|...|
|idx=1|''|''|''|''|...|
|idx=2|''|''|''|''|...|
|idx=3|''|''|nsubj|''|...|

例えばこの表では、行がgovernor,列がdependantなどとして、typeを要素としてもつ。idx=3の単語がroot要素で、idx=3の子要素(nsubj)がidx=2の単語、という具合。

In [14]:
import pydot

for n,dependencies in enumerate(root.findall('.//dependencies[@type="collapsed-dependencies"]')):
    matrix = [[None for i in range(len(words_in_xml[n])+1)] for j in range(len(words_in_xml[n])+1)]
    for dep in dependencies:
        i = int(dep.find('governor').attrib['idx'])
        j = int(dep.find('dependent').attrib['idx'])
        matrix[i][j]=dep.attrib['type']
    s = 'digraph graphname{ '
    for l,line in enumerate(matrix):
        for k,di in enumerate(line):
            if di is not None:
                #s += ' "{}" -> "{}" [label="{}"] '.format(words_in_xml[n][l-1],words_in_xml[n][k-1],di)
                s += ' "{}:{}" -> "{}:{}" [label="{}"] '.format(l-1,words_in_xml[n][l-1],k-1,words_in_xml[n][k-1],di)
    s += '}'
    print(s)
    g = pydot.graph_from_dot_data(s)
    g[0].write_png('57_out.png')
    break

digraph graphname{  "-1:." -> "8:field" [label="root"]  "2:processing" -> "0:Natural" [label="amod"]  "2:processing" -> "1:language" [label="compound"]  "2:processing" -> "4:NLP" [label="appos"]  "4:NLP" -> "3:-LRB-" [label="punct"]  "4:NLP" -> "5:-RRB-" [label="punct"]  "8:field" -> "2:processing" [label="nsubj"]  "8:field" -> "6:is" [label="cop"]  "8:field" -> "7:a" [label="det"]  "8:field" -> "11:science" [label="nmod:of"]  "8:field" -> "12:," [label="punct"]  "8:field" -> "14:intelligence" [label="conj:and"]  "8:field" -> "15:," [label="punct"]  "8:field" -> "16:and" [label="cc"]  "8:field" -> "17:linguistics" [label="conj:and"]  "8:field" -> "30:." [label="punct"]  "11:science" -> "9:of" [label="case"]  "11:science" -> "10:computer" [label="compound"]  "14:intelligence" -> "13:artificial" [label="amod"]  "17:linguistics" -> "18:concerned" [label="acl"]  "18:concerned" -> "21:interactions" [label="nmod:with"]  "21:interactions" -> "19:with" [label="case"]  "21:interactions" -> "20:

![title](57_out.png)

## 58. タプルの抽出

In [15]:
for n,dependencies in enumerate(root.findall('.//dependencies[@type="collapsed-dependencies"]')):
    matrix = [[None for i in range(len(words_in_xml[n])+1)] for j in range(len(words_in_xml[n])+1)]
    for dep in dependencies:
        i = int(dep.find('governor').attrib['idx'])
        j = int(dep.find('dependent').attrib['idx'])
        matrix[i][j]=dep.attrib['type']
    for l,line in enumerate(matrix):
        if 'nsubj' in line and 'dobj' in line:
            v=words_in_xml[n][l-1]
            s=words_in_xml[n][line.index('nsubj')-1]
            o=words_in_xml[n][line.index('dobj')-1]
            print("{}\t{}\t{}".format(s,v,o))
    

understanding	enabling	computers(computers)
others	involve	generation
Turing	published	article
experiment	involved	translation
ELIZA	provided	interaction
patient	exceeded	base
ELIZA(ELIZA)	provide	response
which	structured	information
underpinnings	discouraged	sort
that	underlies	approach
Some	produced	systems
which	make	decisions
systems	rely	which
that	contains	errors
implementations	involved	coding
algorithms	take	set
Some	produced	systems
which	make	decisions
models	have	advantage
Some of the earliest-used algorithms , such as decision trees(they)	express	certainty
Systems	have	advantages
Automatic	make	use
that	make	decisions


## 59. S式の解析

https://ja.wikipedia.org/wiki/S式

>S式（エスしき、英: S-expression）とは、Lispで導入され、主にLispで用いられる、2分木ないしリスト構造の形式的な記述方式。


In [16]:
#試しに一つ見てみる
parse = root.find('.//parse')
print(parse.text)

(ROOT (S (NP (NP (JJ Natural) (NN language) (NN processing)) (PRN (-LRB- -LRB-) (NP (NN NLP)) (-RRB- -RRB-))) (VP (VBZ is) (NP (NP (NP (DT a) (NN field)) (PP (IN of) (NP (NN computer) (NN science)))) (, ,) (NP (JJ artificial) (NN intelligence)) (, ,) (CC and) (NP (NP (NNS linguistics)) (VP (VBN concerned) (PP (IN with) (NP (NP (DT the) (NNS interactions)) (PP (IN between) (NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages)))))))))) (. .))) 


比較のため元の文

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.

In [17]:
class Node:
    def __init__(self,type):
        self.type = type
        self.childs = []
        self.parent = None
        self.surface = ""

In [18]:
matches = re.findall('\(.*?(?=\s)|\S*?\)',parse.text)

root = Node('ROOT')
parent = root
beforenode = root
for match in matches[1:]:
    if match[0] == '(': #開き
        newnode = Node(match[1:])
        newnode.parent = parent
        parent.childs.append(newnode)
        parent = newnode
    else: #閉じ
        if len(match) > 0:
            parent.surface = match[:-1]
        parent = parent.parent

In [19]:
def print_NP(node,flag):
    if node.surface != "" and flag:
        print(node.surface,end=' ')
    nflag = flag
    if node.type == 'NP':
        nflag = True
    for c in node.childs:
        print_NP(c,nflag)
    if not(flag) and nflag:
        print()

print_NP(root,False)

Natural language processing -LRB- NLP -RRB- 
a field of computer science , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages 
