CONLL2000のデータ (https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz)をダウンロードし，Baseline (最頻の品詞を割りあてる方法)を実装し，POS taggingの精度を評価。

精度を計算する際には，交差検定法 (cross-validation)を使い，品詞の頻度を計算するための訓練データと精度を計算するための評価データを分けてください。

評価結果について各自考察してレポートにまとめる。

In [44]:
import os
import pandas as pd
import sklearn
from collections import Counter
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.model_selection import cross_val_score
from sklearn.base import BaseEstimator
from tqdm import tqdm

In [None]:
#loading data
os.system("wget https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz")
os.system("gunzip train.txt.gz")

In [45]:
df = pd.read_csv("train.txt",sep=" ",names= ["word","POS","boundary"])
df.head()

Unnamed: 0,word,POS,boundary
0,Confidence,NN,B-NP
1,in,IN,B-PP
2,the,DT,B-NP
3,pound,NN,I-NP
4,is,VBZ,B-VP


In [46]:
#The boundary columns will not be used
df = df.drop(columns="boundary")
print(df.shape)
df.head()

(211727, 2)


Unnamed: 0,word,POS
0,Confidence,NN
1,in,IN
2,the,DT
3,pound,NN
4,is,VBZ


### Baseline from the textbook
#### P.207 of "Language processing with Perl and Prolog"

* Tag each word with its most frequent part of speech
* For words that are not found in the training data will be POS tagged as the most frequent POS tag in the corpus

In [47]:
#To use the crossvalidation score function in sklearn
#we need to inherit from BaseEstimator
class POS(BaseEstimator):
    def __init__(self):
        pass
    def fit(self, x, y):
        self.d = {}
        new_df = pd.DataFrame({"word":x,"POS":y})
        for word,tag in tqdm(zip(x,y)):
            self.d[word] = list(Counter(new_df[new_df["word"] == word]["POS"]))[0]
        return self 

    def predict(self, word):
        try:
            return self.d[word]
        except:
            return list(Counter(self.d.values()))[0]
    def score(self, x, y):
        x = x.tolist()
        y = y.tolist()
        total = 0
        for i in tqdm(range(len(y))):
            if self.predict(x[i]) == y[i]:
                total += 1
        return total/len(y)

In [48]:
# Simple 5-cross-validation
# This takes a lot of time ...
POS_tagger = POS()
scores = cross_validation.cross_val_score(POS_tagger,df["word"],df["POS"],cv=5)
print("Validation scores : ",scores)
print()
print("Average score", sum(scores)/5)

169381it [45:18, 62.31it/s]
100%|██████████| 42346/42346 [00:03<00:00, 11341.72it/s]
169381it [53:37, 52.65it/s]
100%|██████████| 42346/42346 [00:03<00:00, 11109.32it/s]
169382it [49:31, 57.01it/s]
100%|██████████| 42345/42345 [00:03<00:00, 10920.61it/s]
169382it [47:40, 59.22it/s]
100%|██████████| 42345/42345 [00:04<00:00, 9888.26it/s] 
169382it [47:48, 59.05it/s]
100%|██████████| 42345/42345 [00:03<00:00, 13324.22it/s]


Validation scores :  [0.87469891 0.88754546 0.88317393 0.88770811 0.89087259]

Average score 0.8847998004752966


### Result analysis

#### First the experiment was conducted using only 1000 data. The abvearge accuracy for this experiment was 60~70% (This can be easily be experimented by changing df \["word"\] to df\["word"\]\[:1000\] and same for the df\["POS"\]).

#### But, after conducting the experiment using the whole data (211727) it can be seen that the mean accuray has changed drastically. (88.4%)

#### Increasing the corpus size, increased the accuracy.

#### Another fact that I was impressed is that this simple model only using the frequency of POS tags for each word, scored accuracy of (88.4%) .