<a href="https://colab.research.google.com/github/tktkbohshi/m1_study_nlp100practices/blob/main/M1_NLP_100practices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.svm import SVC
import plotly.graph_objects as go
import optuna

C:\Users\tooka\anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\tooka\anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


# 第6章: 機械学習
本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

## 50.データの入手・整形
News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．
1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
1. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
1. 抽出された事例をランダムに並び替える．
1. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．
学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

### Datasets detail
+ ID		Numeric ID
+ TITLE		News title 
+ URL		Url
+ PUBLISHER	Publisher name
+ CATEGORY	News category (b = business, t = science and technology, e = entertainment, m = health)
+ STORY		Alphanumeric ID of the cluster that includes news about the same story
+ HOSTNAME	Url hostname
+ TIMESTAMP 	Approximate time the news was published, as the number of milliseconds since the epoch 00:00:00 GMT, January 1, 1970

In [2]:
columns = ["ID","TITLE","URL","PUBLISHER","CATEGORY","STORY","HOSTNAME","TIMESTAMP"]
df_publisher = pd.read_csv("./data/NewsAggregatorDataset/newsCorpora.csv", names=columns, sep="\t")
df_publisher = df_publisher[df_publisher["PUBLISHER"].isin(["Reuters", "Huffington Post", "Businessweek", "Contactmusic.com", "Daily Mail"])]
df_publisher = df_publisher.sample(frac=1).reset_index(drop=True)
df_publisher["TITLE"] = df_publisher["TITLE"].str.lower()
df_publisher.head(5)

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,5258,trading firm virtu financial plans to raise up...,http://in.reuters.com/article/2014/03/11/virtu...,Reuters,b,dHF--A9t5KnFT1MmUAIsHMMu25ZEM,in.reuters.com,1394580780148
1,119643,us airways social media manager speaks out abo...,http://www.dailymail.co.uk/news/article-260536...,Daily Mail,e,dYvnirQ5sLRyjpMNgiKM0uOmgttPM,www.dailymail.co.uk,1397593900286
2,246861,"afte being attached since 2006, director edgar...",http://www.contactmusic.com/article/edgar-wrig...,Contactmusic.com,e,d8zMXjIiRYvwOUMgNFCJRWafcyqIM,www.contactmusic.com,1400935519181
3,281384,"fda should fight products, not food",http://www.huffingtonpost.com/manuel-villacort...,Huffington Post,m,dnjw8SrRqgK2WMMsKbV6Rb46rWZgM,www.huffingtonpost.com,1402690276802
4,230619,"lenovo aims to sell 100 mln smartphones, table...",http://in.reuters.com/article/2014/05/21/idINB...,Reuters,b,dy4ACBFTk1JlJeME2GDvD5jerrUhM,in.reuters.com,1400684221435


In [3]:
df_train = df_publisher[0:int(len(df_publisher)*0.8)]
df_valid = df_publisher[int(len(df_publisher)*0.8):int(len(df_publisher)*0.9)]
df_test = df_publisher[int(len(df_publisher)*0.9):int(len(df_publisher))]
df_train.to_csv("./data/outputs/train.txt",sep="\t", index=False)
df_valid.to_csv("./data/outputs/valid.txt",sep="\t", index=False)
df_test.to_csv("./data/outputs/test.txt",sep="\t", index=False)

## 51.特徴量抽出
学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ．
なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

tf-idf = tf*idf = 単語の出現頻度*各単語のレア度

$tf=\frac{文書Aにおける単語Xの出現頻度}{文書Aにおける全単語の出現頻度の和}$

$idf=log(\frac{全文書数}{単語Xを含む文書数})$



In [4]:
tfidf_vec = TfidfVectorizer()
X_train = tfidf_vec.fit_transform(df_train["TITLE"])
X_test = tfidf_vec.transform(df_test["TITLE"])
X_valid = tfidf_vec.transform(df_valid["TITLE"])

In [11]:
pd.DataFrame(X_train).to_csv("./data/outputs/train.feature.txt")
pd.DataFrame(X_test).to_csv("./data/outputs/test.feature.txt")
pd.DataFrame(X_valid).to_csv("./data/outputs/valid.feature.txt")

## 52.学習
51で構築した学習データを用いて，ロジスティック回帰モデルを学習せよ．

In [207]:
model = LogisticRegression(random_state=123, max_iter=10000)
model.fit(X_train, df_train["CATEGORY"])

LogisticRegression(max_iter=10000, random_state=123)

## 53.予測
52で学習したロジスティック回帰モデルを用い，与えられた記事見出しからカテゴリとその予測確率を計算するプログラムを実装せよ．

In [208]:
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

## 54.正解率の計測
52で学習したロジスティック回帰モデルの正解率を，学習データおよび評価データ上で計測せよ．

(TP+TN)/(TP+TN+FP+FN)

In [209]:
accuracy_train = accuracy_score(df_train["CATEGORY"], pred_train)
accuracy_test = accuracy_score(df_test["CATEGORY"], pred_test)
accuracy_train, accuracy_test

(0.9468703148425787, 0.8958020989505248)

## 55.混同行列の作成
52で学習したロジスティック回帰モデルの混同行列（confusion matrix）を，学習データおよび評価データ上で作成せよ．

In [210]:
labels = df_train["CATEGORY"].unique()
labels

array(['b', 'm', 'e', 't'], dtype=object)

column = Predicted 

index = Actual


In [211]:
confusion_matrix(df_train["CATEGORY"], pred_train)

array([[4416,   58,    4,   34],
       [  16, 4157,    0,    4],
       [  77,  116,  534,    3],
       [ 144,  110,    1,  998]], dtype=int64)

2*(Precision*Recall)/(Precision+Recall)

## 56.適合率，再現率，F1スコアの計測
52で学習したロジスティック回帰モデルの適合率，再現率，F1スコアを，評価データ上で計測せよ．カテゴリごとに適合率，再現率，F1スコアを求め，カテゴリごとの性能をマイクロ平均（micro-average）とマクロ平均（macro-average）で統合せよ．

TP/(TP+FP)

In [212]:
precision = precision_score(df_test["CATEGORY"], pred_test, average=None, labels=labels)
precision

array([0.90940767, 0.95555556, 0.88529887, 0.85416667])

TP/(TP/FN)

In [213]:
recall = recall_score(df_test["CATEGORY"], pred_test, average=None, labels=labels)
recall

array([0.93214286, 0.5       , 0.97857143, 0.640625  ])

In [214]:
f1 = f1_score(df_test["CATEGORY"], pred_test, average=None, labels=labels)
f1

array([0.92063492, 0.65648855, 0.92960136, 0.73214286])

- マクロ平均は各クラスごとに指標を算出してから平均をとる
- マイクロ平均は全クラスでいっせいに指標を計算する

In [215]:
df_eval = pd.DataFrame({"Precision":precision,"Recall":recall,"F1":f1},index=labels)
df_eval.loc["マイクロ平均"] = [
    precision_score(df_test["CATEGORY"], pred_test, average="micro", labels=labels),
    recall_score(df_test["CATEGORY"], pred_test, average="micro", labels=labels),
    f1_score(df_test["CATEGORY"], pred_test, average="micro", labels=labels)
    ]
df_eval.loc["マクロ平均"] = [
    precision_score(df_test["CATEGORY"], pred_test, average="macro", labels=labels),
    recall_score(df_test["CATEGORY"], pred_test, average="macro", labels=labels),
    f1_score(df_test["CATEGORY"], pred_test, average="macro", labels=labels)
    ]
df_eval

Unnamed: 0,Precision,Recall,F1
b,0.909408,0.932143,0.920635
m,0.955556,0.5,0.656489
e,0.885299,0.978571,0.929601
t,0.854167,0.640625,0.732143
マイクロ平均,0.895802,0.895802,0.895802
マクロ平均,0.901107,0.762835,0.809717


## 57.特徴量の重みの確認
52で学習したロジスティック回帰モデルの中で，重みの高い特徴量トップ10と，重みの低い特徴量トップ10を確認せよ．

In [216]:
df_X_train = pd.DataFrame(X_train.toarray(), columns=tfidf_vec.get_feature_names())
df_X_test = pd.DataFrame(X_test.toarray(), columns=tfidf_vec.get_feature_names())
df_X_train.head(3)

Unnamed: 0,05,07,08,09,0ff,0ut,10,100,1000,10000,...,zooey,zoosk,zpfa3mqti7qdrpfhqwjm,zuckerberg,zynga,zâ,œf,œlousyâ,œpiece,œwaist
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [217]:
df_weights = pd.DataFrame(model.coef_, index=model.classes_, columns=df_X_train.columns).T
df_weights

Unnamed: 0,b,e,m,t
05,-0.004721,0.008754,-0.001450,-0.002584
07,0.068883,-0.042181,-0.012544,-0.014158
08,0.018594,-0.009093,-0.004317,-0.005184
09,0.029917,-0.014801,-0.006149,-0.008967
0ff,-0.028109,0.061052,-0.017585,-0.015358
...,...,...,...,...
zâ,-0.008806,0.037665,-0.010637,-0.018221
œf,-0.070255,0.122389,-0.017154,-0.034980
œlousyâ,-0.062596,0.118413,-0.023575,-0.032242
œpiece,-0.081245,0.115131,-0.013374,-0.020512


In [218]:
df_best10 = pd.DataFrame(index=range(1,10))
df_worst10 = pd.DataFrame()
for label in labels:
  df_best10 = pd.merge(
    df_best10,
    pd.DataFrame(df_weights[label].sort_values(ascending=False).head(10).reset_index().set_axis([f"{label}_word",f"{label}_value"], axis=1)),
    left_index=True,
    right_index=True,
    how="outer"
  )
  df_worst10 = pd.merge(
    df_worst10,
    pd.DataFrame(df_weights[label].sort_values(ascending=True).head(10).reset_index().set_axis([f"{label}_word",f"{label}_value"], axis=1)),
    left_index=True,
    right_index=True,
    how="outer"
  )
df_best10

Unnamed: 0,b_word,b_value,m_word,m_value,e_word,e_value,t_word,t_value
0,china,3.486889,ebola,4.581722,kardashian,3.235733,google,5.528656
1,fed,3.483078,study,3.866656,her,2.86497,facebook,4.896001
2,bank,3.406272,cancer,3.838957,chris,2.688136,apple,4.748743
3,stocks,3.226849,fda,3.729969,star,2.57042,climate,3.952941
4,ecb,3.159029,drug,3.485057,she,2.554801,microsoft,3.885825
5,euro,2.966304,mers,3.057071,kim,2.497468,gm,3.363674
6,update,2.69794,health,2.576026,miley,2.478249,tesla,3.097616
7,oil,2.688688,cases,2.42967,cyrus,2.428558,nasa,2.791598
8,ukraine,2.509668,could,2.3082,film,2.37484,mobile,2.673299
9,dollar,2.47146,heart,2.264832,movie,2.323641,comcast,2.612757


In [219]:
df_worst10

Unnamed: 0,b_word,b_value,m_word,m_value,e_word,e_value,t_word,t_value
0,the,-2.108695,gm,-1.169854,update,-3.530433,stocks,-1.460971
1,and,-2.026373,facebook,-1.079749,us,-3.252529,fed,-1.144156
2,ebola,-1.948874,google,-1.053034,google,-2.804136,drug,-1.086358
3,she,-1.830033,apple,-0.971099,china,-2.314718,american,-1.056126
4,her,-1.818175,amazon,-0.958561,says,-2.310574,ecb,-1.023379
5,apple,-1.765809,deal,-0.892157,gm,-2.240274,cancer,-0.965979
6,microsoft,-1.71719,ceo,-0.889915,facebook,-2.184366,day,-0.956874
7,study,-1.677262,fed,-0.885209,study,-2.134899,her,-0.933353
8,google,-1.671486,billion,-0.847226,ceo,-2.127249,kardashian,-0.929507
9,facebook,-1.631886,climate,-0.832703,apple,-2.011835,ebola,-0.913903


## 58.正則化パラメータの変更
ロジスティック回帰モデルを学習するとき，正則化パラメータを調整することで，学習時の過学習（overfitting）の度合いを制御できる．異なる正則化パラメータでロジスティック回帰モデルを学習し，学習データ，検証データ，および評価データ上の正解率を求めよ．実験の結果は，正則化パラメータを横軸，正解率を縦軸としたグラフにまとめよ．

# $min f_{loss}(x)+\lambda \sum_{i=1}^{n}|w_{i}|$

In [220]:
df_regularization = pd.DataFrame(columns=["lambda","train_accuracy","valid_accuracy","test_accuracy"])
for C in tqdm(np.arange(0.1,2.0,0.2)):
    model = LogisticRegression(random_state=123, max_iter=10000,C=C)
    model.fit(X_train, df_train["CATEGORY"])
    pred_train = model.predict(X_train)
    pred_valid = model.predict(X_valid)
    pred_test = model.predict(X_test)
    df_regularization.loc[C] = [
        C,
        accuracy_score(df_train["CATEGORY"], pred_train),
        accuracy_score(df_valid["CATEGORY"], pred_valid),
        accuracy_score(df_test["CATEGORY"], pred_test)
    ]
df_regularization = df_regularization.reset_index(drop=True)
df_regularization

100%|██████████| 10/10 [00:08<00:00,  1.14it/s]


Unnamed: 0,lambda,train_accuracy,valid_accuracy,test_accuracy
0,0.1,0.79048,0.787106,0.797601
1,0.3,0.869753,0.848576,0.852324
2,0.5,0.905079,0.868816,0.877811
3,0.7,0.926349,0.876312,0.889805
4,0.9,0.940499,0.881559,0.895802
5,1.1,0.951087,0.885307,0.89955
6,1.3,0.959052,0.889805,0.901799
7,1.5,0.965986,0.892054,0.904048
8,1.7,0.971608,0.892804,0.905547
9,1.9,0.976199,0.895802,0.907796


In [221]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_regularization["lambda"], y=df_regularization["train_accuracy"], name="train_accuracy"))
fig.add_trace(go.Scatter(x=df_regularization["lambda"], y=df_regularization["valid_accuracy"], name="valid_accuracy"))
fig.add_trace(go.Scatter(x=df_regularization["lambda"], y=df_regularization["test_accuracy"], name="test_accuracy"))
fig.show()

## 59.ハイパーパラメータの探索
学習アルゴリズムや学習パラメータを変えながら，カテゴリ分類モデルを学習せよ．検証データ上の正解率が最も高くなる学習アルゴリズム・パラメータを求めよ．また，その学習アルゴリズム・パラメータを用いたときの評価データ上の正解率を求めよ．

### Logistic Regression

In [222]:
def objective_lg(trial):
    l1_ratio = trial.suggest_uniform("l1_ratio", 0, 1)
    C = trial.suggest_uniform("C", 0.1, 2)
    model = LogisticRegression(
        random_state=123, 
        max_iter=10000, 
        C=C,
        l1_ratio=l1_ratio, 
        penalty='elasticnet',
        solver='saga'
        )
    model.fit(X_train, df_train["CATEGORY"])
    pred_valid = model.predict(X_valid)
    accuracy_valid = accuracy_score(df_valid["CATEGORY"], pred_valid)
    return accuracy_valid

In [223]:
study = optuna.create_study(direction='maximize')
study.optimize(objective_lg, timeout=120)
trial = study.best_trial
trial

[32m[I 2021-08-04 00:52:10,653][0m A new study created in memory with name: no-name-9bcf8c16-6c8d-4ff8-9c8e-5591880ec1d4[0m
[32m[I 2021-08-04 00:52:17,112][0m Trial 0 finished with value: 0.8793103448275862 and parameters: {'l1_ratio': 0.3052599590590047, 'C': 1.3115147477133606}. Best is trial 0 with value: 0.8793103448275862.[0m
[32m[I 2021-08-04 00:52:22,806][0m Trial 1 finished with value: 0.8748125937031485 and parameters: {'l1_ratio': 0.24522490924595697, 'C': 0.9491009016453744}. Best is trial 0 with value: 0.8793103448275862.[0m
[32m[I 2021-08-04 00:52:27,559][0m Trial 2 finished with value: 0.863568215892054 and parameters: {'l1_ratio': 0.963534397802527, 'C': 1.3515525314346295}. Best is trial 0 with value: 0.8793103448275862.[0m
[32m[I 2021-08-04 00:52:28,405][0m Trial 3 finished with value: 0.8103448275862069 and parameters: {'l1_ratio': 0.9543672213270494, 'C': 0.4314029295127988}. Best is trial 0 with value: 0.8793103448275862.[0m
[32m[I 2021-08-04 00:52:4

FrozenTrial(number=10, values=[0.8958020989505248], datetime_start=datetime.datetime(2021, 8, 4, 0, 52, 59, 260650), datetime_complete=datetime.datetime(2021, 8, 4, 0, 53, 29, 482679), params={'l1_ratio': 0.001219332161190051, 'C': 1.9018869495849327}, distributions={'l1_ratio': UniformDistribution(high=1.0, low=0.0), 'C': UniformDistribution(high=2.0, low=0.1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=10, state=TrialState.COMPLETE, value=None)

In [224]:
l1_ratio = trial.params["l1_ratio"]
C = trial.params["C"]
model = LogisticRegression(
    random_state=123, 
    max_iter=10000, 
    C=C,
    l1_ratio=l1_ratio, 
    penalty='elasticnet',
    solver='saga'
    )
model.fit(X_train, df_train["CATEGORY"])

pred_train = model.predict(X_train)
pred_valid = model.predict(X_valid)
pred_test = model.predict(X_test)

accuracy_train = accuracy_score(df_train["CATEGORY"], pred_train)
accuracy_valid = accuracy_score(df_valid["CATEGORY"], pred_valid)
accuracy_test = accuracy_score(df_test["CATEGORY"], pred_test)
pd.DataFrame([
    accuracy_train,
    accuracy_valid,
    accuracy_test
],index=["train","valid","test"])

Unnamed: 0,0
train,0.976387
valid,0.895802
test,0.907796


### SVM

In [225]:
def objective_svm(trial):
    C = trial.suggest_uniform("C", 0.1, 2)
    model = SVC(
        C=C,
        kernel="linear",
        random_state=None
    )
    model.fit(X_train, df_train["CATEGORY"])
    pred_valid = model.predict(X_valid)
    accuracy_valid = accuracy_score(df_valid["CATEGORY"], pred_valid)
    return accuracy_valid

In [226]:
study = optuna.create_study(direction='maximize')
study.optimize(objective_svm, timeout=120)
trial = study.best_trial
trial

[32m[I 2021-08-04 00:54:43,778][0m A new study created in memory with name: no-name-777578d3-f854-49a9-9560-edacd8af21f5[0m
[32m[I 2021-08-04 00:54:50,863][0m Trial 0 finished with value: 0.8950524737631185 and parameters: {'C': 0.7283131870605415}. Best is trial 0 with value: 0.8950524737631185.[0m
[32m[I 2021-08-04 00:54:57,841][0m Trial 1 finished with value: 0.9115442278860569 and parameters: {'C': 1.4064350016011882}. Best is trial 1 with value: 0.9115442278860569.[0m
[32m[I 2021-08-04 00:55:06,076][0m Trial 2 finished with value: 0.8598200899550225 and parameters: {'C': 0.22088877655455572}. Best is trial 1 with value: 0.9115442278860569.[0m
[32m[I 2021-08-04 00:55:13,552][0m Trial 3 finished with value: 0.8785607196401799 and parameters: {'C': 0.43726290405700263}. Best is trial 1 with value: 0.9115442278860569.[0m
[32m[I 2021-08-04 00:55:20,761][0m Trial 4 finished with value: 0.8943028485757122 and parameters: {'C': 0.6065371270704762}. Best is trial 1 with va

FrozenTrial(number=12, values=[0.9130434782608695], datetime_start=datetime.datetime(2021, 8, 4, 0, 56, 10, 244828), datetime_complete=datetime.datetime(2021, 8, 4, 0, 56, 17, 272022), params={'C': 1.275057539561717}, distributions={'C': UniformDistribution(high=2.0, low=0.1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=12, state=TrialState.COMPLETE, value=None)

In [227]:
C = trial.params["C"]
model = SVC(
    C=C,
    kernel="linear",
    random_state=None
)
model.fit(X_train, df_train["CATEGORY"])

pred_train = model.predict(X_train)
pred_valid = model.predict(X_valid)
pred_test = model.predict(X_test)

accuracy_train = accuracy_score(df_train["CATEGORY"], pred_train)
accuracy_valid = accuracy_score(df_valid["CATEGORY"], pred_valid)
accuracy_test = accuracy_score(df_test["CATEGORY"], pred_test)
pd.DataFrame([
    accuracy_train,
    accuracy_valid,
    accuracy_test
],index=["train","valid","test"])

Unnamed: 0,0
train,0.98735
valid,0.913043
test,0.928786
