<a href="https://colab.research.google.com/github/tktkbohshi/m1_study_nlp100practices/blob/main/M1_NLP_100practices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [119]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

# 第6章: 機械学習
本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

## 50.データの入手・整形
News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．
1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
1. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
1. 抽出された事例をランダムに並び替える．
1. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．
学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

### Datasets detail
+ ID		Numeric ID
+ TITLE		News title 
+ URL		Url
+ PUBLISHER	Publisher name
+ CATEGORY	News category (b = business, t = science and technology, e = entertainment, m = health)
+ STORY		Alphanumeric ID of the cluster that includes news about the same story
+ HOSTNAME	Url hostname
+ TIMESTAMP 	Approximate time the news was published, as the number of milliseconds since the epoch 00:00:00 GMT, January 1, 1970

In [76]:
columns = ["ID","TITLE","URL","PUBLISHER","CATEGORY","STORY","HOSTNAME","TIMESTAMP"]
df_publisher = pd.read_csv("./data/NewsAggregatorDataset/newsCorpora.csv", names=columns, sep="\t")
df_publisher = df_publisher[df_publisher["PUBLISHER"].isin(["Reuters", "Huffington Post", "Businessweek", "Contactmusic.com", "Daily Mail"])]
df_publisher = df_publisher.sample(frac=1).reset_index(drop=True)
df_publisher["TITLE"] = df_publisher["TITLE"].str.lower()
df_publisher.head(5)

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,264730,samsung unveils prototype health band with clo...,http://www.businessweek.com/news/2014-05-28/sa...,Businessweek,t,d-nOD2jkbIjxiDMUfJkGOqCOTnm0M,www.businessweek.com,1401360390844
1,97909,'house of cards' in setback as maryland balks ...,http://www.businessweek.com/news/2014-04-08/ho...,Businessweek,t,dGARGeNtJm3oezMFAgrVrMtJiqq2M,www.businessweek.com,1397235728087
2,318824,outrage from viewers as baby eagle is left to ...,http://www.dailymail.co.uk/news/article-266974...,Daily Mail,t,dFH0zcBWg6cjb5MjkQTgE-y9OUmEM,www.dailymail.co.uk,1403787928015
3,298052,"ge said to refine jobs, nuclear plans in alsto...",http://www.businessweek.com/news/2014-06-17/ge...,Businessweek,b,dfNbQTGBfJ6jr7MBhWNlj_D_HhIuM,www.businessweek.com,1403073455171
4,227041,at&t-comcast start of deals forming regulatory...,http://www.businessweek.com/news/2014-05-20/at...,Businessweek,t,dVKMSZMLATiLKpM5kLkclxzxZebVM,www.businessweek.com,1400640960411


In [77]:
df_train = df_publisher[0:int(len(df_publisher)*0.8)]
df_valid = df_publisher[int(len(df_publisher)*0.8):int(len(df_publisher)*0.9)]
df_test = df_publisher[int(len(df_publisher)*0.9):int(len(df_publisher))]
df_train.to_csv("./data/outputs/train.txt",sep="\t", index=False)
df_valid.to_csv("./data/outputs/valid.txt",sep="\t", index=False)
df_test.to_csv("./data/outputs/test.txt",sep="\t", index=False)

## 51.特徴量抽出

In [83]:
tfidf_vec = TfidfVectorizer()
X_train = tfidf_vec.fit_transform(df_train["TITLE"])
X_test = tfidf_vec.transform(df_test["TITLE"])

## 52.学習

In [87]:
model = LogisticRegression(random_state=123, max_iter=10000)
model.fit(X_train, df_train["CATEGORY"])

LogisticRegression(max_iter=10000, random_state=123)

## 53.予測

In [95]:
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

## 54.正解率の計測

In [100]:
accuracy_train = accuracy_score(df_train["CATEGORY"], pred_train)
accuracy_test = accuracy_score(df_test["CATEGORY"], pred_test)
accuracy_train, accuracy_test

(0.9460269865067467, 0.8710644677661169)

## 55.混同行列の作成

In [111]:
labels = df_train["CATEGORY"].unique()
labels

array(['t', 'b', 'e', 'm'], dtype=object)

In [104]:
confusion_matrix(df_train["CATEGORY"], pred_train)

array([[4411,   52,    5,   30],
       [  18, 4210,    0,    4],
       [  80,  125,  510,    4],
       [ 148,  108,    2,  965]], dtype=int64)

## 56.適合率，再現率，F1スコアの計測

In [112]:
precision_score(df_test["CATEGORY"], pred_test, average=None, labels=labels)

array([0.87719298, 0.88927336, 0.84237288, 0.98076923])

In [122]:
recall_score(df_test["CATEGORY"], pred_test, average=None, labels=labels)

array([0.57142857, 0.93454545, 0.97642436, 0.51      ])

In [123]:
f1_score(df_test["CATEGORY"], pred_test, average=None, labels=labels)

array([0.69204152, 0.91134752, 0.9044586 , 0.67105263])

In [124]:
df_eval = pd.DataFrame()
df_eval

## 57.特徴量の重みの確認

## 58.正則化パラメータの変更

## 59.ハイパーパラメータの探索Permalink