### 第6章: 機械学習 
#### 本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

#### 50. データの入手・整形
News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
抽出された事例をランダムに並び替える．
抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

In [1]:
import numpy as np
import pandas as pd 

In [2]:
with open("NewsAggregatorDataset/newsCorpora.csv", 'r',encoding='utf-8') as f:
    text_ori = f.readlines()
text_list=[text.split("\t") for text in text_ori]
text_df=pd.DataFrame(
    text_list,
    columns=["ID","TITLE","URL","PUBLISHER","CATEGORY","STORY","HOSTNAME","TIMESTAMP"]
)
text_5pub_df=text_df[text_df["PUBLISHER"].isin(["Reuters","Huffington Post","Businessweek","Contactmusic.com","Daily Mail"])].reset_index(drop=True)
text_5pub_df_shuffled=text_5pub_df.sample(frac=1).reset_index(drop=True)
text_5pub_df_shuffled_for_save=text_5pub_df_shuffled[["CATEGORY","TITLE"]]
_text_5pub_df_shuffled_for_save=["\t".join(list(_))+"\n" for _ in text_5pub_df_shuffled_for_save.values]

In [3]:
import math
len_total=len(_text_5pub_df_shuffled_for_save)
len_train=math.ceil(len_total*0.8)
len_val=math.ceil(len_total*0.1)
train=_text_5pub_df_shuffled_for_save[:len_train]
val=_text_5pub_df_shuffled_for_save[len_train:len_train+len_val]
test=_text_5pub_df_shuffled_for_save[len_train+len_val:]

import os
file="train.txt"
if os.path.exists(file):
    os.remove(file)
for rw in train:
    with open(file,mode="a",encoding="utf-8") as f:
        f.write(rw)
        f.close()
file="valid.txt"
if os.path.exists(file):
    os.remove(file)
for rw in val:
    with open(file,mode="a",encoding="utf-8") as f:
        f.write(rw)
        f.close()
file="test.txt"
if os.path.exists(file):
    os.remove(file)
for rw in test:
    with open(file,mode="a",encoding="utf-8") as f:
        f.write(rw)
        f.close()

In [4]:
text_5pub_df_shuffled_for_save_train=text_5pub_df_shuffled_for_save.iloc[:len_train].reset_index(drop=True)
text_5pub_df_shuffled_for_save_val=text_5pub_df_shuffled_for_save.iloc[len_train:len_train+len_val].reset_index(drop=True)
text_5pub_df_shuffled_for_save_test=text_5pub_df_shuffled_for_save.iloc[len_train+len_val:].reset_index(drop=True)
from collections import Counter
count_train=Counter(text_5pub_df_shuffled_for_save_train["CATEGORY"])
count_val=Counter(text_5pub_df_shuffled_for_save_val["CATEGORY"])
count_test=Counter(text_5pub_df_shuffled_for_save_test["CATEGORY"])
print("category train",dict(count_train))
print("category val",dict(count_val))
print("category test",dict(count_test))

category train {'e': 4229, 'b': 4485, 'm': 757, 't': 1214}
category val {'e': 519, 't': 157, 'b': 579, 'm': 81}
category test {'b': 563, 'e': 546, 't': 154, 'm': 72}


#### 51. 特徴量抽出
学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ． なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

In [163]:
from sklearn.feature_extraction.text import CountVectorizer
sample = text_5pub_df_shuffled_for_save.TITLE.values
# CountVectorizer
vec_count = CountVectorizer()
# ベクトル化
vec_count.fit(sample)
X = vec_count.transform(sample)
X_df=pd.DataFrame(X.toarray(), columns=vec_count.get_feature_names())

In [106]:
X_train_feature=X_df.iloc[:len_train].values
X_val_feature=X_df.iloc[len_train:len_train+len_val].values
X_test_feature=X_df.iloc[len_train+len_val:].values

np.savetxt("train.feature.txt",X_train_feature)
np.savetxt("valid.feature.txt",X_val_feature)
np.savetxt("test.feature.txt",X_test_feature)

#### 52

In [176]:
label_dict={
    "b":0,
    "e":1,
    "m":2,
    "t":3
}

In [175]:
label_dict["b"]

0

In [179]:
def create_y(df):
    idx=-1
    y=np.zeros((len(df)))
    for lbl in df.CATEGORY:
        idx+=1
        y[idx]=label_dict[lbl]
    return y

In [180]:
y_train=create_y(text_5pub_df_shuffled_for_save_train)
y_val=create_y(text_5pub_df_shuffled_for_save_val)
y_test=create_y(text_5pub_df_shuffled_for_save_test)

In [181]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression().fit(X_train_feature, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


#### 53

In [189]:
title=np.array([text_5pub_df_shuffled_for_save.TITLE.values[567]])
X_trial=pd.DataFrame(vec_count.transform(title).toarray()).values
pred=LR.predict(X_trial)
label_dict_ops = dict([(v, k) for k, v in label_dict.items()])
print(label_dict_ops[pred[0]])

#### 54

In [198]:
pred_train=LR.predict(X_train_feature)
pred_val=LR.predict(X_val_feature)
pred_test=LR.predict(X_test_feature)

In [199]:
from sklearn.metrics import accuracy_score
print("train_acc: ",accuracy_score(y_train,pred_train))
print("val_acc: ",accuracy_score(y_val,pred_val))
print("test_acc: ",accuracy_score(y_test,pred_test))

train_acc:  0.9962564342536265
val_acc:  0.9236526946107785
test_acc:  0.900374531835206


#### 55

In [223]:
from IPython.display import display
from sklearn.metrics import confusion_matrix
def conf_df(t,y):
    conf=pd.DataFrame(confusion_matrix(t,y,labels=[0,1,2,3]),columns=["b","e","m","t"],index=["b","e","m","t"])
    conf.index.name = "act"
    conf.columns.name = "pred"
    display(conf)

In [227]:
conf_df(y_train,pred_train)
conf_df(y_val,pred_val)
conf_df(y_test,pred_test)

pred,b,e,m,t
act,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,4477,2,1,5
e,6,4221,0,2
m,3,3,751,0
t,14,4,0,1196


pred,b,e,m,t
act,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,548,16,3,12
e,7,511,0,1
m,7,10,61,3
t,23,17,3,114


pred,b,e,m,t
act,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,527,19,0,17
e,7,532,4,3
m,13,12,41,6
t,32,17,3,102
