# https://nlp100.github.io/ja/ch06.html

## 第6章: 機械学習
本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

In [15]:
!pip install scikit-learn

[33mDEPRECATION: Loading egg at /home/ryu/.venv/lib/python3.12/site-packages/cabocha_python-0.69-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.15.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1

### 50. データの入手・整形
News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
2. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
3. 抽出された事例をランダムに並び替える．
4. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

In [3]:
!cat ./datafiles/readme.txt

SUMMARY: Dataset of references (urls) to news web pages

DESCRIPTION: Dataset of references to news web pages collected from an online aggregator in the period from March 10 to August 10 of 2014. The resources are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that point (has a link to) one of the news page in the collection.

TAGS: web pages, news, aggregator, classification, clustering

LICENSE: Public domain - Due to restrictions on content and use of the news sources, the corpus is limited to web references (urls) to web pages and does not include any text content. The references have been retrieved from the news aggregator through traditional web browsers. 

FILE ENCODING: UTF-8

FORMAT: Tab delimited CSV files. 

DATA SHAPE AND STATS: 422937 news pages and divided up into:

152746 	news of business category
108465 	news of science and technology category
115920 	news of business category
 45615 	news of

In [4]:
import pandas as pd
column_names = ['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP']
df = pd.read_csv('./datafiles/newsCorpora.csv', sep='\t', header=None, names=column_names)
df.head(5)

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [5]:
df = df[df['PUBLISHER'].isin(['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail'])]
df.head(5)

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
12,13,Europe reaches crunch point on banking union,http://in.reuters.com/article/2014/03/10/eu-ba...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470501755
13,14,ECB FOCUS-Stronger euro drowns out ECB's messa...,http://in.reuters.com/article/2014/03/10/ecb-p...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470501948
19,20,"Euro Anxieties Wane as Bunds Top Treasuries, S...",http://www.businessweek.com/news/2014-03-10/ge...,Businessweek,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,www.businessweek.com,1394470503148
20,21,Noyer Says Strong Euro Creates Unwarranted Eco...,http://www.businessweek.com/news/2014-03-10/no...,Businessweek,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,www.businessweek.com,1394470503366
29,30,REFILE-Bad loan triggers key feature in ECB ba...,http://in.reuters.com/article/2014/03/10/euroz...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470505070


In [6]:
df['PUBLISHER'].value_counts()

PUBLISHER
Reuters             3902
Huffington Post     2455
Businessweek        2395
Contactmusic.com    2334
Daily Mail          2254
Name: count, dtype: int64

In [9]:
# Shuffle the DataFrame
import numpy as np
shuffled_indices = np.random.permutation(df.index)
df = df.loc[shuffled_indices].reset_index(drop=True)

In [18]:
df = df[['CATEGORY', 'TITLE']]
df.head(10)

Unnamed: 0,CATEGORY,TITLE
0,e,Justin Bieber Avoids Felony Charge In Alleged ...
1,b,Lorillard Reaches Record on Fresh Reynolds Tak...
2,b,UPDATE 1-Juniper's revenue rises as telecom cl...
3,e,Emma Stone and Colin Firth in Magic In The Moo...
4,b,UK inflation hits new four-year low in Februar...
5,b,UPDATE 1-California's proposed 2015 Obamacare ...
6,b,UPDATE 2-China PMIs fuel hope economy is stabi...
7,e,Bedridden Miley Cyrus loses her 'brain' in The...
8,e,â€˜Tammyâ€™ Proves No Match For â€˜Transformer...
9,b,Twitter Insiders Plan to Hold Stock Even as Lo...


In [19]:
from sklearn.model_selection import train_test_split

# まず、データを80%と20%に分割
train_data, temp_data = train_test_split(df, test_size=0.2, random_state=42)

# 残りの20%を検証データと評価データに分割
validation_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

Train data size: 10672
Validation data size: 1334
Test data size: 1334


In [20]:
# タブ区切りで各データセットをファイルに保存
train_data.to_csv("./datafiles/train.txt", sep="\t", index=False)
validation_data.to_csv("./datafiles/valid.txt", sep="\t", index=False)
test_data.to_csv("./datafiles/test.txt", sep="\t", index=False)

In [21]:
# 各データセットのサイズを確認
print("Train data size:", len(train_data))
print("Validation data size:", len(validation_data))
print("Test data size:", len(test_data))

Train data size: 10672
Validation data size: 1334
Test data size: 1334


## 51. 特徴量抽出
学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ． なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

In [26]:
# ベースライン：見出しを単語列に変換
def extract_features(data):
    # 単語列を作成（例としてスペース区切りで分割）
    data['features'] = data['TITLE'].str.split()
    return data[['features', 'CATEGORY', 'TITLE']]

# 学習データ、検証データ、評価データから特徴量を抽出
train_features = extract_features(train_data)
validation_features = extract_features(validation_data)
test_features = extract_features(test_data)

# 各データをファイルに保存
train_features.to_csv("./datafiles/train.feature.txt", sep="\t", index=False, header=False)
validation_features.to_csv("./datafiles/valid.feature.txt", sep="\t", index=False, header=False)
test_features.to_csv("./datafiles/test.feature.txt", sep="\t", index=False, header=False)

## 52. 学習
51で構築した学習データを用いて，ロジスティック回帰モデルを学習せよ．

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# カテゴリを数値に変換
category_to_index = {category: idx for idx, category in enumerate(df["CATEGORY"].unique())}
index_to_category = {v: k for k, v in category_to_index.items()}  # インデックスからカテゴリ名のマッピング
train_data["CATEGORY"] = train_data["CATEGORY"].map(category_to_index)
validation_data["CATEGORY"] = validation_data["CATEGORY"].map(category_to_index)
test_data["CATEGORY"] = test_data["CATEGORY"].map(category_to_index)

# TF-IDFベクトル化
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data["TITLE"])
X_validation = vectorizer.transform(validation_data["TITLE"])
X_test = vectorizer.transform(test_data["TITLE"])

y_train = train_data["CATEGORY"]
y_validation = validation_data["CATEGORY"]
y_test = test_data["CATEGORY"]

# ロジスティック回帰モデルの学習
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)

# モデル評価
validation_pred = model.predict(X_validation)
test_pred = model.predict(X_test)

# カテゴリ名リストを作成
target_names = [index_to_category[i] for i in sorted(index_to_category.keys())]

print("Validation Classification Report:")
print(classification_report(y_validation, validation_pred, target_names=target_names))

print("\nTest Classification Report:")
print(classification_report(y_test, test_pred, target_names=target_names))

Validation Classification Report:
              precision    recall  f1-score   support

           e       0.90      0.98      0.93       558
           b       0.89      0.94      0.91       520
           m       0.97      0.61      0.75       109
           t       0.86      0.59      0.70       147

    accuracy                           0.89      1334
   macro avg       0.90      0.78      0.83      1334
weighted avg       0.89      0.89      0.89      1334


Test Classification Report:
              precision    recall  f1-score   support

           e       0.89      0.98      0.93       524
           b       0.88      0.95      0.91       556
           m       0.94      0.46      0.62       100
           t       0.87      0.60      0.71       154

    accuracy                           0.88      1334
   macro avg       0.89      0.75      0.79      1334
weighted avg       0.88      0.88      0.87      1334



## 53. 予測
52で学習したロジスティック回帰モデルを用い，与えられた記事見出しからカテゴリとその予測確率を計算するプログラムを実装せよ．