# Chapter 6: 機械学習

本章では，Fabio Gasparetti氏が公開している[News Aggregator Data Set](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)を用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

## 50. データの入手・整形

News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
2. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
3. 抽出された事例をランダムに並び替える．
4. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

In [3]:
# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip -P ../data/

--2020-11-01 11:38:06--  https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29224203 (28M) [application/x-httpd-php]
Saving to: ‘../data/NewsAggregatorDataset.zip’


2020-11-01 11:38:15 (11.5 MB/s) - ‘../data/NewsAggregatorDataset.zip’ saved [29224203/29224203]



In [4]:
# !unzip ../data/NewsAggregatorDataset.zip -d ../data/news_aggregator

Archive:  ../data/NewsAggregatorDataset.zip
  inflating: ../data/news_aggregator/2pageSessions.csv  
   creating: ../data/news_aggregator/__MACOSX/
  inflating: ../data/news_aggregator/__MACOSX/._2pageSessions.csv  
  inflating: ../data/news_aggregator/newsCorpora.csv  
  inflating: ../data/news_aggregator/__MACOSX/._newsCorpora.csv  
  inflating: ../data/news_aggregator/readme.txt  
  inflating: ../data/news_aggregator/__MACOSX/._readme.txt  


In [5]:
!cat ../data/news_aggregator/readme.txt

SUMMARY: Dataset of references (urls) to news web pages

DESCRIPTION: Dataset of references to news web pages collected from an online aggregator in the period from March 10 to August 10 of 2014. The resources are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that point (has a link to) one of the news page in the collection.

TAGS: web pages, news, aggregator, classification, clustering

LICENSE: Public domain - Due to restrictions on content and use of the news sources, the corpus is limited to web references (urls) to web pages and does not include any text content. The references have been retrieved from the news aggregator through traditional web browsers. 

FILE ENCODING: UTF-8

FORMAT: Tab delimited CSV files. 

DATA SHAPE AND STATS: 422937 news pages and divided up into:

152746 	news of business category
108465 	news of science and technology category
115920 	news of business category
 45615 	news of

In [6]:
import pandas as pd

In [11]:
df = pd.read_table('../data/news_aggregator/newsCorpora.csv', header=None)

In [13]:
df.columns = ['id', 'title', 'url', 'publisher', 'category', 'story', 'hostname', 'timestamp']

In [16]:
news_df = df[df['publisher'].isin(['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail'])]

In [18]:
news_df = news_df.sample(frac=1)

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
train_df, test_df = train_test_split(news_df, test_size=.2, random_state=1)

In [26]:
test_df, val_df = train_test_split(test_df, test_size=.5, random_state=1)

In [37]:
train_df.to_csv('../data/news_aggregator/train.txt', sep='\t', index=False)

In [38]:
val_df.to_csv('../data/news_aggregator/valid.txt', sep='\t', index=False)

In [39]:
test_df.to_csv('../data/news_aggregator/test.txt', sep='\t', index=False)

In [29]:
train_df.groupby('category').count()

Unnamed: 0_level_0,id,title,url,publisher,story,hostname,timestamp
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
b,4499,4499,4499,4499,4499,4499,4499
e,4236,4236,4236,4236,4236,4236,4236
m,739,739,739,739,739,739,739
t,1198,1198,1198,1198,1198,1198,1198


In [30]:
test_df.groupby('category').count()

Unnamed: 0_level_0,id,title,url,publisher,story,hostname,timestamp
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
b,557,557,557,557,557,557,557
e,532,532,532,532,532,532,532
m,82,82,82,82,82,82,82
t,163,163,163,163,163,163,163


In [31]:
val_df.groupby('category').count()

Unnamed: 0_level_0,id,title,url,publisher,story,hostname,timestamp
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
b,571,571,571,571,571,571,571
e,511,511,511,511,511,511,511
m,89,89,89,89,89,89,89
t,163,163,163,163,163,163,163


## 51. 特徴量抽出

In [43]:
train_df = pd.read_table('../data/news_aggregator/train.txt')

In [42]:
val_df = pd.read_table('../data/news_aggregator/valid.txt')

In [40]:
test_df = pd.read_table('../data/news_aggregator/test.txt')

In [45]:
import re

Unnamed: 0,id,title,url,publisher,category,story,hostname,timestamp
0,192681,Banks are managing lower liquidity on their ow...,http://www.reuters.com/article/2014/05/08/ecb-...,Reuters,b,dj2gaJQ71DfKWyMRvKbZUWIkrAKLM,www.reuters.com,1399562324095
1,19596,Miley Cyrus caught on video rapping the saucy ...,http://www.dailymail.co.uk/tvshowbiz/article-2...,Daily Mail,e,dl3DxJI4HM7nbaM0Ly9r0fp5LXOzM,www.dailymail.co.uk,1395166785128
2,27098,27 Things You Need To Know About Happiness,http://www.huffingtonpost.com/2014/03/20/facts...,Huffington Post,e,dVdCQOMmrtem5KMPas9xyT4xvni7M,www.huffingtonpost.com,1395331102743
3,57111,UPDATE 1-Euro zone private sector loans contra...,http://in.reuters.com/article/2014/03/27/euroz...,Reuters,b,d_yCfTJxDUFGs_MQrL1DnBRuBd_eM,in.reuters.com,1396014023049
4,152453,Boeing's quarterly revenue rises 8 pct,http://in.reuters.com/article/2014/04/23/boein...,Reuters,b,d2aJePDim3-pgUMy9UuD2mMkzof7M,in.reuters.com,1398272238506
...,...,...,...,...,...,...,...,...
1329,210173,Miley Cyrus shares bizarre painted face selfie...,http://www.dailymail.co.uk/tvshowbiz/article-2...,Daily Mail,e,denr4gBnqVUfO1MzvZYchw966BhmM,www.dailymail.co.uk,1399981311319
1330,169152,"Forget 'Batman vs Superman', This Is Amazon vs...",http://www.contactmusic.com/article/netflix-am...,Contactmusic.com,e,dr3jk0vLDaSHQyMw4tJ2lufUc-AgM,www.contactmusic.com,1398794555702
1331,349240,CNH Tracker-Offshore yuan deposits to pick up ...,http://in.reuters.com/article/2014/07/03/marke...,Reuters,b,dkpRSIMBCMKcw0MGAdthwBjOXNMEM,in.reuters.com,1404381641486
1332,196225,Fed Proposes Rule Limiting Financial Firms' Co...,http://www.businessweek.com/news/2014-05-08/fe...,Businessweek,b,dQpUvgVmlSoB-sMIAoWv66Nmc4BdM,www.businessweek.com,1399621743515
