本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

# Processing

 ### データの入手・整形

News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
2. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
3. 抽出された事例をランダムに並び替える．
4. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ

In [56]:
import sys, os
import pandas as pd

In [57]:
path = '../env/ref/NewsAggregatorDataset/'
os.listdir(path)

['2pageSessions.csv', 'newsCorpora.csv', 'readme.txt', '__MACOSX']

In [58]:
with open(path + 'readme.txt') as f:
    readme = f.read()

In [59]:
readme.split('\n')

['SUMMARY: Dataset of references (urls) to news web pages',
 '',
 'DESCRIPTION: Dataset of references to news web pages collected from an online aggregator in the period from March 10 to August 10 of 2014. The resources are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that point (has a link to) one of the news page in the collection.',
 '',
 'TAGS: web pages, news, aggregator, classification, clustering',
 '',
 'LICENSE: Public domain - Due to restrictions on content and use of the news sources, the corpus is limited to web references (urls) to web pages and does not include any text content. The references have been retrieved from the news aggregator through traditional web browsers. ',
 '',
 'FILE ENCODING: UTF-8',
 '',
 'FORMAT: Tab delimited CSV files. ',
 '',
 'DATA SHAPE AND STATS: 422937 news pages and divided up into:',
 '',
 '152746 \tnews of business category',
 '108465 \tnews of science and techn

In [60]:
# read_csvとread_tableの使い分け:
# 区切り文字が','であるか、'\t'(タブ文字)であるか
# 今回使用するデータは上記READMEの通りタブ分けされてるので後者を使う！

data = pd.read_table(path+'newsCorpora.csv', header=None)

In [61]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [62]:
data.columns = ['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP']

In [63]:
data.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [64]:
data = data[data['PUBLISHER'].isin(["Reuters", "Huffington Post", "Businessweek", "Contactmusic.com",  "Daily Mail"])]
data.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
12,13,Europe reaches crunch point on banking union,http://in.reuters.com/article/2014/03/10/eu-ba...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470501755
13,14,ECB FOCUS-Stronger euro drowns out ECB's messa...,http://in.reuters.com/article/2014/03/10/ecb-p...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470501948
19,20,"Euro Anxieties Wane as Bunds Top Treasuries, S...",http://www.businessweek.com/news/2014-03-10/ge...,Businessweek,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,www.businessweek.com,1394470503148
20,21,Noyer Says Strong Euro Creates Unwarranted Eco...,http://www.businessweek.com/news/2014-03-10/no...,Businessweek,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,www.businessweek.com,1394470503366
29,30,REFILE-Bad loan triggers key feature in ECB ba...,http://in.reuters.com/article/2014/03/10/euroz...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470505070


In [65]:
data.groupby(['PUBLISHER']).size()

PUBLISHER
Businessweek        2395
Contactmusic.com    2334
Daily Mail          2254
Huffington Post     2455
Reuters             3902
dtype: int64

In [66]:
from sklearn.model_selection import train_test_split

In [67]:
# reset_index()のdrop=Trueで古いインデックスのカラムが保持されるのを防ぐ
data = data.sample(frac=1).reset_index(drop=True)

In [75]:
data.loc[:,['CATEGORY', 'TITLE']]

Unnamed: 0,CATEGORY,TITLE
0,b,NEW YORK (AP) — A former partial owner of the ...
1,e,"PICTURED: Jessica Simpson throws huge Red, Whi..."
2,e,Madonna reveals a little too much as she poses...
3,t,Huge Google Event Interrupted By Protester: 'D...
4,b,UPDATE 1-BOJ's Iwata signals chance of taperin...
...,...,...
13335,m,US syphilis rate up; mostly gay and bisexual men
13336,b,US Memorial Day Weekend Travel to Reach 9-Year...
13337,b,Yellen Says Financial Instability Shouldn't Pr...
13338,t,The world's fastest animal is Paratarsotomus m...


In [76]:
train_data, other_data = train_test_split(data, train_size=0.8)

In [77]:
valid_data, test_data = train_test_split(other_data, train_size=0.5)
del other_data

In [78]:
print(len(data))
print(len(train_data), len(valid_data), len(test_data) )

13340
10672 1334 1334


In [80]:
assert len(data) == len(train_data) + len(valid_data) + len(test_data), 'ERROR!'

In [81]:
#df.to_csv(r'c:\data\pandas.txt', header=None, index=None, sep=' ', mode='a')

train_data.to_csv('../env/ref/ch06_train.txt', header=True, index=None, sep='\t')
valid_data.to_csv('../env/ref/ch06_valid.txt', header=True, index=None, sep='\t')
test_data.to_csv('../env/ref/ch06_test.txt', header=True, index=None, sep='\t')

EoF