# Processing

 ### データの入手・整形

本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
2. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
3. 抽出された事例をランダムに並び替える．
4. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ

In [2]:
import sys, os
import pandas as pd

In [3]:
path = '../env/ref/NewsAggregatorDataset/'
os.listdir(path)

['2pageSessions.csv', 'newsCorpora.csv', 'readme.txt', '__MACOSX']

In [4]:
with open(path + 'readme.txt') as f:
    readme = f.read()

In [5]:
readme.split('\n')

['SUMMARY: Dataset of references (urls) to news web pages',
 '',
 'DESCRIPTION: Dataset of references to news web pages collected from an online aggregator in the period from March 10 to August 10 of 2014. The resources are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that point (has a link to) one of the news page in the collection.',
 '',
 'TAGS: web pages, news, aggregator, classification, clustering',
 '',
 'LICENSE: Public domain - Due to restrictions on content and use of the news sources, the corpus is limited to web references (urls) to web pages and does not include any text content. The references have been retrieved from the news aggregator through traditional web browsers. ',
 '',
 'FILE ENCODING: UTF-8',
 '',
 'FORMAT: Tab delimited CSV files. ',
 '',
 'DATA SHAPE AND STATS: 422937 news pages and divided up into:',
 '',
 '152746 \tnews of business category',
 '108465 \tnews of science and techn

In [6]:
# read_csvとread_tableの使い分け:
# 区切り文字が','であるか、'\t'(タブ文字)であるか
# 今回使用するデータは上記READMEの通りタブ分けされてるので後者を使う！

data = pd.read_table(path+'newsCorpora.csv', header=None)

In [7]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [8]:
data.columns = ['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP']

In [9]:
data.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [10]:
data = data[data['PUBLISHER'].isin(["Reuters", "Huffington Post", "Businessweek", "Contactmusic.com",  "Daily Mail"])]
data.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
12,13,Europe reaches crunch point on banking union,http://in.reuters.com/article/2014/03/10/eu-ba...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470501755
13,14,ECB FOCUS-Stronger euro drowns out ECB's messa...,http://in.reuters.com/article/2014/03/10/ecb-p...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470501948
19,20,"Euro Anxieties Wane as Bunds Top Treasuries, S...",http://www.businessweek.com/news/2014-03-10/ge...,Businessweek,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,www.businessweek.com,1394470503148
20,21,Noyer Says Strong Euro Creates Unwarranted Eco...,http://www.businessweek.com/news/2014-03-10/no...,Businessweek,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,www.businessweek.com,1394470503366
29,30,REFILE-Bad loan triggers key feature in ECB ba...,http://in.reuters.com/article/2014/03/10/euroz...,Reuters,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,in.reuters.com,1394470505070


In [11]:
data.groupby(['PUBLISHER']).size()

PUBLISHER
Businessweek        2395
Contactmusic.com    2334
Daily Mail          2254
Huffington Post     2455
Reuters             3902
dtype: int64

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
# reset_index()のdrop=Trueで古いインデックスのカラムが保持されるのを防ぐ
data = data.sample(frac=1).reset_index(drop=True)

In [14]:
data.loc[:,['CATEGORY', 'TITLE']]

Unnamed: 0,CATEGORY,TITLE
0,b,FedEx criminally indicted by Justice Departmen...
1,b,Is an Obamacare bailout worth BILLIONS on the ...
2,e,DETROIT (AP) — An arrest warrant has been issu...
3,e,Megan Fox is stunning as April O'Neil as she s...
4,e,Michelle Fairley Will Not Return To 'Game Of T...
...,...,...
13335,e,Andrew Garfield hangs out with Angelina Jolie'...
13336,t,Oculus Buys Startup RakNet Ahead of Facebook D...
13337,b,India election commission allows cbank to anno...
13338,e,Avicii Drops Out of Ultra Music Festival After...


In [21]:
data =data.loc[:,['CATEGORY', 'TITLE']]

In [22]:
data.head()

Unnamed: 0,CATEGORY,TITLE
0,b,FedEx criminally indicted by Justice Departmen...
1,b,Is an Obamacare bailout worth BILLIONS on the ...
2,e,DETROIT (AP) — An arrest warrant has been issu...
3,e,Megan Fox is stunning as April O'Neil as she s...
4,e,Michelle Fairley Will Not Return To 'Game Of T...


In [23]:
train_data, other_data = train_test_split(data, train_size=0.8)

In [24]:
valid_data, test_data = train_test_split(other_data, train_size=0.5)
del other_data

In [25]:
print(len(data))
print(len(train_data), len(valid_data), len(test_data) )

13340
10672 1334 1334


In [26]:
assert len(data) == len(train_data) + len(valid_data) + len(test_data), 'ERROR!'

In [27]:
train_data.to_csv('../env/ref/ch06_train.txt', header=True, index=None, sep='\t')
valid_data.to_csv('../env/ref/ch06_valid.txt', header=True, index=None, sep='\t')
test_data.to_csv('../env/ref/ch06_test.txt', header=True, index=None, sep='\t')

In [28]:
train_data.head()

Unnamed: 0,CATEGORY,TITLE
6720,b,Tycoon buys homeless lunch
13224,b,ECB's Constancio sees inflation below 1 pct fo...
4299,e,Drew Barrymore And Husband Will Kopelman Welco...
7596,b,UPDATE 2-Media executives question Comcast-Tim...
1419,m,"Deadly Ebola could affect up to 20000 people, ..."


### 特徴量抽出

学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ． なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

In [47]:
train_data['TITLE'][:4]

6720                            Tycoon buys homeless lunch
13224    ECB's Constancio sees inflation below 1 pct fo...
4299     Drew Barrymore And Husband Will Kopelman Welco...
7596     UPDATE 2-Media executives question Comcast-Tim...
Name: TITLE, dtype: object

In [83]:
sample_text = train_data.iloc[4,:].iloc[-1]
print(sample_text)

Deadly Ebola could affect up to 20000 people, say world health chiefs as they  ...


In [84]:
import re

In [85]:
re.sub('[,\.]', '', sample_text.lower())

'deadly ebola could affect up to 20000 people say world health chiefs as they  '

In [118]:
re.sub('[,\.]', '', sample_text.lower()).strip(' ')

'deadly ebola could affect up to 20000 people say world health chiefs as they'

In [120]:
re.sub('[,\.]', '', sample_text.lower()).strip(' ').split(' ')

['deadly',
 'ebola',
 'could',
 'affect',
 'up',
 'to',
 '20000',
 'people',
 'say',
 'world',
 'health',
 'chiefs',
 'as',
 'they']

In [99]:
4%3

1

In [102]:
len(data)

13340

In [133]:
_text

'Boy, oh boy! Kristen Stewart and Anne Hathaway dress in drag for new music  ...'

In [134]:
unique_words = []
for i in range(0, int(len(data)) ):
# for i in range(0, 10):
    if i%1000 == 0:
        print(f'{i:> 6} Passed')
    
    _text = data.iloc[i,:].iloc[-1]
    _words = re.sub('[,\.]', '', _text.lower()).strip(' ').split(' ')
    
    for k in _words:
        if k not in unique_words:
            unique_words.append(k)
        
print('Done')

     0 Passed
  1000 Passed
  2000 Passed
  3000 Passed
  4000 Passed
  5000 Passed
  6000 Passed
  7000 Passed
  8000 Passed
  9000 Passed
 10000 Passed
 11000 Passed
 12000 Passed
 13000 Passed
Done


In [135]:
len(unique_words)

20347

In [87]:
len(data)

13340

In [145]:
unique_words[:4]

['fedex', 'criminally', 'indicted', 'by']

In [143]:
# 各単語の頻出度合いを獲得する


words_dict = {}
for k in unique_words:
    words_dict[k] = 0

for i in range(0, int(len(data))):
# for i in range(0, 10):
    if i%1000 == 0:
        print(f'{i:> 6} titles passed')
        
    _text = data.iloc[i,:].iloc[-1]
    _words = re.sub('[,\.]', '', _text.lower()).strip(' ').split(' ')
    
    for k in _words:
        words_dict[k] += 1
        
print('Done')
    


     0 titles passed
  1000 titles passed
  2000 titles passed
  3000 titles passed
  4000 titles passed
  5000 titles passed
  6000 titles passed
  7000 titles passed
  8000 titles passed
  9000 titles passed
 10000 titles passed
 11000 titles passed
 12000 titles passed
 13000 titles passed
Done


In [144]:
words_dict['uk']

102

In [147]:
import numpy as np

In [157]:
np.zeros((1,len(unique_words)))

array([[0., 0., 0., ..., 0., 0., 0.]])

In [158]:
unique_words.index('uk')

1656

In [159]:
unique_words[1656]

'uk'

In [179]:
data.head()

Unnamed: 0,CATEGORY,TITLE,WordVec
0,b,FedEx criminally indicted by Justice Departmen...,
1,b,Is an Obamacare bailout worth BILLIONS on the ...,
2,e,DETROIT (AP) — An arrest warrant has been issu...,
3,e,Megan Fox is stunning as April O'Neil as she s...,
4,e,Michelle Fairley Will Not Return To 'Game Of T...,


In [180]:
data.iloc[i,:].iloc['WordVec'] = [1,2,3]

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

In [177]:
data.iloc[3, :].loc['TITLE']

"Megan Fox is stunning as April O'Neil as she springs into action in new Teenage  ..."

In [None]:
# 単語の頻出回数を数えてベクトル化
# for i in range(0, int(len(data))):
for i in range(0, 10):
    if i%1000 == 0:
        print(f'{i:> 6} titles passed')
        
    _text = data.iloc[i,:].iloc['TITLE']
    _words = re.sub('[,\.]', '', _text.lower()).strip(' ').split(' ')
    
    data.iloc[i]= np.zeros((1,len(unique_words)))
    
    data.iloc
    
    for k in _words:
        _index = unique_words.index(k)
        
        words_dict[k] += 1
    
    
    
print('Done')
    
    

# 蜜な行列にする

EoF