# 自然言語処理 ー 感情分析

## 1.　自然言語処理の7ステップ

#### ① 自然言語（文章）を用意
　吾輩は猫である。

#### ② 単語ごとに区切る（形態素解析）
　吾輩　／　は　／　猫　／　で　／　ある　／　。

#### ③ 助詞「てにをは」などを削除する（データのクレンジング）
　吾輩　／　猫　　　※それ以外は削除

#### ④ 単語を原形（running→run など）に戻す（トークン化）
　今回はなし

#### ⑤ 各単語の頻出度を求める（BoW）
　吾輩：１　猫：1

#### ⑥ 各単語の重要度（TF-IDF）を調べ、特徴語を決める
　吾輩：0.3　、猫：0.56　　　→　特徴語は「猫」

#### ⑦ 他の文章との区別（分類）を行う
　夏目漱石「吾輩は猫である」の特徴語：猫<br>
 　森鴎外「舞姫」の特徴語：姫
  
　どちらかの小説の一文を投入する　→　この小説は「夏目漱石」の小説です

## 2.　IMDbデータセットの取得<br>

今回は、以下のデータセットを使用する。

・[IMDb](http://ai.stanford.edu/~amaas/data/sentiment/)：映画レビューのデータセット

In [1]:
import os
import re
import sys
import time
import nltk
import optuna
import pyprind
import numpy as np
import pandas as pd
import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [2]:
pd.set_option("display.max_colwidth", 150)

In [3]:
df = pd.read_csv('../input/movie-datacsv/movie_data.csv')

print(df.shape)
df.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve...",1
1,OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low...,0
2,"***SPOILER*** Do not read this, if you think about watching that movie, although it would be a waste of time. (By the way: The plot is so predicta...",0
3,hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i. i love the songs once you have seen th...,1
4,"I recently bought the DVD, forgetting just how much I hated the movie version of ""A Chorus Line."" Every change the director Attenborough made to t...",0
5,Leave it to Braik to put on a good show. Finally he and Zorak are living their own lives outside of Spac Ghost Coast To Coast. I have to say that ...,1
6,"Nathan Detroit (Frank Sinatra) is the manager of the New York's longest- established floating craps game, and he needs $1000 to secure a new locat...",1
7,"To understand ""Crash Course"" in the right context, you must understand the 80's in TV. Most TV shows didn't have any point. The sitcom outpopulate...",1
8,"I've been impressed with Chavez's stance against globalisation for sometime now, but it wasn't until I saw the film at the Amsterdam documentary i...",1
9,This movie is directed by Renny Harlin the finnish miracle. Stallone is Gabe Walker. Cat and Mouse on the mountains with ruthless terrorists. Renn...,1


In [4]:
pd.DataFrame([['review', '映画の感想文'],
              ['sentiment', '感情（0：マイナス、1：プラス）']],
              columns=['カラム', '意味'])

Unnamed: 0,カラム,意味
0,review,映画の感想文
1,sentiment,感情（0：マイナス、1：プラス）


## 3.　BoWベクトル

文を数値化（ベクトル化）する方法として、**Bag of Words（BOW）**がある。

### 3.1　単語を特徴ベクトルに変換する

In [5]:
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, and one and one is two'])

scikit-learnに実装されているCountVectorizer()で、BoWベクトルをカウントする。

In [6]:
count = CountVectorizer()
bag = count.fit_transform(docs)

上記によって、次の3つの文章におけるBoWモデルが構築された。<br>
ここで、各単語に番号（インデックス）を付け、それぞれ確認をする。

In [7]:
print('単語:インデックス', count.vocabulary_)

単語:インデックス {'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


3文目の「the」の頻出度を調べてみる。<br>
0から数えるとインデックスは2となり、最後には単語を入れる。

In [8]:
print('頻出度：', bag.toarray()[2][count.vocabulary_['the']])

頻出度： 2


次に、特徴ベクトルを出力する。

In [9]:
pd.DataFrame(bag.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,1,0,1,1,0,1,0,0
1,0,1,0,0,0,1,1,0,1
2,2,3,2,1,1,1,2,1,1


行は1～3文目、列は各単語のインデックス番号を前提に表示されている。<br>
分かりやすいように、表示し直す。

In [10]:
pd.DataFrame(bag.toarray(),
             index=['The sun is shining',
                    'The weather is sweet',
                    'The sun is shining, the weather is sweet, and one and one is two'],
             columns=['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather'])

Unnamed: 0,and,is,one,shining,sun,sweet,the,two,weather
The sun is shining,0,1,0,1,1,0,1,0,0
The weather is sweet,0,1,0,0,0,1,1,0,1
"The sun is shining, the weather is sweet, and one and one is two",2,3,2,1,1,1,2,1,1


## 4.　TF-IDFを使って単語の関連性を評価する

$\rm{TF}$：対象の単語が、ある文書中にどれだけの頻度で出現しているか（頻度）<br>
$\rm{IDF}$：対象の単語が含まれる文が、文章全体にどれだけの頻度で出現していないか（珍しさ）<br>
$\rm{TF-IDF}$：Term Frequency – Inverse Document Frequency の略（重みづけの指標）

例えば、小説一冊が、下記で構成されているとする。

・$D$ 個の文<br>
・$N$ 個の単語

小説に単語 $x$ が $n$ 回現れるならば
$$\begin{eqnarray}
{\rm TF}=\frac{n}{N}
\end{eqnarray}$$
小説に単語 $x$ を含む文が $d$ 個あるならば
$$\begin{eqnarray}
{\rm IDF}=-\log_{10}\frac{d}{D}=\log_{10}\frac{D}{d}
\end{eqnarray}$$
よって、次のように求まる。
$$\begin{eqnarray}
{\rm TF-IDF}=\frac{n}{N}\log_{10}\frac{D}{d}
\end{eqnarray}$$

例題

太宰治の作品「一歩前進二歩退却」から一部を抜粋したものである。<br>
「作家」という単語のTF-IDFを求めなさい。

-----

　作家は、いよいよ窮屈である。何せ、眼光紙背に徹する読者ばかりを<br>
相手にしているのだから、うっかりできない。あんまり緊張して、ついには<br>
机のまえに端座したまま、そのまま、沈黙は金、という格言を底知れず肯定している。<br>
そんなあわれな作家さえ出て来ぬともかぎらない。<br>
　謙譲を、作家のみ要求し、作家は大いに恐縮し、卑屈なほどへりくだって<br>
そうして読者は旦那である。作家の私生活、底の底まで剥ごうとする。<br>
失敗である。安売りにしていいのは作品である。作家の人間までを売ってはいない。<br>
謙譲は、読者にこそ之を要求したい。

-----

下記をカウントした。<br>
<br>
単語数 $N$：150<br>
単語の種類：48<br>
文数 $D$：10

次に「作家」という単語 $x$ に着目する。

出現回数 $n$：6<br>
含まれる文数 $d$：5

よって、次のように求まる。<br>

$$\begin{eqnarray}
{\rm TF}=\frac{n}{N}=\frac{6}{150}=0.04
\end{eqnarray}$$

$$\begin{eqnarray}
{\rm IDF}=\log_{10}\frac{D}{d}=\log_{10}\frac{10}{5}=0.301
\end{eqnarray}$$

$$\begin{eqnarray}
{\rm TF-IDF}=0.0120
\end{eqnarray}$$<br>
<br>
実際の自然言語処理では、これを48種類の単語すべてに行っていく。

### 4.1　scikit-learnでTF-IDFを実装する

scikit-learnには、TfidTransformerという関数が実装されている。<br>
これは、fit_transformメソッドのCountVectorizerから「生の単語の出現頻度」を入力として受け取り、TF-IDFに変換する。

In [11]:
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
tfidf_vec = tfidf.fit_transform(count.fit_transform(docs)).toarray()

pd.DataFrame(tfidf_vec.round(2))

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.0,0.43,0.0,0.56,0.56,0.0,0.43,0.0,0.0
1,0.0,0.43,0.0,0.0,0.0,0.56,0.43,0.0,0.56
2,0.5,0.45,0.5,0.19,0.19,0.19,0.3,0.25,0.19


smooth_idf：Truleがデフォルト<br>
use_idf：idf()の使用有無（idf()関数による重み付けを行うかどうか）<br>
norm：正則化の指定（デフォルトはなし、'l2'指定で単語ベクトルの長さが1になるよう正規化）<br>
toarray()：行列出力

それぞれの文や単語とも紐づけて可視化すると、以下のようになる。

In [12]:
tfidf_df = pd.DataFrame(tfidf_vec.round(2),
                        index=['The sun is shining',
                               'The weather is sweet',
                               'The sun is shining, the weather is sweet, and one and one is two'],
                        columns=['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather'])

tfidf_df.head()

Unnamed: 0,and,is,one,shining,sun,sweet,the,two,weather
The sun is shining,0.0,0.43,0.0,0.56,0.56,0.0,0.43,0.0,0.0
The weather is sweet,0.0,0.43,0.0,0.0,0.0,0.56,0.43,0.0,0.56
"The sun is shining, the weather is sweet, and one and one is two",0.5,0.45,0.5,0.19,0.19,0.19,0.3,0.25,0.19


「is」は3つの文章で使われているため、特徴語ではなく、どの文法に必要な語と判断される。<br>
TF-IDFが（0.45）とそれほど大きくないことからも考えられる。

一方で、「one」は3つ目の文章だけで2回使用されているので<br>
TF-IDFは（0.5）と少しだけ大きくなっている。

## 5.　クレンジング

まず、映画レビューデータセットの1つ目の文章から、最後の50文字を出力する。

In [13]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

出力結果を見てみると、不要な句読点や非英字文字が多い。<br>
感情分析に役に立ちそうな顔文字要素のある記号「　：)　」のみ残し、それ以外はすべて削除する。<br>
その際、今回はPythonの正規表現ライブラリを使用する。

In [14]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

re.sub：正規表現で指定した文字列を置換する<br>
str.lower(): すべての文字を小文字に変換

2行目：正規表現（<[^>]*>）を使用し、HTMLマークアップを削除<br>
3行目：顔文字を検索し、emoticonsに格納<br>
4行目：正規表現[\w]+を使って単語の一部でない文字を削除、小文字に変換し、emoticonsを加え、顔文字内の「-」を消去

In [15]:
print('変換前:', df.loc[0, 'review'][-50:])
print('変換後:', preprocessor(df.loc[0, 'review'][-50:]))

変換前: is seven.<br /><br />Title (Brazil): Not Available
変換後: is seven title brazil not available


In [16]:
print(preprocessor("</a>This :) is :( a test :-)!"))
print(preprocessor("!\/.i like ;.::python/:]/];/]"))
print(preprocessor('machine\::lear\[:::nig];@[/]'))

this is a test :) :( :)
 i like python 
machine lear nig 


再度、映画レビューデータセットの1行目を表示する。

In [17]:
print(df.shape)
df.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve...",1
1,OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low...,0
2,"***SPOILER*** Do not read this, if you think about watching that movie, although it would be a waste of time. (By the way: The plot is so predicta...",0
3,hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i. i love the songs once you have seen th...,1
4,"I recently bought the DVD, forgetting just how much I hated the movie version of ""A Chorus Line."" Every change the director Attenborough made to t...",0
5,Leave it to Braik to put on a good show. Finally he and Zorak are living their own lives outside of Spac Ghost Coast To Coast. I have to say that ...,1
6,"Nathan Detroit (Frank Sinatra) is the manager of the New York's longest- established floating craps game, and he needs $1000 to secure a new locat...",1
7,"To understand ""Crash Course"" in the right context, you must understand the 80's in TV. Most TV shows didn't have any point. The sitcom outpopulate...",1
8,"I've been impressed with Chavez's stance against globalisation for sometime now, but it wasn't until I saw the film at the Amsterdam documentary i...",1
9,This movie is directed by Renny Harlin the finnish miracle. Stallone is Gabe Walker. Cat and Mouse on the mountains with ruthless terrorists. Renn...,1


これに、先ほど作成したpreprocessor関数を適用する。<br>
出力結果を確認すると、余計な文字や記号がなくなっていることが分かる。

In [18]:
df['review'] = df['review'].apply(preprocessor)
df.head(10)

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grace moves to the high class area of belle haven greenwich connecticut on the mischief night eve of hal...,1
1,ok so i really like kris kristofferson and his usual easy going delivery of lines in his movies age has helped him with his soft spoken low energy...,0
2,spoiler do not read this if you think about watching that movie although it would be a waste of time by the way the plot is so predictable that i...,0
3,hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i i love the songs once you have seen the...,1
4,i recently bought the dvd forgetting just how much i hated the movie version of a chorus line every change the director attenborough made to the s...,0
5,leave it to braik to put on a good show finally he and zorak are living their own lives outside of spac ghost coast to coast i have to say that i ...,1
6,nathan detroit frank sinatra is the manager of the new york s longest established floating craps game and he needs 1000 to secure a new location c...,1
7,to understand crash course in the right context you must understand the 80 s in tv most tv shows didn t have any point the sitcom outpopulated the...,1
8,i ve been impressed with chavez s stance against globalisation for sometime now but it wasn t until i saw the film at the amsterdam documentary in...,1
9,this movie is directed by renny harlin the finnish miracle stallone is gabe walker cat and mouse on the mountains with ruthless terrorists renny h...,1


## 6.　トークン化

トークン化：文章を個々の単語に分割したり、変換したりすること<br>
ワードステミング：単語を原形に変換すること　（例）running →　run　など

ワードステミングは、PorterStemmerによって開発され、Porter stemmingアルゴリズムとも呼ばれる。<br>
NLTKライブラリに実装されている。

In [19]:
porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

split()：文字を空白で分割<br>
tokenizer_porter(text)：split()で分割した単語をfor文で1個ずつ取り出し、ワードステミングを実行

ここで、トークン化とトークン化＋ワードワードステミングの結果を比較する。

In [20]:
print(tokenizer('runners like running and thus they run'))
print(tokenizer_porter('runners like running and thus they run'))

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
['runner', 'like', 'run', 'and', 'thu', 'they', 'run']


また、ストップワードの除去も行う。<br>
こちらも、NLTKライブラリで実行することができる。

まず、nltkに登録されているストップワードをダウンロードする。<br>
その後、ストップワードでないものにtokenizer_porter()を適用し、単語を抽出する。

In [21]:
nltk.download('stopwords')

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['runner', 'like', 'run', 'run', 'lot']

## 7.　自然言語処理のデモンストレーション

機械学習モデルを用いて、下記に分類する。

1：ポジティブな感想（面白かった など）<br>
0：ネガティブな感想（つまなかった など）

まず、クレンジングしたテキストのDataframeを、訓練データとテストデータに分割する。

In [22]:
df = pd.read_csv('../input/movie-datacsv/movie_data.csv')

print(df.shape)
df.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve...",1
1,OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low...,0
2,"***SPOILER*** Do not read this, if you think about watching that movie, although it would be a waste of time. (By the way: The plot is so predicta...",0
3,hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i. i love the songs once you have seen th...,1
4,"I recently bought the DVD, forgetting just how much I hated the movie version of ""A Chorus Line."" Every change the director Attenborough made to t...",0
5,Leave it to Braik to put on a good show. Finally he and Zorak are living their own lives outside of Spac Ghost Coast To Coast. I have to say that ...,1
6,"Nathan Detroit (Frank Sinatra) is the manager of the New York's longest- established floating craps game, and he needs $1000 to secure a new locat...",1
7,"To understand ""Crash Course"" in the right context, you must understand the 80's in TV. Most TV shows didn't have any point. The sitcom outpopulate...",1
8,"I've been impressed with Chavez's stance against globalisation for sometime now, but it wasn't until I saw the film at the Amsterdam documentary i...",1
9,This movie is directed by Renny Harlin the finnish miracle. Stallone is Gabe Walker. Cat and Mouse on the mountains with ruthless terrorists. Renn...,1


In [23]:
pd.DataFrame([['review', '映画の感想文'],
              ['sentiment', '感情（0：マイナス、1：プラス）']],
              columns=['カラム', '意味'])

Unnamed: 0,カラム,意味
0,review,映画の感想文
1,sentiment,感情（0：マイナス、1：プラス）


In [24]:
train = df[:25000]

print(train.shape)
train.head()

(25000, 2)


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve...",1
1,OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low...,0
2,"***SPOILER*** Do not read this, if you think about watching that movie, although it would be a waste of time. (By the way: The plot is so predicta...",0
3,hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i. i love the songs once you have seen th...,1
4,"I recently bought the DVD, forgetting just how much I hated the movie version of ""A Chorus Line."" Every change the director Attenborough made to t...",0


In [25]:
train.loc[0, 'review']

'In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich famil

In [26]:
test = df[25000:25010]

print(test.shape)
test.head(10)

(10, 2)


Unnamed: 0,review,sentiment
25000,"There's a part of me that would like to give this movie a high rating. Considering that it was made in 1953, this is a very courageous movie about...",0
25001,"Excellent and moving story of the end of a uniquely intimate affair. Then again, the point of the film, to paraphrase another comment, is that eve...",1
25002,"Surprisingly well-acted, well-written movie about hard rockin'-but-decent young man getting that much-hoped-for ticket to stardom: his favorite he...",1
25003,"What garbage, is there actually no part II? If this movie actually ends the way it did, everyone involved with this movie should be ashamed. This ...",0
25004,"Basically, this was obviously designed to be promotional material for the movie produced by the same horrible director, which happens to be even w...",0
25005,"Robot Jox doesn't suffer from story or bad effects. I mean, this was 1990 if you know what I'm talking about. RoboCop 2 still used the stop animat...",1
25006,"I'm sorry to say this, but the acting in this film is horrible. The dialogue sounds as if they are reading their lines for the first time ever. Pe...",0
25007,"Greystoke: The Legend of Tarzan, Lord of the Apes is based on the classic book Tarzan of the Apes by Edgar Rice Burroughs and is a more faithful a...",1
25008,"I searched for this movie for years, apparently it ain't available here in the States so bought me a copy off Ebay.<br /><br />Four young hunters ...",1
25009,"Once again, Disney manages to make a children's movie which totally ignores its background. About the only thing common with this and the original...",0


In [27]:
test_review = test['review']

In [28]:
data = pd.concat([train, test], axis=0)

print(data.shape)
data.head()

(25010, 2)


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve...",1
1,OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low...,0
2,"***SPOILER*** Do not read this, if you think about watching that movie, although it would be a waste of time. (By the way: The plot is so predicta...",0
3,hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i. i love the songs once you have seen th...,1
4,"I recently bought the DVD, forgetting just how much I hated the movie version of ""A Chorus Line."" Every change the director Attenborough made to t...",0


In [29]:
X = data['review']
y = data['sentiment']

In [30]:
X = X.apply(lambda x: preprocessor(x))

print(X.shape)
pd.DataFrame(X).head()

(25010,)


Unnamed: 0,review
0,in 1974 the teenager martha moxley maggie grace moves to the high class area of belle haven greenwich connecticut on the mischief night eve of hal...
1,ok so i really like kris kristofferson and his usual easy going delivery of lines in his movies age has helped him with his soft spoken low energy...
2,spoiler do not read this if you think about watching that movie although it would be a waste of time by the way the plot is so predictable that i...
3,hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i i love the songs once you have seen the...
4,i recently bought the dvd forgetting just how much i hated the movie version of a chorus line every change the director attenborough made to the s...


TF-IDFの算出と、トークン化を同時に行う。

In [31]:
tv = TfidfVectorizer(tokenizer=tokenizer_porter, stop_words='english')
X = tv.fit_transform(X)

  'stop_words.' % sorted(inconsistent))


In [32]:
X_train = X[:25000]
y_train = y[:25000]
X_test = X[25000:]
y_test = y[25000:]

print(X_test.shape)

(10, 54274)


In [33]:
'''
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=666)

def create_model(trial):
    num_leaves = trial.suggest_int('num_leaves', 2, 30)
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    learning_rate = trial.suggest_uniform('learning_rate', 0.0001, 0.99)
    max_depth = trial.suggest_int('max_depth', 2, 10)
    min_child_samples = trial.suggest_int('min_child_samples', 100, 1200)
    min_data_in_leaf = trial.suggest_int('min_data_in_leaf', 5, 90)
    bagging_freq = trial.suggest_int('bagging_freq', 1, 7)
    bagging_fraction = trial.suggest_uniform('bagging_fraction', 0.0001, 1.0)
    feature_fraction = trial.suggest_uniform('feature_fraction', 0.0001, 1.0)
    subsample = trial.suggest_uniform('subsample', 0.1, 1.0)
    colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.1, 1.0)
    
    model = lgb.LGBMClassifier(
        num_leaves=num_leaves,
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        max_depth=max_depth, 
        min_child_samples=min_child_samples, 
        min_data_in_leaf=min_data_in_leaf,
        bagging_freq=bagging_freq,
        bagging_fraction=bagging_fraction,
        feature_fraction=feature_fraction,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        random_state=666)
    return model

def objective(trial):
    model = create_model(trial)
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    return accuracy

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=40)
params = study.best_params
print(params)
'''

"\nX_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=666)\n\ndef create_model(trial):\n    num_leaves = trial.suggest_int('num_leaves', 2, 30)\n    n_estimators = trial.suggest_int('n_estimators', 50, 300)\n    learning_rate = trial.suggest_uniform('learning_rate', 0.0001, 0.99)\n    max_depth = trial.suggest_int('max_depth', 2, 10)\n    min_child_samples = trial.suggest_int('min_child_samples', 100, 1200)\n    min_data_in_leaf = trial.suggest_int('min_data_in_leaf', 5, 90)\n    bagging_freq = trial.suggest_int('bagging_freq', 1, 7)\n    bagging_fraction = trial.suggest_uniform('bagging_fraction', 0.0001, 1.0)\n    feature_fraction = trial.suggest_uniform('feature_fraction', 0.0001, 1.0)\n    subsample = trial.suggest_uniform('subsample', 0.1, 1.0)\n    colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.1, 1.0)\n    \n    model = lgb.LGBMClassifier(\n        num_leaves=num_leaves,\n        n_estimators=n_estimators,\n        learning_

In [34]:
params =  {'num_leaves': 15,
           'n_estimators': 261,
           'learning_rate': 0.20889849934915378,
           'max_depth': 9,
           'min_child_samples': 539,
           'min_data_in_leaf': 51,
           'bagging_freq': 5,
           'bagging_fraction': 0.6471599134667355,
           'feature_fraction': 0.7844921786683485,
           'subsample': 0.8286584823998,
           'colsample_bytree': 0.7804281493256859,
           'random_state': 666}

In [35]:
cls = lgb.LGBMClassifier(**params)
cls.fit(X_train, y_train)



LGBMClassifier(bagging_fraction=0.6471599134667355, bagging_freq=5,
               colsample_bytree=0.7804281493256859,
               feature_fraction=0.7844921786683485,
               learning_rate=0.20889849934915378, max_depth=9,
               min_child_samples=539, min_data_in_leaf=51, n_estimators=261,
               num_leaves=15, random_state=666, subsample=0.8286584823998)

In [36]:
y_pred = cls.predict(X_test)
print(X_test.shape)
print(y_pred.shape)

(10, 54274)
(10,)


In [37]:
df = pd.DataFrame(test_review, columns=['review'])
df['answer'] = y_test
df['prediction'] = y_pred

df.head(10)

Unnamed: 0,review,answer,prediction
25000,"There's a part of me that would like to give this movie a high rating. Considering that it was made in 1953, this is a very courageous movie about...",0,0
25001,"Excellent and moving story of the end of a uniquely intimate affair. Then again, the point of the film, to paraphrase another comment, is that eve...",1,1
25002,"Surprisingly well-acted, well-written movie about hard rockin'-but-decent young man getting that much-hoped-for ticket to stardom: his favorite he...",1,0
25003,"What garbage, is there actually no part II? If this movie actually ends the way it did, everyone involved with this movie should be ashamed. This ...",0,0
25004,"Basically, this was obviously designed to be promotional material for the movie produced by the same horrible director, which happens to be even w...",0,0
25005,"Robot Jox doesn't suffer from story or bad effects. I mean, this was 1990 if you know what I'm talking about. RoboCop 2 still used the stop animat...",1,1
25006,"I'm sorry to say this, but the acting in this film is horrible. The dialogue sounds as if they are reading their lines for the first time ever. Pe...",0,0
25007,"Greystoke: The Legend of Tarzan, Lord of the Apes is based on the classic book Tarzan of the Apes by Edgar Rice Burroughs and is a more faithful a...",1,1
25008,"I searched for this movie for years, apparently it ain't available here in the States so bought me a copy off Ebay.<br /><br />Four young hunters ...",1,0
25009,"Once again, Disney manages to make a children's movie which totally ignores its background. About the only thing common with this and the original...",0,0


In [38]:
print('精度:', accuracy_score(y_test, y_pred)*100, '%')

精度: 80.0 %
