### IMDB 영화평 감성분석
    - PipeLine 을 이용한 GridSearchCV
    - TfidfVectorizer + NaiveBayes, Logistic Regression

In [5]:
import pandas as pd
import numpy as np

In [6]:
df = pd.read_csv('data/labeledTrainData.tsv', sep='\t', quoting=3)
df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [7]:
# 공백제거
df.review = df.review.str.replace('<br />', ' ')
# 구둣점, 숫자 제거 --> 영어 이외의 문자 공백으로 변환
df.review = df.review.str.replace('[^A-Za-z]', ' ')

  df.review = df.review.str.replace('[^A-Za-z]', ' ')


In [8]:
# 데이터셋 분리
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values, df.sentiment.values, stratify=df.sentiment.values, test_size=0.2, random_state=2023
)

#### Pipelining

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [10]:
tvect = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
nb = MultinomialNB()
pipeline = Pipeline([('TVECT', tvect), ('NB', nb)])

In [11]:
# 학습
%time pipeline.fit(X_train, y_train)

CPU times: total: 12.7 s
Wall time: 13.1 s


In [13]:
pipeline.score(X_test, y_test)

0.8804

In [14]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=2023)
pipeline = Pipeline([('TVECT', tvect), ('LRC', lrc)])
%time pipeline.fit(X_train, y_train)

CPU times: total: 1min 4s
Wall time: 31.5 s


In [15]:
pipeline.score(X_test, y_test)

0.8818

- 최적 파라미터 찾기

In [16]:
from sklearn.model_selection import GridSearchCV
params = {
    'TVECT__max_df' : [100, 500],
    'LRC__C' : [1, 10]
}

In [17]:
grid_pipe = GridSearchCV(
    pipeline, params, scoring='accuracy', cv=3
)
%time grid_pipe.fit(X_train, y_train)

CPU times: total: 8min 16s
Wall time: 4min 28s


In [19]:
grid_pipe.best_params_

{'LRC__C': 10, 'TVECT__max_df': 500}

In [20]:
best_pipe = grid_pipe.best_estimator_
best_pipe.score(X_test, y_test)

0.89

- 실제 데이터로 검증

In [21]:
review = ['''
This isn't just a beautifully crafted gangster film. Or an outstanding family portrait, for that matter. An amazing period piece. A character study. A lesson in filmmaking and an inspiration to generations of actors, directors, screenwriters and producers. For me, this is more: this is the definitive film. 10 stars out of 10.
''', '''I continually fail to understand why The Godfather is hailed as "The Greatest Movie of All Time". I've seen it twice--a second time just to make sure--and I have to tell you that I sat there in a stupor, bored out of my mind. And I'm not a teenager raised on MTV; I'm in my 30s and am absolutely devoted to movies--I've seen as many classics (American & foreign) that I can get my hands on. But, for me, The Godfather ranks alongside Singin' in the Rain as the most overrated films of all time.
Singin' in the Rain, at least, I get (it's just my intense dislike for Donald O'Connor that makes me dislike this film). But The Godfather? It's just a bland epic about a bunch of moronic gangsters, with Marlon Brando giving a campy performance, and riddled with repulsive violence. Give me a break. The fact that this movie is so "beloved" has had the direct result that nowadays we got absurdly worse and worse films every year, created by clueless filmmakers.''', '''"The Godfather" was a sickening experience the first time I saw it in its initial release, and it hasn't changed. I thoroughly disliked the film's spirit, finding it very ugly at its heart. I saw little justification for its existence, was sorry I saw it, and have tried to forget it.''']


In [22]:
# 텍스트 전처리
import re
review = map(lambda x: re.sub('[^A-Za-z]', ' ', x), review)

In [23]:
best_pipe.predict(review)

array([1, 0, 1], dtype=int64)