### IMDB 영화평 감성분석
- Pipeline을 이용한 GridSearchCV
- TfidfVectorizer + NaiveBayes

In [5]:
import numpy as np
import pandas as pd
df = pd.read_csv('data/labeledTrainData.tsv', sep='\t', quoting=3)      # 3: QOUTE NONE
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [6]:
df.review = df.review.str.replace('<br />', ' ')
df.review = df.review.str.replace('[^A-Za-z]', ' ', regex=True)

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values, df.sentiment.values, stratify=df.sentiment.values,
    test_size=0.2, random_state=2023
)
np.unique(y_train, return_counts=True)

(array([0, 1], dtype=int64), array([10000, 10000], dtype=int64))

##### Pipelining

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [9]:
tvect = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
nb = MultinomialNB()
pipeline = Pipeline([('TVECT', tvect), ('NB', nb)])

In [10]:
# 학습
%time pipeline.fit(X_train, y_train)

CPU times: total: 12.8 s
Wall time: 13.1 s


In [11]:
pipeline.score(X_test, y_test)

0.8804

In [12]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=2023)
pipeline = Pipeline([('TVECT', tvect), ('LRC', lrc)])
%time pipeline.fit(X_train, y_train)

CPU times: total: 30.9 s
Wall time: 29.3 s


In [13]:
pipeline.score(X_test, y_test)

0.8818

##### 최적 파라메터 찾기

In [14]:
from sklearn.model_selection import GridSearchCV
params = {
    'TVECT__max_df': [100, 500],
    'LRC__C': [1, 10]
}

In [15]:
grid_pipe = GridSearchCV(
    pipeline, params, scoring='accuracy', cv=3
)
%time grid_pipe.fit(X_train, y_train)

CPU times: total: 4min 34s
Wall time: 4min 28s


In [16]:
grid_pipe.best_params_

{'LRC__C': 10, 'TVECT__max_df': 500}

In [17]:
best_pipe = grid_pipe.best_estimator_
best_pipe.score(X_test, y_test)

0.89

- 실 데이터에 적용

In [18]:
review = '''
First off this movie is for kids and fans of Nintendo and the Mario franchise. 
I still think an adult who isnt a fan could still enjoy it but this movie is so full of fan service 
that it will have you smiling the whole time. 
The voice acting I was skeptical but they all work and work well too. 
Jack Black is the star here. I love how they kept the story simple like all of the games. 
Truly felt like a video game on screen. This movie felt like a beautifully animated amusement park ride. 
The audio in the movie was amazing too. The sounds and the score with reimagined iconic music was perfect. 
Some of the songs in the movie felt unnecessary but they worked. 
I think they should've bumped the run time to 105-120 min. 90 min felt too short as it goes by quick. 
I havent had this much wholesome fun at the movies in a long time. If youre a fan you HAVE to see it.'''

In [19]:
import re
review = re.sub('[^A-Za-z]', ' ', review)

In [21]:
best_pipe.predict([review])

array([1], dtype=int64)