### IMDB 영화평 감상분석
- Pipeline을 이용한 GridSearchCV
- TfidVectorizer + SVC, LogisticRegression

In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('data/labeledTrainData.tsv', sep='\t', quoting=3)

In [2]:
df.review = df.review.str.replace('<br />', ' ')
df.review = df.review.str.replace('[^A-Za-z]', ' ', regex=True)

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values, df.sentiment, stratify=df.sentiment,
    test_size=0.2, random_state=2023
)

#### Pipelining

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [7]:
tvect = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
nb = MultinomialNB()
pipline = Pipeline([('TVECT',tvect),('NB',nb)])

In [8]:
%time pipline.fit(X_train,y_train)

CPU times: total: 11.7 s
Wall time: 11.8 s


In [9]:
pipline.score(X_test,y_test)

0.8804

In [10]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=2023)
pipline = Pipeline([('TVECT',tvect),('LRC',lrc)])
%time pipline.fit(X_train,y_train)

CPU times: total: 31.1 s
Wall time: 28.5 s


In [12]:
pipline.score(X_test,y_test)

0.8818

#### 최적 파라메타 찾기

In [18]:
from sklearn.model_selection import GridSearchCV
params = {
    'TVECT__max_df' :[100,500],
    'LRC__C' : [1,10],
}

In [19]:
grid_pip = GridSearchCV(
    pipline, params, scoring='accuracy',cv=3
)
%time grid_pip.fit(X_train,y_train)

CPU times: total: 4min 22s
Wall time: 4min 8s


In [20]:
best_pipe = grid_pip.best_estimator_
best_pipe.score(X_test,y_test)

0.89

In [21]:
review = ['''
This isn't just a beautifully crafted gangster film.
Or an outstanding family portrait, for that matter.An amazing period piece.
A character study. A lesson in filmmaking and an inspiration to generations of actors, directors, screenwriters and producers.
For me, this is more: this is the definitive film.
10 stars out of 10.
''',
'''
I follow recommendations on this site highly. 
I rented this movie and wanted my money back. 
Ever been to one of those parties with distant relatives where you don't know anyone there and just sit in the corner waiting for it to end? 
If so, you've seen 90% of this movie. Throw in a few good scenes that happen so far apart, you forget the last one by the time you see the next one.
Might be worth watching once just to say you have, but you'll probably never watch it again. 
Definitely not "best movie ever material."
''']

In [22]:
import re
review = map(lambda x:re.sub('[^A-Za-z]', ' ', x),review)
best_pipe.predict(review)

array([1, 1], dtype=int64)