### IMDB 영화평 감상분석(이진분류)
- CountVectorizer + LogisticRegression

In [1]:
import numpy as np
import pandas as pd

##### 1. 데이터 탐색

In [2]:
df = pd.read_csv('data/labeledTrainData.tsv', sep='\t')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
df = pd.read_csv('data/labeledTrainData.tsv', sep='\t', quoting=3)      # 3: QOUTE NONE
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [5]:
df.review[0][:1000]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [6]:
df.isna().sum().sum()

0

##### 2. 텍스트 전처리

In [7]:
# <br /> 태그는 공백으로
df.review = df.review.str.replace('<br />', ' ')

In [8]:
# 구둣점, 숫자 제거
df.review = df.review.str.replace('[^A-Za-z]', ' ', regex=True)

In [9]:
df.review[0][:200]

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want '

##### 3. 데이터 셋 분리

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values, df.sentiment.values, stratify=df.sentiment.values,
    test_size=0.2, random_state=2023
)
np.unique(y_train, return_counts=True)

(array([0, 1], dtype=int64), array([10000, 10000], dtype=int64))

##### 4. Text Encoding

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words='english')

In [14]:
# 아래와 같은 방법으로 하면 안됨
cvect.fit_transform(X_train).shape, cvect.fit_transform(X_test).shape

((20000, 66602), (5000, 37763))

In [15]:
# 이와 같은 방법을 사용해야 함
cvect.fit(X_train)
X_train_cv = cvect.transform(X_train)
X_test_cv = cvect.transform(X_test)
X_train_cv.shape, X_test_cv.shape

((20000, 66602), (5000, 66602))

##### 5. 학습 및 평가

In [17]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=2023, max_iter=500)

In [18]:
# 시간이 오래 걸리는 작업 - %time magic 명령어 사용
%time lrc.fit(X_train_cv, y_train)

CPU times: total: 5.36 s
Wall time: 5.31 s


In [19]:
lrc.score(X_test_cv, y_test)

0.8786

##### 6. Bigram

In [20]:
cvect2 = CountVectorizer(stop_words='english', ngram_range=(1, 2))
cvect2.fit(X_train)
X_train_cv2 = cvect2.transform(X_train)
X_test_cv2 = cvect2.transform(X_test)
X_train_cv2.shape, X_test_cv2.shape

((20000, 1455899), (5000, 1455899))

In [21]:
lrc2 = LogisticRegression(random_state=2023, max_iter=500)
%time lrc2.fit(X_train_cv2, y_train)

CPU times: total: 55.5 s
Wall time: 48.1 s


In [23]:
lrc2.score(X_test_cv2, y_test)

0.8896

##### 7. 모델 save/load

In [24]:
import joblib

In [26]:
# 모델 저장
joblib.dump(cvect2, 'model/imdb_cvect_2.pkl')
joblib.dump(lrc2, 'model/imdb_lrc2.pkl')

['model/imdb_lrc2.pkl']

In [27]:
# 모델 로드
new_cvect = joblib.load('model/imdb_cvect_2.pkl')
new_lrc = joblib.load('model/imdb_lrc2.pkl')

##### 8. 실제 데이터로 검증

In [42]:
review = '''
First off this movie is for kids and fans of Nintendo and the Mario franchise. 
I still think an adult who isnt a fan could still enjoy it but this movie is so full of fan service 
that it will have you smiling the whole time. 
The voice acting I was skeptical but they all work and work well too. 
Jack Black is the star here. I love how they kept the story simple like all of the games. 
Truly felt like a video game on screen. This movie felt like a beautifully animated amusement park ride. 
The audio in the movie was amazing too. The sounds and the score with reimagined iconic music was perfect. 
Some of the songs in the movie felt unnecessary but they worked. 
I think they should've bumped the run time to 105-120 min. 90 min felt too short as it goes by quick. 
I havent had this much wholesome fun at the movies in a long time. If youre a fan you HAVE to see it.'''

In [43]:
# 텍스트 전처리
import re
review = re.sub('[^A-Za-z]', ' ', review)

In [44]:
# feature 변환
review_cv = new_cvect.transform([review])
review_cv.shape

(1, 1455899)

In [45]:
# 예측
'긍정' if new_lrc.predict(review_cv)[0] == 1 else '부정'

'긍정'

In [40]:
review2 = '''
The movie is a huge miss and a disappointment beyond words.
Story, nope. Dialogue, nope. Funny, nope. Fun, not besides seeing the characters we all love. 
It's just a salad of screaming, explosions, slow mo, racing, camera angles. 
Can you do a Super Mario movie without explosions? Probably not. 
Do you have to do one only with explosions? Definitely not. 
Waste of time for nostalgic adults, and definitely not what i want my kids to remember Super Mario for. 
Whatever reminds you of the good things about Super Mario is inserted
just so you can't say it's not there, beyond amateur and kitsch'''

In [41]:
review2 = re.sub('[^A-Za-z]', ' ', review2)
review_cv2 = new_cvect.transform([review2])
'긍정' if new_lrc.predict(review_cv2)[0] == 1 else '부정'

'부정'