### IMDB 영화평 감성분석(이진분류)
- CountVectorizer + LogisticRegression

In [1]:
import numpy as np
import pandas as pd

##### 1. 데이터 탐색

In [2]:
df = pd.read_csv('../data/labeledTrainData.tsv', sep='\t')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
df = pd.read_csv('../data/labeledTrainData.tsv', sep='\t', quoting=3)   # 3: QUOTE NONE
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [12]:
df.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.  The actual feature film bit when it finally starts is only on f

In [9]:
df.isna().sum().sum()

0

##### 2. 텍스트 전처리

In [11]:
# <br /> 태그를 공백으로
df.review = df.review.str.replace('<br />', ' ')

In [16]:
# 구둣점, 숫자 제거 --> 영어 이외의 문자는 공백으로 변환
df.review = df.review.str.replace('[^A-Za-z]', ' ', regex=True)

In [17]:
df.review[0]

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for 

##### 3. 데이터 셋 분리

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values, df.sentiment.values, stratify=df.sentiment.values, 
    test_size=0.2, random_state=2023
)
np.unique(y_train, return_counts=True)

(array([0, 1], dtype=int64), array([10000, 10000], dtype=int64))

##### 4. Text Encoding

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words='english')

In [23]:
# 아래와 같은 방법으로 하면 안됨
# 단어의 index가 달라진다.
cvect.fit_transform(X_train).shape, cvect.fit_transform(X_test).shape

((20000, 66602), (5000, 37763))

In [24]:
# 이와 같은 방법을 사용해야 함
cvect.fit(X_train)
X_train_cv = cvect.transform(X_train)
X_test_cv = cvect.transform(X_test)
X_train_cv.shape, X_test_cv.shape

((20000, 66602), (5000, 66602))

##### 5. 학습 및 평가

In [27]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=2023, max_iter=500)

In [29]:
# 시간이 오래 걸리는 작업 - %time magic 명령어 사용
# 걸리는 시간을 알 수 있음
%time lrc.fit(X_train_cv, y_train)

CPU times: total: 5.3 s
Wall time: 5.35 s


In [30]:
lrc.score(X_test_cv, y_test)

0.8786

##### 6. Bigram

In [35]:
cvect2 = CountVectorizer(stop_words='english', ngram_range=(1,2))
cvect2.fit(X_train)
X_train_cv2 = cvect2.transform(X_train)
X_test_cv2 = cvect2.transform(X_test)
X_train_cv2.shape, X_test_cv2.shape

((20000, 1455899), (5000, 1455899))

In [36]:
lrc2 = LogisticRegression(random_state=2023, max_iter=500)
%time lrc2.fit(X_train_cv2, y_train)

CPU times: total: 55.8 s
Wall time: 51.6 s


In [37]:
lrc2.score(X_test_cv2, y_test)

0.8896

##### 7. 모델 load/save

In [38]:
import joblib

In [39]:
# 모델 저장
joblib.dump(cvect2, 'model/imdb_cvect_2.pkl')
joblib.dump(lrc2, 'model/imdb_lrc2.pkl')

['model/imdb_lrc2.pkl']

In [40]:
# 모델 로드
new_cvect = joblib.load('model/imdb_cvect_2.pkl')
new_lrc = joblib.load('model/imdb_lrc2.pkl')


##### 8. 실제 데이터로 검증

In [67]:
review = ['''
I was very much disappointed by this flat action movie and its predictable ending. I am a fan of the old Mission:Impossible series of the 60s and 80s and therefore I think the plot is ridiculous at best. Why should Jim Phelps do what he did? He was always loyal through the many episodes of the series and there he could have gotten much more money if he had betrayed his team.

The reason why Peter Graves (Jim Phelps) did not star in this movie is because he did not agree with what I have just said.

Anyway this movie is NOT for fans of the series because there is nothing left of the teamwork spirit of the series. It is a one man show for Tom Cruise.
''',
'''
This is, without a doubt, one of my favorite films of all time! I'll never forget watching this film for the first time with a good buddy of mine, afterward we couldn't stop talking about it and spent a great deal of time explaining plot points to each other. We finally decided that we just had to see it again, so we did and all of our questions were answered and our theories proven correct.

The story is nothing less than superb! Every time you think you have the movie figured out they throw you for another loop, but not too much as to get you irritated trying to figure out the plot. This is most definitely a film that deserves at least two viewings before you can truly understand and appreciate the story. The characters are all excellent as well, although I was sad to see Jack Harmen (Emilio Estevez) get killed off so quickly, I liked his character.

The cast is extraordinary! Tom Cruise plays Ethan Hunt perfectly! Jon Voight was the perfect choice for Jim Phelps. Emmanuelle Beart was very good in her role. Henry Czerny was superb as Kittridge. Jean Reno was an excellent addition to the cast. Ving Rhames was a very nice touch and really added a lot to the film. Kristin Scott Thomas was lovely as always, although played a somewhat minor role in the greater scheme of things. Vanessa Redgrave was another nice addition to the cast. And finally, Emilio Estevez (as I mentioned above), played a small role, and played it quite well.

I can see why some of the big fans of the show wouldn't like this film due to certain plot points that I can't give away, so if you are a big fan of the show, be forewarned, you may have some issues with the film. Personally, I've never seen a single episode of the old television show, so I had absolutely no frame of reference. Which, I believe, put me in a better position to appreciate the story.

I feel that I have to mention the action scenes in this film! SPECTACULAR!!! The scene where Kittridge and Hunt are talking in the restaurant...just AWESOME! The entire last 20 minutes of the film...UNBELIEVABLE!!! The filming, the action, the special effects and stunts alone make this film worth watching (but luckily, there is so much more to appreciate).

If you are a fan of Tom Cruise, or just crime/mystery/action films in general, be sure to check this one out (at least twice). This is honestly one of my top 20 films of all time, I truly hope that you will enjoy this film. Thanks for reading,

-Chris

'''          
]

In [70]:
# 텍스트 전처리
import re
review = map(lambda x: re.sub('[^A-Za-z]', ' ', x), review)
# review = re.sub('[^A-Za-z]', ' ', review)

In [71]:
# feature 변환
review_cv = new_cvect.transform(review)
review_cv.shape

(2, 1455899)

In [72]:
# 예측
new_lrc.predict(review_cv)

array([0, 1], dtype=int64)