<a href="https://colab.research.google.com/github/trietp1253201581/NaturalLanguangeProcessing/blob/main/sentiment_analysis_movie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phân tích tình cảm của bài đánh giá phim dựa trên TF-ITF

## Khám phá dữ liệu

In [1]:
!ls .

kaggle.json  sample_data


In [2]:
import os
IS_KAGGLE = 'KAGGLE_KERNEL_RUN_TYPE' in os.environ

In [6]:
if IS_KAGGLE:
    data_dir = '../input/sentiment-analysis-on-movie-reviews'
    train_fname = data_dir + '/train.tsv'
    test_fname = data_dir + '/test.tsv'
    sample_fname = data_dir + '/sampleSubmission.csv'
else:
    os.environ['KAGGLE_CONFIG_DIR'] = '.'
    !kaggle competitions download -c sentiment-analysis-on-movie-reviews -f train.tsv.zip -p data
    !kaggle competitions download -c sentiment-analysis-on-movie-reviews -f test.tsv.zip -p data
    !kaggle competitions download -c sentiment-analysis-on-movie-reviews -f sampleSubmission.csv -p data
    train_fname = 'data/train.tsv.zip'
    test_fname = 'data/test.tsv.zip'
    sample_fname = 'data/sampleSubmission.csv'

train.tsv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.tsv.zip: Skipping, found more recently modified local copy (use --force to force download)
sampleSubmission.csv: Skipping, found more recently modified local copy (use --force to force download)


In [4]:
import pandas as pd

In [7]:
train_df = pd.read_csv(train_fname, sep='\t')
test_df = pd.read_csv(test_fname, sep='\t')
sample_df = pd.read_csv(sample_fname)

In [8]:
train_df

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2
...,...,...,...,...
156055,156056,8544,Hearst 's,2
156056,156057,8544,forced avuncular chortles,1
156057,156058,8544,avuncular chortles,3
156058,156059,8544,avuncular,2


In [9]:
train_df['Sentiment'].value_counts()

Sentiment
2    79582
3    32927
1    27273
4     9206
0     7072
Name: count, dtype: int64

In [10]:
test_df

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine
...,...,...,...
66287,222348,11855,"A long-winded , predictable scenario ."
66288,222349,11855,"A long-winded , predictable scenario"
66289,222350,11855,"A long-winded ,"
66290,222351,11855,A long-winded


## Triển khai TF-IDF

In [15]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import nltk

In [22]:
nltk.download('punkt')
nltk.download('stopwords')
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
def tokenizer(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]

In [23]:
vectorizer = TfidfVectorizer(tokenizer=tokenizer,
                             stop_words=stop_words,
                             lowercase = True,
                             max_features=10000)

In [24]:
%%time
vectorizer.fit(train_df['Phrase'])



CPU times: user 33.1 s, sys: 59.6 ms, total: 33.1 s
Wall time: 33.4 s


In [25]:
print(len(vectorizer.vocabulary_))
print(vectorizer.vocabulary_)

10000
{'seri': 7745, 'escapad': 2934, 'demonstr': 2312, 'adag': 281, 'good': 3753, 'goos': 3769, 'also': 425, 'gander': 3588, ',': 12, 'occasion': 6037, 'amus': 467, 'none': 5963, 'amount': 463, 'much': 5754, 'stori': 8390, '.': 17, 'quiet': 6951, 'introspect': 4602, 'entertain': 2889, 'independ': 4450, 'worth': 9891, 'seek': 7656, 'even': 2975, 'fan': 3135, 'merchant': 5517, "'s": 9, 'work': 9868, 'suspect': 8594, 'would': 9895, 'hard': 4005, 'time': 8899, 'sit': 7945, 'one': 6098, 'posit': 6668, 'thrill': 8858, 'combin': 1787, 'ethnographi': 2957, 'intrigu': 4598, 'betray': 963, 'deceit': 2239, 'murder': 5776, 'shakespearean': 7787, 'tragedi': 9012, 'juici': 4773, 'soap': 8102, 'opera': 6126, 'aggress': 349, 'self-glorif': 7691, 'manipul': 5377, 'comedy-drama': 1794, 'near': 5857, 'epic': 2906, 'proport': 6835, 'root': 7391, 'sincer': 7930, 'perform': 6433, 'titl': 8923, 'charact': 1517, 'undergo': 9249, 'midlif': 5564, 'crisi': 2072, 'narrat': 5837, 'troubl': 9091, 'everi': 2986, 'd

In [26]:
%%time
vector = vectorizer.transform(train_df['Phrase'])

CPU times: user 38 s, sys: 167 ms, total: 38.2 s
Wall time: 44.7 s


In [36]:
test_df = test_df[test_df['Phrase'].isna() == False]

In [37]:
test_df

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine
...,...,...,...
66287,222348,11855,"A long-winded , predictable scenario ."
66288,222349,11855,"A long-winded , predictable scenario"
66289,222350,11855,"A long-winded ,"
66290,222351,11855,A long-winded


In [None]:
%%time
test_vector = vectorizer.transform(test_df['Phrase'])

## ML Model

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
train_inputs, val_inputs, train_targets, val_targets = train_test_split(vector,
                                                                        train_df['Sentiment'],
                                                                        test_size=0.2,
                                                                        random_state=42)

In [41]:
train_inputs.shape

(124848, 10000)

In [42]:
from sklearn.linear_model import LogisticRegression

In [49]:
model = LogisticRegression(solver='sag')

In [50]:
%%time
model.fit(train_inputs, train_targets)

CPU times: user 3.26 s, sys: 2.59 ms, total: 3.26 s
Wall time: 3.26 s


In [45]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [51]:
train_preds = model.predict(train_inputs)
val_preds = model.predict(val_inputs)

In [52]:
accuracy_score(train_targets, train_preds), f1_score(train_targets, train_preds, average='macro')

(0.6712001794181724, 0.5269737782711433)

In [53]:
accuracy_score(val_targets, val_preds), f1_score(val_targets, val_preds, average='macro')

(0.6314558503139818, 0.4720591648244506)

Như vậy đạt được độ chính xác khoảng 2/3 trên cả bộ đào tạo và xác nhận

In [54]:
model.predict(test_vector)

array([3, 3, 2, ..., 1, 1, 1])