# 使用LSTM和Attention机制进行IMDB电影评论情感分析
目标：实现一个基于LSTM和Attention机制的情感分析模型，准确率超过85%。

## 获取数据集
IMDB 数据集可以从 [Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) 下载。下载后，将 `IMDB Dataset.csv` 文件放置在项目目录中。

In [None]:


# 安装必要的库
%pip install scikit-learn

# 导入必要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from gensim.models import Word2Vec
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Attention, Bidirectional


In [None]:
# 加载IMDB数据集
df = pd.read_csv('IMDB Dataset.csv')
df.head()

In [None]:
# 数据预处理
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
X = df['review']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# 词嵌入（Word2Vec）
sentences = [review.split() for review in X_train]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
vocab_size = len(word2vec_model.wv)
embedding_matrix = np.zeros((vocab_size, 100))
for i in range(vocab_size):
    embedding_matrix[i] = word2vec_model.wv[word2vec_model.wv.index_to_key[i]]


In [None]:
# 文本序列化和填充
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
maxlen = 100
X_train_pad = pad_sequences(X_train_seq, maxlen=maxlen)
X_test_pad = pad_sequences(X_test_seq, maxlen=maxlen)


In [None]:
# 构建LSTM模型
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=100, weights=[embedding_matrix], input_length=maxlen, trainable=False))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Attention())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()


In [None]:
# 训练模型
history = model.fit(X_train_pad, y_train, epochs=5, batch_size=64, validation_split=0.2)


In [None]:
# 评估模型
y_pred = (model.predict(X_test_pad) > 0.5).astype('int32')
accuracy = accuracy_score(y_test, y_pred)
print(f'准确率: {accuracy * 100:.2f}%')
print(classification_report(y_test, y_pred))
assert accuracy > 0.85, '模型准确率未达到85%'