#### 文本分类是将文本文档或句子分配到预定义的类别或标签的过程。以下是一些主要的文本分类方法及其基本代码示例：

1. **朴素贝叶斯分类器**:
   朴素贝叶斯是基于贝叶斯定理的一种简单的概率分类器。

   ``` python
   from sklearn.naive_bayes import MultinomialNB
   from sklearn.feature_extraction.text import TfidfVectorizer

   vectorizer = TfidfVectorizer()
   X = vectorizer.fit_transform(texts)
   clf = MultinomialNB()
   clf.fit(X, labels)
   predictions = clf.predict(X)
   ```

2. **支持向量机 (SVM)**:
   SVM是一个监督学习算法，用于分类或回归问题。

   ``` python
   from sklearn.svm import SVC

   clf = SVC(kernel="linear")
   clf.fit(X, labels)
   predictions = clf.predict(X)
   ```

3. **决策树和随机森林**:
   决策树是一个树形模型，随机森林是决策树的集合。

   ``` python
   from sklearn.ensemble import RandomForestClassifier

   clf = RandomForestClassifier(n_estimators=100)
   clf.fit(X, labels)
   predictions = clf.predict(X)
   ```

4. **Logistic Regression**:
   尽管名为回归，但逻辑回归实际上是一个分类算法。

   ``` python
   from sklearn.linear_model import LogisticRegression

   clf = LogisticRegression()
   clf.fit(X, labels)
   predictions = clf.predict(X)
   ```

5. **深度学习 (如CNN, RNN, BERT)**:
   深度学习方法，特别是卷积神经网络 (CNN) 和循环神经网络 (RNN)，在文本分类任务上表现出色。

   ``` python
   # 这需要使用深度学习库如TensorFlow或PyTorch
   # 下面是一个简单的TensorFlow Keras示例
   from tensorflow.keras.models import Sequential
   from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D

   model = Sequential()
   model.add(Embedding(input_dim=10000, output_dim=16))
   model.add(GlobalAveragePooling1D())
   model.add(Dense(1, activation='sigmoid'))
   
   model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
   model.fit(X_train, y_train, epochs=10, batch_size=512, validation_data=(X_val, y_val))
   predictions = model.predict(X_test)
   ```



### 支持向量机分类

In [6]:
# 导入所需的库
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 准备数据
texts = [
    "I love machine learning",
    "I hate coding bugs",
    "Deep learning is fascinating",
    "Programming is like magic",
    "Bugs can be annoying",
    "Machine learning opens many opportunities"
]
labels = [1, 0, 1, 1, 0, 1]  # 1表示正面情感，0表示负面情感

# 数据预处理
# 将数据分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

# 使用TfidfVectorizer将文本转化为向量
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# 使用SVM进行分类
# 初始化SVM分类器
clf = SVC(kernel='linear')  # 使用线性核函数

# 训练模型
clf.fit(X_train_vec, y_train)

# 预测
y_pred = clf.predict(X_test_vec)

# 打印每一个测试样本及其预测的类别
for text, prediction in zip(X_test, y_pred):
    print(f"Text: {text}\nPredicted Class: {'Positive' if prediction == 1 else 'Negative'}\n")

# 评估模型
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")


Text: I love machine learning
Predicted Class: Positive

Text: I hate coding bugs
Predicted Class: Positive

Accuracy: 50.00%


In [7]:
print(y_pred)

[1 1]


### 卷积神经网络分类

In [4]:
# 引入所需的库和模块
import tensorflow as tf                  # 引入TensorFlow框架
import numpy as np                       # 引入NumPy库，用于数值计算
from tensorflow.keras.preprocessing.text import Tokenizer  # 用于文本预处理的分词器
from tensorflow.keras.preprocessing.sequence import pad_sequences  # 用于填充文本序列，使其具有相同长度
from tensorflow.keras.models import Sequential  # 用于构建模型的线性层叠
from tensorflow.keras.layers import Dense, Embedding, Conv1D, MaxPooling1D, GlobalMaxPooling1D  # 各种神经网络层

# 示例数据
texts = ["I love machine learning", 
         "I hate coding bugs", 
         "Deep learning is fascinating", 
         "Programming is like magic"]  # 输入的文本数据
labels = [1, 0, 1, 1]  # 1表示正面情感，0表示负面情感，这是对应于texts的情感标签

# 数据预处理
tokenizer = Tokenizer(num_words=50)     # 初始化一个分词器，设置最大词汇量为50
tokenizer.fit_on_texts(texts)            # 对输入文本进行拟合，建立词汇索引
sequences = tokenizer.texts_to_sequences(texts)  # 将文本转换为数字序列
data = pad_sequences(sequences, maxlen=10)  # 对序列进行填充或截断，使其长度为10
labels = np.array(labels).reshape(-1, 1)  # 使标签形状为 (batch_size, 1)，方便后续模型训练

# 构建CNN模型
model = Sequential()  # 初始化一个顺序模型
model.add(Embedding(50, 32, input_length=10))  # 添加词嵌入层，设置输入长度为10，嵌入维度为32，最大词汇量为50
model.add(Conv1D(32, 3, activation='relu'))    # 添加一维卷积层，32个滤波器，卷积核大小为3，激活函数为ReLU
model.add(Conv1D(32, 3, activation='relu'))    # 再添加一维卷积层
model.add(GlobalMaxPooling1D())  # 添加全局最大池化层，对每个特征图取最大值
model.add(Dense(32, activation='relu'))  # 添加全连接层，32个神经元，激活函数为ReLU
model.add(Dense(1, activation='sigmoid'))  # 添加输出层，1个神经元，使用sigmoid激活函数进行二分类

# 编译模型
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])  # 使用adam优化器，二元交叉熵损失函数，并监控准确性

# 训练模型
history = model.fit(data, labels, epochs=30, verbose=2)  # 使用数据和标签训练模型，进行30轮迭代，verbose=2表示每轮打印一次训练日志

# 预测
predictions = model.predict(data)  # 使用模型对输入数据进行预测
predicted_labels = [1 if p > 0.5 else 0 for p in predictions]  # 将预测值转换为标签，阈值设为0.5

predicted_labels  # 输出预测得到的标签



Epoch 1/30
1/1 - 2s - loss: 0.6929 - accuracy: 0.7500 - 2s/epoch - 2s/step
Epoch 2/30
1/1 - 0s - loss: 0.6863 - accuracy: 0.7500 - 4ms/epoch - 4ms/step
Epoch 3/30
1/1 - 0s - loss: 0.6811 - accuracy: 0.7500 - 6ms/epoch - 6ms/step
Epoch 4/30
1/1 - 0s - loss: 0.6758 - accuracy: 0.7500 - 5ms/epoch - 5ms/step
Epoch 5/30
1/1 - 0s - loss: 0.6702 - accuracy: 0.7500 - 4ms/epoch - 4ms/step
Epoch 6/30
1/1 - 0s - loss: 0.6646 - accuracy: 0.7500 - 4ms/epoch - 4ms/step
Epoch 7/30
1/1 - 0s - loss: 0.6590 - accuracy: 0.7500 - 9ms/epoch - 9ms/step
Epoch 8/30
1/1 - 0s - loss: 0.6533 - accuracy: 0.7500 - 5ms/epoch - 5ms/step
Epoch 9/30
1/1 - 0s - loss: 0.6471 - accuracy: 0.7500 - 9ms/epoch - 9ms/step
Epoch 10/30
1/1 - 0s - loss: 0.6406 - accuracy: 0.7500 - 14ms/epoch - 14ms/step
Epoch 11/30
1/1 - 0s - loss: 0.6337 - accuracy: 0.7500 - 9ms/epoch - 9ms/step
Epoch 12/30
1/1 - 0s - loss: 0.6264 - accuracy: 0.7500 - 13ms/epoch - 13ms/step
Epoch 13/30
1/1 - 0s - loss: 0.6185 - accuracy: 0.7500 - 6ms/epoch - 6m

[1, 1, 1, 1]