Text classification is one of the most common tasks in NLP. It is the process of assigning a label or category to a given piece of text. For example, we can classify emails as spam or not spam, tweets as positive or negative, and articles as relevant or not relevant to a given topic.

For example, classify users comments on SQO as relevant to price, competitor, product upgrade or other-reasons(such as client budget). 

💥 Transformer是一种基于自注意力机制的深度学习模型，广泛应用于自然语言处理（NLP）任务。
	
💥 与传统的循环神经网络（RNN）不同，Transformer不依赖于序列的顺序处理，而是通过自注意力机制并行处理输入数据的所有位置，捕捉长距离依赖关系。
	
👉 Transformer的核心组件包括编码器（Encoder）和解码器（Decoder），每个组件都由多层自注意力和前馈神经网络组成。
	
👉 这种架构使得Transformer在处理大规模数据时，能更高效地并行计算，从而大大提高了训练速度和性能，成为了包括BERT、GPT等在内的许多先进模型的基础

# 使用PyTorch实现简单文本分类¶
一、准备工作：环境配置，加载数据

In [1]:
# pip install torch

# pip install spacy
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.
Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage.spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc.
Using Thinc as its backend, spaCy features convolutional neural network models for part-of-speech tagging, dependency parsing, text categorization and named entity recognition (NER). Prebuilt statistical neural network models perform these tasks are available for 23 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language NER model. Additional support for tokenization for more than 65 languages allows users to train custom models on their own datasets as well.

# pip install nltk
https://www.nltk.org/howto/wordnet.html
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. 
It supports classification, tokenization(标记化), stemming（词干提取）, tagging（标记）, parsing（解析）, and semantic reasoning（语义推理） functionalities. 
NLTK includes graphical demonstrations（图形演示） and sample data. 

🤗 Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. AutoTokenizer is a special class in the Huggingface Transformers library. It helps you choose the right tokenizer for your model without knowing the details.

These models can be applied on:

📝 Text, for tasks like text classification, information extraction, question answering, summarization, translation, and text generation, in over 100 languages.
🖼️ Images, for tasks like image classification, object detection, and segmentation.
🗣️ Audio, for tasks like speech recognition and audio classification.

# pip install transformers
Tokenizers are essential tools in machine learning, especially in natural language processing (NLP). They break down text into smaller units called tokens.

These tokens can be words, subwords, or characters.

For example, let's take the sentence "I love apples." When we tokenize this sentence, it breaks down into three tokens: "I," "love," and "apples."

Tokenization makes it easier for computers to understand and process the text. It’s used for tasks like translation, sentiment analysis, and all of NLP.

AutoTokenizer is a special class in the Huggingface Transformers library. It helps you choose the right tokenizer for your model without knowing the details.

Think of it as a smart assistant that knows which tool to use for the job.

The AutoTokenizer is easy to use. You don’t have to remember which tokenizer goes with which model. It ensures you use the correct tokenizer for the model, reducing errors and improving consistency.

Autotokenizer is flexible. It works with many different models, allowing you to switch models without changing much code.

In [4]:
import pandas as pd 
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
import nltk
import spacy
from sklearn.metrics import classification_report
from tqdm import tqdm
import re
import string,time
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
print(torch.__version__)
print(spacy.__version__)
print(nltk.__version__)

2.2.2
3.8.3
3.8.1


In [5]:
nltk.data.path

['/Users/georgeli/nltk_data',
 '/Users/georgeli/anaconda3/nltk_data',
 '/Users/georgeli/anaconda3/share/nltk_data',
 '/Users/georgeli/anaconda3/lib/nltk_data',
 '/usr/share/nltk_data',
 '/usr/local/share/nltk_data',
 '/usr/lib/nltk_data',
 '/usr/local/lib/nltk_data']

In [7]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/georgeli/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/georgeli/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/georgeli/nltk_data...
[nltk_data] Downloading package punkt to /Users/georgeli/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Look up a word using synsets(); this function has an optional pos argument which lets you constrain the part of speech of the word. 使用 synsets() 查找单词；这个函数有一个可选的 pos 参数，它可以让你限制单词的词性：

WordNet: an NLTK interface（接口)for WordNet. 
WordNet is a lexical database（词汇数据库） of English. 
Using synsets（同义词集）, helps find conceptual relationships between words such as hypernyms（上位词）, hyponyms（下位词）, synonyms（同义词）, antonyms（反义词） etc.

In [17]:
from nltk.corpus import wordnet
# Synset: a set of synonyms that share a common meaning.
print(wordnet.synsets('happy'))
print(wordnet.synsets('dog', pos=wordnet.VERB))

[Synset('happy.a.01'), Synset('felicitous.s.02'), Synset('glad.s.02'), Synset('happy.s.04')]
[Synset('chase.v.01')]


In [18]:
print(wordnet.synset('chase.v.01').definition())

go after with the intent to catch


In [19]:
print(wordnet.synset('dog.n.01').examples()[0])

the dog barked all night


In [25]:
print(wordnet.synset('spy.n.01').lemma_names('jpn'))

['いぬ', 'まわし者', 'スパイ', '回し者', '回者', '密偵', '工作員', '廻し者', '廻者', '探', '探り', '犬', '秘密捜査員', '諜報員', '諜者', '間者', '間諜', '隠密']


In [26]:
#The synonyms of a word are returned as a nested list of synonyms of the different senses of the input word in the given language, since these different senses are not mutual synonyms:
wordnet.synonyms('car')

[['auto', 'automobile', 'machine', 'motorcar'],
 ['railcar', 'railroad_car', 'railway_car'],
 ['gondola'],
 ['elevator_car'],
 ['cable_car']]

# Data Load

In [30]:
df=pd.read_csv("ecommerceDataset.csv",header=None, names=["segment", "text"])
df.head()

Unnamed: 0,segment,text
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [31]:
df.shape

(50425, 2)

In [32]:
df.segment.value_counts()

segment
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: count, dtype: int64

In [33]:
label_column = 'segment'
samples_per_class = 600
balanced_data = df.groupby(label_column).apply(lambda x: x.sample(n=samples_per_class, random_state=42))
balanced_data = balanced_data.reset_index(drop=True)

In [34]:
balanced_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   segment  2400 non-null   object
 1   text     2400 non-null   object
dtypes: object(2)
memory usage: 37.6+ KB


In [40]:
class TextPreprocessor:
    def __init__(self, max_length=128):
        # Bert tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") #https://huggingface.co/google-bert/bert-base-uncased
#         self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
        self.max_length = max_length
        self.vocab_size = self.tokenizer.vocab_size
    def clean_text(self, text):
        text = text.lower()
        text = re.sub(r'@\w+|#\w+', '', text)
        pattern = re.compile('<.*?>')
        text=pattern.sub(r'', text)
        text = re.sub(r'http\S+|www\S+|https\S+', '', text)
        text = re.sub(r'[^\x00-\x7F]+', '', text)
        text=text.translate(str.maketrans('', '', string.punctuation))
        stop_words = set(stopwords.words('english'))
        tokenizer = TweetTokenizer()
        tokens = tokenizer.tokenize(text)
        tokens = [word for word in tokens if word not in stop_words]
        text = ' '.join(tokens)
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in text.split()]
        text = ' '.join(tokens)
        return text.strip()
    def text_to_sequence(self, text):
        text = self.clean_text(text)
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0)
        }

In [42]:
prosesor=TextPreprocessor()
test=balanced_data['text'][1]
print(test)
print('*'*50)
print(prosesor.clean_text(test))

Organic Shastra Pav Bhaji Masala 150g Pav bhaji, a popular street food of India, is incomplete without Pav Bhaji masala. It is a blend of spices - red chillies, coriander seeds, cumin seeds, black pepper, cinnamon, clove, black cardamom, dry mango powder, fennel seeds and turmeric powder. Not just in pav bhaji, this aromatic masala can also be used to spice up curries, stir fries, rice preparations etc
**************************************************
organic shastra pav bhaji masala 150g pav bhaji popular street food india incomplete without pav bhaji masala blend spice red chilli coriander seed cumin seed black pepper cinnamon clove black cardamom dry mango powder fennel seed turmeric powder pav bhaji aromatic masala also used spice curry stir fry rice preparation etc


In [44]:
class SentimentDataset(Dataset):
    def __init__(self, texts, labels=None, preprocessor=None):
        self.texts = texts
        self.labels = labels
        self.preprocessor = preprocessor

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        encoding = self.preprocessor.text_to_sequence(text)
        
        item = {
            'input_ids': encoding['input_ids'],
            'attention_mask': encoding['attention_mask'],
            'length': torch.sum(encoding['attention_mask'])
        }
        if self.labels is not None:
            item['label'] = torch.tensor(self.labels[idx])
            
        return item

In [45]:
categories = ['Books', 'Clothing & Accessories', 'Electronics', 'Household']

In [46]:
category_map = {category: idx for idx, category in enumerate(categories)}
balanced_data['segment'] = balanced_data['segment'].map(category_map)

In [48]:
from sklearn.model_selection import train_test_split
texts = balanced_data.text.tolist()
labels = balanced_data.segment.tolist()

train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42 )

In [49]:
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels, dropout=0.5):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,bidirectional=True)
        self.dropout=nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_dim, num_labels)
    
    def forward(self, input_ids,attention_mask=None):
        embedded = self.embedding(input_ids)
         # Extract the output of the last time step 
        _,(hidden,_)=self.lstm(embedded)
        
        # Shape Hidden:(number of layers* direction , batch,  hidden)
        hidden_last_layer = hidden[-1,:,:]
        hidden_last_layer = self.dropout(hidden_last_layer)

        # Linear classifier layer
        output = self.classifier(hidden_last_layer)
        return output

In [57]:
from sklearn.metrics import accuracy_score, f1_score,classification_report
# Evaluation function
def evaluate(model, dataloader, device):
    model.eval()
    all_predicted_labels = []
    all_true_labels = []
    with torch.no_grad():
        for batch in tqdm(dataloader, desc='eval'):
            input_ids= batch['input_ids'].to(device)
            labels = batch['label'].to(device)

            # forward pass
            outputs = model(input_ids=input_ids)
            _,predicted_labels=torch.max(outputs, dim =1 )


            all_predicted_labels.extend(predicted_labels.cpu().numpy())
            all_true_labels.extend(labels.cpu().numpy())


    # Calculating Metrics using sklearn.metrics
    accuracy = accuracy_score(all_true_labels, all_predicted_labels)
    f1= f1_score(all_true_labels,all_predicted_labels, average = 'weighted')
    
    return accuracy, f1

In [58]:
from torch.utils.data import Dataset, DataLoader
preprocessor = TextPreprocessor()

train_dataset = SentimentDataset(train_texts, train_labels, preprocessor)
train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True)

val_dataset = SentimentDataset(val_texts, val_labels, preprocessor)
val_dataloader = DataLoader(val_dataset, batch_size=2, shuffle=False)


# Model Instantiation
vocab_size=preprocessor.vocab_size
embedding_dim = 128
hidden_dim = 256
num_labels = 4
model = LSTMClassifier(vocab_size, embedding_dim, hidden_dim, num_labels)


# --- Training setup ---
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


# --- Training loop ---
epochs = 5
for epoch in range(epochs):
    total_loss = 0
    model.train()  # Put the model in training mode
    progress_bar = tqdm(enumerate(train_dataloader), total=len(train_dataloader),desc=f"Epoch : {epoch+1}/{epochs}")
    for idx, batch in progress_bar : 

        input_ids = batch['input_ids'].to(device)
        labels = batch['label'].to(device)

        # Forward Pass
        outputs = model(input_ids)


        # Calculating Loss
        loss = criterion(outputs, labels)
        total_loss += loss.item()
        # Backward Pass & optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Update progress bar loss in training
        progress_bar.set_postfix({'loss': f'{loss.item() / (idx + 1) :.4f}'})

    avg_loss = total_loss/ len(train_dataloader)


    # Evaluate every epoch after training.
    val_accuracy,val_f1 = evaluate(model,val_dataloader, device)
    
    print(f"Epoch {epoch+1}/{epochs} , Train Loss: {avg_loss:.4f}, Val_Accuracy: {val_accuracy:.4f} Val_F1:{val_f1:.4f}")

Epoch : 1/5: 100%|███████████████| 960/960 [00:46<00:00, 20.69it/s, loss=0.0003]
eval: 100%|██████████████████████████████████| 240/240 [00:02<00:00, 109.52it/s]


Epoch 1/5 , Train Loss: 0.9428, Val_Accuracy: 0.7771 Val_F1:0.7789


Epoch : 2/5: 100%|███████████████| 960/960 [00:43<00:00, 22.16it/s, loss=0.0016]
eval: 100%|██████████████████████████████████| 240/240 [00:01<00:00, 138.91it/s]


Epoch 2/5 , Train Loss: 0.4076, Val_Accuracy: 0.6917 Val_F1:0.6964


Epoch : 3/5: 100%|███████████████| 960/960 [00:44<00:00, 21.66it/s, loss=0.0000]
eval: 100%|██████████████████████████████████| 240/240 [00:02<00:00, 112.15it/s]


Epoch 3/5 , Train Loss: 0.1804, Val_Accuracy: 0.8583 Val_F1:0.8576


Epoch : 4/5: 100%|███████████████| 960/960 [00:44<00:00, 21.64it/s, loss=0.0000]
eval: 100%|██████████████████████████████████| 240/240 [00:01<00:00, 133.70it/s]


Epoch 4/5 , Train Loss: 0.0708, Val_Accuracy: 0.8438 Val_F1:0.8422


Epoch : 5/5: 100%|███████████████| 960/960 [00:43<00:00, 22.06it/s, loss=0.0000]
eval: 100%|██████████████████████████████████| 240/240 [00:01<00:00, 134.67it/s]


Epoch 5/5 , Train Loss: 0.0387, Val_Accuracy: 0.8771 Val_F1:0.8777


In [59]:
# Inference Example
def predict_sentiment(text, model, preprocessor, device):
    model.eval()
    
    encoding = preprocessor.text_to_sequence(text)
    input_ids = encoding['input_ids'].unsqueeze(0).to(device)
    
    with torch.no_grad():
        outputs = model(input_ids)

    probabilities = torch.nn.functional.softmax(outputs, dim=-1)


    predicted_label = torch.argmax(probabilities).item()
    return predicted_label, probabilities

In [60]:
predict_sentiment(test, model, preprocessor, device)

(0, tensor([[0.8468, 0.0015, 0.0070, 0.1446]]))

In [62]:
num={
    'Books':0,
    'Clothing & Accessories':1,
    'Electronics':2,
    'Household':3
}
df['segment'] = df['segment'].map(num)

In [63]:
random_row = df.sample(n=1)
random_text = random_row['text'].values[0]
random_Sentment = random_row['segment'].values[0]

In [64]:
predict_sentiment(random_text, model, preprocessor, device)

(3, tensor([[1.2434e-04, 1.2626e-04, 7.5909e-04, 9.9899e-01]]))

In [65]:
random_Sentment

3