<a href="https://colab.research.google.com/github/raz0208/Techniques-For-Text-Analysis/blob/main/SentimentClassificationByLR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment classification
Sentiment classification is a type of natural language processing (NLP) task that involves determining the emotional tone or opinion expressed in a piece of text. In practice, this means categorizing text—like reviews, social media posts, or comments—as having positive, negative, neutral, or even more fine-grained sentiments.

### Key Points:
### Definition:
- It’s the process of assigning sentiment labels (such as positive, negative, or neutral) to a given text.

### Techniques:

- Lexicon-based Methods: Use predefined dictionaries of words that are associated with specific sentiments.
- Machine Learning Approaches: Train models like Naive Bayes, Support Vector Machines (SVM), or logistic regression on labeled datasets to predict sentiment.
- Deep Learning Methods: Leverage neural networks, such as LSTMs or Transformers, which can capture context and nuances in language for improved performance.

### Applications:

- Customer Feedback: Automatically analyzing reviews and feedback to gauge customer satisfaction.
- Social Media Monitoring: Tracking public sentiment towards brands, events, or political issues.
- Market Research: Helping companies understand public opinion trends and make data-driven decisions.

Overall, sentiment classification helps in summarizing and quantifying subjective information, making it easier for organizations to understand and react to the emotions behind the text data.

### Step 1: Import libraries and read dataset

In [1]:
# Import rewuired libraries
import torch
import numpy as np
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from torch.utils.data import DataLoader, TensorDataset

In [4]:
# Load Data (Example dataset)
data = [
    ("I love this product!", 1),
    ("This is the worst experience I've had.", 0),
    ("Not great, but not terrible either.", 0),
    ("Absolutely fantastic!", 1),
    ("I wouldn't recommend this to anyone.", 0),
    ("A good choice to buy.", 1),
    ("It is not great but it has some use.", 0),
    ("I did not like this product", 0)
]
texts, labels = zip(*data)
texts, labels

(('I love this product!',
  "This is the worst experience I've had.",
  'Not great, but not terrible either.',
  'Absolutely fantastic!',
  "I wouldn't recommend this to anyone.",
  'A good choice to buy.',
  'It is not great but it has some use.',
  'I did not like this product'),
 (1, 0, 0, 1, 0, 1, 0, 0))

### Step 2: Split data into train and test
- Split into train (80%) and test (20%) sets.

In [5]:
# Split data into train and test
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

### Step 3: Load Pretrained Model & Tokenizer
- Use DistilBertTokenizer and DistilBertForSequenceClassification from Hugging Face.

In [6]:
# Load Pretrained Model & Tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Step 4: Tokenization
- Convert texts into tokenized format using DistilBERT tokenizer.
- Apply truncation & padding (max length = 512).

In [10]:
# Tokenization
def tokenize_texts(texts):
    return tokenizer(list(texts), padding=True, truncation=True, max_length=512, return_tensors="pt")

tokenized_train = tokenize_texts(train_texts)
tokenized_test = tokenize_texts(test_texts)

print(tokenized_train, "\n")
print(tokenized_test)

{'input_ids': tensor([[  101,  1045,  2293,  2023,  4031,   999,   102,     0,     0,     0,
             0,     0],
        [  101,  1045,  2106,  2025,  2066,  2023,  4031,   102,     0,     0,
             0,     0],
        [  101,  2025,  2307,  1010,  2021,  2025,  6659,  2593,  1012,   102,
             0,     0],
        [  101,  1045,  2876,  1005,  1056, 16755,  2023,  2000,  3087,  1012,
           102,     0],
        [  101,  7078, 10392,   999,   102,     0,     0,     0,     0,     0,
             0,     0],
        [  101,  2009,  2003,  2025,  2307,  2021,  2009,  2038,  2070,  2224,
          1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} 

{'input_ids': tensor([[ 101, 2023, 2003, 1996, 5409, 3325, 1045, 100

### Step 5: Convert to Tensors

In [11]:
# Convert to Tensors
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)

train_dataset = TensorDataset(tokenized_train["input_ids"], tokenized_train["attention_mask"], train_labels)
test_dataset = TensorDataset(tokenized_test["input_ids"], tokenized_test["attention_mask"], test_labels)

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False)

### Step 6: Feature Extraction with DistilBERT
- Pass input through DistilBERT model.
- Extract CLS token embeddings as feature vectors.

In [17]:
# Feature Extraction with DistilBERT
def extract_features(dataloader, model):
    model.eval()
    features = []
    labels = []
    with torch.no_grad():
        for batch in dataloader:
            input_ids, attention_mask, batch_labels = batch
            outputs = model.distilbert(input_ids, attention_mask=attention_mask)
            cls_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()  # CLS token embedding
            features.extend(cls_embeddings)
            labels.extend(batch_labels.cpu().numpy())
    return np.array(features), np.array(labels)

train_features, train_labels = extract_features(train_loader, model)
test_features, test_labels = extract_features(test_loader, model)

print(train_features, train_labels, "\n")
print(test_features, test_labels)

[[ 0.13890916  0.03303748 -0.00232603 ...  0.06351364  0.26527876
   0.2247639 ]
 [ 0.02925154  0.0810848   0.08291865 ... -0.14410505  0.28628007
   0.18182307]
 [-0.04692893  0.04784037  0.0106302  ... -0.08479393  0.2915197
   0.32401773]
 [-0.13017929 -0.02802864  0.23941529 ... -0.16929476  0.25457275
   0.26665863]
 [-0.19441237 -0.03964168  0.1129404  ... -0.11472832  0.10034972
   0.43304315]
 [-0.07417349 -0.19786097 -0.08614597 ...  0.01941669  0.41275784
   0.19150375]] [0 1 0 1 0 0] 

[[-0.06830289  0.09045129  0.03950415 ... -0.05792443  0.25653446
   0.30703673]
 [-0.11665102 -0.11226921  0.07609144 ... -0.0626655   0.1057962
   0.12285985]] [0 1]


### STep 7: Hyperparameter Tuning

In [18]:
# Hyperparameter Tuning (Example: setting C value for Logistic Regression)
C_values = [0.01, 0.1, 1, 10]
best_C = 1  # Placeholder, can be selected via GridSearchCV

### Step 8: Train Logistic Regression

In [19]:
# Train Logistic Regression
clf = LogisticRegression(C=best_C, max_iter=1000)
clf.fit(train_features, train_labels)

### Step 9: Evaluate Model

In [20]:
# Evaluate Model
y_pred = clf.predict(test_features)
accuracy = accuracy_score(test_labels, y_pred)
report = classification_report(test_labels, y_pred)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [21]:
# Visualise
print(f"Accuracy: {accuracy:.4f}")
print(report)

Accuracy: 0.5000
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

