<a href="https://colab.research.google.com/github/usshaa/SMBDA/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here's a notebook outline that covers the fundamentals of sentiment analysis, including the key steps such as introduction, pre-processing the dataset, word embeddings, building a neural network, training the model, testing the model, and applying it to a single input. Each section includes detailed explanations and relevant Python code snippets.

# Sentiment Analysis with Deep Learning

This notebook provides a comprehensive guide to sentiment analysis using deep learning techniques. We will cover the following topics:

1. Introduction to Sentiment Analysis
2. Pre-Processing the Dataset
3. Word Embeddings
4. Build the Network
5. Train the Model
6. Test the Model
7. Apply to a Single Input


### Additional Notes:
To run this notebook successfully, ensure you have the necessary libraries installed:

In [None]:
!pip install pandas numpy nltk keras tensorflow



### Dataset:
Make sure to provide a dataset named `sentiment_data.csv` with columns `text` and `label` where `label` is binary (0 or 1) indicating negative or positive sentiment.

This structured approach will provide a clear understanding of sentiment analysis and help readers replicate the process effectively.

You may also need to download the NLTK stopwords using:

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 1. Introduction to Sentiment Analysis

**Sentiment Analysis** is a Natural Language Processing (NLP) technique used to determine the sentiment or emotional tone behind a body of text. It helps businesses and organizations understand opinions, reviews, and customer feedback.

### Example Use Cases:
- Analyzing product reviews
- Monitoring social media sentiment
- Customer feedback analysis

---

## 2. Pre-Processing the Dataset

Before training a model, it's essential to pre-process the text data. This step involves cleaning the text, removing noise, and preparing it for analysis.

### Steps:
- Load the dataset
- Clean the text (remove punctuation, lowercasing, etc.)
- Tokenization
- Remove stop words
- Split into training and testing sets

In [None]:
# Importing necessary libraries
import pandas as pd
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

In [None]:
# Load the dataset
# (Assuming the dataset has two columns: 'text' and 'label')
data = pd.read_csv('sentiment_data.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,text,label
0,"I love this product, it's amazing!",1
1,This is the worst experience I've ever had.,0
2,"Absolutely fantastic service, will come again!",1
3,I'm not happy with this purchase.,0
4,Best purchase ever! Totally worth it.,1


In [None]:
# Cleaning the text
def clean_text(text):
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Lowercase the text
    text = text.lower()
    return text

In [None]:
# Apply cleaning function to the dataset
data['cleaned_text'] = data['text'].apply(clean_text)
data['cleaned_text']

Unnamed: 0,cleaned_text
0,i love this product its amazing
1,this is the worst experience ive ever had
2,absolutely fantastic service will come again
3,im not happy with this purchase
4,best purchase ever totally worth it
5,i regret buying this item
6,the quality is great im very satisfied
7,terrible customer service very disappointed
8,i cant recommend this enough
9,its okay nothing special


In [None]:
# Tokenization and removing stop words
stop_words = set(stopwords.words('english'))
data['tokenized'] = data['cleaned_text'].apply(lambda x: [word for word in x.split() if word not in stop_words])
data['tokenized']

Unnamed: 0,tokenized
0,"[love, product, amazing]"
1,"[worst, experience, ive, ever]"
2,"[absolutely, fantastic, service, come]"
3,"[im, happy, purchase]"
4,"[best, purchase, ever, totally, worth]"
5,"[regret, buying, item]"
6,"[quality, great, im, satisfied]"
7,"[terrible, customer, service, disappointed]"
8,"[cant, recommend, enough]"
9,"[okay, nothing, special]"


In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['tokenized'], data['label'], test_size=0.2, random_state=42)

## 3. Word Embeddings

**Word Embeddings** are a type of word representation that allows words to be represented as vectors in a continuous vector space. This representation captures semantic meanings of words.

### Popular Word Embedding Techniques:
- Word2Vec
- GloVe
- FastText

For simplicity, we will use `GloVe` embeddings.

```python
# Download GloVe embeddings (example: GloVe.6B.100d)
# Assuming the embeddings are downloaded and extracted to 'glove.6B.100d.txt'

Word embedding techniques are methods used to convert words into numerical vectors, allowing machines to understand and process language more effectively. Here are some popular word embedding techniques:

1. **Word2Vec**:
   - Developed by Google, Word2Vec is a predictive model that uses a neural network to learn word associations from large corpora of text. It can be trained using two architectures:
     - **Continuous Bag of Words (CBOW)**: Predicts a target word based on its context (surrounding words).
     - **Skip-Gram**: Predicts the surrounding context words given a target word.
   - Produces high-dimensional dense vectors that capture semantic relationships between words.

2. **GloVe (Global Vectors for Word Representation)**:
   - Developed by Stanford, GloVe uses matrix factorization techniques on the word co-occurrence matrix. It focuses on capturing global statistical information about words in a corpus.
   - The objective is to find word vectors such that their dot product predicts the probability of word co-occurrences.

3. **FastText**:
   - Developed by Facebook, FastText extends Word2Vec by representing each word as a bag of character n-grams. This allows it to create embeddings for out-of-vocabulary words and better capture morphological features.
   - Useful for languages with rich morphology or when handling domain-specific terms.

4. **ElMo (Embeddings from Language Models)**:
   - Developed by AllenNLP, ELMo generates word embeddings by utilizing a deep bidirectional LSTM (Long Short-Term Memory) language model.
   - Unlike static embeddings, ELMo produces contextualized embeddings, meaning the representation of a word can change based on its context in a sentence.

5. **BERT (Bidirectional Encoder Representations from Transformers)**:
   - Developed by Google, BERT uses a transformer architecture and is trained on masked language modeling and next sentence prediction tasks.
   - BERT generates contextual embeddings, allowing the same word to have different representations depending on its usage in context.

6. **Transformer-Based Models (e.g., GPT, RoBERTa, T5)**:
   - Many modern NLP models based on the transformer architecture produce word embeddings as part of their internal processes. These models can capture more complex relationships and dependencies in the data due to their multi-layered structure.

7. **Sentence Embeddings**:
   - Techniques like Universal Sentence Encoder and Sentence-BERT extend word embeddings to entire sentences, enabling applications in tasks such as semantic similarity and text classification.

8. **Custom Word Embeddings**:
   - You can also create custom word embeddings using techniques like Autoencoders or Neural Networks on domain-specific data, which can provide better performance in specialized applications.

Each of these techniques has its strengths and is suitable for different NLP tasks, depending on factors like the size of the corpus, the need for contextuality, and computational resources.

In [None]:
# Load GloVe embeddings
def load_glove_embeddings(file):
    embeddings = {}
    with open(file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

In [None]:
import numpy as np

In [None]:
# # Download the GloVe embeddings (if not already downloaded)
# !wget http://nlp.stanford.edu/data/glove.6B.zip

# # Unzip the downloaded file
# !unzip glove.6B.zip

# Load the embeddings, ensuring the path is correct
glove_embeddings = load_glove_embeddings('glove.6B.100d.txt')

In [None]:
# Create an embedding matrix for our vocabulary
def create_embedding_matrix(vocabulary, glove_embeddings, embedding_dim=100):
    embedding_matrix = np.zeros((len(vocabulary), embedding_dim))
    for i, word in enumerate(vocabulary):
        embedding_vector = glove_embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

In [None]:
# Create a vocabulary from the training set
vocabulary = set(word for words in X_train for word in words)
embedding_matrix = create_embedding_matrix(vocabulary, glove_embeddings)

## 4. Build the Network

Now, we will build a simple neural network using Keras to perform sentiment analysis.

In [None]:
# Importing necessary libraries
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, SpatialDropout1D

In [None]:
# Parameters
embedding_dim = 100
max_length = 100  # Maximum length of input sequences

In [None]:
# Building the model
model = Sequential()
model.add(Embedding(input_dim=len(vocabulary), output_dim=embedding_dim, weights=[embedding_matrix], input_length=max_length, trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))  # For binary classification

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

## 5. Train the Model

We will train the model using the training data.

In [None]:
# Convert the set 'vocabulary' to a list
vocabulary_list = list(vocabulary)

# Pad sequences to ensure uniform input size
from keras.preprocessing.sequence import pad_sequences

# Pad sequences using the vocabulary_list for indexing
X_train_padded = pad_sequences(X_train.apply(lambda x: [vocabulary_list.index(word) for word in x if word in vocabulary_list]), maxlen=max_length, padding='post')
X_test_padded = pad_sequences(X_test.apply(lambda x: [vocabulary_list.index(word) for word in x if word in vocabulary_list]), maxlen=max_length, padding='post')

In [None]:
# Train the model
history = model.fit(X_train_padded, y_train, epochs=5, batch_size=64, validation_data=(X_test_padded, y_test), verbose=2)

Epoch 1/5
1/1 - 0s - 245ms/step - accuracy: 0.5000 - loss: 0.7452 - val_accuracy: 0.5000 - val_loss: 0.6937
Epoch 2/5
1/1 - 0s - 213ms/step - accuracy: 0.5000 - loss: 0.7383 - val_accuracy: 0.5000 - val_loss: 0.6951
Epoch 3/5
1/1 - 0s - 166ms/step - accuracy: 0.3750 - loss: 0.8099 - val_accuracy: 0.5000 - val_loss: 0.6964
Epoch 4/5
1/1 - 0s - 312ms/step - accuracy: 0.5000 - loss: 0.6801 - val_accuracy: 0.5000 - val_loss: 0.6958
Epoch 5/5
1/1 - 0s - 162ms/step - accuracy: 0.3750 - loss: 0.6600 - val_accuracy: 0.5000 - val_loss: 0.6959


## 6. Test the Model

We will evaluate the model using the test data to check its performance.

In [None]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test_padded, y_test, verbose=2)
print("Test Accuracy:", accuracy)

1/1 - 0s - 64ms/step - accuracy: 0.5000 - loss: 0.6959
Test Accuracy: 0.5


## 7. Apply to a Single Input

Finally, we will create a function to apply the trained model to a single input for prediction.

In [None]:
def predict_sentiment(text):
    # Clean the text
    cleaned_text = clean_text(text)
    # Tokenize and remove stop words
    tokenized_text = [word for word in cleaned_text.split() if word not in stop_words]
    # Convert words to indices
    # Convert vocabulary to a list to enable indexing
    vocabulary_list = list(vocabulary)
    padded_text = pad_sequences([[vocabulary_list.index(word) for word in tokenized_text if word in vocabulary_list]], maxlen=max_length, padding='post')

    # Make prediction
    prediction = model.predict(padded_text)
    return "Positive" if prediction[0][0] > 0.5 else "Negative"

# Example prediction
print(predict_sentiment("I love this product!"))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 220ms/step
Positive


`
## Conclusion

In this notebook, we covered the basics of sentiment analysis using deep learning techniques. We went through data pre-processing, word embeddings, building a neural network, training the model, evaluating its performance, and applying it to a single input. This is a foundational approach to sentiment analysis and can be extended and modified for more complex tasks.