<a href="https://colab.research.google.com/github/yashchougule19/DELTA_Element3/blob/yash/Twitter_Sentiment_Analysis_with_Augmented_Data_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Handling class imbalance using DATA AUGMENTATION

Importing the Dependencies

In [None]:
import pandas as pd
import numpy as np
import re # re = regular expression & is used for pattern matching, search through the data, etc.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer # To convert the textual data to numerical data to feed the ML model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from tqdm.auto import tqdm
from sklearn.utils import shuffle
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
#print(stopwords.words('english'))

Data Processing

In [None]:
train_data = pd.read_parquet("/content/btc_tweets_train.parquet.gzip")
test_data = pd.read_parquet("/content/btc_tweets_test.parquet.gzip")

In [None]:
train_data.shape, test_data.shape

((1500, 5), (500, 5))

In [None]:
train_data.head()

Unnamed: 0_level_0,hashtags,content,username,user_displayname,sentiment
tweet ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1641579121972236290,"[Bitcoin, Bitcoin, BTC, Bitcoin, BTC, SHIB, HO...","$Bitcoin TO $100,000 SOONER THAN YOU THINK‼️💯🙏...",BezosCrypto,SHIB Bezos,True
1641579176171016194,"[Bitcoin, bitcoinordinals, crypto]",Alright I have my rares. Who else is grabbing ...,spartantc81,SpartanTC,True
1641579486071390208,"[BTC, SHIB, HOGE, SAITAMA, BNB, DOGE, ETH, Bab...","Bitcoin (BTC) Targets Over $100,000 as This Im...",BezosCrypto,SHIB Bezos,True
1641579537103302656,[BTC],📢 Xverse Web-based pool is live:\n\n•Update @x...,godfred_xcuz,Algorithm.btc,True
1641579588399804418,[Bitcoin],"Yesterday, a Bitcoin projection was displayed ...",goddess81oo,she is lucky,True


In [None]:
# Check for missing values
train_data.isnull().sum()

Unnamed: 0,0
hashtags,0
content,0
username,0
user_displayname,0
sentiment,0


In [None]:
# Check the distribution of target column. If imbalanced, then will have to perform upsampling or downsampling
# True -> positive tweet
# False -> negative tweet
train_data['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
True,1220
False,280


Convert the target from True to 1 and from False to 0

In [None]:
train_data['sentiment'] = train_data['sentiment'].astype(int)
test_data['sentiment'] = test_data['sentiment'].astype(int)

**Data Augmentation**

In [None]:
#!pip install nlpaug



In [None]:
import nlpaug.augmenter.word.context_word_embs as aug

In [None]:
# sample_text = train_data['content'].iloc[100]
# sample_text

In [None]:
augmenter = aug.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# augmented_sample_text = augmenter.augment(sample_text)

In [None]:
# augmented_sample_text

In [None]:
# for i in range(5):
#     print(augmenter.augment(sample_text))

In [None]:
def augmentMyData(df, augmenter, repetitions=1, samples=1000):
    augmented_texts = []
    # select only the minority class samples
    minority_df = df[df['sentiment'] == 0].reset_index(drop=True) # removes unecessary index column
    for i in tqdm(np.random.randint(0, len(minority_df), samples)):
        # generating 'n_samples' augmented texts
        for _ in range(repetitions):
            augmented_text = augmenter.augment(minority_df['content'].iloc[i])
            augmented_texts.append(augmented_text)

    data = {
        'sentiment': 0,
        'content': augmented_texts
    }
    aug_df = pd.DataFrame(data)
    df = shuffle(pd.concat([df, aug_df]).reset_index(drop=True))
    #df = shuffle(df.concat(aug_df).reset_index(drop=True))
    return df

In [None]:
aug_train_data = augmentMyData(train_data, augmenter, samples=1000)

  0%|          | 0/1000 [00:00<?, ?it/s]

In [None]:
aug_train_data.to_csv('aug_train_data.csv', index=False)

In [None]:
# temp = aug_train_data[aug_train_data['sentiment'] == 0]
# temp.head()

Unnamed: 0,hashtags,content,username,user_displayname,sentiment
1780,,[they transferred 9k # btc in one txn ( as 31f...,,,0
803,"[litecoin, bitcoin, ltc]",$kas @KaspaCurrency will eventually replace $l...,plzsats,Captain Sats 𐤊,0
2379,,[maybe if i sell you something fast and accept...,,,0
1709,,"[this is not black mirror, this is their real ...",,,0
1631,,"[when you are in the same markets, tradfi or #...",,,0


**Stemming:**
the process of reducing a word to its keyword

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ', content) # removing all but the letters from content. In our specific case content = tweet
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split() # every content will be split and will be stored in a list
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content) # recombine the split and stemmed content
  return stemmed_content

In [None]:
aug_train_data.head(20)

Unnamed: 0,hashtags,content,username,user_displayname,sentiment
39,"[OPTIMUS, FLOKI, DOGE, SHIBARUIM, SHIB, PAW, A...",Crypto market is about to go nuts. This recent...,Khassan03,Khaled hassan,1
878,"[Bitcoin, masnews, economy, developer, BTC, ne...",MAS Network Breaking! News! \nfull video here ...,masbtc21,The MAS Network,1
489,"[Crypto, PiNetwork, ETH, RABBIT2023, VOLT, LTC...",Which of these #Crypto project are you glad th...,SAMHONBD,Saddam Hossain,1
1780,,they transferred 9k # btc in one txn ( as 31f1...,,,0
499,"[Sologenic, Solo, Coreum, Tokenized, BTC, Solo...",#Sologenic #Solo #Coreum Trillions #Tokenized ...,soloscrooge,$OLO $CROOGE,1
803,"[litecoin, bitcoin, ltc]",$kas @KaspaCurrency will eventually replace $l...,plzsats,Captain Sats 𐤊,0
631,"[meme, cronos, Crofam, cro, FFTB, Bnb, Avax, BTC]",Let build the number 1 #meme community on #cro...,michael000Best1,Mike Trollcoin Ambassador,1
2379,,maybe if i sell you something fast and accept ...,,,0
555,"[Bitcoin, SP500, NASDAQ, DXY]","As the Month is about to close, here's a remin...",Washigorira,Titan of Crypto,1
363,"[MATIC, Y00ts, Polygon, Crypto, btc, ETH, 꽃처럼_...","""🚨 Y00ts has officially stepped into the Polyg...",YogiTugi,Tugi 👑,1


In [None]:
type(aug_train_data['content'].iloc[3])

str

In [None]:
# Convert lists to strings in the 'text' column after concatenation
aug_train_data['content'] = aug_train_data['content'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)

In [None]:
aug_train_data['stemmed content'] = aug_train_data['content'].apply(stemming)

In [None]:
test_data['stemmed content'] = test_data['content'].apply(stemming)

In [None]:
test_data.head()

Unnamed: 0_level_0,hashtags,content,username,user_displayname,sentiment,stemmed content
tweet ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1641861708246552576,"[crypto, btc]",#crypto $crypto #btc \nI am Chinese crypto alp...,huahuayjy,花花研究院 | Crypto Alpha🇨🇳,1,crypto crypto btc chines crypto alpha although...
1641861783898972167,"[Bitcoin, Bitcoin]",#Bitcoin would have to fall another 80% to rea...,luke_broyles,Luke Broyles,0,bitcoin would fall anoth reach low year ago ra...
1641862152532418562,"[Giveaway, BTC, SolanaGiveaways, Giveaway, Air...",#Giveaway $1000 Matic in 3Days\n\n🏆To win\n1️⃣...,cryptomarsdo,Crypto Mars,1,giveaway matic day win follow matic like amp r...
1641862338369183753,"[EOS, USDT, BTC, crypto, Bitcoin, etherium, Bi...",Up or Down?\n\n!!! $EOS #EOS !!!\n\nVS\n\n$USD...,andreyukrnet,Andrey Ukraine,1,eo eo vs usdt usdt btc btc crypto bitcoin ethe...
1641862430434131968,"[BTC, ETH, BSC, GroveToken]",Mid Day Mix-up is LIVE! Never know who might s...,JustAman04,Justin Anderson,1,mid day mix live never know might stop btc eth...


In [None]:
# Seperating the data and label
X_train = aug_train_data['stemmed content'].values
y_train = aug_train_data['sentiment'].values

X_test = test_data['stemmed content'].values
y_test = test_data['sentiment'].values

Splitting the data into train and test data

In [None]:
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify=y, random_state = 19)  # stratiy=Y ensures equal distribution of 1s and 0s in the train and test set

In [None]:
# converting the textual data to mumerical data

vectorizer = TfidfVectorizer() # assigns importance to each individual word

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [None]:
#print(X_train)

Training the (Logistic Regression) Mchine leraning Model


In [None]:
model = GaussianNB()

In [None]:
model.fit(X_train.toarray(), y_train)

Model Evaluation

In [None]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train.toarray())

In [None]:
training_data_accuracy = accuracy_score(y_train, X_train_prediction)
print(f'Accuracy score on training data : {training_data_accuracy}')

Accuracy score on training data : 0.972


In [None]:
# accuracy score on the TEST data
X_test_prediction = model.predict(X_test.toarray())

In [None]:
test_data_accuracy = accuracy_score(y_test, X_test_prediction)
print(f'Accuracy score on test data : {test_data_accuracy}')

Accuracy score on test data : 0.704


In [None]:
from sklearn.metrics import f1_score
test_data_f1 = f1_score(y_test, X_test_prediction, average='weighted')
print(f'F1 score on test data : {test_data_f1}')

F1 score on test data : 0.8467573629598575


In [None]:
print(confusion_matrix(y_test, X_test_prediction))
print(accuracy_score(y_test, X_test_prediction))
print(classification_report(y_test, X_test_prediction))


[[ 47  49]
 [ 99 305]]
0.704
              precision    recall  f1-score   support

           0       0.32      0.49      0.39        96
           1       0.86      0.75      0.80       404

    accuracy                           0.70       500
   macro avg       0.59      0.62      0.60       500
weighted avg       0.76      0.70      0.72       500



Saving the model to be reused on the nrw data

In [None]:
import pickle

filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb')) # wb represents writing the file

Using the saved model for future predictionss

In [None]:
#loading the saved model
loaded_model = pickle.load(open('/content/trained_model.sav', 'rb'))

In [None]:
# Simulating a prediction with the saved model
index = 100

X_new = X_test[index]
#print(X_new)
print(y_test[index])

prediction = loaded_model.predict(X_new)
print(prediction)

if prediction[0] == 0:
  print('Negative Tweet')

else:
  print('Positive Tweet')

1
[0]
Negative Tweet


SMOTE (Synthetic Minority Over-sampling Technique) is typically designed for numerical data and not directly applicable to text data in its raw form. However, there are ways to adapt SMOTE to text-based tasks, primarily by using it on vectorized or embedded representations of text. Here's how you can handle class imbalance in text data:

### 1. **Vectorizing Text Data for SMOTE:**
   - **Before Applying SMOTE:** You need to convert your text data into a numerical form. Common methods include:
     - **TF-IDF Vectorization:** Converts text into a matrix of TF-IDF features.
     - **Word Embeddings:** Use pre-trained embeddings (e.g., Word2Vec, GloVe) or transform text into embeddings using models like BERT.
   - **Applying SMOTE:** Once the text is vectorized, SMOTE can be applied to the resulting numerical vectors.

### Example of Using SMOTE with TF-IDF:

```python
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd

def handle_class_imbalance_with_smote(df, content_column, sentiment_column):
    """
    Handles class imbalance using SMOTE on TF-IDF vectorized text data.

    Parameters:
    - df: pd.DataFrame
        The DataFrame containing the dataset.
    - content_column: str
        The name of the column containing the text content to analyze.
    - sentiment_column: str
        The name of the column containing the sentiment labels.

    Returns:
    - pd.DataFrame
        The DataFrame after SMOTE resampling with balanced classes.
    """
    # Vectorize the text content using TF-IDF
    tfidf = TfidfVectorizer(stop_words='english')
    X = tfidf.fit_transform(df[content_column])
    
    y = df[sentiment_column]
    
    # Apply SMOTE to the vectorized text
    smote = SMOTE(random_state=42)
    X_resampled, y_resampled = smote.fit_resample(X, y)
    
    # Convert the resampled data back to a DataFrame
    df_resampled = pd.DataFrame(X_resampled.toarray(), columns=tfidf.get_feature_names_out())
    df_resampled[sentiment_column] = y_resampled

    return df_resampled
```

### 2. **Limitations of SMOTE on Text:**
   - **Dimensionality:** Text vectorization methods like TF-IDF or word embeddings result in high-dimensional data. SMOTE might not perform optimally in such high-dimensional spaces, leading to issues like overfitting or poor synthetic samples.
   - **Interpretability:** Applying SMOTE on TF-IDF vectors can result in synthetic samples that may not correspond to any meaningful or interpretable text.

### 3. **Alternatives to SMOTE for Text Data:**
   - **Data Augmentation:** You can use techniques like paraphrasing, back-translation, or synonym replacement to generate additional samples for the minority class.
   - **Undersampling the Majority Class:** Instead of oversampling the minority class, you can undersample the majority class to balance the dataset.
   - **Class Weighting:** When training a model, you can assign higher weights to the minority class in the loss function, making the model more sensitive to the minority class.

### Conclusion:
While SMOTE can be adapted for text data by applying it to vectorized representations, it’s important to be cautious of its limitations. Depending on your specific use case, you might want to explore alternative methods like data augmentation or class weighting to handle class imbalance in text data.

### After having tried SMOTE, it's realized that it is not the reliable way to handle class imbalance in our dataset

The actions suggested to handle class imbalance—data augmentation, undersampling, and class weighting—should be performed **after text cleaning**. Here's why:

### 1. **Text Cleaning First:**
   - **Consistency:** Cleaning the text ensures consistency in the data. For example, removing stopwords, punctuation, and applying lemmatization makes the text uniform, which is crucial before generating new samples or modifying the existing dataset.
   - **Quality of Augmentation:** When you perform data augmentation (like paraphrasing or synonym replacement) on cleaned text, the generated samples will be of higher quality and more relevant to the task.
   - **Accurate Balancing:** If you balance the dataset before cleaning, the imbalance may reappear or change after cleaning due to the removal of certain words or characters. By cleaning first, you’re balancing the dataset as it will appear during model training.

### 2. **Data Augmentation:**
   - **After Cleaning:** Perform text cleaning first, then apply augmentation. This ensures that the augmented samples are consistent with the cleaned data. For instance, if you replace synonyms, you want to ensure that all text has been cleaned and standardized so that the augmentation doesn't introduce unwanted noise.

### 3. **Undersampling the Majority Class:**
   - **After Cleaning:** Undersample after cleaning to ensure that the remaining samples are the most relevant and highest quality. Cleaning the text first will help ensure that you are removing noise and irrelevant data before deciding which samples to remove.

### 4. **Class Weighting:**
   - **Model Training Step:** Class weighting is applied during the model training phase, not during preprocessing. However, it still assumes that the text has been cleaned and processed appropriately before the model is trained.

### Summary:
- **Text Cleaning:** First step to ensure data quality and consistency.
- **Class Imbalance Handling (Augmentation, Undersampling):** Performed after text cleaning to ensure that the balancing process is effective and relevant to the cleaned dataset.
- **Class Weighting:** Applied during the training phase, after the data has been preprocessed.

Following this order ensures that the data fed into the model is clean, consistent, and balanced, leading to better model performance.