# CSE 572: Lab 3

In this lab, you will practice implementing to handle imbalanced datasets. You will use methods such as upsampling, downsampling, and SMOTE to balance/augment different datasets (matrix, image, text data) and evaluate how these techniques affect model performance.

To execute and make changes to this notebook, click File > Save a copy to save your own version in your Google Drive or Github. Read the step-by-step instructions below carefully. To execute the code, click on each cell below and press the SHIFT-ENTER keys simultaneously or by clicking the Play button.

When you finish executing all code/exercises, save your notebook then download a copy (.ipynb file). Submit 1) a link to your Colab notebook, 2) the .ipynb file, and **3) a pdf of the executed notebook** on Canvas.

To generate a pdf of the notebook, click File > Print > Save as PDF.

# **PUT YOUR GROUP INFO HERE**

| Group number | August Group XXX |            |
|--------------|------------------|------------|
| Member 1     | NAME             | ASURITE ID |
| Member 2     |                  |            |
| Member 3     |                  |            |
| Member 4     |                  |            |

In [None]:
# Importing Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification


In [None]:
# Step 1: Load Dataset
# Using an imbalanced dataset from scikit-learn
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=5, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
data = pd.DataFrame(X, columns=[f'feature{i}' for i in range(1, X.shape[1] + 1)])
data['label'] = y

print("Original Dataset Class Distribution:")
print(data['label'].value_counts())


In [None]:
# Step 2: Train-Test Split
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Balancing Techniques

In [None]:
## Upsampling Minority Class
train_data = pd.concat([X_train, y_train], axis=1)
minority = train_data[train_data['label'] == 1]
majority = train_data[train_data['label'] == 0]
minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42)
upsampled_data = pd.concat([majority, minority_upsampled])

In [None]:
print("Upsampled Dataset Class Distribution:")
# Your Code Here


In [None]:
## Downsampling Majority Class
majority_downsampled = resample(majority, replace=False, n_samples=len(minority), random_state=42)
downsampled_data = pd.concat([majority_downsampled, minority])

In [None]:
print("Downsampled Dataset Class Distribution:")
# Your Code Here


In [None]:
## SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

In [None]:
print("SMOTE Dataset Class Distribution:")
# Your Code Here

In [None]:
# Step 4: Train Classifiers and Evaluate

## Helper function to train and evaluate a model
def train_and_evaluate(X_train, y_train, X_test, y_test, method_name):
    print(f"\nResults for {method_name}:")
    model = LogisticRegression(random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

# Train and evaluate on original data
train_and_evaluate(X_train, y_train, X_test, y_test, "Original Dataset")

# Train and evaluate on upsampled data
train_and_evaluate(upsampled_data.drop('label', axis=1), upsampled_data['label'], X_test, y_test, "Upsampled Dataset")

# Train and evaluate on downsampled data
train_and_evaluate(downsampled_data.drop('label', axis=1), downsampled_data['label'], X_test, y_test, "Downsampled Dataset")

# Train and evaluate on SMOTE data
train_and_evaluate(X_smote, y_smote, X_test, y_test, "SMOTE Dataset")

**Question 1: Analyze and compare the results from the different balancing techniques. Which method provided the best balance between precision, recall, and F1-score?**

**Answer:**

YOUR ANSWER HERE


# Image Data Augmentation



## Installs

In [None]:
#@title Install Dependencies
!pip install imgaug --quiet
!pip install albumentations --quiet
!pip install torchvision --quiet
!pip install opencv-python --quiet

## Env Config

In [None]:
#@title Imports
import imgaug.augmenters as iaa
import cv2
import matplotlib.pyplot as plt
import requests
import numpy as np

In [None]:
#@title Env params

IMG_PATH = 'https://www.gstatic.com/webp/gallery/1.jpg' #@param

In [None]:


# Fetch the image from the URL
response = requests.get(IMG_PATH, stream=True).raw
image = np.asarray(bytearray(response.read()), dtype="uint8")
image = cv2.imdecode(image, cv2.IMREAD_COLOR)

# Convert from BGR to RGB for matplotlib
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

plt.imshow(image)
plt.title("Original Image")
plt.show()

## Image Augmentation Options

In [None]:
#@title Flipping

import imgaug.augmenters as iaa

# flip horizontally
flip_aug = iaa.Fliplr(1.0)  # 1.0 means always flip

augmented_image = flip_aug.augment_image(image)

plt.imshow(augmented_image)
plt.title("Horizontally Flipped Image")
plt.show()


In [None]:
#@title Option 2: Rotation
rotate_aug = iaa.Affine(rotate=(90)) # rotating 90 degrees, if you pass in a tupple, rotate will pick a random value between your two params

augmented_image = rotate_aug.augment_image(image)

plt.imshow(augmented_image)
plt.title("Rotated Image")
plt.show()

In [None]:
#@title Brightness Adjustment

brightness_aug = iaa.Multiply((0.5))

augmented_image = brightness_aug.augment_image(image)

plt.imshow(augmented_image)
plt.title("Brightness Adjusted Image")
plt.show()

In [None]:
#@title Option 4: Random Noise

noise_aug = iaa.AdditiveGaussianNoise(scale=(100)) # standard deviation fixed at 100

augmented_image = noise_aug.augment_image(image)

plt.imshow(augmented_image)
plt.title("Image with Gaussian Noise")
plt.show()

In [None]:
#@title Distortions (Elastic Transformation)

elastic_aug = iaa.ElasticTransformation(alpha=100, sigma=3) # alpha contols level of distortions, sigma controls focus on unnoticable distortions

augmented_image = elastic_aug.augment_image(image)

plt.imshow(augmented_image)
plt.title("Elastic Transformed Image")
plt.show()

# NLP Data Augmentation

## Installs

In [None]:
#@title Install Dependencies
!pip install --upgrade gensim --quiet
!pip install transformers --quiet
!pip install sacremoses --quiet
!pip install nlpaug --quiet

In [None]:
#@title Download Models
from nlpaug.util.file.download import DownloadUtil

# DownloadUtil.download_word2vec(dest_dir = '.')
# Possible values are ‘wiki-news-300d-1M’, ‘wiki-news-300d-1M-subword’, ‘crawl-300d-2M’ and ‘crawl-300d-2M-subword’

DownloadUtil.download_fasttext(dest_dir = '.', model_name = 'crawl-300d-2M')

# for synonym replacement
# DownloadUtil.download_glove(dest_dir = '.', model_name = 'glove.6B')

## Config Env

In [None]:
#@title Imports (restart runtime)
import gensim
print(gensim.__version__)

import transformers

import sacremoses # for back translation tokenizer

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action


In [None]:
#@title Example Sentence for Augmentation

TEXT = "I have been riding my scooter to the store everyday to get bananas." #@param

## Options for NLP Data Augmentation

In [None]:
#@title Option 1: Replace words with other most similar words

aug = naw.WordEmbsAug(
  model_type = "word2vec",
  model_path = "GoogleNews-vectors-negative300.bin",
  action = "substitute",  # "insert" is another option that doesn't remove og words
  aug_p = 0.25 # probability of token selection for replacement
  )

## Other Models you can use
# aug = naw.WordEmbsAug(
#   model_type = 'fasttext',
#   model_path = 'crawl-300d-2M.vec',
#   action = "insert"
#   )

# aug = naw.WordEmbsAug(
#   model_type = 'glove',
#   model_path = 'glove.6B.300d.txt',
#   action = "substitute"
#   )


# Augment the text
augmented_text = aug.augment(TEXT)
print(f"Original:         {TEXT}\n")
print(f"Augmented Text:   {augmented_text}")

In [None]:
#@title Option 2: Add context words based on nearest embeddings

aug = naw.ContextualWordEmbsAug(
  model_path = 'bert-base-uncased',
  action = "insert",
  aug_p = 0.25
  )

augmented_text = aug.augment(TEXT)

print(f"Original:         {TEXT}\n")
print(f"Augmented Text:   {augmented_text}")

In [None]:
#@title Option 3: Synonym Replacement

aug = naw.SynonymAug(
    aug_src = "wordnet",
    aug_max = 3 # aug_p words too but this allows us to limit how many words are changed
    )
augmented_text = aug.augment(TEXT)

print(f"Original:         {TEXT}\n")
print(f"Augmented Text:   {augmented_text}")

In [None]:
#@title Option 4: Translations (and back translation)

back_translation_aug = naw.BackTranslationAug(
    from_model_name = 'facebook/wmt19-en-de',
    to_model_name = 'facebook/wmt19-de-en'
)

augmented_text = back_translation_aug.augment(TEXT)

print(f"Original:         {TEXT}\n")
print(f"Augmented Text:   {augmented_text}")