In [1]:
import pandas as pd

df = pd.read_csv('/content/spam.csv', encoding='latin-1')
df.head()


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [2]:
df = df[['v1', 'v2']]
df.columns = ['label', 'message']
df.head()


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ham,4825
spam,747


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# This parameter automatically adjusts weights based on class frequencies (87/13)
log_model = LogisticRegression(class_weight='balanced')
rf_model = RandomForestClassifier(class_weight='balanced')
svm_model = SVC(class_weight='balanced')

# Then, train your model as usual:
# log_model.fit(X_train, y_train)


### 1. Data Preparation: Text to Numerical Features

First, we need to convert the text messages into numerical features. We'll use the `TfidfVectorizer` from `sklearn.feature_extraction.text`.

**Note**: For demonstration, we will use the original `df` for this step, as balancing techniques are typically applied *after* splitting data into training and testing sets, or at least on the training set. If you've already created `df_balanced` using simple oversampling, you can apply this to `df_balanced` instead. For now, let's assume we're working with the original `df` to show the full process with SMOTE.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df['message']
y = df['label']

# Convert 'spam'/'ham' labels to numerical (0/1)
y = y.map({'ham': 0, 'spam': 1})

# Split data into training and testing sets
# It's crucial to balance only the training data to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data (do not fit on test data)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("Shape of X_train_tfidf after vectorization:", X_train_tfidf.shape)
print("Shape of y_train:", y_train.shape)

Shape of X_train_tfidf after vectorization: (4457, 5000)
Shape of y_train: (4457,)


### 2. Applying SMOTE (Upsampling)

Now that we have numerical features (`X_train_tfidf`) and numerical labels (`y_train`), we can apply SMOTE to the *training data* to balance the classes. You'll need to install `imbalanced-learn` if you haven't already (`%pip install imbalanced-learn`).

In [8]:
# Install imbalanced-learn if not already installed
%pip install imbalanced-learn

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import numpy as np

# Redefine features (X) and target (y) based on the global 'df'
X = df['message']
y = df['label']

# Convert 'spam'/'ham' labels to numerical (0/1)
y = y.map({'ham': 0, 'spam': 1})

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_tfidf, y_train)

print("Class distribution before SMOTE (training data):")
display(y_train.value_counts())

print("Class distribution after SMOTE:")
display(pd.Series(y_train_smote).value_counts())

print("Shape of X_train_smote after SMOTE:", X_train_smote.shape)

Class distribution before SMOTE (training data):


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,3859
1,598


Class distribution after SMOTE:


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,3859
1,3859


Shape of X_train_smote after SMOTE: (7718, 5000)


### Which approach is better for your project?

Let's discuss the options:

1.  **Simple Oversampling (Duplication of minority class, like in my previous response):**
    *   **Pros:** Easy to implement. Increases the number of minority samples.
    *   **Cons:** Can lead to overfitting, as the model sees exact duplicates of minority samples. It doesn't add new information to the dataset.

2.  **Downsampling (Your first snippet suggestion):**
    *   **Pros:** Reduces the total dataset size, which can speed up training. Helpful when you have a very large majority class and computational resources are limited.
    *   **Cons:** You discard potentially valuable information from the majority class, which can lead to a less robust model.

3.  **Upsampling with SMOTE (Your second snippet suggestion, and implemented above):**
    *   **Pros:** Generates synthetic samples for the minority class, rather than just duplicating existing ones. This adds new, but similar, data points, which can help the model generalize better and reduce the risk of overfitting compared to simple oversampling.
    *   **Cons:** Can introduce noise if the minority class is already very noisy or ill-defined. It works best on numerical data, hence the need for TF-IDF vectorization first. It can also increase the training time due to the larger dataset size.

**Recommendation for Spam Classification:**

For text classification tasks like spam detection, **SMOTE is generally preferred over simple oversampling or downsampling.**

*   **Why not simple oversampling?** It's prone to overfitting by just repeating existing spam messages.
*   **Why not downsampling?** 'Ham' messages contain valuable linguistic patterns that help distinguish them from 'spam'. Randomly removing 'ham' messages might lead to a loss of crucial information, potentially making your model less effective at identifying legitimate messages.
*   **Why SMOTE?** By creating synthetic 'spam' messages, SMOTE helps the model learn more diverse patterns within the minority class without just memorizing existing examples or losing majority class information. This usually leads to a more balanced and robust classifier.

Therefore, I recommend proceeding with **SMOTE** as demonstrated above, after converting your text data into numerical features.

### Current Balanced Training Dataset

After TF-IDF vectorization and applying SMOTE, our training data is now represented by `X_train_smote` (features) and `y_train_smote` (labels).

In [9]:
import pandas as pd

print("Shape of X_train_smote (features) after SMOTE:", X_train_smote.shape)
print("Shape of y_train_smote (labels) after SMOTE:", y_train_smote.shape)

print("\nFirst 5 rows of X_train_smote (numerical features, sparse representation):")
# Convert to dense array for display, only for a few rows
display(pd.DataFrame(X_train_smote[:5].toarray()))

print("\nFirst 5 rows of y_train_smote (balanced labels):")
display(y_train_smote.head())

print("\nClass distribution of y_train_smote (balanced labels):")
display(y_train_smote.value_counts())

Shape of X_train_smote (features) after SMOTE: (7718, 5000)
Shape of y_train_smote (labels) after SMOTE: (7718,)

First 5 rows of X_train_smote (numerical features, sparse representation):


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



First 5 rows of y_train_smote (balanced labels):


Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0



Class distribution of y_train_smote (balanced labels):


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,3859
1,3859


As you can see, `X_train_smote` has `7718` samples (rows) and `5000` features (columns), and `y_train_smote` also has `7718` labels, with an equal count of `0` (ham) and `1` (spam), demonstrating that the dataset is now balanced for training.

Learnt how to work with unbalanced dataset
Looking forward to use ml in .net
https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets
https://medium.com/codex/handling-imbalanced-data-upsampling-and-downsampling-in-machine-learning-10f33ff0620b
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
