Model Selection - Multinomial Naive Bayes:

Naive Bayes is a good baseline model for text classification tasks due to its simplicity and effectiveness with high-dimensional data, like text.
It can work well even with small amounts of data and can be quite robust to the noise.
While it doesn't inherently handle class imbalance, when combined with a proper sampling strategy, it can still perform well.
Sampling - SMOTE (Synthetic Minority Over-sampling Technique):

SMOTE is an oversampling technique that creates synthetic samples of the minority class(es) by interpolating between existing samples.
This method is used to create a more balanced class distribution without losing information, which is crucial when you have an imbalanced dataset.
In the pipeline, SMOTE is applied only during the training phase, ensuring that the model learns from a balanced distribution of classes.
Pipeline Usage:

The use of a pipeline ensures that vectorization and SMOTE are applied correctly during cross-validation, preventing data leakage and ensuring that the resampling only happens on the data being used to train the model during each fold of cross-validation.
Stratified Split:

The train-test split is stratified, which means it will maintain the original proportion of the class distribution in both the training and test datasets.
This stratification ensures that even though the classes are imbalanced, they will be represented proportionally in both sets, which is especially important for evaluating the model performance accurately.

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder

from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import nltk

In [3]:
df = pd.read_csv('/content/unbalanced_description&category.csv')

In [8]:
df.head()

Unnamed: 0,description,category_verification,category_encoded
0,"Today, software engineers need to know not onl...",arch,4
1,As the digital economy changes the rules of th...,arch,4
2,Software architecture metrics are key to the m...,arch,4
3,ONE-VOLUME INTRODUCTION TO QUANTUM COMPUTING C...,arch,4
4,What will you learn from this book? If you're ...,arch,4


In [20]:
# Download NLTK resources (if not already installed)
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define a function for preprocessing the text
def lemmatize_text(text):
    # Tokenize the text and remove special characters
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text.lower())  # Lowercasing the text
    # Lemmatize tokens and remove stop words
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in tokens if w not in stop_words])
    return lemmatized_output

# Creating a train pipeline
def train_pipeline(df, description_column, category_column):
    # Encoding target label
    label_encoder = LabelEncoder()
    df['category_encoded'] = label_encoder.fit_transform(df[category_column])

    # Splitting the dataset into the Training set and Test set
    X_train, X_test, y_train, y_test = train_test_split(
        df[description_column], df['category_encoded'],
        test_size=0.2, random_state=7,
        stratify=df['category_encoded']
    )

    # Convert the set of stop words to a list
    stop_words_list = list(stop_words)

    # Creating a pipeline with text vectorization, SMOTE resampling, and Naive Bayes classifier
    pipeline = make_pipeline(
        TfidfVectorizer(tokenizer=lemmatize_text, stop_words=stop_words_list),
        SMOTE(random_state=7),
        MultinomialNB()
    )

    # Training the pipeline
    pipeline.fit(X_train, y_train)

    # Predicting the Test set results
    y_pred = pipeline.predict(X_test)

    # Generating the classification report
    report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)

    return pipeline, label_encoder, X_test, y_test, report, y_pred

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [7]:
def predict_category(pipeline, label_encoder, description):
  prediction_encoded = pipeline.predict([description])
  prediction = label_encoder.inverse_transform(prediction_encoded)
  return prediction[0]

In [22]:
# Train the pipeline
pipeline, label_encoder, X_test, y_test, report, y_pred = train_pipeline(df, 'description', 'category_verification')



In [23]:
print(report)

              precision    recall  f1-score   support

          AI       0.35      0.05      0.09      1240
          DB       0.19      0.18      0.19       294
Program_lang       0.35      0.29      0.32       549
       UI_UX       0.10      0.44      0.16       257
        arch       0.02      0.08      0.03        52
    business       0.27      0.68      0.38      1295
    comp_eng       0.13      0.21      0.16       200
info_mindful       0.80      0.08      0.15      1404
  networking       0.18      0.08      0.11       593
          os       0.12      0.27      0.16       149
 programming       0.24      0.04      0.07       814

    accuracy                           0.23      6847
   macro avg       0.25      0.22      0.17      6847
weighted avg       0.37      0.23      0.19      6847



In [24]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.23
