Task: Classifying textual data using a multi-label approach.

Stack: TF-IDF, Random Forest

Steps:
1. Data loading and normalization
2. label binarization and title vectorization
3. Splitting the data into training and test parts, according to 80/20 standard.
4. Model training
5. Model evaluation and results

Result: Achieved an accuracy of 89%, which is 9% higher than the results obtained in the Logistic Regression model.

Difficulties:
- Owner label is equal to 0
- Insufficient handling of rare classes

Solutions undertaken:
- Used NLTK for high quality noise cleanup
- Added lematization
- Testing different variants of parameters n_estimators, test_size.

Opportunities for optimization:
- Balancing classes to increase focus on rare metrics like Owner.
- Changing vectorization tool (e.g. Word2Vec)
- Changing the model (e.g. MLN, BERT)

In [None]:
My results:

Classification Report:
                              precision    recall  f1-score   support

               Chief Officer       0.96      0.60      0.74        40
                    Director       0.93      0.93      0.93        97
Individual Contributor/Staff       0.97      0.98      0.97       226
                     Manager       0.85      0.53      0.65        32
                       Owner       0.00      0.00      0.00         2
              Vice President       0.93      0.93      0.93        67

                   micro avg       0.95      0.89      0.92       464
                   macro avg       0.77      0.66      0.70       464
                weighted avg       0.94      0.89      0.91       464
                 samples avg       0.92      0.91      0.91       464

Total model accuracy:
0.8928571428571429

In [10]:
from google.colab import drive
import nltk
import pandas as pandas
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# NLTK Resources
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("punkt_tab")


# 1. Data pre-processing

# Downloading dataset from Google Drive
drive.mount("/content/drive")
data_path = "/content/drive/My Drive/dataset_path/dataset_name.xlsx"

data = pandas.read_excel(data_path)

data.fillna("", inplace=True) # Replacing NaN > empty string

# Labels processing
data["Combined_Labels"] = data[["Column 1", "Column 2", "Column 3", "Column 4"]].values.tolist() # Split the labels
data["Combined_Labels"] = data["Combined_Labels"].apply(lambda x: list(filter(None, x)))  # Remove empty values

def text_cleaner(text):
    text = text.lower() # Register
    tokens = nltk.word_tokenize(text) # Tokenization
    stop_words = set(stopwords.words("english")) # Noise with NLTK

    filtered_tokens = []
    for word in tokens:
        if word not in stop_words:
            filtered_tokens.append(word)
    tokens = filtered_tokens

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = []
    for word in tokens:
        lemmatized_word = lemmatizer.lemmatize(word)
        lemmatized_tokens.append(lemmatized_word)
    tokens = lemmatized_tokens

    return " ".join(tokens)

# Title processing
data["Processed_Title"] = data["Title"].apply(text_cleaner)

# attributes X abd labels Y
X_titles = data["Processed_Title"]
Y_labels = data["Combined_Labels"]

# Labels to binary
binar = MultiLabelBinarizer()
y_binarized = binar.fit_transform(Y_labels)


# 2. TF-IDF vectorization
vector = TfidfVectorizer()
X_tfidf = vector.fit_transform(X_titles)


# 3. 80/20 data split
X_train, X_test, Y_train, Y_test = train_test_split(X_tfidf, y_binarized, test_size=0.20, random_state=1)


# 4. Training
model = RandomForestClassifier(random_state=1, n_estimators=100)
model.fit(X_train, Y_train)

# 5. Result
Y_pred = model.predict(X_test)

results = classification_report(Y_test, Y_pred, target_names=binar.classes_, zero_division=0)
accuracy = accuracy_score(Y_test, Y_pred)

# Print metrics
print(f"Classification Report:\n{results}")
print(f"Total model accuracy:\n{accuracy}")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Classification Report:
                              precision    recall  f1-score   support

               Chief Officer       0.96      0.60      0.74        40
                    Director       0.93      0.93      0.93        97
Individual Contributor/Staff       0.97      0.98      0.97       226
                     Manager       0.85      0.53      0.65        32
                       Owner       0.00      0.00      0.00         2
              Vice President       0.93      0.93      0.93        67

                   micro avg       0.95      0.89      0.92       464
                   macro avg       0.77      0.66      0.70       464
                weighted avg       0.94      0.89      0.91       464
                 samples avg       0.92      0.91      0.91       464

Total model accuracy:
0.8928571428571429
