# Multi-Label Text Classification with TFIDF features and Logistic Regression

In this notebook, I have implemented a Logistic Regression model for multi-label classification using TfIdf features extracted from text. 

Since the TFIDF vectorization is a very sparse matrix, linear classifiers are known to work well with them. Also the available data and the number of features extracted from them can although be large, linear methods can use them to train a model. 

I have therefore made the choice to use a Logistic regression model. 

I use a combination of main_product names and sub_product names as the tags after doing a number of text cleaning steps. 

In [21]:
# Imports
import re
import nltk
import pickle
import tqdm
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, average_precision_score, f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression

## Loading and preparing data

Below I prepare the data required for fitting Logistic Regression model. The main fields I use are the "main_product" as the class label and the "complaint_text" as the input text. 

After loading the data from tables, I merge the products and complaints tables to get the data into one frame. 

Later, I undersample the majority classes in order to reduce the class imbalance. 

In [22]:
# Input data into dataframes 
complaints_users = pd.read_csv('../data/complaints_users.csv')
products = pd.read_csv('../data/products.csv')

# Merge tables to create a unified dataset with predictors and response 
df = pd.merge(complaints_users, products, left_on="PRODUCT_ID", right_on="PRODUCT_ID", how="left")

# Drop columns that are not required
df = df[["COMPLAINT_TEXT", "PRODUCT_ID", "MAIN_PRODUCT", "SUB_PRODUCT"]]
df = df.drop_duplicates()
df = df.reset_index()

# groupby "main_products" and perform majaority undersampling 
grouped_complaints = df.groupby("MAIN_PRODUCT")
new_df = pd.DataFrame()
for name, group in grouped_complaints:
    if group.shape[0] > 10000:
        chosen_records = group.sample(n=10000, axis=0, random_state=9)
    elif group.shape[0] > 5000 and group.shape[0] < 10000:
        chosen_records = group
    else:
        pass
    new_df = pd.concat([new_df, chosen_records])
    
# the new_df is ready
new_df = new_df.reset_index()

## Text Cleaning 

Text tidying such as removal of punctuation, lemmatization of words after tokenization are performed below. 

In [23]:
# Some basic text tidy job is done here 

# regex to remove anything other than word and space - i.e, punctuations 
remove_punctuation = re.compile('[^\w\s]')

# regex to remove xxxx usually credit card entries - do not use
remove_xxxx = re.compile('\sx+x')

# regex to remove digits - do not use
remove_digits = re.compile('\d')

# stopwords corpora 
stopwords = set(stopwords.words('english'))

# this is a good lemmatizer that reduces nouns to their correct root form but leaves the verbs out
stemmer = WordNetLemmatizer()

# this tokenizer splits not only on space but on punctuation too
tokenizer = TreebankWordTokenizer()

# function to clean the text
def text_cleaning(text):
    text = text.lower()
    text = remove_punctuation.sub('', text)
    #text = remove_xxxx.sub('', text)
    #text = remove_digits.sub('', text)
    text = tokenizer.tokenize(text)
    text = ' '.join(stemmer.lemmatize(word) for word in text if word not in stopwords)
    return text

# Using apply to apply the above function on the COMPLAINT_TEXT series 
new_df["COMPLAINT_TEXT"] = new_df["COMPLAINT_TEXT"].apply(text_cleaning)

Some classes are merged due to the names being same or substrings of one another. 

In [24]:
# Merging classes
new_df.loc[new_df["MAIN_PRODUCT"]=="Credit card", "MAIN_PRODUCT"] = "Credit card or prepaid card"
new_df.loc[new_df["MAIN_PRODUCT"]=="Prepaid card", "MAIN_PRODUCT"] = "Credit card or prepaid card"
new_df.loc[new_df["MAIN_PRODUCT"]=="Payday loan", "MAIN_PRODUCT"] = "Payday loan, title loan, or personal loan"
new_df.loc[new_df["MAIN_PRODUCT"]=="Money transfers", "MAIN_PRODUCT"] = "Money transfer, virtual currency, or money service"
new_df.loc[new_df["MAIN_PRODUCT"]=="Virtual currency", "MAIN_PRODUCT"] = "Money transfer, virtual currency, or money service"
new_df.loc[new_df["MAIN_PRODUCT"]=="Credit reporting", "MAIN_PRODUCT"] = "Credit reporting, credit repair services, or other personal consumer reports"

'new_df.loc[new_df["MAIN_PRODUCT"]=="Credit card", "MAIN_PRODUCT"] = "Credit card or prepaid card"\nnew_df.loc[new_df["MAIN_PRODUCT"]=="Prepaid card", "MAIN_PRODUCT"] = "Credit card or prepaid card"\nnew_df.loc[new_df["MAIN_PRODUCT"]=="Payday loan", "MAIN_PRODUCT"] = "Payday loan, title loan, or personal loan"\nnew_df.loc[new_df["MAIN_PRODUCT"]=="Money transfers", "MAIN_PRODUCT"] = "Money transfer, virtual currency, or money service"\nnew_df.loc[new_df["MAIN_PRODUCT"]=="Virtual currency", "MAIN_PRODUCT"] = "Money transfer, virtual currency, or money service"\nnew_df.loc[new_df["MAIN_PRODUCT"]=="Credit reporting", "MAIN_PRODUCT"] = "Credit reporting, credit repair services, or other personal consumer reports'

## Partition data and extract TFIDF features

In this section, I will creates train/val/test partitions from data and then transform them to TFIDF features. 

X is the cleaned complained text from new_df. 
y (target) will be the tags obtained by concatenating the "main_product" and "sub_product" for each record. 

Sklearn train_test_split is used twice to progressively partition data. 

In [25]:
# Text on which feature extraction will be applied
X = new_df["COMPLAINT_TEXT"]

# Target labels
new_df["MAIN_PRODUCT"] = new_df["MAIN_PRODUCT"].fillna('')
new_df["SUB_PRODUCT"] = new_df["SUB_PRODUCT"].fillna('')
tags = pd.Series(zip(new_df["MAIN_PRODUCT"], new_df["SUB_PRODUCT"])).map(list)

In [26]:
# Two step application of data partitioning - the final data partition ratio will be 0.7:0.15:0.15

X_train, X_val, y_train, y_val = train_test_split(X, tags, test_size=0.3)

X_val, X_test, y_val, y_test = train_test_split(X_val, y_val, test_size=0.5)

In [27]:
# This is a function where I define a TFIDF vectorizer and fit it with train partition first and transform the two more partitions for use later in the pipeline.

def extract_tfidf_features(X_train, X_test, X_val):
    tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, max_df=0.5, norm='l2', ngram_range=(1, 2), stop_words="english", max_features = 2**16)
    tfidf_vectorizer.fit(X_train)
    X_train = tfidf_vectorizer.transform(X_train)
    X_val = tfidf_vectorizer.transform(X_val)
    X_test = tfidf_vectorizer.transform(X_test)
    return X_train, X_val, X_test

'def extract_tfidf_features(X):\n    tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, max_df=0.5, norm=\'l2\', ngram_range=(1, 2), stop_words="english", max_features=10000)\n    return tfidf_vectorizer.fit_transform(X)'

In [28]:
# Apply the above function to transform all three partitions to their feature forms 
X_train_tfidf, X_val_tfidf, X_test_tfidf = extract_tfidf_features(X_train, X_val, X_test)

'X_train, X_val, y_train, y_val = train_test_split(X_tfidf, tags, test_size=0.3)\n\nX_val, X_test, y_val, y_test = train_test_split(X_val, y_val, test_size=0.5)'

## Model fitting

In this section I will define a Logistic Regression classifier which can predict tags for each text. 

Since each text snippet can have multiple labels assigned, during the prediction also there can be multiple tags predicted for each input text. 

To deal with this type of prediction of multiple tags per text, we need to prepare the "target" as a binary variable where 1 indicates the presence of the tag and 0 indicates it's absence. For each unique tag, there can be two possibilities - 1/0 for every text. 

Sklearn's MultiLabelBinarizer can encode the tags for each text in binary format as described above. 

In [29]:
# Let the MLB know of all the unique possible tags in the corpus 

# generate unique tags in the corpus
# class tags counts
class_tags_counts = {}
for tag in tags:
    for item in tag:
        if item!="":
            if item in class_tags_counts:
                class_tags_counts[item] += 1
            else:
                class_tags_counts[item] = 1
    

# create an instance of MLB with the unique tags we created above
mlb = MultiLabelBinarizer(classes=sorted(class_tags_counts.keys()))


In [30]:
# Here I transform the targets in train and val data into their binary counterparts - I have kept the "test" targets aside for now. 
y_train_tfidf = mlb.fit_transform(y_train)
y_val_tfidf = mlb.fit_transform(y_val)

In [31]:
# The function below defines a classifier and fits it for training partition

def train_classifier(X_train, y_train):
    """
      X_train, y_train — training data
      
      return: trained classifier
    """
    
    # Create and fit LogisticRegression wraped into OneVsRestClassifier.

    clf = OneVsRestClassifier(LogisticRegression(multi_class='ovr', penalty='l2', C=10)).fit(X_train, y_train)

    return clf

In [32]:
# I call the above function to fit the classifier with the training partition

# Notice that the train X and train targets are tfidf transformed and binarized respectively
classifier_tfidf = train_classifier(X_train_tfidf, y_train_tfidf)

## Evaluation 

In this section, I will evaluate the classifier on validation set and generate evaluation metrics

In [33]:
# I will predict the tags on the validation set and check if it performs okay. This is used to choose parameters of the Log reg model above.
y_val_predicted_labels = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores = classifier_tfidf.decision_function(X_val_tfidf)

In [34]:
# I will reverse transform the predictions to check if they are good
y_val_pred = mlb.inverse_transform(y_val_predicted_labels)

In [38]:
# I will check a few example predictions 

for i in range(10):
    print("complaint text: {}".format(X_val.iloc[i]))
    print("True labels: {}".format(y_val.iloc[i]))
    print("Predicted labels: {} \n".format(y_val_pred[i]))

complaint text: complaint regarding xxxx xxxx fka xxxx acctnoxxxx reporting experian xxxx xxxx credit bureausit special significance account disputed twice within last 30 day bureau bureau updated verified information time evidence attached 3rd party credit pull pulled today xxxx2019after multiple dispute verified bureau information match correct even though bureau say information correcti would like specious creditor bought account bank simply delete account solution ongoing problem proof clearly readable understandable bureau act unilaterally delete account creditor posse record original creditor merely guessing response dispute
True labels: ['Credit reporting, credit repair services, or other personal consumer reports', 'Credit reporting']
Predicted labels: ('Conventional adjustable mortgage (ARM)', 'Mortgage') 

complaint text: xxxxxxxx mailed xxxx xxxx check 55000 cover cost xxxx xxxx xxxx credit card account xxxx xxxx xxxx processed check yet send receipt send 3 credit reporting 

In [36]:
# Function that prints performance measures
def print_evaluation_scores(y_val, predicted):
    
    print("accuracy: {}".format(accuracy_score(y_val, predicted)))
    print("f1 score: {}".format(f1_score(y_val, predicted,average="weighted")))
    print("average precision score: {}".format(average_precision_score(y_val, predicted)))

In [37]:
print_evaluation_scores(y_val_tfidf, y_val_predicted_labels)

accuracy: 0.033134143828559656
f1 score: 0.09520909055244269
average precision score: nan
