<font size = 5 color = lightblue> <bb> NLP - Streamlining the Customer Grievance Process

Objective:

* The goal is to use NLP techniques, such as text classification and sentiment analysis
* Efficiently gain insights into the underlying causes of customer grievances
* Improving our grievance redressal process

In [1]:
!pip install nltk==3.8.1
!pip install vaderSentiment
!pip install spacy==3.5.1
import nltk
nltk.download('punkt')
nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
import pandas as pd
import tensorflow as tf
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Collecting spacy==3.5.1
  Using cached spacy-3.5.1.tar.gz (1.2 MB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Installing build dependencies ... [?25l[?25herror
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


<font size = 3 color = Magenta> <b> Read Data from excel

In [2]:
def nlpDataSetInfo():
  print(f"INFO: {bankingComplaints.info()}")
  print("========================================================\n")
  print(f"columns : {bankingComplaints.columns}")
  print("========================================================\n")
  print(f"{bankingComplaints.head()}")

In [3]:
bankingComplaints = pd.read_excel("/content/sample_data/banking_complaints_2023.xlsx")
nlpDataSetInfo()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7011 entries, 0 to 7010
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Complaint ID           7011 non-null   object        
 1   Date Received          7011 non-null   datetime64[ns]
 2   Banking Product        7011 non-null   object        
 3   Department             7011 non-null   object        
 4   Issue ID               7011 non-null   object        
 5   Complaint Description  7011 non-null   object        
 6   State                  6984 non-null   object        
 7   ZIP                    6981 non-null   object        
 8   Bank Response          7011 non-null   object        
dtypes: datetime64[ns](1), object(8)
memory usage: 493.1+ KB
INFO: None

columns : Index(['Complaint ID', 'Date Received', 'Banking Product', 'Department',
       'Issue ID', 'Complaint Description', 'State', 'ZIP', 'Bank Response'],
      dtype

<font size = 3 color = Magenta> <b> Prepare text data

In [4]:
# pos tag mapping
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [5]:
def preprocessing(text):
  text = text.lower()   #lowering case
  text = ''.join([word for word in text if (word not in punctuation and not word.isdigit())])  #Punctuation and number removal

  # Obtain the list of stopwords from the corpus
  stp_wrds_eng = stopwords.words('english')
  text = ' '.join([word for word in text.split() if word not in stp_wrds_eng])  #Stopwords removal

  #Remove redundant x's
  string_to_remove = ['xx','xxxx','xxxxxxxx','xxxxxx23','xxxxxx','xxxxxxxxxx','xxxxxxxxxxxx']

  # This creates a new list excluding the specified string
  text = ' '.join([word for word in text.split() if word not in string_to_remove])

  #Lemmatization
  # Get the pos tag
  tagged = nltk.pos_tag(text.split())
  # Create lemmatizer object
  lemmatizer = WordNetLemmatizer()
  lemma_sent = []
  for word, tag in tagged:
    new_tag = pos_tagger(tag)
    lemma = lemmatizer.lemmatize(word, new_tag)
    lemma_sent.append(lemma)
  text = ' '.join(lemma_sent)
  return lemma_sent

In [6]:
def cropText(text,numCrop):
  text = text[numCrop:]
  return int(text)

In [7]:
def clean_text():
  bankingComplaints.rename(columns = {'Complaint ID':'ComplaintID','Date Received':'DateReceived','Banking Product':'BankingProduct','Issue ID':'IssueID','Complaint Description':'ComplaintDescription','Bank Response':'BankResponse'}, inplace = True)
  bankingComplaints['State'] = bankingComplaints['State'].fillna(bankingComplaints['State'].mode()[0])
  bankingComplaints['ZIP'] = bankingComplaints['ZIP'].str.replace('XX','00')

  # Convert to numeric first, coercing errors to NaN, then fill NaNs, then convert to int
  bankingComplaints['ZIP'] = pd.to_numeric(bankingComplaints['ZIP'], errors='coerce').fillna(bankingComplaints['ZIP'].mode()[0]).astype(int)

  #apply preprocessing steps
  bankingComplaints['ComplaintDescription_cleaned'] = bankingComplaints['ComplaintDescription'].apply(lambda x: preprocessing(x))

  #format the datatype of the columns
  bankingComplaints['ComplaintID'] = bankingComplaints['ComplaintID'].apply(lambda x : cropText(x,3))
  bankingComplaints['IssueID'] = bankingComplaints['IssueID'].apply(lambda x : cropText(x,2))
  bankingComplaints['DateReceived'] = pd.to_datetime(bankingComplaints['DateReceived'])
  print(f'date range Min : {bankingComplaints['DateReceived'].min()}, Date range Max: {bankingComplaints['DateReceived'].max()}')
  return

clean_text()

date range Min : 2023-01-01 00:00:00, Date range Max: 2023-10-21 00:00:00


In [8]:
bankingComplaints.head()

Unnamed: 0,ComplaintID,DateReceived,BankingProduct,Department,IssueID,ComplaintDescription,State,ZIP,BankResponse,ComplaintDescription_cleaned
0,76118977,2023-01-01,Checking or savings account,CASA,3510635,on XX/XX/XX22 I opened a safe balance account ...,California,30000,Closed with monetary relief,"[open, safe, balance, account, online, use, pa..."
1,98703933,2023-01-01,"Credit reporting, credit repair services, or o...",Credit Reports,3798538,There is an item from Bank of ABC on my credit...,California,30000,Closed with explanation,"[item, bank, abc, credit, report, belong, must..."
2,52036665,2023-01-01,Checking or savings account,CASA,3648593,On XX/XX/XX22 I found out that my account was ...,New York,30000,Closed with monetary relief,"[find, account, frozen, apparent, reason, go, ..."
3,62581335,2023-01-01,Credit card or prepaid card,Credit Cards,6999080,I've had a credit card for years with Bank of ...,California,30000,Closed with monetary relief,"[ive, credit, card, year, bank, abc, pay, bala..."
4,65731164,2023-01-01,Checking or savings account,CASA,3648593,This issue has to do with the way that Bank of...,New Jersey,30000,Closed with explanation,"[issue, way, bank, abc, account, link, bill, p..."


In [9]:
#create a feature joining the list of strings to make a sentence, creating clean sentence
bankingComplaints['ComplaintDescription_cleaned_str'] = bankingComplaints['ComplaintDescription_cleaned'].apply(lambda x: ' '.join(x))

<font size = 3 color = Magenta> <b> Convert the pre-processed text into a matrix of TF-IDF features for downstream modelling.

In [10]:
# Split Data into Training and Testing Sets
dataSet_TF_IDF = list(bankingComplaints['ComplaintDescription_cleaned_str'])
y = bankingComplaints['Department']
X_train, X_test, y_train, y_test = train_test_split(dataSet_TF_IDF, y, test_size=0.2, random_state=42)


#vectorizer object
tfidf_vectorizer = TfidfVectorizer(max_features=5000, lowercase=True, analyzer='word')

# Fit the vectorizer on the training data and transform both training and test data and create vocabulary
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

vectorizedData = pd.DataFrame(X_train_tfidf.toarray(), columns= list(tfidf_vectorizer.get_feature_names_out()), index = [f'sent_{i}' for i in range(1,len(X_train)+1)])
vectorizedData

Unnamed: 0,aaa,aba,abandon,abc,abcc,abcmerill,abcn,abcns,abcs,abcxxxx,...,youll,young,youre,youve,yr,yrs,zelle,zero,zip,zone
sent_1,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_2,0.0,0.0,0.0,0.048555,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_3,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_4,0.0,0.0,0.0,0.030261,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_5,0.0,0.0,0.0,0.168590,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sent_5604,0.0,0.0,0.0,0.053926,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_5605,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_5606,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_5607,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


 <font size = 3 color = Magenta> <b> Categorise the complaint and pass on to the concerned product department. Consider the department as a target variable and build a classification model

In [11]:
#  Train a Multinomial Naive Bayes Classifier
# Initialize MultinomialNB
naive_bayes_classifier = MultinomialNB()

# Train the classifier
naive_bayes_classifier.fit(X_train_tfidf, y_train)

# Make Predictions
y_pred = naive_bayes_classifier.predict(X_test_tfidf)

# Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Example of predicting a new document
new_document = ["I have received multiple mail letters stating that my bank account was closed and checks have bounce in account I do not have open"]
new_document_tfidf = tfidf_vectorizer.transform(new_document)
prediction = naive_bayes_classifier.predict(new_document_tfidf)
print(f"Prediction for '{new_document[0]}': {prediction[0]}")

Accuracy: 0.6764076977904491
Prediction for 'I have received multiple mail letters stating that my bank account was closed and checks have bounce in account I do not have open': CASA


 <font size = 4 color = Magenta> <b> Transformer-based Modeling & Sentiment Analysis

In [12]:
#The compound score is a normalized value between -1 (most extreme negative) and +1 (most extreme positive), providing an overall sentiment rating
def getAnalysis(score):
    if score < 0:
        return 'Negative'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Positive'


#SentimentIntensityAnalyzer  to predict sentiments from the complaints.
def SentimentAnalyzer(text1):
  # create an analyzer object
  analyzer = SentimentIntensityAnalyzer()
  # obtain the polarity scores
  vs = analyzer.polarity_scores(text1)
  sentiment = getAnalysis(vs['compound'])

  return sentiment

In [13]:
bankingComplaints['Sentiments'] = bankingComplaints['ComplaintDescription_cleaned_str'].apply(lambda x : SentimentAnalyzer(x))
bankingComplaints = bankingComplaints.drop(['ComplaintDescription_cleaned'])

In [14]:
bankingComplaints.head()

Unnamed: 0,ComplaintID,DateReceived,BankingProduct,Department,IssueID,ComplaintDescription,State,ZIP,BankResponse,ComplaintDescription_cleaned,ComplaintDescription_cleaned_str,Sentiments
0,76118977,2023-01-01,Checking or savings account,CASA,3510635,on XX/XX/XX22 I opened a safe balance account ...,California,30000,Closed with monetary relief,"[open, safe, balance, account, online, use, pa...",open safe balance account online use payroll c...,Negative
1,98703933,2023-01-01,"Credit reporting, credit repair services, or o...",Credit Reports,3798538,There is an item from Bank of ABC on my credit...,California,30000,Closed with explanation,"[item, bank, abc, credit, report, belong, must...",item bank abc credit report belong must remove...,Positive
2,52036665,2023-01-01,Checking or savings account,CASA,3648593,On XX/XX/XX22 I found out that my account was ...,New York,30000,Closed with monetary relief,"[find, account, frozen, apparent, reason, go, ...",find account frozen apparent reason go boa bra...,Neutral
3,62581335,2023-01-01,Credit card or prepaid card,Credit Cards,6999080,I've had a credit card for years with Bank of ...,California,30000,Closed with monetary relief,"[ive, credit, card, year, bank, abc, pay, bala...",ive credit card year bank abc pay balance auto...,Negative
4,65731164,2023-01-01,Checking or savings account,CASA,3648593,This issue has to do with the way that Bank of...,New Jersey,30000,Closed with explanation,"[issue, way, bank, abc, account, link, bill, p...",issue way bank abc account link bill pay part ...,Positive


# Summary

Based on the evaluation metric Naive Bayes Classifier and Transformer-based SentimentIntensityAnalyzer achieved perfect scores of 68% and 100% of accuracy on the test data.

Conclusion
The sentiment analysis system, utilizing Transformer-based SentimentIntensityAnalyzer model on preprocessed Customer Grievance dataset, demonstrates exceptional performance with 100% accuracy across all the feedback from the customers. The preprocessing steps, including text cleaning and TF-IDF method of vectorization, were effective in preparing the data for these models. The sparse matrix input to the classifier model, likely contributed to the high performance. This system is highly effective in classifying the customers feedback in various departments into positive, neutral, and negative sentiments.