## Problem Statement

You need to build a model that is able to classify customer complaints based on the products/services. By doing so, you can segregate these tickets into their relevant categories and, therefore, help in the quick resolution of the issue.

You will be doing topic modelling on the <b>.json</b> data provided by the company. Since this data is not labelled, you need to apply NMF to analyse patterns and classify tickets into the following five clusters based on their products/services:

* Credit card / Prepaid card

* Bank account services

* Theft/Dispute reporting

* Mortgages/loans

* Others


With the help of topic modelling, you will be able to map each ticket onto its respective department/category. You can then use this data to train any supervised model such as logistic regression, decision tree or random forest. Using this trained model, you can classify any new customer complaint support ticket into its relevant department.

## Pipelines that needs to be performed:

You need to perform the following eight major tasks to complete the assignment:

1.  Data loading

2. Text preprocessing

3. Exploratory data analysis (EDA)

4. Feature extraction

5. Topic modelling

6. Model building using supervised learning

7. Model training and evaluation

8. Model inference

## Importing the necessary libraries

In [None]:
import json
import numpy as np
import pandas as pd
import re, nltk, spacy, string
import en_core_web_sm
nlp = en_core_web_sm.load()
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer

import re
import string
import spacy

## Loading the data

The data is in JSON format and we need to convert it to a dataframe.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!ls  "/content/drive/My Drive/Automatic_Ticket_Classification_Project/Project_data/"

complaints-2021-05-14_08_16.json


In [None]:
# Opening JSON file
# f =  "/content/drive/My Drive/Automatic_Ticket_Classification_Project/Project_data/complaints-2021-05-14_08_16.json"# Write the path to your data file and load it

# # returns JSON object as
# # a dictionary
# data = json.load(f)
# df=pd.json_normalize(data)

file_path = "/content/drive/My Drive/Automatic_Ticket_Classification_Project/Project_data/complaints-2021-05-14_08_16.json"
with open(file_path, "r") as f:
    data = json.load(f)
df=pd.json_normalize(data)


## Data preparation

In [None]:
# Inspect the dataframe to understand the given data.

df.head()

Unnamed: 0,_index,_type,_id,_score,_source.tags,_source.zip_code,_source.complaint_id,_source.issue,_source.date_received,_source.state,...,_source.company_response,_source.company,_source.submitted_via,_source.date_sent_to_company,_source.company_public_response,_source.sub_product,_source.timely,_source.complaint_what_happened,_source.sub_issue,_source.consumer_consent_provided
0,complaint-public-v2,complaint,3211475,0.0,,90301,3211475,Attempts to collect debt not owed,2019-04-13T12:00:00-05:00,CA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-13T12:00:00-05:00,,Credit card debt,Yes,,Debt is not yours,Consent not provided
1,complaint-public-v2,complaint,3229299,0.0,Servicemember,319XX,3229299,Written notification about debt,2019-05-01T12:00:00-05:00,GA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-05-01T12:00:00-05:00,,Credit card debt,Yes,Good morning my name is XXXX XXXX and I apprec...,Didn't receive enough information to verify debt,Consent provided
2,complaint-public-v2,complaint,3199379,0.0,,77069,3199379,"Other features, terms, or problems",2019-04-02T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-02T12:00:00-05:00,,General-purpose credit card or charge card,Yes,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Problem with rewards from credit card,Consent provided
3,complaint-public-v2,complaint,2673060,0.0,,48066,2673060,Trouble during payment process,2017-09-13T12:00:00-05:00,MI,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2017-09-14T12:00:00-05:00,,Conventional home mortgage,Yes,,,Consent not provided
4,complaint-public-v2,complaint,3203545,0.0,,10473,3203545,Fees or interest,2019-04-05T12:00:00-05:00,NY,...,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2019-04-05T12:00:00-05:00,,General-purpose credit card or charge card,Yes,,Charged too much interest,


In [None]:
#print the column names
df.columns

Index(['_index', '_type', '_id', '_score', '_source.tags', '_source.zip_code',
       '_source.complaint_id', '_source.issue', '_source.date_received',
       '_source.state', '_source.consumer_disputed', '_source.product',
       '_source.company_response', '_source.company', '_source.submitted_via',
       '_source.date_sent_to_company', '_source.company_public_response',
       '_source.sub_product', '_source.timely',
       '_source.complaint_what_happened', '_source.sub_issue',
       '_source.consumer_consent_provided'],
      dtype='object')

In [None]:
#Assign new column names


In [None]:
print(df[["_source.complaint_what_happened", "_source.product"]])


                         _source.complaint_what_happened  \
0                                                          
1      Good morning my name is XXXX XXXX and I apprec...   
2      I upgraded my XXXX XXXX card in XX/XX/2018 and...   
3                                                          
4                                                          
...                                                  ...   
78308                                                      
78309  On Wednesday, XX/XX/XXXX I called Chas, my XXX...   
78310  I am not familiar with XXXX pay and did not un...   
78311  I have had flawless credit for 30 yrs. I've ha...   
78312  Roughly 10+ years ago I closed out my accounts...   

                   _source.product  
0                  Debt collection  
1                  Debt collection  
2      Credit card or prepaid card  
3                         Mortgage  
4      Credit card or prepaid card  
...                            ...  
78308  Checking or s

It appears that the customer complaints are in this column: "_source.complaint_what_happened" and the product that it relates to is in the column  "_source.product". The other columns do not look relevant to this exercise. Consequently we will only assign the name for these two columns.



In [None]:
df.rename(columns={'_source.complaint_what_happened':'complaint_what_happened', '_source.product':'tag'}, inplace=True)

In [None]:
#Assign nan in place of blanks in the complaints column
df['complaint_what_happened'] = df['complaint_what_happened'].replace("", np.nan)

In [None]:
#Remove all rows where complaints column is nan
df.dropna(subset=['complaint_what_happened'], inplace=True)

In [None]:
# Checking the shape of the dataframe again
df.shape

(21072, 22)

Conclusion: The number of rows have reduced from 78313 to 21072 due to the blank customer complains being removed

## Prepare the text for topic modeling

Once you have removed all the blank complaints, you need to:

* Make the text lowercase
* Remove text in square brackets
* Remove punctuation
* Remove words containing numbers


Once you have done these cleaning operations you need to perform the following:
* Lemmatize the texts
* Extract the POS tags of the lemmatized text and remove all the words which have tags other than NN[tag == "NN"].


In [None]:
# Write your function here to clean the text and remove all the unnecessary elements.


def preprocess_text(text):
    """
    Cleans the input text by:
    - Converting it to lowercase
    - Removing text within square brackets
    - Eliminating punctuation
    - Filtering out words that contain numbers
    """

    # Convert text to lowercase
    cleaned_text = text.lower()

    # Remove content inside square brackets
    cleaned_text = re.sub(r'\[.*?\]', '', cleaned_text)

    # Remove punctuation marks
    cleaned_text = re.sub(f"[{re.escape(string.punctuation)}]", '', cleaned_text)

    # Remove words containing numbers
    cleaned_text = re.sub(r'\b\w*\d\w*\b', '', cleaned_text)

    return cleaned_text


In [None]:
# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

def extract_nouns(text):
    """
    This function performs:
    - Lemmatization of text
    - Extracts POS tags and keeps only nouns (NN)
    """

    # Process text with spaCy
    doc = nlp(text)

    # Extract lemmas only for nouns (NN)
    noun_lemmas = [token.lemma_ for token in doc if token.pos_ == "NOUN"]

    # Return the processed text as a string
    return " ".join(noun_lemmas)


In [None]:
#Create a dataframe('df_clean') that will have only the complaints and the lemmatized complaints

# Create a new DataFrame for cleaned complaints
df_clean = pd.DataFrame()

# Apply the text preprocessing function
df_clean["complaint_what_happened"] = df["complaint_what_happened"].dropna().apply(preprocess_text)

# Apply the lemmatization + noun extraction function
df_clean["lemmatized_complaint"] = df_clean["complaint_what_happened"].apply(extract_nouns)

# Display the first few rows of the new DataFrame
df_clean.head()



Unnamed: 0,complaint_what_happened,lemmatized_complaint
1,good morning my name is xxxx xxxx and i apprec...,morning name stop bank cardmember service debt...
2,i upgraded my xxxx xxxx card in and was told ...,xxxx card agent anniversary date agent informa...
10,chase card was reported on however fraudulent...,card application identity consent service cred...
11,on while trying to book a xxxx xxxx ticket ...,xxxx ticket offer ticket reward card informati...
14,my grand son give me check for i deposit it i...,son check chase account fund chase bank accoun...


In [None]:
df_clean

Unnamed: 0,complaint_what_happened,lemmatized_complaint
1,good morning my name is xxxx xxxx and i apprec...,morning name stop bank cardmember service debt...
2,i upgraded my xxxx xxxx card in and was told ...,xxxx card agent anniversary date agent informa...
10,chase card was reported on however fraudulent...,card application identity consent service cred...
11,on while trying to book a xxxx xxxx ticket ...,xxxx ticket offer ticket reward card informati...
14,my grand son give me check for i deposit it i...,son check chase account fund chase bank accoun...
...,...,...
78303,after being a chase card customer for well ove...,chase card customer decade solicitation credit...
78309,on wednesday xxxxxxxx i called chas my xxxx xx...,xxxx credit card provider claim purchase prote...
78310,i am not familiar with xxxx pay and did not un...,xxxx pay risk consumer chase bank app chase ye...
78311,i have had flawless credit for yrs ive had ch...,credit yrs chase credit card chase freedom pro...


In [1]:
#Write your function to extract the POS


import spacy
import pandas as pd

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

def extract_singular_nouns(text):
    """
    Extracts only singular nouns (NN) from the given text using spaCy.
    """

    # Process the text with spaCy
    doc = nlp(text)

    # Keep only words tagged as singular nouns (NN)
    noun_only_text = " ".join([token.text for token in doc if token.tag_ == "NN"])

    return noun_only_text

# Apply the function to filter only singular nouns and store in a new column
df_clean["complaint_POS_removed"] = df_clean["lemmatized_complaint"].apply(extract_singular_nouns)

# Display the first few rows of the updated DataFrame
df_clean.head()



NameError: name 'df_clean' is not defined

In [2]:
#The clean dataframe should now contain the raw complaint, lemmatized complaint and the complaint after removing POS tags.
df_clean

NameError: name 'df_clean' is not defined

## Exploratory data analysis to get familiar with the data.

Write the code in this task to perform the following:

*   Visualise the data according to the 'Complaint' character length
*   Using a word cloud find the top 40 words by frequency among all the articles after processing the text
*   Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text. ‘




In [None]:
# Write your code here to visualise the data according to the 'Complaint' character length

import matplotlib.pyplot as plt
import seaborn as sns

# Calculate character length of each complaint
df_clean["complaint_length"] = df_clean["complaint_POS_removed"].str.len()

# Plot distribution
plt.figure(figsize=(12,6))
sns.histplot(df_clean["complaint_length"], bins=30, kde=True, color="purple")
plt.title("Distribution of Complaint Lengths", fontsize=14)
plt.xlabel("Number of Characters", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.show()


#### Find the top 40 words by frequency among all the articles after processing the text.

In [None]:
#Using a word cloud find the top 40 words by frequency among all the articles after processing the text
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

# Define a set of words to exclude
stopwords = set(STOPWORDS)

# Create a WordCloud with a limit of 40 words
wordcloud = WordCloud(
    background_color="white",
    stopwords=stopwords,
    max_words=40,
    max_font_size=40,
    random_state=42
).generate(" ".join(df_clean["complaint_POS_removed"]))

# Plot and display the word cloud
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")  # Hide axis for cleaner look
plt.title("Word Cloud of Top 40 Words", fontsize=16)
plt.show()


In [None]:
#Removing -PRON- from the text corpus
df_clean['Complaint_clean'] = df_clean['complaint_POS_removed'].str.replace('-PRON-', '')

In [None]:
df_clean.head()

#### Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text.

In [None]:
#Write your code here to find the top 30 unigram frequency among the complaints in the cleaned datafram(df_clean).


def get_top_n_grams(text_data, n=1, top_k=10):
    """
    Extracts the top-k most frequent n-grams (unigrams, bigrams, or trigrams) from the given text.

    Parameters:
    -----------
    text_data : list or Pandas Series
        The corpus of texts from which to extract n-grams.
    n : int
        Size of the n-gram (1 = unigrams, 2 = bigrams, 3 = trigrams, etc.).
    top_k : int
        Number of top n-grams to retrieve.

    Returns:
    --------
    list of tuples (ngram, frequency)
        Each tuple contains the n-gram and its overall frequency in the corpus.
    """

    vectorizer = CountVectorizer(ngram_range=(n, n), stop_words='english')
    transformed_data = vectorizer.fit_transform(text_data)

    # Sum the counts of each n-gram across all documents
    word_counts = transformed_data.sum(axis=0)

    # Map each n-gram to its frequency
    freq_map = [
        (ngram, word_counts[0, idx])
        for ngram, idx in vectorizer.vocabulary_.items()
    ]

    # Sort by frequency in descending order
    sorted_freq_map = sorted(freq_map, key=lambda x: x[1], reverse=True)

    # Return the top_k n-grams
    return sorted_freq_map[:top_k]

# Example usage: extracting unigrams, bigrams, and trigrams
top_20_unigrams = get_top_n_grams(df_clean["complaint_POS_removed"], n=1, top_k=20)
top_20_bigrams = get_top_n_grams(df_clean["complaint_POS_removed"], n=2, top_k=20)
top_20_trigrams = get_top_n_grams(df_clean["complaint_POS_removed"], n=3, top_k=20)

# Converting results to DataFrames for easy viewing
df_unigrams = pd.DataFrame(top_20_unigrams, columns=["Unigram", "Frequency"])
df_bigrams = pd.DataFrame(top_20_bigrams, columns=["Bigram", "Frequency"])
df_trigrams = pd.DataFrame(top_20_trigrams, columns=["Trigram", "Frequency"])

print("Top 20 Unigrams:\n", df_unigrams)
print("\nTop 20 Bigrams:\n", df_bigrams)
print("\nTop 20 Trigrams:\n", df_trigrams)



In [None]:
#Print the top 10 words in the unigram frequency

# Display the top 10 unigrams
print("Top 10 Unigrams:")
print(df_unigrams.head(10))


In [None]:
#Write your code here to find the top 30 bigram frequency among the complaints in the cleaned datafram(df_clean).
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

def get_top_bigrams(text_data, top_k=30):
    """
    Returns the most common bigrams (two-word sequences) in the given text corpus.

    Parameters:
    -----------
    text_data : list or pd.Series
        Collection of documents (strings) to analyze.
    top_k : int
        The number of top bigrams to retrieve.

    Returns:
    --------
    list of tuples [(bigram, frequency), ...]
        Sorted in descending order by frequency.
    """

    # Initialize CountVectorizer for bigrams
    vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words="english")
    transformed_data = vectorizer.fit_transform(text_data)

    # Sum the counts of each bigram
    counts = transformed_data.sum(axis=0)

    # Create a list of (bigram, frequency) pairs
    bigram_freqs = [
        (bigram, counts[0, idx])
        for bigram, idx in vectorizer.vocabulary_.items()
    ]

    # Sort in descending order by frequency
    sorted_bigram_freqs = sorted(bigram_freqs, key=lambda x: x[1], reverse=True)

    return sorted_bigram_freqs[:top_k]

# Use the function to get the top 30 bigrams in the processed text
bigrams_top_30 = get_top_bigrams(df_clean["complaint_POS_removed"].values.astype("U"), top_k=30)

# Convert the result into a DataFrame
df_bigrams = pd.DataFrame(bigrams_top_30, columns=["Bigram", "Frequency"])

# Display the top 30 bigrams
print(df_bigrams)


In [None]:
#Print the top 10 words in the bigram frequency
print("\nTop 10 Bigram Words:")
print(df_bigrams["Bigram"].head(10))

In [None]:
#Write your code here to find the top 30 trigram frequency among the complaints in the cleaned datafram(df_clean).
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

def get_top_trigrams(corpus, top_k=30):
    """
    Extracts the top_k most frequent trigrams from a given text corpus.

    Parameters:
      corpus (list or pd.Series): The text data to analyze.
      top_k (int): The number of top trigrams to return.

    Returns:
      list of tuples: Each tuple contains a trigram and its frequency.
    """
    # Initialize CountVectorizer for trigrams (3-word sequences)
    vectorizer = CountVectorizer(ngram_range=(3, 3), stop_words='english')
    X = vectorizer.fit_transform(corpus)

    # Sum up the counts for each trigram
    trigram_counts = X.sum(axis=0)

    # Map trigrams to their frequencies
    trigram_freqs = [
        (trigram, trigram_counts[0, idx])
        for trigram, idx in vectorizer.vocabulary_.items()
    ]

    # Sort the list by frequency in descending order
    trigram_freqs_sorted = sorted(trigram_freqs, key=lambda x: x[1], reverse=True)

    return trigram_freqs_sorted[:top_k]

# Apply the function to the processed text column in df_clean
top_30_trigrams = get_top_trigrams(df_clean['complaint_POS_removed'].values.astype('U'), top_k=30)

# Convert the results to a DataFrame for easy viewing
df_trigrams = pd.DataFrame(top_30_trigrams, columns=['trigram', 'count'])

# Print the top 30 trigrams
print(df_trigrams)


In [None]:
#Print the top 10 words in the trigram frequency

print("Top 10 Trigrams with Frequency:")
print(df_trigrams.head(10))


## The personal details of customer has been masked in the dataset with xxxx. Let's remove the masked text as this will be of no use for our analysis

In [None]:
df_clean['Complaint_clean'] = df_clean['Complaint_clean'].str.replace('xxxx','')

In [None]:
#All masked texts has been removed
df_clean

## Feature Extraction
Convert the raw texts to a matrix of TF-IDF features

**max_df** is used for removing terms that appear too frequently, also known as "corpus-specific stop words"
max_df = 0.95 means "ignore terms that appear in more than 95% of the complaints"

**min_df** is used for removing terms that appear too infrequently
min_df = 2 means "ignore terms that appear in less than 2 complaints"

In [None]:
#Write your code here to initialise the TfidfVectorizer



#### Create a document term matrix using fit_transform

The contents of a document term matrix are tuples of (complaint_id,token_id) tf-idf score:
The tuples that are not there have a tf-idf score of 0

In [None]:
#Write your code here to create the Document Term Matrix by transforming the complaints column present in df_clean.


## Topic Modelling using NMF

Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there are no labeling of topics that the model will be trained on. The way it works is that, NMF decomposes (or factorizes) high-dimensional vectors into a lower-dimensional representation. These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.

In this task you have to perform the following:

* Find the best number of clusters
* Apply the best number to create word clusters
* Inspect & validate the correction of each cluster wrt the complaints
* Correct the labels if needed
* Map the clusters to topics/cluster names

In [None]:
from sklearn.decomposition import NMF

## Manual Topic Modeling
You need to do take the trial & error approach to find the best num of topics for your NMF model.

The only parameter that is required is the number of components i.e. the number of topics we want. This is the most crucial step in the whole topic modeling process and will greatly affect how good your final topics are.

In [None]:
#Load your nmf_model with the n_components i.e 5
num_topics = #write the value you want to test out

#keep the random_state =40
nmf_model = #write your code here

In [None]:
nmf_model.fit(dtm)
len(tfidf.get_feature_names())

In [None]:
#Print the Top15 words for each of the topics


In [None]:
#Create the best topic for each complaint in terms of integer value 0,1,2,3 & 4



In [None]:
#Assign the best topic to each of the cmplaints in Topic Column

df_clean['Topic'] = #write your code to assign topics to each rows.

In [None]:
df_clean.head()

In [None]:
#Print the first 5 Complaint for each of the Topics
df_clean=df_clean.groupby('Topic').head(5)
df_clean.sort_values('Topic')

#### After evaluating the mapping, if the topics assigned are correct then assign these names to the relevant topic:
* Bank Account services
* Credit card or prepaid card
* Theft/Dispute Reporting
* Mortgage/Loan
* Others

In [None]:
#Create the dictionary of Topic names and Topics

Topic_names = {   }
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)

In [None]:
df_clean

## Supervised model to predict any new complaints to the relevant Topics.

You have now build the model to create the topics for each complaints.Now in the below section you will use them to classify any new complaints.

Since you will be using supervised learning technique we have to convert the topic names to numbers(numpy arrays only understand numbers)

In [None]:
#Create the dictionary again of Topic names and Topics

Topic_names = {   }
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)

In [None]:
df_clean

In [None]:
#Keep the columns"complaint_what_happened" & "Topic" only in the new dataframe --> training_data
training_data=

In [None]:
training_data

####Apply the supervised models on the training data created. In this process, you have to do the following:
* Create the vector counts using Count Vectoriser
* Transform the word vecotr to tf-idf
* Create the train & test data using the train_test_split on the tf-idf & topics


In [None]:

#Write your code to get the Vector count


#Write your code here to transform the word vector to tf-idf

You have to try atleast 3 models on the train & test data from these options:
* Logistic regression
* Decision Tree
* Random Forest
* Naive Bayes (optional)

**Using the required evaluation metrics judge the tried models and select the ones performing the best**

In [None]:
# Write your code here to build any 3 models and evaluate them using the required metrics



