## ðŸ“Š Customer Complaint Intelligence: NLP & Trend Analysis

Project Overview : 

In the Fintech industry, understanding why customers are calling support is crucial for reducing churn and improving products. This project analyzes a dataset of customer complaints, call logs, and user information to identify the root causes of dissatisfaction.

Objective : 

The goal is to move beyond simple ticket counting and use Natural Language Processing (NLP) to extract specific recurring phrases (e.g., "hidden charges" vs. just "charges"). This allows us to pinpoint exactly what is going wrong and when.


Technical Workflow

Data Ingestion: Connecting to a local SQL database using sqlalchemy to retrieve live complaint data.

Data Transformation: Merging three distinct tables (complaints, call_logs, user_info) into a single analytical dataset.

NLP & Text Mining:

Removing "stop words" (common noise words like the, is, and).

Generating Bigrams (2-word combinations) to capture context (e.g., converting "payment" -> "receive payment").

Trend Analysis: Grouping data by month to see how specific issues evolve over time.

In [43]:
!pip install matplotlib



In [42]:
import pandas as pb 
import numpy as nu
import sqlalchemy 
from collections import Counter
import nltk
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
nltk.download('stopwords')


from urllib.parse import quote_plus

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yashp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
password = quote_plus("Yash@1234")
engine = sqlalchemy.create_engine(f"mysql+pymysql://root:{password}@127.0.0.1:3306/project_1")

First i want to clean the data by replacing the null vaule 

In [3]:
fd = pb.read_sql_table("complaints" , engine , schema="project_1")

fd = fd.fillna("unknown_issue")

In [4]:
pf = pb.read_sql_table("call_logs" , engine , schema="project_1")
pf = pf.fillna("issue not given")


In [5]:
df = pb.read_sql_table("user_info" , engine , schema="project_1")
df = df.drop_duplicates(subset=['customer_id'])


## merging three table in one table

In [6]:
temp = fd.merge(df , how = "outer")

In [7]:
final_table = temp.merge(pf , how = "outer")
final_table

Unnamed: 0,case_id,customer_id,issue_text,created_at,category,name,city,call_id,issue_summary,agent_id,call_duration
0,4-938000036198,36198,unknown_issue,2024-01-15 09:23:45,Authentication Failure,Aarav Sharma,Mumbai,p1000000000993233001,issue not given,101,45.50
1,4-938000041257,41257,unknown_issue,2024-01-16 14:35:12,Authentication Failure,Priya Patel,Delhi,p1000000000993233002,issue not given,102,32.25
2,4-938000052189,52189,unknown_issue,2024-01-17 11:47:33,Disconnected Call,Rohan Kumar,Bangalore,p1000000000993233003,issue not given,103,28.75
3,4-938000067423,67423,unknown_issue,2024-01-18 16:52:19,Authentication Failure,Ananya Gupta,Chennai,p1000000000993233004,issue not given,104,51.20
4,4-938000078365,78365,unknown_issue,2024-01-19 10:15:27,Disconnected Call,Vikram Singh,Kolkata,p1000000000993233005,issue not given,105,36.80
...,...,...,...,...,...,...,...,...,...,...,...
92,4-938000958714,958714,issue with business account,2024-04-17 10:35:53,Business Account,Neha Sharma,Bhopal,p1000000000993233093,Business account administrative control panel ...,193,407.35
93,4-938000969825,969825,made a wrong payment,2024-04-18 13:50:29,Transaction Issue,Vikram Patel,Indore,p1000000000993233094,Incorrect payment transfer with funds sent to ...,194,268.00
94,4-938000970936,970936,money debited but bill not paid,2024-04-19 08:05:46,Transaction Issue,Anjali Mehta,Thane,p1000000000993233095,Bill payment completed but utility service not...,195,311.65
95,4-938000981047,981047,unable to make payment,2024-04-20 15:25:32,Payment Issue,Raj Malhotra,Navi Mumbai,p1000000000993233096,External application compatibility issue not r...,196,183.30


The data is already clean so i dont need to chance anything.But data cleaning is important very much.in above code i have merge the three table. I used merge table because it is very much easy to use.  

SO below i am going to Identify the issue most of the people are facing so that we can get an ideal where is the issue people are facing 

The data in the table is the really data of a top fintech company in india the issue are not right i made a modifiction for this project 

This are the terms that are mostly commonly used by the customer 


### What we'll do next
We'll extract the most frequently used *single words* and *bigrams* from `issue_text` after removing stop words, explain what each cell does and then visualize the results (top keywords, top bigrams).
- Step 1: Remove stopwords and get single-word counts.
- Step 2: Build bigrams (pairs of adjacent meaningful words) and compute counts.
- Step 3: Visualize both counts and later explore categories/monthly top problems.

In [8]:
stop_words_list = stopwords.words('english')
stop_words_list.extend(["issue", "problem", "error" , "make" ,"add" , "made" , "open" , "paid","go","unable","money","wrong",])
all_word = final_table["issue_text"].str.split(" ", expand=True).stack()
clean_word = all_word[~all_word.isin(stop_words_list)]
Most_Common_terms = clean_word.value_counts()

#rename the coloum 

Most_Common_terms = Most_Common_terms.reset_index( name ='frequency')
Most_Common_terms = Most_Common_terms.rename(columns={'index' : 'Keyword'})
Most_Common_terms


Unnamed: 0,Keyword,frequency
0,payment,35
1,account,20
2,receive,14
3,hidden,10
4,charges,10
5,cancel,10
6,autopay,10
7,bank,10
8,business,10
9,application,9


I am creating a pattern recognition system that analyzes the data in pairs rather than just looking at isolated words. First, I defined a custom function that strips away 'noise'â€”words like 'the', 'is', or 'at'â€”so that only the meaningful keywords remain. It then links these keywords together into adjacent pairs (for example, connecting 'server' and 'down' to make 'server down'). I applied this logic to the entire dataset and expanded the results so that every specific pair gets its own row. Finally, I counted the frequency of these pairs to create a ranked leaderboard. This tells us exactly which specific two-word combinationsâ€”like 'login failure' or 'network timeout'â€”are occurring most often, giving us much better context than single words alone.

In [9]:
def get_clean_word(text):
    words = text.split()
    meaningful_words=[w for w in words if w.lower() not in stop_words_list]
    return [" ".join(pair) for pair in zip (meaningful_words, meaningful_words[1:])]
clean_bigrams = final_table["issue_text"].apply(get_clean_word).explode()
Most_Common_pairs = clean_bigrams.value_counts()

Most_Common_pairs = Most_Common_pairs.reset_index(name = 'Frequency')
print(Most_Common_pairs)

          issue_text  Frequency
0    receive payment         14
1     hidden charges         10
2     cancel autopay         10
3       bank account         10
4   business account         10
5  debited merchants          4
6  merchants receive          4
7       debited bill          4
8       debit device          2


the coloum 8 is useless we can not understand what do this means so remove this 

In [10]:
Most_Common_pairs = Most_Common_pairs.drop(Most_Common_pairs.index[8])
Most_Common_pairs

Unnamed: 0,issue_text,Frequency
0,receive payment,14
1,hidden charges,10
2,cancel autopay,10
3,bank account,10
4,business account,10
5,debited merchants,4
6,merchants receive,4
7,debited bill,4


In [None]:

def clean_data(Most_Common_pairs):
    # Capitalize the first character in column: 'issue_text'
    Most_Common_pairs['issue_text'] = Most_Common_pairs['issue_text'].str.title()
    return Most_Common_pairs

Most_Common_pairs_clean = clean_data(Most_Common_pairs.copy())
Most_Common_pairs_clean.head()

Unnamed: 0,issue_text,Frequency
0,receive payment,14
1,hidden charges,10
2,cancel autopay,10
3,bank account,10
4,business account,10


In [11]:
Issue = final_table.groupby("category").aggregate(count_of_customer_issue = ("case_id", 'count'))
Issue


Unnamed: 0_level_0,count_of_customer_issue
category,Unnamed: 1_level_1
Account Issue,20
Authentication Failure,3
Business Account,10
Disconnected Call,2
Payment Issue,31
Technical Issue,9
Transaction Issue,20
Unrelated Issue,2


Trying to find out the issue faced by the user in a month of the year 

In [None]:
final_table['month'] = final_table['created_at'].dt.month

In [27]:
monthly_issue = final_table.groupby(['month'])


In [35]:
final_table['issue_pairs'] = final_table['issue_text'].apply(get_clean_word)
df_explore = final_table.explode('issue_pairs')

monthly_counts = df_explore.groupby(['month', 'issue_pairs']).size().reset_index(name='count')


monthly_counts = monthly_counts.sort_values(['month', 'count'], ascending=[True, False])


top_issue_per_month = monthly_counts.groupby('month').head(1)


print(top_issue_per_month)

    month      issue_pairs  count
0       1     bank account      1
8       2   cancel autopay      4
22      3  receive payment      6
30      4  receive payment      3
