`Analyzing Customer Support Calls`

`November 2025`

This project focuses on leveraging Speech Recognition and Natural Language Processing techniques to analyze customer support calls. The goal is to extract valuable insights that can help improve customer service quality.

`Any questions, please reach out!`

Chiawei Wang, PhD\
Data & Product Analyst\
<chiawei.w@outlook.com>

`*` Note that the table of contents and other links may not work directly on GitHub.

[Table of contents](#table-of-contents)
1. [Executive summary](#executive-summary)
   - [Challenge](#challenge)
   - [Research Questions](#research-questions)
   - [Data overview](#data-overview)
   - [Approach](#approach)
   - [Results](#results)
   - [Conclusion](#conclusion)
2. [Exploratory data analysis](#exploratory-data-analysis)

# Executive summary

## Challenge

We aim to enhance customer support services by leveraging Speech Recognition and Natural Language Processing techniques. We will transcribe a sample customer audio call and analyze a dataset of pre-transcribed customer calls to extract insights such as sentiment analysis, named entity recognition, and text similarity.

## Research questions

1. Is the audio compatible for future speech recognition modeling?
2. How many calls have a true positive sentiment?
3. What is the most frequently named entity across all of the transcriptions?
4. Which call is the most similar to 'wrong package delivery'?

## Data overview

| Index | Column            | Type   | Description                                        |
| ----- | ----------------- | ------ | -------------------------------------------------- |
| 0     | `text`            | object | Transcription of the customer call                 |
| 1     | `sentiment_label` | object | Whether the call is positive, neutral, or negative |

## Approach

1. Implement speech recognition and calculate audio statistics
2. Perform sentiment analysis
3. Run named entity recognition
4. Find most similar texts

## Results

- The audio file has 1 channel and a frame rate of 44100 Hz.
- True positive sentiment count: 2
- Most frequently entity: yesterday
- Most similar text to 'wrong package delivery': 'wrong package delivered'

## Conclusion

We successfully transcribed a customer support call and analyzed a dataset of transcriptions to extract valuable insights. The findings can help improve customer service quality by addressing common issues and understanding customer sentiment better.

# Exploratory data analysis

In [1]:
# Import necessary libraries
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import speech_recognition as sr
from pydub import AudioSegment
import spacy

In [2]:
# Read in the CSV as a DataFrame
df = pd.read_csv('customer_sentiment.csv')

# Preview the data
print(df.shape)
df.head()

(102, 2)


Unnamed: 0,text,sentiment_label
0,how's it going Arthur I just placed an order w...,negative
1,yeah hello I'm just wondering if I can speak t...,neutral
2,hey I receive my order but it's the wrong size...,negative
3,hi David I just placed an order online and I w...,neutral
4,hey I bought something from your website the o...,negative


In [3]:
# Speech to Text: Convert the sample audio call, sample_customer_call.wav, to text and store the result in transcribed_text
# Define a recognizer object
recognizer = sr.Recognizer()

# Convert the audio file to audio data
transcribe_audio_file = sr.AudioFile('customer_call.wav')
with transcribe_audio_file as source:
    transcribe_audio = recognizer.record(source)

# Convert the audio data to text
transcribed_text = recognizer.recognize_google(transcribe_audio)

# Review trascribed text
print('Transcribed text:', transcribed_text)

Transcribed text: hello I'm experiencing an issue with your product I'd like to speak to someone about a replacement


In [4]:
# Speech to Text: Store few statistics of the audio file such as number of channels, sample width and frame rate
# Review number of channels and frame rate of the audio file
audio_segment = AudioSegment.from_file('customer_call.wav')
number_channels = audio_segment.channels
frame_rate = audio_segment.frame_rate

print('Number of channel(s):', number_channels)
print('Frame rate:', frame_rate)

Number of channel(s): 1
Frame rate: 44100


In [5]:
# Sentiment Analysis: Use vader module from nltk library to determine the sentiment of each text of the customer_call_transcriptions.csv file and store them at a new sentiment_label column using compound score
# Initialize SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# Analyze sentiment by evaluating compound score generated by Vader SentimentIntensityAnalyzer
def find_sentiment(text):
    scores = sid.polarity_scores(text)
    compound_score = scores['compound']

    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

df['sentiment_predicted'] = df.apply(lambda row: find_sentiment(row['text']), axis = 1)

# Sentiment Analysis: Calculate number of texts with positive label that are correctly labeled as positive
true_positive = len(df.loc[(df['sentiment_predicted'] == df['sentiment_label']) &
                (df['sentiment_label'] == 'positive')])

print('True positives:', true_positive)

True positives: 2


In [6]:
# Named Entity Recognition: Find named entities for each text in the df object and store entities in a named_entities column
# Load spaCy English Language model
nlp = spacy.load('en_core_web_sm')

# NER using spaCy
def extract_entities(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return entities

# Apply NER to the entire text column
df['named_entities'] = df['text'].apply(extract_entities)

# Flatten the list of named entities
all_entities = [ent for entities in df['named_entities'] for ent in entities]

# Create a DataFrame with the counts
entities_df = pd.DataFrame(all_entities, columns=['entity'])
entities_counts = entities_df['entity'].value_counts().reset_index()
entities_counts.columns = ['entity', 'count']

# Extract most frequent named entity
most_freq_ent = entities_counts['entity'].iloc[0]
print('Most frequent entity:', most_freq_ent)

Most frequent entity: yesterday


In [7]:
# Find most similar text: find the list of customer calls that complained about 'wrong package delivery' by finding similarity score of each text to the 'wrong package delivery' string using spaCy small English Language model
# Load spaCy English Language model
nlp = spacy.load('en_core_web_md')

# Process the text column
df['processed_text'] = df['text'].apply(lambda text: nlp(text))

# Input query
input_query = 'wrong package delivery'
processed_query = nlp(input_query)

# Calculate similarity scores and sort dataframe with respect to similarity scores
df['similarity'] = df['processed_text'].apply(lambda text: processed_query.similarity(text))
df = df.sort_values(by='similarity', ascending=False)

# Find the most similar text
most_similar_text = df['text'].iloc[0]
print('Most similar text:', most_similar_text)

# Print the top entries and their scores
print()
print('Top similar texts:')
print(df[['text', 'similarity']].head())

Most similar text: wrong package delivered

Top similar texts:
                                                 text  similarity
81                            wrong package delivered    0.890197
45  hi Jacob I just placed an order with you guys ...    0.852101
50  hi just about to order these shoes online I wa...    0.823615
75  I was looking online it says that you're only ...    0.823007
58  I purchase something from your online store ye...    0.817316
