<a href="https://colab.research.google.com/github/sergiomar73/nlp-google-colab/blob/main/nlp_poc_01_classify_transcript.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [62]:
transcript = "Hello, how are you doing today? I'm here to tell you a little bit about, uh, quantified communications and the quantified platform and how it impacts organizations, who it helps and how it works. So I'll get started off by telling you just a little bit about a high level about, um, the quantified platform. Oh, so the quantified platform is one of the most advanced communication intelligence in AI powered coaching systems. And what does that really mean? So, um, communication coaching is something that is typically delivered one on one between a communication coach who has a, uh, a doctorate or a, um, background and experience in teaching people how to be better communicators and how to express themselves effectively. Um, those coaches would work one-on-one with individuals, um, maybe put their information in front of audiences and see how well they respond. And that can be a very costly process as well as a time consuming. And, um, not always backed by the science of what really drives great communication. So the quantified platform allows all of that to be automated through our AI coaching system. Um, our system empowers, um, uh, is empowered by behavioral science in order to be able to take videos into the system and be able to render exactly how a audience is going to perceive you and provide the communication feedback that you need in order to be, become a better communicator. So that helps you sell more, that helps drive better experiences and improves your external communication with your clients. So how does it work? Um, I touched on that a little bit, um, but let me kind of unpack exactly the science behind it. Um, so we started off with a, a large swath of videos of from fantastic communicators, all the people that you would idolize and wanna be like, and we took those videos and we put them in front of panelists and we scored them to see exactly how well they would perform in front of an audience. So how likable, uh, uh, was that speaker, um, how effective were they at communicating their ideas? You know, were they persuasive? Would you actually buy something from them? Did you wanna listen to them longer? Um, did you find them engaging? These things are innately human in their, um, in how communication elicits a response from us? Those are the types of things that we actually measured and built an algorithm around. So the way that the system works is it, um, uh, you are allowed to record yourself inside the application. Um, we also embed into video conferencing platforms as well. So you can invite a bot into your live meeting conversations if you wish. And we have other integration options as well, including having a role playing conversations inside the application. Um, once we have that, the system analyzes, uh, the video content for the words that you say, so your sentence structure, phrases, um, sentiment analysis, pronoun usage, ver burb tone and usage. Um, how you conduct your face, the microexpressions that you have, um, do you appear happy, calm, angry, and your gestures? Um, you know, part of being an appealing, um, conversationalist is being engaging and have people want to, <laugh> want to listen to you. And, and so all of these things all come together into really, um, defining what makes a great piece of communication. And we use that as our benchmark of how to define that inside of our platform. So when you go into the platform, you're really being measured against the best communicators, as well as our entire community of people using the quantified platforms. So you can see where you are against other, um, roles, similar to yours, other people, similar career paths, and see how you grow and, um, get better from there and to optimize your behavior. So who does it help? Um, it can help everybody, everybody can improve their lives, their personal lives, their professional lives, um, their business contacts, their ability to be able to sell and deliver products, um, through having better communications and being able to effectively deliver your message. This is fantastic for people in leadership programs who are looking to accelerate to senior executive executive positions, uh, who are looking to improve their status inside of an organization, their ability to be a leader and be inspiring, um, as well as entry level people who really want to represent their brand well, they wanna have a great impact on their external customer experience, as well as their internal ones. And this, this whole system can be tailored specifically for an organization so that we identify the key characteristics of your top sales leaders, your top performers, your top leaders, and replicate that across the rest of your organization. So it doesn't come with a one size fits all. It is very specific on thinking about the characteristics and the behavioral patterns and the communication styles of those who are already effective inside your organization and creating the patterns to duplicate that. So depending on, on your brand presence and what you value inside of your organization, that can be replicated at scale. So who's it gonna have the greatest impact on, um, those participating in customer experiences, those communicating with customers directly, um, uh, spending time with members of your team, inspiring them, providing leadership guidance, visibility into the overall vision of an organization. Um, there are so much science out there that says really effective leaders lead from great communication. Um, and we wanna remake those people remarkably better. Everybody can improve their communication and everybody deserves to be a great communicator. Um, we see growth early on in the process. So, um, as people participate in the program, they usually, uh, get about 30% better within their first six weeks to 12 weeks. So there's a huge uptick in ability to be able to become more trustworthy, authentic, credible, um, have better collective performances across your team and across your organization and have that individual growth as a team leader. Um, this is all based on evidence based research and a ton of analytics, which we're all very, very proud of. Uh, so I hope that explains our quantified platform. And I look forward to talking to you again soon. Thank you very much."

# Read saved Categories and Phrases

In [63]:
!pip install pickle5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [64]:
import numpy as np
import pandas as pd
import pickle5 as pickle

In [65]:
df_phrases_path = './df_phrases.pkl'
df_phrases = pd.read_pickle(df_phrases_path)
print(df_phrases.shape)
df_phrases.head(3)

(64, 4)


Unnamed: 0,category,label,example,embedding
0,What is Quantified,What,most advanced conversation intelligence and AI...,"[-0.007960937917232513, 0.0075285546481609344,..."
1,What is Quantified,What,a software platform that helps people reach th...,"[-0.00591889675706625, 2.6476920538698323e-05,..."
2,What is Quantified,What,for communicating and connecting,"[-0.005206338595598936, 0.0007997832144610584,..."


# OpenAI

In [66]:
!pip install openai
import os
import openai

openai.organization = "org-XXXXXXXXXXXXXXXXXXXXX"
openai.api_key = "sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Process Transcript

In [67]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(transcript)
sentences = [ sentence.text for sentence in list(doc.sents) ]
sentences[:3]

['Hello, how are you doing today?',
 "I'm here to tell you a little bit about, uh, quantified communications and the quantified platform and how it impacts organizations, who it helps and how it works.",
 "So I'll get started off by telling you just a little bit about a high level about, um, the quantified platform."]

In [68]:
import time

def calculate_embeddings_with_gpt3(sentence, engine="text-similarity-davinci-001", interval = 1.5, verbose=True):
  if verbose:
    print(f'Calculating embedding for {sentence}...')
  time.sleep(interval)
  response = openai.Embedding.create(
    input=sentence,
    engine=engine
  )
  embedding = response['data'][0]['embedding']
  return embedding

In [69]:
df_sentences = pd.DataFrame(columns=['line', 'sentence', 'embedding'])

for idx, sentence in enumerate(sentences):
  embedding = calculate_embeddings_with_gpt3(sentence)
  # Create new row
  new_row = {
    'line': idx + 1,
    'sentence': sentence,
    'embedding': embedding
  }
  df_sentences = df_sentences.append(new_row, ignore_index=True)

print(df_sentences.shape)
df_sentences.head()

Calculating embedding for Hello, how are you doing today?...
Calculating embedding for I'm here to tell you a little bit about, uh, quantified communications and the quantified platform and how it impacts organizations, who it helps and how it works....
Calculating embedding for So I'll get started off by telling you just a little bit about a high level about, um, the quantified platform....
Calculating embedding for Oh, so the quantified platform is one of the most advanced communication intelligence in AI powered coaching systems....
Calculating embedding for And what does that really mean?...
Calculating embedding for So, um, communication coaching is something that is typically delivered one on one between a communication coach who has a, uh, a doctorate or a, um, background and experience in teaching people how to be better communicators and how to express themselves effectively....
Calculating embedding for Um, those coaches would work one-on-one with individuals, um, maybe put t

Unnamed: 0,line,sentence,embedding
0,1,"Hello, how are you doing today?","[-0.003950469195842743, 0.015000976622104645, ..."
1,2,"I'm here to tell you a little bit about, uh, q...","[-0.009600087068974972, 0.002050349721685052, ..."
2,3,So I'll get started off by telling you just a ...,"[-0.009551599621772766, 0.0032257316634058952,..."
3,4,"Oh, so the quantified platform is one of the m...","[-0.003871886758133769, 0.004901846870779991, ..."
4,5,And what does that really mean?,"[-0.009942175820469856, 0.01378173753619194, -..."


# Similarity

In [70]:
targets = np.array([ np.array(value[0]) for value in df_phrases[["embedding"]].values ])
print(f"targets:{targets.shape}")

targets:(64, 12288)


In [71]:
df_cosines = pd.DataFrame(columns=['line'])

for i, row in df_sentences.iterrows():
    line = f'{row["line"]:03}'
    # print(f'Calculating cosines for [ {line} ] {row["sentence"][:50]}...')
    source = np.array(row["embedding"])
    cosine = np.dot(targets,source)/(np.linalg.norm(targets, axis=1)*np.linalg.norm(source))
    # Create new row
    new_row = dict([(f"Cosine{f'{key:02}'}", value) for key, value in enumerate(cosine.flatten(), 1)])
    new_row["line"] = row["line"]
    df_cosines = df_cosines.append(new_row, ignore_index=True)

df_cosines['line'] = df_cosines['line'].astype('int')
print(df_cosines.shape)
df_cosines.head(3)

(51, 65)


Unnamed: 0,line,Cosine01,Cosine02,Cosine03,Cosine04,Cosine05,Cosine06,Cosine07,Cosine08,Cosine09,...,Cosine55,Cosine56,Cosine57,Cosine58,Cosine59,Cosine60,Cosine61,Cosine62,Cosine63,Cosine64
0,1,0.64784,0.685653,0.689358,0.72486,0.681512,0.672155,0.687316,0.688927,0.668197,...,0.646963,0.694437,0.646987,0.695505,0.703807,0.678319,0.711509,0.685593,0.682147,0.723876
1,2,0.793354,0.746575,0.688302,0.694458,0.67256,0.716549,0.663016,0.705523,0.741473,...,0.639954,0.657171,0.743381,0.686107,0.697144,0.661339,0.67207,0.721339,0.705404,0.670754
2,3,0.770353,0.75764,0.676912,0.702108,0.688316,0.687752,0.6635,0.713817,0.722872,...,0.670474,0.690668,0.722508,0.701332,0.716044,0.686732,0.689009,0.716929,0.714853,0.675136


In [81]:
df_comparison = df_cosines #[(df_cosines.filter(regex='Cosine') > threshold).any(axis=1)]
print(df_comparison.shape)
df_comparison.head(3)

(51, 65)


Unnamed: 0,line,Cosine01,Cosine02,Cosine03,Cosine04,Cosine05,Cosine06,Cosine07,Cosine08,Cosine09,...,Cosine55,Cosine56,Cosine57,Cosine58,Cosine59,Cosine60,Cosine61,Cosine62,Cosine63,Cosine64
0,1,0.64784,0.685653,0.689358,0.72486,0.681512,0.672155,0.687316,0.688927,0.668197,...,0.646963,0.694437,0.646987,0.695505,0.703807,0.678319,0.711509,0.685593,0.682147,0.723876
1,2,0.793354,0.746575,0.688302,0.694458,0.67256,0.716549,0.663016,0.705523,0.741473,...,0.639954,0.657171,0.743381,0.686107,0.697144,0.661339,0.67207,0.721339,0.705404,0.670754
2,3,0.770353,0.75764,0.676912,0.702108,0.688316,0.687752,0.6635,0.713817,0.722872,...,0.670474,0.690668,0.722508,0.701332,0.716044,0.686732,0.689009,0.716929,0.714853,0.675136


In [99]:
df_results = pd.DataFrame(columns=['line', 'sentence', 'phrase', 'category', 'tag', 'similarity'])

for i, row in df_comparison.iterrows():
  for n in range(1,64+1):
    col = f"Cosine{f'{n:02}'}"
    # if row[col] > threshold:
    phrase = df_phrases.loc[[ n - 1 ]]
    new_row = { 
      'line': row["line"],
      'sentence': df_sentences.at[int(row["line"])-1,"sentence"],
      'phrase': df_phrases.at[n-1,"example"],
      'category': df_phrases.at[n-1,"category"],
      'tag': df_phrases.at[n-1,"label"],
      'similarity': row[col]
    }
    df_results = df_results.append(new_row, ignore_index=True)

df_results['line'] = df_cosines['line'].astype('int')
print(df_results.shape)
df_results.head(3)

(3264, 6)


Unnamed: 0,line,sentence,phrase,category,tag,similarity
0,1.0,"Hello, how are you doing today?",most advanced conversation intelligence and AI...,What is Quantified,What,0.64784
1,2.0,"Hello, how are you doing today?",a software platform that helps people reach th...,What is Quantified,What,0.685653
2,3.0,"Hello, how are you doing today?",for communicating and connecting,What is Quantified,What,0.689358


# Setting » threshold

In [230]:
threshold = 0.8

In [231]:
df_summary = df_results.groupby(['tag'])['similarity'].agg('max').to_frame()
df_summary['ok'] = np.where(df_summary['similarity'] > threshold, True, False)
df_summary

Unnamed: 0_level_0,similarity,ok
tag,Unnamed: 1_level_1,Unnamed: 2_level_1
How,0.867479,True
Impact,0.879954,True
What,0.883998,True
Who,0.78315,False


In [232]:
import plotly.express as px

fig = px.bar(
  df_summary,
  y='similarity',
  color='ok',
  color_discrete_map={ True: px.colors.qualitative.Plotly[2], False: px.colors.qualitative.Set2[7] },
  text='similarity',
  text_auto='.3f',
  labels={'tag': 'Category', 'similarity': 'Similarity'},
  title = f"{transcript[:200]}..."
)
fig.update_yaxes(
    range=[0, 1]
)
fig.add_shape( # add a horizontal "target" line
    type="line", line_color="salmon", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=threshold, y1=threshold, yref="y"
)
fig.show()

# Top 3 sentences by Category

In [233]:
df_results.drop(labels='line',axis=1).sort_values(['tag','similarity'],ascending=[True,False]).groupby('tag').head(3).reset_index().drop(labels='index',axis=1)

Unnamed: 0,sentence,phrase,category,tag,similarity
0,"Um, we also embed into video conferencing plat...",integrated into video conference platforms,How does it work?,How,0.867479
1,So the quantified platform allows all of that ...,coach you using artificial intelligence,How does it work?,How,0.825322
2,"Hello, how are you doing today?",relay to you how you're doing,How does it work?,How,0.814374
3,"Um, and we wanna remake those people remarkabl...",we want to make you remarkably better,How can it have the greatest impact?,Impact,0.879954
4,"Um, and we wanna remake those people remarkabl...",we want to make you extraordinary at that beha...,How can it have the greatest impact?,Impact,0.810032
5,Those are the types of things that we actually...,measured by evidence-based research,How can it have the greatest impact?,Impact,0.802939
6,"Oh, so the quantified platform is one of the m...",most advanced conversation intelligence and AI...,What is Quantified,What,0.883998
7,So the quantified platform allows all of that ...,most advanced conversation intelligence and AI...,What is Quantified,What,0.852535
8,"Um, and we wanna remake those people remarkabl...",help them deliver better experiences,What is Quantified,What,0.818323
9,"These things are innately human in their, um, ...","words that they use, the way that they present...",Who does it help?,Who,0.78315
