# Comparing your transcripts

**NOTE:** Click [here](https://colab.research.google.com/github/senthilchandrasegaran/IDEM105-EDI/blob/main/04-comparing-transcripts.ipynb) to open the file in Colab.

In this notebook, you will compare your own transcript with all the transcripts from the post-it exercise (download the combined transcript from BrightSpace) using the dictionary category of your choice. Make sure you download the appropriate dictionary category from BrightSpace. 

You will also compare your transcript to a non-design transcript to examine any differences in the score, and use KWIC to see if the context of the utterances align with the description of your dictionary category.

## Load Your Transcript
Let's load your Transcript file from BrightSpace. At this point you would already have the file on your after the exercise from the last class.
If you are using Colab, you would need to first upload the file to Google Drive and then specify the link in the `read_excel` command below.

Since this is an excel file, you will need to use a python library called *pandas* to read and process files as tables or "DataFrames".

In [1]:
import pandas as pd  
# Use 'pd' as a shortcut for 'pandas' as it saves you the effort of typing 'pandas' every time.

It is convention to add a `_df` suffix to all variables that represent dataframes. So we load the transcript into a variable called `transcript_df`.

In [2]:
transcript_df = pd.read_excel('./data/sample_transcript.xlsx')
transcript_df.astype({'utterance': 'str'}).dtypes
# Print a random sample of the dataframe, showing 5 rows.
transcript_df.sample(5)

Unnamed: 0,timestamp,speaker,utterance
620,00:44:19,Charlie,I found the glue.
47,00:02:48,Bravo,That we can also do.
517,00:35:46,Bravo,Yes it does.
461,00:30:37,Bravo,I can see it sinking.
145,00:07:05,Alpha,Ah.


# Load the reference dataset of transcripts

You can download the reference dataset of transcripts that we have prepared for you (download from BrightSpace). This is an aggregation of all your transcripts, anonymized to a large extent. Load this into a separate dataframe.

In [3]:
all_df = pd.read_excel('./data/all-transcripts.xlsx')
all_df['utterance'] = all_df['utterance'].astype(str)
all_df.sample(5)

Unnamed: 0,timestamp,speaker,utterance,group
2614,00:19:28,speaker4,It's interesting.,8
2430,00:03:42,unclear,(Confusion),8
1519,,speaker1,Making people happy and now we are ...,5
32,00:02:17,speaker2,I think I like the 3d thing.,1
1204,,speaker4,"Yeah, you didn’t ?",5


Youn can see that the group ID is mentioned in an extra column. This is in case you want to try some group-level comparison, but for now let's ignore the column.

In [4]:
with open('./data/dictionaries/EDI-insight.txt', 'r') as fo:
    dictionary_terms_list = fo.readlines()

# We get rid of extraneous carriage return (\n) characters from the text
dictionary_terms_list = [w.strip('\n') for w in dictionary_terms_list]
print("Number of terms in dictionary:", len(dictionary_terms_list))

Number of terms in dictionary: 383


## Finding matches between dictionary and text
The next step is to find how many terms from the text match the terms in the dictionary category, and to count every match. Note that I use the word "term" and not "word", since there are a number of multi-word terms in the dictionary, such as `realize that`. 

There are also some wildcards, indicated by `*`. A wildcard character indicates a general pattern. For instance, `option*` will return a match to `option`, `options`, `optional`, and `optionally`. 

Due to these wildcards and multi-word terms, we cannot simply use a token-by-token match to perform dictionary term matching. Instead, we will have to find patterns in the original text that match the patterns indicated in the dictionary entries. This includes single- and multi-word terms as well as terms that use wildcards. To achieve this, we will use a concept called [**regular expressions**](https://en.wikipedia.org/wiki/Regular_expression). In python, regular expressions are largely implemented using the ["`re`" library](https://docs.python.org/3/howto/regex.html#regex-howto).

In [5]:
import re

def count_matches(text, pattern):
    if pattern.startswith('*') :
        pattern = r"[A-Za-z]*" + pattern[1:]
        
    if pattern.endswith('*') :
        pattern = pattern[:-1] + r"[A-Za-z]*'"
    
    m = r"\b" + pattern + r"\b"
    matches = re.findall(pattern, text)
    return len(matches)

## Compute dictionary category score for the entire transcript
In the last workshop we computed the dictionary category score for individual turns. This time since we are comparing transcripts, let's perform an aggregate score.

In [6]:
import nltk
# nltk.download('punkt_tab')  # comment this line after the first time you run this code.
from nltk import word_tokenize

# Make a single string combining all the utterances
transcript_utterances = ' '.join(transcript_df['utterance'].to_list())

def get_category_score(utterance_string, category_term_list):
    # Count the total number of times any word from the dictionary appears in the transcript
    term_counts = 0
    for dict_term in category_term_list :
        term_counts += count_matches(utterance_string.lower(), dict_term)
    
    # Count the total words in the transcript
    tokens = word_tokenize(utterance_string)
    word_count = len(tokens)
    
    # Compute dictionary category score
    category_score = term_counts/word_count

    # print results
    print('#####################################')
    print("Total number of matches for the dictionary category:", term_counts)
    print("Total number of words in the transcript:", word_count)
    print(f'Dictionary category score for the transcript: {category_score: .4f}')
    print('#####################################')

In [7]:
print('#####################################')
print('Computing score for YOUR transcript:')
get_category_score(transcript_utterances, dictionary_terms_list)

#####################################
Computing score for YOUR transcript:
#####################################
Total number of matches for the dictionary category: 145
Total number of words in the transcript: 6444
Dictionary category score for the transcript:  0.0225
#####################################


## Perform same analysis on aggregate transcript
Let's compare this with the aggregate transcript.

In [8]:
all_utterances = ' '.join(all_df['utterance'].to_list())

print('#####################################')
print('Computing score for ALL transcripts:')
get_category_score(all_utterances, dictionary_terms_list)

#####################################
Computing score for ALL transcripts:
#####################################
Total number of matches for the dictionary category: 1417
Total number of words in the transcript: 50609
Dictionary category score for the transcript:  0.0280
#####################################


# Import a non-design transcript
Download from BrightSpace the copy of a non-design transcript (in this case, the dataset consists of transcripts from [post-match tennis interviews](https://www.cs.cornell.edu/~liye/tennis.html)).

In [9]:
with open('./data/tennis_finals_interview.txt', 'r') as fo:
    tennis_utterance_str = fo.read()
    

In [10]:
print('#####################################')
print('Computing score for NON-DESIGN transcript:')
get_category_score(tennis_utterance_str, dictionary_terms_list)

#####################################
Computing score for NON-DESIGN transcript:
#####################################
Total number of matches for the dictionary category: 34031
Total number of words in the transcript: 884140
Dictionary category score for the transcript:  0.0385
#####################################


## Concordance Analysis

Since dictionary-based scores are not sensitive to the contexts of word use, it might be a good idea to examine the contexts of word use.

For this purpose, we use a KWIC or KeyWord In Context view that shows all occurrences of a word of interest in the context of its surrounding text.

In [11]:
from nltk.text import Text

# enter the variable corresponding to the text you plan to examine
text_to_examine = transcript_utterances

tokens = word_tokenize(text_to_examine.lower())
textList = Text(tokens)

for term in dictionary_terms_list :
    if term in text_to_examine.lower() :
        print("Looking for occurrences of", term, "...")
        textList.concordance(term, width=85, lines=25)

Looking for occurrences of believe ...
Displaying 1 of 1 matches:
ting . does n't it do time lapse but i believe if you like ask your phone or whatever
Looking for occurrences of choose ...
Displaying 1 of 1 matches:
st make it from the back then , we can choose which way it goes to the camera . fina
Looking for occurrences of concentrate ...
no matches
Looking for occurrences of decide ...
Displaying 1 of 1 matches:
ee people . yeah . yeah… we can always decide to make it flat [ [ slams table ] ] [ 
Looking for occurrences of explain ...
Displaying 3 of 3 matches:
? [ '00:11:03 ' , 'alpha ' ] should we explain what we 're doing ? is that in the des
ap this couple as well . always when i explain to people like oh , i do ide , yes , v
 . yeah no and then i end up having to explain that i fold a duck today . folded a du
Looking for occurrences of feel ...
Displaying 1 of 1 matches:
regular video around . yeah exactly . i feel like halfway through it was the best . 
Looking for occurrence