# Comparing your transcripts

**NOTE:** Click [here](https://colab.research.google.com/github/senthilchandrasegaran/IDEM105-EDI/blob/main/insights-analysis.ipynb) to open the file in Colab.

In this notebook, you will upload the transcript of the songwriting video that you were asked to watch for homework, and examine how your identification of insight-related terms performs when generalised to the transcript.

## Load Your Transcript
Download the [transcript from BrightSpace](https://brightspace.tudelft.nl/d2l/le/content/767143/viewContent/4644865/View) based on the instructions given. Note the file extension. This is a comma-separated values file, or `.csv` file. A CSV file is a text version of a spreadsheet, with each column separation shown using a comma (",") or other special character. In our case, since the transcript text itself can have commas, we use a semicolon (";") as the separator.

#### NOTE:
Most modern file navigators keep the file extension hidden. This is not ideal for our work, and can create a number of errors in later exercises. So, to prevent this issue, please do the following:

**Mac:** Open "Finder" on your Mac, choose Finder > Settings, then click Advanced. Select “Show all filename extensions.”

**Windows:** Open "File Explorer" your PC, choose "View". Go to the "Show/hide" section, and make sure "File name extensions" is checked on. 

Since this is a CSV file, you will need to use a python library called *pandas* to read and process files as tables or "DataFrames".

### Upload the transcript file here.
Use the "upload" button below to upload your transcript.

In [None]:
## Import some necessary libraries that we will use for this analysis.
## Libraries are tools to make programs easier to write. 
## They provide pre-written, reusable chunks of code for particular tasks.
import io
import ipywidgets as widgets
from google.colab import files


uploaded = files.upload()

## Uncomment the code below and comment the code above if running in Jupyter Notebook/Lab
# uploader = widgets.FileUpload(
#     accept='.csv',  # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
#     multiple=False  # True to accept multiple files upload else False
# )
# display(uploader)

It is convention to add a `_df` suffix to all variables that represent dataframes. 
So we save the transcript contents into a variable called `transcript_df`.

In [None]:
import pandas as pd  
# Use 'pd' as a shortcut for 'pandas' as it saves you the effort of typing 'pandas' every time.

transcript_df = pd.read_csv('/content/Morning_Writes_with_Christiana.csv', sep=';')

## Uncomment the code below and comment the code above if running in Jupyter Notebook/Lab
# uploaded_file = uploader.value[0]
# transcript_df = pd.read_csv(io.BytesIO(uploaded_file.content), sep=';')

transcript_df.astype({'Utterance': 'str'}).dtypes
# Print a random sample of the dataframe, showing 5 rows.
transcript_df.sample(5)

Note the columns show in the above 'sampling' of the table. This is what it would look like as an Excel (or Numbers) file.

## Define a list of insight-related words
Go back to your notes from your preliminary analysis of transcripts. What words or phrases did you associate with insights?

Write the terms down in the text box below, with each term separated by a comma from the previous term.

In [None]:
text =  widgets.Textarea(
    layout={'height': '100%'},
    value='wow, yeah, see',
    width=100,
    placeholder='term1, term2, term3',
    description='Enter terms:',
    disabled=False
)

text

In [None]:
insight_terms = [x.lower().strip() for x in text.value.split(',')]
print(len(insight_terms), "insight terms entered. Listing...")
print(insight_terms)

## Finding matches between dictionary and text
The next step is to find how many terms from the text match the terms in the dictionary category, and to count every match. Note that I use the word "term" and not "word", since there are a number of multi-word terms in the dictionary, such as `"I see"`. 

Since we will be using this counting part of the code a lot, it might be good to write it up as a function that we can use multiple times.

In [None]:
import re

def count_matches(text, pattern):
    if pattern.startswith('*') :
        pattern = r"[A-Za-z]*" + pattern[1:]
        
    if pattern.endswith('*') :
        pattern = pattern[:-1] + r"[A-Za-z]*'"
    
    m = r"\b" + pattern + r"\b"
    matches = re.findall(pattern, text)
    return len(matches)

## Count terms for the entire transcript
Let's start by looking at the entire transcript as a whole.

In [None]:
import nltk
# nltk.download('punkt_tab')  # comment this line after the first time you run this code.
from nltk import word_tokenize

# Make a single string combining all the utterances
transcript_utterances = ' '.join(transcript_df['Utterance'].to_list())

def get_insight_counts(utterance_string, list_of_terms):
    # Count the total number of times any word from the dictionary appears in the transcript
    term_counts = 0
    for term in list_of_terms :
        term_counts += count_matches(utterance_string.lower(), term)
    
    tokens = word_tokenize(utterance_string)
    word_count = len(tokens)
    
    normalized_count = term_counts/word_count
    return term_counts, word_count, normalized_count


In [None]:
insight_counts, word_count,insight_counts_normalized = get_insight_counts(transcript_utterances, insight_terms)
# print results
print('Counting insight terms for your transcript...')
print('------------------------------------------------')
print("Total number of insight terms in transcript:", insight_counts)
print("Total number of words in transcript:", word_count)
print(f'Fraction of insight terms for the transcript: {insight_counts_normalized: .4f}')
print('------------------------------------------------')

## Concordance Analysis/ KeyWord in Context (KWIC) visualization

To verify if your insight terms indeed appear in the context of insights, it might be a good idea to examine the contexts of word use.

For this purpose, we use a KWIC or KeyWord In Context view that shows all occurrences of a word of interest in the context of its surrounding text.

In [None]:
from nltk.text import Text

# enter the variable corresponding to the text you plan to examine
text_to_examine = transcript_utterances

tokens = word_tokenize(text_to_examine.lower())
textList = Text(tokens)

term_for_kwic = insight_terms[2]

if term_for_kwic in text_to_examine.lower() :
    print("looking for occurrences of", term_for_kwic, "...")
    textList.concordance(term_for_kwic, width=125, lines=25)
else :
    print(term_for_kwic, "is not found in text.")

## Analyse transcript by speech turns

We analysed the transcript as a single unit of text, but can we analyse it over time to see when insights occur, and by whom?


In [None]:
transcript_df[["Insight Count", "Word Count", "Norm Insight Count"]]  = transcript_df.apply(lambda row: get_insight_counts(row["Utterance"], insight_terms), axis='columns', result_type='expand') 
transcript_df[0:5]

In [None]:
def convert_to_seconds(timestamp_str):
    hms = [int(x) for x in timestamp_str.split(":")]
    seconds = hms[0] * 3600 + hms[1] * 60 + hms[2]
    return seconds

In [None]:
transcript_df["Timestamp (sec)"] = transcript_df["Timestamp"].apply(convert_to_seconds)
transcript_df.to_excel("transcript_with_measures.xlsx")
transcript_df[0:5]

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure(figsize=[20,4], dpi=600)
nonzero_df = transcript_df[transcript_df["Norm Insight Count"] > 0]
g = sns.scatterplot(data=nonzero_df, x="Timestamp (sec)", y="Norm Insight Count", alpha=0.5)

# Custom format x-ticks every 600 seconds (10 mins)
space = 600
g.xaxis.set_major_locator(ticker.MultipleLocator(space))

# Convert labels from seconds to minutes for easier readability
xticklabels = g.get_xticks()
label_str_list = [''] + [str(int(label/60)) for label in xticklabels[1:-1]] + ['']
g.xaxis.set_ticks(xticklabels)
g.set_xticklabels(label_str_list)
g.set_xlabel("Timestamp (min)")

# Save the figure
plt.savefig("insight_plot.pdf", bbox_inches = "tight")

### Plotting insights separately for each speaker
We exclude Michael and Preston from the speakers and focus only on the songwriters, Christiana and Emily.

In [None]:
# plt.figure(figsize=[20,12], dpi=600)
# Focus on Emily & Christiana
singers_df = nonzero_df[(nonzero_df['Speaker'] == 'Christiana') | (nonzero_df['Speaker'] == 'Emily') ]
g = sns.FacetGrid(singers_df, row="Speaker", height=2, aspect=10, hue="Speaker", palette=sns.color_palette("Set2"))
g.map(sns.scatterplot, "Timestamp (sec)", "Norm Insight Count", alpha=0.8)

# Convert labels from seconds to minutes for easier readability
# Custom format x-ticks every 600 seconds (10 mins)
space = 600
g.axes[1][0].xaxis.set_major_locator(ticker.MultipleLocator(space))
xticklabels = g.axes[1][0].xaxis.get_ticklabels()
label_str_list = [str(int(float(label.get_text())/60)) for label in xticklabels[1:-1]]
g.axes[1][0].xaxis.set_ticks([int(float(x.get_text())) for x in xticklabels[1:-1]])
g.axes[1][0].set_xticklabels(label_str_list)
g.axes[1][0].set_xlabel("Timestamp (min)")

plt.savefig("insight_plot_speakers.pdf", bbox_inches = "tight")