<a href="https://colab.research.google.com/github/snedmagdous/Maya-M/blob/main/a0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## [A0] Getting (to know) the Kardashians: From raw transcripts to information!

INFO/CS 4300, spring 2024

<br/>

<div align="center">
    <img src="https://catandgirl.com/wp-content/uploads/2010/06/2010-06-04-cgtalk.gif" width="600"/>
    <br/>
    Source: <a href="https://catandgirl.com/silent-spring/">
"<i>Silent Spring</i>" (Cat and Girl) by Dorothy Gambrell</a>
    <br/>
    (Distributed under CC BY-NC-SA 2.5 US.)
</div>

<br/>

No part (code, documentation, comments, etc.) of this notebook or any assignment-related artefacts were generated/created, refined, or modified using generative AI tools such as ChatGPT. Cite this notebook as:
> Tushaar Gangavarapu and Cristian Danescu-Niculescu-Mizil. 2024. INFO/CS 4300 Sp'24 A0: Getting (to know) the Kardashians: From raw transcripts to information! GitHub. https://github.coecis.cornell.edu/cs4300-sp24-public/a0/.

__Acknowledgments.__ This work is inspired by the assignment "Getting (to know) the Kardashians" developed, tested, and updated by the course staff from previous runs of the course.

---

__Deadlines__

Follow [Ed #4](https://edstem.org/us/courses/53550/discussion/4170989) for all updates on the assignment; it can be misleading to just follow the "git commit" trail. A few notes:
* Assignment submission deadline: <font color="red">January 26, 2024</font> (Friday), 11.59pm on the submission site(s).
* This is an _individual_ component (i.e., not to be done in teams), and the use of generative AI tools is prohibited.

__Documentation.__ For your convenience, we're maintaining a documentation of all the modules and scripts used in this assignment at: https://pages.github.coecis.cornell.edu/cs4300/a0/.

__Learning outcomes__

The goal of this assignment is to familiarize yourself with the structure of the data you will be analyzing in the upcoming assignments. To this end, you will:
* (understand how to work with Colab, GitHub, and other tools,)
* process raw transcript data to enable meaningful analyses, and
* draw basic inferences from the processed data.



__Policies.__ All the policies described on the course website are applicable as is (including the policy on academic integrity and the use of generative AI tools), for more information, see: https://canvas.cornell.edu/courses/62833/.

---

<a name="outline"></a>__Assignment outline__

* [[$\ast$] Attributions](#attr)
* [[0] Imports and installs!](#sec0)
* [[1] Data processing: Extracting information from raw transcripts](#sec1)
  * [[1.1] Preprocessing for "valid" dialogue](#sec11)
  * [[1.2] Preprocessing transcripts for analysis](#sec12)
* [[2] Well, _how much_ does a Kardashian talk!?](#sec2)
* [[$\ast$] Final submission](#final) ← Gradescope autograder provided! <a name="footnote1"></a>[<sup>[1]</sup>](#autograder)

> <a name="autograder"></a><sup>[1] </sup>Passing the autograder doesn't guarantee the full correctness of the tested components (you should be writing your own test cases to ensure that!). Post final submission, your code will be tested on several additional test cases. [↩︎](#footnote1)

---
<a name="attr"></a>
### [$\ast$] Attributions [↩︎](#outline)

Use the space provided below to acknowledge (by name/source) all the resources you consulted in solving this assignment. (Please beware that this assignment is an _individual_ component, i.e., collaboration with other students in the class constitutes a violation of academic integrity.)

_Attributions (if any) go here._

---
<a name="sec0"></a>
### [0] Imports and installs! [↩︎](#outline)

Assuming that you've followed [setup.ipynb](https://github.coecis.cornell.edu/cs4300-sp24-public/a0/blob/main/notebooks/setup.ipynb) and successfully set up the `CS4300/a0` folder, the following code will install any external libraries and needed packages to run the assignment (takes ~1 minute). Before proceeding, be sure to run the second code cell to ensure that the installation is successful.

> __Tip__. We're using Colab to conveniently avoid installing standard packages; you don't need GPUs for this assignment—to avoid accidentally "running out of GPU cycles," please change your runtype accordingly.

In [1]:
from google.colab import drive
drive.mount("/content/drive")

%cd "/content/drive/MyDrive/CS4300/a0"

from colab.file_utils import load_required
load_required()

Mounted at /content/drive
/content/drive/MyDrive/CS4300/a0


In [2]:
from IPython.display import display

try:
    from src.utils.utils import success, colored
    display(success())
except ImportError:
    print("\033[31mInstallation failed, please retrace your steps ...")

[92mSuccess!
[0m

Let's import a few packages and methods that are used throughout this notebook; in this notebook, you are free to import and/or install packages (a lot of the packages you may need should already be available) as you see fit. <font color="red">That said, you are __not__ allowed to modify the imports in any of the Python source files; furthermore, please do not modify (delete lines, change method signatures, etc.) above or below the `TODO` placeholders within the Python source files.</font>

In [3]:
import os
import pprint as pp
import random
from glob import glob
from time import process_time

import bs4
from IPython.display import HTML

from src.data_processing.analysis import replace_speaker_name, num_episodes, num_speaker_utterances
from src.data_processing.transcript_parser import KardashiansTranscriptParser
from src.utils.utils import save_dict_to_json, load_dict_from_json

Next, let's set up a few filepaths: for convenience, we will redirect all the output artefacts to `CS4300/a0/artefacts` folder—this includes processed data, submission .zip files, and others.

In [4]:
BASE_DIR = os.path.abspath(".")

DATASET_DIR = os.path.join(BASE_DIR, "dataset")
ARTEFACTS_DIR = os.path.join(BASE_DIR, "artefacts")
SCRIPTS_DIR = os.path.join(BASE_DIR, "scripts")

Finally, set your net ID below as a string (e.g., "tg352"); we'll use the `net_id` variable to auto-populate any information required in making the submission. Please __don't__ include any special characters (e.g., using "\<tg352\>" or "[tg352]" will result in processing errors).

In [5]:
net_id = "mmm443"

if net_id is None:
    raise ValueError("net-ID not set; set it above")
print(f"{net_id=} set")

net_id='mmm443' set


---

<a name="sec1"></a>
### [1] Data processing: Extracting information from raw transcripts <small>[↩︎](#outline)</small>

> __Note.__ There's no code to be written/edited by you in this section (we've done all the work for you!). That said, please peruse this section carefully; you'll need to understand what's done in this section to solve [the next section](#sec2).

Transcripts of the TV show: [Keeping Up with the Kardashians](https://en.wikipedia.org/wiki/Keeping_Up_with_the_Kardashians) are available online, and have been downloaded and provided to you in a .html format (under the `dataset/` folder). However, in their "raw" format, they're quite unusable; let's see how to process them for analysis!

We will be using [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) Python library to work with .html files. Let us choose a random transcript and see what it looks like:

In [6]:
random_transcript_filepath = random.choice(glob(f"{DATASET_DIR}/**/*"))
with open(random_transcript_filepath, "r") as fp:
    transcript_bsoup = bs4.BeautifulSoup(fp, "html.parser")

print(transcript_bsoup)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Keeping Up With the Kardashians - A New Perspective in New Orleans - Friday, Nov 19, 2010 - mReplay Livedash TV Transcript - Livedash - Search what is being mentioned across national TV</title>
<meta content="mReplay Livedash TV Transcript - Keeping Up With the Kardashians - A New Perspective in New Orleans.  Aired on EP, Friday, Nov 19, 2010 at 05:30 PM" name="description"/>
<meta content="mReplay, Livedash, TV, Transcript, Keeping Up With the Kardashians - A New Perspective in New Orleans, EP, Friday, Nov 19, 2010, 05:30 PM" name="keywords"/>
<meta content="IE=EmulateIE7" http-equiv="X-UA-Compatible"/>
<style type="text/css">
</style>
<script type="text/javascript">
    var tabberOptions = {manualStartup:true};
    </script>
<script>
    var RecaptchaOptions = {
       theme : 'white'
    };
 

Unsurprisingly, that's a lot of HTML code, and we need to somehow process it to be able to make meaningful analyses. Before proceeding, let's render the above HTML code to see what data might be relevant to us.

In [7]:
# Render the HTML code in a readable format.
HTML(f"<div style='font-family:serif'>{str(transcript_bsoup)}</div>")

0,1
"Keeping Up With the Kardashians - A New Perspective in New Orleans  EP Aired on Friday, Nov 19, 2010 (11/19/2010) at 05:30 PM View other episodes View more from this channel View all transcripts from (11/19/2010)  var addthis_pub = ""zxcjason""; <!--  google_ad_client = ""pub-3300760009941566"";  /* 300x250, created 2/18/10 */  google_ad_slot = ""4418261168"";  google_ad_width = 300;  google_ad_height = 250;  //-->  Video and Thumbnails Transcript Word Map","Tag cloud of show  Livedash is ad supported  <!--  google_ad_client = ""pub-3300760009941566"";  google_ad_slot = ""1472272198"";  google_ad_width = 300;  google_ad_height = 250;  //-->  About Us |  Help |  DMCA |  Terms of Use Privacy Policy  mReplay Livedash is a registered trademark of mReplay Corporation. The information provided with  mReplay Livedash is for informational purposes only. For more  information, please see our terms of use.  The network logos used on mReplay Livedash are registered trademarks of  those respective companies, including Fox, NBC, CBS, PBS, ABC, FX,  TNT, ESPN, ESPN2, TBS, USA, MTV, VH1, Spike, A&E, Bravo, AMC, TLC,  Animal Planet, ABC Family, Cartoon Network, Disney, CNN, CNBC, Fox  News, MSNBC, Comedy Central, Entertainment, TV Land and Oxygen.  These networks, Fox, NBC, CBS, PBS, ABC, FX, TNT, ESPN, ESPN2, TBS,  USA, MTV, VH1, Spike, A&E, Bravo, AMC, TLC, Animal Planet, ABC Family,  Cartoon Network, Disney, CNN, CNBC, Fox News, MSNBC, Comedy Central,  Entertainment, TV Land and Oxygen, are not affiliated with mReplay  Livedash, or mReplay Corporation."

0,1
,View other episodes
,View more from this channel
,View all transcripts from (11/19/2010)
,"var addthis_pub = ""zxcjason"";"


From the above HTML rendering, we make the following observations (feel free to re-run the above cells to randomly choose a different .html file):

* each episode includes a _title_ (e.g., "Keeping Up With the Kardashians - Shape Up or Ship Out"), and

* each _utterance_ in the transcript always begins with a timestamp, and is only followed by a ">>" when there's a transition between speakers. For example,
> 00:00:09 >> KENDALL: Right. <br/>
> 00:00:11 <font color="green">>></font> KRIS: Right. <font color="green">← indicates reply</font>

 indicates that KRIS replied to KENDALL; vs.

 > 00:00:15 >> KHLOE: Mom, that is ridiculous. <font color="green">← first utterance</font> <br/>
 > 00:00:20 Are you kidding? <font color="green">← second utterance</font>

 indicates that the character (here, KHLOE) has already begun speaking, and is continuing their chain of thought. (There are a few irregularities, which we will handle later.)



From the HTML code, we can extract the _title_ of the transcript by looking for the HTML element with the `"id"` attribute set to `"title"`. (As a reminder: this easy "lookup" is facilitated through the use of BeautifulSoup; so, from hereon when we refer to `transcript_bsoup` within code, we're referring to the associated BeautifulSoup object.)

In [8]:
# Retrieve the "title" of the episode.
title = transcript_bsoup.find(attrs={"id": "title"})
print(title.get_text() if title is not None else colored("something went wrong!", "red"))

Keeping Up With the Kardashians - A New Perspective in New Orleans


Similarly, we can extract the conversation in the transcript by querying for `"tr"`s (table rows); each utterance in the conversation is a table row with exactly two cells, one containing the timestamp, the other the text.

> __Tip.__ `&gt;&gt;` ("gt" for greater than) in HTML code is rendered as ">>".

In [9]:
print(colored("(for brevity, logging only 5 entries below)", "red"), "\n")
pp.pprint(transcript_bsoup.findAll("tr")[100: 105])

[91m(for brevity, logging only 5 entries below)[0m 

[]


<a name="sec11"></a>
#### [1.1] Preprocessing for "valid" dialogue [↩︎](#outline)

Thinking back to: "[...] There are a few irregularities, which we will handle later."; yes, _later_ is now! Okay, from our inspections, we note the transcripts to contain the following irregularities:

* A few utterances are prepended with information about the actions of some characters, and we plan to remove such actions for simplicity. For example,
> <font color="green">~(Kourtney and Khloe laughing)~</font> >> BRUCE: Sometimes, I can get so disappointed with these girls.

* When characters take turns in quick succession, we may have multiple characters speaking in the same line; here, we wish to break the utterances onto different lines. For example,
> \>> SCOTT: Three? <font color="green">>> KOURTNEY: Yeah.</font>

* You may have noticed that the character names are all capitalized (e.g., KHLOE); when this isn't the case, we ignore that particular utterance.

* Next, we also want to ensure that the utterance starts with a valid character (an alphabet, a digit, or one of `[., ?, !, $, ", ']`); if not, we can ignore the specific utterance.

* Finally, for completeness, we also wish to ensure that an utterance is "valid": an utterance is invalid when no speaker has been marked yet, but the utterance appears to be a continuing conversation. For instance, consider a transcript that _starts_ as follows:
>
> 00:00:00 I am. <font color="green">← first utterance in the transcript: ill-formatted, missing ">>" and speaker name</font> <br/>
> 00:00:20 >> SCOTT: Three?

We already provide you with the code to preprocess a given .html transcript file that ensures the above irregularities are normalized/removed: see `KardashiansTranscriptParser` class in `data_processing/transcript_parser.py`. (We strongly recommend looking at the class and its methods, or at the very least the documentation provided for each class method.)

Let's run the `KardashiansTranscriptParser` for the sample transcript file; the parser returns a tuple of the transcript unique identifier (UID) string, transcript title string, and the transcript content (conversation). Observe the format of the returned transcript content.

In [10]:
transcript_parser = KardashiansTranscriptParser()
random_transcript_uid, random_transcript_title, random_transcript_convo = transcript_parser.parse(
    transcript_filepath=random_transcript_filepath
)

# The transcript UID is extracted from the filepath.
print(
    f"{colored('filepath', attrs=['underline'])}: "
    f"{random_transcript_filepath.split('_')[0]}_{colored(random_transcript_filepath.split('_')[1][:-5], 'blue')}.html"
    f"\n{colored('uid', attrs=['underline'])}: {colored(random_transcript_uid, 'blue')}"
)

# The transcript title is extracted as shown before.
print(f"\n{colored('title', attrs=['underline'])}: {random_transcript_title}")

# The transcript conversation is processed to normalize for irregularities.
print(
    f"\n{colored('transcript content', attrs=['underline'])}:\n"
    f"{colored('(for brevity, only the first 10 entries are shown below.)', 'red')}\n"
)
pp.pprint(random_transcript_convo[:10])

[4mfilepath[0m[0m: /content/drive/MyDrive/CS4300/a0/dataset/livedash_[94mkardashians3/514592[0m.html
[4muid[0m[0m: [94mkardashians3/514592[0m

[4mtitle[0m[0m: Keeping Up With the Kardashians - A New Perspective in New Orleans

[4mtranscript content[0m[0m:
[91m(for brevity, only the first 10 entries are shown below.)[0m

[]


<a name="sec12"></a>
#### [1.2] Preprocessing transcripts for analysis [↩︎](#outline)

Now that we have our transcript parser (a.k.a., `KardashiansTranscriptParser`), we can preprocess all the transcript files provided in the `dataset/` folder. To this end, we will build two dictionaries:
* `titles` that maps a `trascript_uid` (unique identifier) to the associated transcript title, and
* `transcripts` that maps a `trascript_uid` to the parsed transcript content.

> __Note.__ The following cell takes less than 2 minutes to complete processing all 294 transcript files.

In [11]:
titles, transcripts = {}, {}
transcript_parser = KardashiansTranscriptParser()

_start_time = process_time()
for filepath in glob(f"{DATASET_DIR}/**/*"):
    if os.path.splitext(filepath)[1].lower() == ".html":
        uid, title, transcript = transcript_parser.parse(transcript_filepath=filepath)
        titles[uid] = title
        transcripts[uid] = transcript
time_taken = process_time() - _start_time

print(f"{len(titles.keys())} transcript files processed in {round(time_taken / 60, 2)} minutes")

294 transcript files processed in 0.95 minutes


In the future assignments, we will be analyzing the _language_ used by the central characters in the show. It turns out that one of the characters is referred to by two different names, Rob and Robert. We provide a helper function `replace_speaker_name` (in `src/data_processing/analysis.py`), which can be used to replace a specified name with a new one.

Run the cell below to replace all occurrences of "ROB" with "ROBERT".

In [12]:
total_transcript_keys_before_processing = len(transcripts.keys())

# Replace the speaker name "ROB" with "ROBERT"
transcripts = replace_speaker_name(input_transcripts=transcripts, original_name="ROB", replacement_name="ROBERT")

assert len(transcripts.keys()) == total_transcript_keys_before_processing

[DEBUG]	replace_speaker_name ran in: 0.03023 seconds


Let's go ahead and save the `titles` and `transcripts` dictionaries as .json files (so we don't have to waste compute in redundantly preprocessing the data again). For your convenience, we provide `save_dict_to_json` and `load_dict_from_json` helper methods in `src/utils/utils.py` file.

Running the following cell will save the `titles` dictionary to `artefacts/processed_data/titles.json` and `transcripts` dictionary to `artefacts/processed_data/transcripts.json`.

In [13]:
save_dict_to_json(dict_to_save=titles, filepath=os.path.join(ARTEFACTS_DIR, "processed_data/titles.json"))
save_dict_to_json(dict_to_save=transcripts, filepath=os.path.join(ARTEFACTS_DIR, "processed_data/transcripts.json"))

dict saved to: [94m/content/drive/MyDrive/CS4300/a0/artefacts/processed_data/titles.json[0m
dict saved to: [94m/content/drive/MyDrive/CS4300/a0/artefacts/processed_data/transcripts.json[0m


Let's re-load the dictionaries from the saved .json files and ensure all is as expected.

In [14]:
titles = load_dict_from_json(filepath=os.path.join(ARTEFACTS_DIR, "processed_data/titles.json"))
transcripts = load_dict_from_json(filepath=os.path.join(ARTEFACTS_DIR, "processed_data/transcripts.json"))

assert titles is not None
assert transcripts is not None

Finally, before proceeding, let's marvel at the scale we're dealing with: run the cell below to count the total number of messages stored in the transcripts.

In [15]:
num_messages = sum(map(lambda transcript: len(transcript), transcripts.values()))
print(f"{len(transcripts.keys())} transcript files w/ {colored(str(num_messages) + ' messages', 'blue')}!")

294 transcript files w/ [94m202457 messages[0m!


---

<a name="sec2"></a>
### [2] Well, _how much_ does a Kardashian talk!? <small>[↩︎](#outline)</small>

> <font color="orange">File to be edited: `src/data_processing/analysis.py`.

Upon inspection, we noticed that a single episode (indicated by the title of the episode) could be transcribed over multiple transcript files, resulting in data duplicates. To this end, we wish to determine the total number of distinct episodes in the given files.

Please complete <font color="orange">`TODO-2.1`</font> in `num_episodes` method within `src/data_processing/analysis.py`. Upon completion, run the cell below to count the total number of episodes in all the transcript files.

In [16]:
total_episodes = num_episodes(input_titles=titles)

# Check that `num_episodes` returns the expected output.
assert total_episodes == 56, f"{total_episodes} titles != 56"

[DEBUG]	num_episodes ran in: 0.00003 seconds


We're often interested in the number of times a specific character speaks across episodes. To this end, complete <font color="orange">`TODO-2.2`</font> in `num_speaker_utterances` method within `src/data_processing/analysis.py`. Upon completion, run the cell below to count the total number of utterances by the speaker "ROBERT" in all the transcript files.

(As an aside: such analysis has broader implications beyond this assignment; for example, analyzing the frequency of native vs. non-native speaker utterances in a controlled setting could be extremely insightful!)

In [17]:
num_robert_utterances = num_speaker_utterances(input_transcripts=transcripts, speaker="ROBERT")

# Check that `num_robert_utterances` returns an output in the following range.
assert num_robert_utterances >= 18117, f"{num_robert_utterances} utterances < 18,117"
assert num_robert_utterances <= 18591, f"{num_robert_utterances} utterances > 18,591"

[DEBUG]	num_speaker_utterances ran in: 0.02572 seconds


---

<a name="final"></a>
### [$\ast$] Final submission <small>[↩︎](#outline)</small>

Hurray! Now that we've successfully completed the code for analysis, let's bundle everything up and make a submission on Gradescope. Running the cell below will generate `a0_submission.zip` in the `CS4300/a0/artefacts` folder.

<font color="red">Caution: the script will overwrite any file named `a0_submission.zip` existing in `CS4300/a0/artefacts` folder.</font>

> __Tip.__ For any of the scripts provided, you can run `!<command-name> --help` to see the arguments of the command! <br/>
(Replace the `<command-name>` accordingly.)

You will need to submit the `a0_submission.zip` and a .pdf version of this notebook file to Gradescope as follows:

*  Upload `artefacts/a0_submission.zip` to the Gradescope assignment [A0 code](https://www.gradescope.com/courses/709539/assignments/3988439). Upon submission, the autograder will automatically run "public" tests; however, as noted earlier, for your final autograder score, we will run your code through several additional "hidden" test cases.

* Submit a .pdf version of this notebook file (_don't_ clear out the run outputs) to the [A0 notebook [.pdf file]](https://www.gradescope.com/courses/709539/assignments/3988480) Gradescope assignment.

Note: the .pdf of this notebook you submit will only be used for record-keeping; you will _not_ be graded on any code in this notebook file.

In [18]:
!make_submission.py \
    --basepath-to-store-submission={ARTEFACTS_DIR} \
    --net-id={net_id}

if os.path.isfile(f"{os.path.join(ARTEFACTS_DIR, 'a0_submission.zip')}"):
    display(success())
else:
    print(colored("Oops, something went wrong!", "red"))

submission stored at: [94m/content/drive/MyDrive/CS4300/a0/artefacts/a0_submission.zip[0m

[92mSuccess!
[0m