<a href="https://colab.research.google.com/github/yurisugano/ObjectEllicitationNLP/blob/main/2023_ObjectEllicitationAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 1: Data Cleaning

First, we will read the `Transcripts.docx`, make necessary changes, and save as a new file.

1. Read the `.docx` file into a `doc` object.
2. Each paragraph in the `doc` object corresponds to one sentence stated by a subject
3. We need to format so all subject and object notation is consistent:
  - Enclose all three digit numbers that are not enclosed in curly braces
  - Remove spaces and dashes

---
ℹ️ Packages are aggregates of objects and functions that are used all the time, so they are organized and distributed so others can use. For instance, we imported some packages with `import`. To some of them, we gave them nicknames to make typing easier, so `pandas` can be accessed with `pd`. From other packages we only needed a single function, so we only load that. We used the `Document` function from the `docx` package which can read a `.docx` document. All other package will be explained as they are used.

---

In [1]:
# Load necessary packages
!pip install docx

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import json
import nltk
from docx import Document
plt.style.use('ggplot')

# Read the transcript from the cloud
doc = Document('/content/drive/My Drive/ObjElc Collab/Transcripts.docx')


# Loop over each paragraph in the document
for each_paragraph in doc.paragraphs:

    # Use regex to find three-digit numbers not surrounded by square brackets or curly braces
    numbers = re.findall(r'(?<![\[{])\b(\d{3})\b(?![\]}])', each_paragraph.text)

    # Iterate over the numbers and add curly braces
    for number in numbers:
        transformed_number = '{' + number + '}'
        each_paragraph.text = re.sub(r'\b' + number + r'\b',
                                     transformed_number, each_paragraph.text)


    # Use regex to find numbers inside square brackets with optional spaces and dashes
    matches = re.findall(r'\[([\d\s,-]+)\]', each_paragraph.text)

    # Iterate over the matches
    for match in matches:

        # Split the matched numbers by comma or space and handle ranges indicated by dashes
        numbers = []
        for num_range in re.split(r',\s*|\s+', match):
            num_range = num_range.strip()
            if '-' in num_range:
                start, end = num_range.split('-')
                numbers.extend(range(int(start), int(end) + 1))
            else:
                numbers.append(int(num_range))

        # Construct the transformed string
        transformed = '[' + ']['.join(map(str, numbers)) + ']'

        # Replace the original match with the transformed string
        each_paragraph.text = each_paragraph.text.replace('[' + match + ']', transformed)

PackageNotFoundError: ignored

# Section 2: Extracting Subjects and Sentences

Here, we go through each paragraph and identify the subject, the sentence, and the objects that are referred to

1. We will loop through each paragraph again, now extracting
  - `speaker` with all three digit numbers surrounded by `{ }`
  - `sentence` for the entire string after `:`
  - `objects` for all three digit numbers surrounded by `[ ]`

2. Then we take all sentences by the same speaker and concatenate them in a single string.
3. Lastly, we create a Pandas data frame named `sentence_data`.
  - In order to make each speaker a row and `speaker`, `sentence` and `objects` as columns, we transpose the data frame with `.T`.

4. When you want to inspect the data, you can use the function `head()`, which display the first few rows.

---
ℹ️ A dictionary is an object, a way to store data. We are most familiar with data frames (a format similar to Excel, where each row is an observation and each column is a variable), so we will use data frames to deal with the data, but dictionaries have their own benefits.

In Python, the package Pandas is a particular good way to deal with data frames (here accessed with `pd`).

The `.` notation is confusing. `head()` is a method, which you can think of as a function that an object can perform. Thus the syntax `sentence_data.head()` means "from the object `sentence_data`, perform the function `head()`." Similarly, above, from the object `pd.DataFrame(data)`, we performed the function `.T`

---

In [None]:
# Initialize an empty dictionary to store the data
data = {}

# Loop over each paragraph in the document
for each_paragraph in doc.paragraphs:
    # Use regex to extract the speaker, the object, and the sentence
    speaker_match = re.search(r'\{(\d{3})\}', each_paragraph.text)
    sentence_match = re.search(r': (.*)', each_paragraph.text)
    objects_match = re.findall(r'\[(\d{3})\]', each_paragraph.text)

    if speaker_match and sentence_match:
        speaker = speaker_match.group(1)
        sentence = sentence_match.group(1)
        objects = objects_match

        # Concatenate sentences for each speaker
        if speaker in data:
            data[speaker]['sentence'] += ' ' + sentence
        else:
            data[speaker] = {'speaker': speaker,
                             'sentence': sentence,
                             'objects': objects}

sentence_data = pd.DataFrame(data).T

sentence_data.head()

Unnamed: 0,speaker,sentence,objects
0,0,"OK. It is July 26, 4:20 PM and I'm here with p...",[]
104,104,sounds good I didn’t even see that bag. this...,[]
105,105,OK. Is my bag and stuff OK? [throws {203} at ...,[]
106,106,OK no. all right. right off the bat they are...,[]
107,107,OK I don't think so they're [201][202][203][2...,[]


# Section 3: tokens? Initial steps to data analysis

The first step is to separate the sentence in its individual components. These components are called **tokens**. Note that tokens include individual words, but also commas and punctiation. Let's grab the third subject as an example

1. Get the data from the column named `sentence` for subject with index `[2]`.


In [None]:
example = sentence_data['sentence'][2]

#single object with the entire sentence
print(example)

OK. Is my bag and stuff OK?  [throws {203} at the wall and it falls off] OK so it's like sticks to like hard surfaces and like I said not always the best but when I was a kid and I would play with them and I was always--I would always like to do this [rolls ball on table so it sticks and unsticks, it makes popping sound] and like make popcorn noises. so it's also different colors red blue and green my hair is in it now. it's OK. but this is cool too. it's also a sphere but in like a subtle way because the middle is like a hard little circle and then this forms a circle but it's like all these little things. OK. Ooh. this [202] reminds me of like a pencil eraser almost, just in the way it looks and like the way it feels. I'm not super familiar with a bouncy ball like this, it's kind of like softer than a normal bouncy ball, like I feel like if this hit me it like wouldn't hurt as much. I don't know if that's true but I feel that way but yeah. it's cute. has the little barcode on it. so 


2. Use the package `nltk` to convert the sentence to tokens using the `word_tokenize()` method.


In [None]:
example_tokenized = nltk.word_tokenize(example)

# Tokenized version is an n x 1 array with as many objects as there are tokens
print(example_tokenized)

['OK.', 'Is', 'my', 'bag', 'and', 'stuff', 'OK', '?', '[', 'throws', '{', '203', '}', 'at', 'the', 'wall', 'and', 'it', 'falls', 'off', ']', 'OK', 'so', 'it', "'s", 'like', 'sticks', 'to', 'like', 'hard', 'surfaces', 'and', 'like', 'I', 'said', 'not', 'always', 'the', 'best', 'but', 'when', 'I', 'was', 'a', 'kid', 'and', 'I', 'would', 'play', 'with', 'them', 'and', 'I', 'was', 'always', '--', 'I', 'would', 'always', 'like', 'to', 'do', 'this', '[', 'rolls', 'ball', 'on', 'table', 'so', 'it', 'sticks', 'and', 'unsticks', ',', 'it', 'makes', 'popping', 'sound', ']', 'and', 'like', 'make', 'popcorn', 'noises', '.', 'so', 'it', "'s", 'also', 'different', 'colors', 'red', 'blue', 'and', 'green', 'my', 'hair', 'is', 'in', 'it', 'now', '.', 'it', "'s", 'OK.', 'but', 'this', 'is', 'cool', 'too', '.', 'it', "'s", 'also', 'a', 'sphere', 'but', 'in', 'like', 'a', 'subtle', 'way', 'because', 'the', 'middle', 'is', 'like', 'a', 'hard', 'little', 'circle', 'and', 'then', 'this', 'forms', 'a', 'circl

3. use the `pos_tag()` method which returns the pos (part-of-speech) tag.
---
ℹ️ Python starts counting with 0. So the first object is in position `[0]`, and the third object is in position `[2]`

---

In [None]:
example_postag = nltk.pos_tag(example)


TypeError: ignored

In [None]:

!pip install --target=$nb_path docx

Collecting docx
  Using cached docx-0.2.4-py3-none-any.whl
Collecting lxml (from docx)
  Downloading lxml-4.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Pillow>=2.0 (from docx)
  Downloading Pillow-9.5.0-cp310-cp310-manylinux_2_28_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m86.8 MB/s[0m eta [36m0:00:00[0m
[?25h

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bcp47............... BCP-47 Language Tags
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)


    Downloading package punkt to /root/nltk_data...
      Unzipping tokenizers/punkt.zip.



---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


KeyboardInterrupt: ignored