# Week 1: Introduction

In [None]:
__author__ = "Rose E. Wang, Dorottya Demszky"
__version__ = "CS293/EDUC473, Stanford, Fall 2023"

## Table of Contents

* [Overview](#overview)
    * [Deliverables](#your-deliverables)
    * [Linking Colab Notebooks to your GDrive](#linking-this-colab-to-your-gdrive)
* [Downloading the NCTE dataset](#downloading-the-dataset)
* [Exploring the NCTE dataset](#exploring-the-ncte-dataset)
    * [Transcript Data](#exploring-the-transcripts)
    * [Metadata](#exploring-the-metadata)
    * [Student Reasoning Data](#exploring-the-student-reasoning-data)
* [Text Analysis Tools](#text-analysis-tools)
    * [Word Frequencies](#word-frequencies)
    * [Log-Odds Analysis](#log-odds-analysis)
    * [Topic-Modelling](#topic-modeling)
    * [Clustering](#clustering)
* [Assignment](#assignments)
* [Extra Assignment](#extra-assignments)

## Overview

The purpose of this notebook is to introduce to you the dataset we'll be working with throughout the quarter and perform some simple data analysis on the dataset.


The dataset we'll be using this quarter is called the [NCTE dataset](https://arxiv.org/pdf/2211.11772.pdf) [1], which contains classroom transcripts of 4th and 5th grade mathematics classes.
The dataset was collected by the National Center for Teacher Effectivenesss (NCTE) between 2010 and 2013, and it contains anonymized transcripts represent data from 317 teachers across 4 school districts that serve largely historically marginalized students. The transcripts come with rich metadata, including turn-level annotations for dialogic discourse moves, classroom observation scores, demographic information, survey responses and student test scores. For more details on the dataset, please refer to the aforementioned reference.

[1] Demszky, D., & Hill, H. (2022). The NCTE transcripts: A dataset of elementary math classroom transcripts. arXiv preprint arXiv:2211.11772.

### Your Deliverables

To receive credit for this assignment, please upload the PDF version of your ENTIRE Colab that includes all your code and written responses to Gradescope.

### Note

**We are assuming you are running everything within [Colab](https://colab.research.google.com/).** We are not responsible for bugs created outside of the Colab environment.


## Linking this Colab to your GDrive

Please download the repository from [https://github.com/rosewang2008/cs293_empowering_educators_via_language_technology](https://github.com/rosewang2008/cs293_empowering_educators_via_language_technology).

Then, upload the repository to your Google Drive and link the directory to all future colabs. The following command should link your GDrive to this colab.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Specify the directory path to where you've uploaded the GitHub repository.
Within the directory path you've specified, it should contain things like the `data` folder or python notebooks.

In [None]:
DIR_PATH = "/content/drive/your_path_to_the_folder_REPLACE_ME/"

# Downloading the NCTE dataset

1. Fill out the Google form in [https://github.com/ddemszky/classroom-transcript-analysis](https://github.com/ddemszky/classroom-transcript-analysis).
2. That form sould share with you a Google Drive folder with the files: `ncte_single_utterances.csv`, `student_reasoning.csv` and `paired_annotations.csv`.
3. Download these file and put it in your Google Drive folder for the course, e.g., if `DIR_PATH = /content/drive/empowering_educators_via_language_technology/` then you should put the CSV files in the existing `data` folder: `/content/drive/empowering_educators_via_language_technology/data/`

The final output should be:
```
/content/drive/empowering_educators_via_language_technology/data/
    └── ncte_csingle_utterances.csv
    └── student_reasoning.csv 
    └── paired_annotations.csv 
```

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Pretty plotting
sns.set_style('whitegrid')
sns.set_context("paper", font_scale=1.5, rc={"lines.linewidth": 2.5})


## Exploring the NCTE dataset

### Exploring the transcripts
Let's explore the NCTE transcripts, `ncte_single_utterances.csv`, which contains all the utterances from the transcript dataset.
The `OBSID` column represents the unique ID for the transcript, and the `NCTETID` represents the teacher ID, which are mappable to metadata.
`comb_idx` represents a unique ID for each utterance (concatenation of OBSID and turn_idx), which is mappable to turn-level annotations.

The other two files `paired_annotations.csv` and `student_reasoning.csv` are derivatives of the utterance csv.
We will explore the `student_reasoning.csv` file later.

Now, let's take a closer look at `single_utterances.csv` and perform some basic checks of the data.

In [None]:
utterances = pd.read_csv(os.path.join(DIR_PATH,'data/ncte_single_utterances.csv'))

In [None]:
print(f"Columns: {utterances.columns}")

utterances

The columns in `utterances` are:
- `speaker`: The speaker of the utterance.
- `text`: The utterance text.
- `year`: The school year in which transcript was taken. 1 = 2010-11, 2 = 2011-12, 3 = 2012-13 school year.
- `OBSID`: The unique ID for the transcript.
- `video_id`: The unique ID of the video from which the transcript was taken.
- `cleaned_text`: The cleaned version of `text` with removed punctuation and lower casing.
- `num_words`: Number of words in the utterance text.
- `turn_idx`: The utterance turn number in the transcript.
- `comb_idx`: The concatenation of `OBSID` and `turn_idx`, i.e., `comb_idx = <OBSID>_<turn_idx>`.

Let's take a look at an example of a classroom transcript. We'll pick one of the transcript IDs and print the transcript.

In [None]:
# Get the unique transcript IDs, sort, and pick the first ID to print transcript for.
transcript_id = sorted(list(utterances['OBSID'].unique()))[0] # 3
transcript_df = utterances[utterances['OBSID'] == transcript_id]

This code will take a df that contains all the utterances from the same classroom session, sort the utterances by a column indicating utterance order, and print each utterance as: `[speaker_name]: [utterance]\n`.

In [None]:
def construct_transcript_from_df(transcript_df, sort_by, speaker_column, text_column, print_transcript=False):
    transcript_text = ""
    # Sort the transcript by the sort_by column
    transcript_df = transcript_df.sort_values(by=sort_by)
    # Add to the transcript_text as `speaker: text` for each row in the dataframe
    for index, row in transcript_df.iterrows():
        transcript_text += row[speaker_column] + ": " + row[text_column] + "\n"
    # Print the transcript text if print_transcript is True
    if print_transcript:
        print(transcript_text)
    return transcript_text

In [None]:
transcript_text = construct_transcript_from_df(
    transcript_df=transcript_df,
    sort_by='turn_idx',
    speaker_column='speaker',
    text_column='text',
    print_transcript=False
)

print(f"Transcript:\n{transcript_text}\n")


Before performing any analysis on the transcript _text_, we perform some analysis based on the column information already provided in the dataframe!
This is good practice to get a sense of numbers and scale in a new dataset.

For example, we can look at the **number of total transcripts, and number of transcripts per school year**...

In [None]:
# Count total number of transcripts
total_transcripts = len(utterances['OBSID'].unique())
print(f"Total number of transcripts: {total_transcripts}")
# Count number of unique `OBSID`s per `year`
transcripts_per_year = utterances.groupby('year')['OBSID'].nunique()
print(f"Transcripts per year:\n{transcripts_per_year}")

# Plot transcripts per year
sns.barplot(x=transcripts_per_year.index, y=transcripts_per_year.values)
plt.xlabel('Year')
plt.ylabel('Number of transcripts')
plt.title('Number of transcripts per year')
plt.show()

We can also take a look at some statistics regarding the transcript utterances.

For example, we can look at the **average number of utterances per transcript**.

In [None]:
# Average number of utterance turns `turn_idx` across transcripts `OBSID`
avg_turns = utterances.groupby('OBSID')['turn_idx'].max().mean()
print(f"Average number of utterance turns across transcripts: {avg_turns:.2f}")

We might also be interested in the average number of utterances spoken by the teacher vs. the students in the classroom:

In [None]:
# Average number of utterance turns `turn_idx` per speaker `speaker` across all transcripts `OBSID`:
unique_speakers = utterances['speaker'].unique()
for speaker in unique_speakers:
    if pd.isnull(speaker):
        continue
    speaker_df = utterances[utterances['speaker'] == speaker]
    # Count number of turns per transcript
    turns_df = speaker_df.set_index('OBSID')['turn_idx'].value_counts()
    # Average the number of turns across transcripts
    avg_turns_per_speaker = turns_df.mean()
    print(f"Average # turns for speaker {speaker}: {avg_turns_per_speaker}")

Finally, let's look at the **number of words each speaker uses on average on their utterance turns**.

In [None]:
# Average number of words per speaker
for speaker in unique_speakers:
    if pd.isnull(speaker):
        continue
    speaker_df = utterances[utterances['speaker'] == speaker]
    avg_words = speaker_df['num_words'].mean()
    print(f"Average # words for {speaker}: {avg_words:.2f}")

**Things to think about**:
- The "average # turns" cell shows that the `teacher` and `student`s have similar number of turns across transcripts, i.e., ~239 teacher utterances compared to ~211 student utterances.
- However, the average # of words indicates a noticeable difference in their utterance lengths, i.e., ~29 words in the teachers' utterances compared to ~4 words in the students' utterances.

#### Manually checking the (quality of the) text

Before even running any kind of "smart" analysis methods on the text, it's always good to manually look at examples.
Previously we printed out an example of a transcript.
Here, we are going to print out a few examples of the teacher utterances and student utterances.

In [None]:
# Helpful utility function!
def print_line_separated(df, speaker_column, text_column):
    for speaker, text in zip(df[speaker_column], df[text_column]):
        print(f"{speaker}: {text}")

In [None]:
NUM_SAMPLES = 15
# Seed the random number generator to get the same results
np.random.seed(42)

# Get NUM_SAMPLES of the teacher utterances
teacher_df = utterances[utterances['speaker'] == 'teacher'].sample(NUM_SAMPLES)
print("Teacher utterances:")
print_line_separated(df=teacher_df, speaker_column='speaker', text_column='text')
print()

# Get NUM_SAMPLES of the student utterances
student_df = utterances[utterances['speaker'] == 'student'].sample(NUM_SAMPLES)
print("Student utterances:")
print_line_separated(df=student_df, speaker_column='speaker', text_column='text')



In [None]:
""">>>Expected result:

Teacher utterances:
teacher: Okay, I can double it by doing 15 times 4?  Does everybody agree with that?  So this is doubling it?
teacher: Okay, go get paper, pencils.  This group, you're gonna work over here together.
teacher: How they've shown it on here, every part of what they did they're able to show on here and you're able to see, right. That's the way it needs to be when you're showing your work. Let's see.  Student A, come here.  You can come up.  Can you tell us how this side works?
teacher: Because that’s not really modeling the tenth.  It’s all got to be [inaudible].  Nope.
teacher: Student B's answering.
teacher: All your pages are filled up?
teacher: I love what you just said.  Did everyone hear that?
teacher: Keep going.
teacher: Add a 0.  Now let's look at the numbers.  The percent – does that make sense?  Are these two numbers pretty close to each other?
teacher: To ten, right?
teacher: That’s a good strategy.  So you’re imagining – one second. You’re imagining the two zeros being in place if it’s a tenth [Inaudible]?  Good.  Make sure you share that out later on.
teacher: Huh?
teacher: Okay.  Boards down.  We’re going to try again because there was some noise.
teacher: We’re going to measure this height, right?  So we’re doing L, right?
teacher: Oh, you’re fine, you can keep writing because I’ve got to do this again.  It didn't like me before.  How many people are still writing?  Okay.

Student utterances:
student: They’re all wearing two –
student: I know the answer for [inaudible].
student: So we’re making a story problem like the other one?
student: Mrs. H, can you put this inside the bag?
student: You divide nine fifteenths.
student: Yes.
student: Okay, you get like [inaudible].
student: A part of a whole.
student: Never mind.
student: Subtract.
student: A triangle, a star, a circle or a square.
student: First I did this.  [Inaudible].
student: Improper fraction.
student: 14.
student: Eight thousand.

"""

It's always good to make observations about the type of texts or patterns in the data.

**Things to think about:** Manually looking at the _teacher_ utterances, we notice a few things.
- Some of the utterances are math content related (e.g., "Add a 0.  Now let's look at the numbers.  The percent – does that make sense?  Are these two numbers pretty close to each other?") and others are not (e.g., "Okay, go get paper, pencils.  This group, you're gonna work over here together.")
- Some of the utterances are about class management such as calling everyone's attention, "I love what you just said.  Did everyone hear that?"
- Some of the utterances are about supporting the student's thinking e.g., "Okay, I can double it by doing 15 times 4?  Does everybody agree with that?  So this is doubling it?"


**Things to think about:** Manually looking at the _student_ utterances, we notice a few things.
- The utterances are shorter than the teacher's utterances.
- The utterances seem to be short answers to questions e.g., "14.", "Yes.", "Subtract."

### Exploring the metadata

Now let's switch gears to explore the metadata folder `data/ICPSR_36095`!
This folder contains many subfolders `DS00##`, each of which contains a `tsv` file and PDF documenting the contents of the `tsv` file.
Each subfolder contains different types of metadata.
For example, `DS0001` contains the metadata and documentation for class observations, a rubric on the teacher's class management and behavior.
Or, `DS0006` contains metadata and documentation on a teacher background questionnaire.


In this section we're going to use `DS0006` and compute the general statistics on the teacher's questionnaire metadata.

In particular, we are going to determine:
- Number of teachers
- % Male
- % Black
- % Asian
- % Hispanic
- % White
- Avg number of years of teaching experience
- % BA in education?


In [None]:
# Load metadata
fpath = os.path.join(DIR_PATH,'data/ICPSR_36095/DS0006/36095-0006-Data.tsv')
teacher_metadata = pd.read_csv(fpath, sep='\t')
print(teacher_metadata.columns)
teacher_metadata.head()

Using the documentation `data/ICPSR_36095/DS0006/36095-0006-Codebook.pdf`, we can easily compute the desired statistics:

In [None]:
# Number of teachers
num_teachers = teacher_metadata['NCTETID'].nunique()
print(f"Number of teachers = {num_teachers}")

# Percentage Male
# Drop non 0/1 values
teacher_metadata = teacher_metadata[teacher_metadata['MALE'].isin([0,1])]
num_male = teacher_metadata['MALE'].sum()
perc_male = ( num_male / num_teachers ) * 100
# Round
perc_male = round(perc_male, 2)
print(f"Percentage male = {perc_male}")

# Percentage Black
num_black = teacher_metadata['BLACK'].sum()
perc_black = ( num_black / num_teachers ) * 100
perc_black = round(perc_black, 2)
print(f"Percentage black = {perc_black}")

# Percentage Asian
num_asian = teacher_metadata['ASIAN'].sum()
perc_asian = ( num_asian / num_teachers ) * 100
perc_asian = round(perc_asian, 2)
print(f"Percentage asian = {perc_asian}")

# Percentage Hispanic
num_hispanic = teacher_metadata['HISP'].sum()
perc_hispanic = ( num_hispanic / num_teachers ) * 100
perc_hispanic = round(perc_hispanic, 2)
print(f"Percentage hispanic = {perc_hispanic}")

# Percentage White
num_white = teacher_metadata['WHITE'].sum()
perc_white = ( num_white / num_teachers ) * 100
perc_white = round(perc_white, 2)
print(f"Percentage white = {perc_white}")

# Average number of years of teaching experience
# Drop empty strings and ' '
teacher_metadata = teacher_metadata[teacher_metadata['EXPERIENCE'] != ' ']
# Make sure it's a float
teacher_metadata['EXPERIENCE'] = teacher_metadata['EXPERIENCE'].astype(float)
avg_years = teacher_metadata['EXPERIENCE'].mean()
avg_years = round(avg_years, 2)
print(f"Average number of years of teaching experience = {avg_years}")

# BA in education
num_ba = teacher_metadata['EDBACHELORS'].sum()
perc_ba = ( num_ba / num_teachers ) * 100
perc_ba = round(perc_ba, 2)
print(f"Percentage with BA in education = {perc_ba}")

### Exploring the student reasoning data

Finally, we explore the student reasoning data `data/student_reasoning.csv`.
The file contains utterances under the column `text` labeled for whether there is student reasoning under the column `student_reasoning`.

The data is labeled by humans who saw the context for the utterance (the preceding utterance) and labeled whether the utterance is the student expressing their reasoning.

Again, we might want to perform some preliminary checks of the data. In this example, we will look at the number of utterances labeled as student reasoning (or not), and the length of these utterances.

In [None]:
student_reasoning_fpath = os.path.join(DIR_PATH,'data/student_reasoning.csv')
student_reasoning = pd.read_csv(student_reasoning_fpath)

# Plot the number of utterances labeled as `reasoning`.
sns.barplot(x=student_reasoning['student_reasoning'].value_counts().index, y=student_reasoning['student_reasoning'].value_counts().values)
plt.xlabel('Is Student Reasoning?')
plt.ylabel('Number of utterances')
plt.title('Number of utterances labeled as `reasoning`')
plt.show()


This dataset is imbalanced---meaning that there are more instances of a particular class than of other classes.

In [None]:
# Calculate the length of `text` and plot the average length of `text` per `student_reasoning`
# We define length as the number of words in `text`
student_reasoning['text_len'] = student_reasoning['text'].apply(lambda x: len(x.split(' ')))
sns.barplot(x=student_reasoning['student_reasoning'], y=student_reasoning['text_len'])
plt.xlabel('Is Student Reasoning?')
plt.ylabel('Average length of text')
plt.title('Average length of text per `student_reasoning`')
plt.show()

Generally, utterances that are marked as `student_reasoning` are longer than ones that are not student reasoning.

We can take a look at a few examples as well to better understand the data!

In [None]:
NUM_EXAMPLES = 10

# Get NUM_EXAMPLES of the student utterances labeled as `reasoning`
reasoning_df = student_reasoning[student_reasoning['student_reasoning'] == 1].sample(NUM_EXAMPLES)

# Get NUM_EXAMPLES of the student utterances labeled as `not_reasoning`
not_reasoning_df = student_reasoning[student_reasoning['student_reasoning'] == 0].sample(NUM_EXAMPLES)

print("Reasoning utterances:")
print_line_separated(df=reasoning_df, speaker_column='comb_idx', text_column='text')

print("\nNot reasoning utterances:")
print_line_separated(df=not_reasoning_df, speaker_column='comb_idx', text_column='text')

## Text Analysis Tools

### Word Frequencies

This section will perform some simple text analysis on the transcripts from above.
Previously, we manually checked the quality of the teacher and student utterances. Now, let's look at what kinds of words the students are saying compared to the teachers.


As a simple analysis, we're going to look at **high-frequency words** for the teachers and students.
To do this, we're going to be using a library called [NLTK](https://www.nltk.org/).
This is a really helpful library for doing basic analysis, and we'll be returning to it again in the Assignments.


FIrst, let's install NLTK and download the appropriate modules.

In [None]:
import nltk
# For tokenizing and tagging
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

In [None]:
# We're going to downsample the data to make it easier to work with
NUM_SAMPLES=100
teacher_df = utterances[utterances['speaker'] == 'teacher'].sample(NUM_SAMPLES)
student_df = utterances[utterances['speaker'] == 'student'].sample(NUM_SAMPLES)

In [None]:
# \n join the teacher utterances, and tokenize the utterances on word level
joined_teacher_utterances = '\n'.join(teacher_df['text'].tolist())
teacher_tokens = nltk.word_tokenize(joined_teacher_utterances)
# Case-insensitive frequency distribution of the teacher utterances
teacher_dist = nltk.FreqDist([w.lower() for w in teacher_tokens])
# Print the 10 most frequent words
print(teacher_dist.most_common(10))

Note, how the word-level frequency doesn't seem to be showing anything interesting or math-content related, as we had seen in the examples.
There's a lot of punctuation symbols showing up and reference words like 'that' or 'this'.

To filter out these common, but less interesting tokens, we can apply some basic filters.
One is a filter that removes things that are non-alphabetic characters like punctuation or numbers.
The other is a filter that removes stopwords, which are commonly used words like "the" "a".


In [None]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

# w.isalpha() returns True if all characters in w are alphabetic
teacher_dist = nltk.FreqDist([
    w.lower() for w in teacher_tokens
    if w.isalpha() # remove non-alphabetic characters
    and w.lower() not in stopwords # remove stopwords e.g., the, a, an, in, of, etc.
])
# Print the 10 most frequent words
print(teacher_dist.most_common(10))



#### Log-Odds Analysis

Going beyond just counting frequent words in the student and teacher's utterances, we might be interested in the chances of a word occurring in the student's text over it occurring in the teacher's text.

This gets us to a [log-odds analysis](https://en.wikipedia.org/wiki/Odds_ratio) on the words.

Let's first define some utility functions to calculate the log odds of the words.

In [None]:
from collections import defaultdict
import math

def get_counts(tweets, vocab):
    counts = {w: 0 for w in vocab}
    for split in tweets:
        count = 0
        prev = ''
        for w in split:
            if w == '':
                continue
            if w in vocab:
                counts[w] += 1
            if count > 0:
                bigram = prev + ' ' + w
                if bigram in vocab:
                    counts[bigram] += 1
            count += 1
            prev = w
    return counts

def log_odds(counts1, counts2, prior, zscore = True):
    # code from Dan Jurafsky
    # note: counts1 will be positive and counts2 will be negative

    sigmasquared = defaultdict(float)
    sigma = defaultdict(float)
    delta = defaultdict(float)

    n1 = sum(counts1.values())
    n2 = sum(counts2.values())

    # since we use the sum of counts from the two groups as a prior, this is equivalent to a simple log odds ratio
    nprior = sum(prior.values())
    for word in prior.keys():
        if prior[word] == 0:
            delta[word] = 0
            continue
        l1 = float(counts1[word] + prior[word]) / (( n1 + nprior ) - (counts1[word] + prior[word]))
        l2 = float(counts2[word] + prior[word]) / (( n2 + nprior ) - (counts2[word] + prior[word]))
        sigmasquared[word] = 1/(float(counts1[word]) + float(prior[word])) + 1/(float(counts2[word]) + float(prior[word]))
        sigma[word] = math.sqrt(sigmasquared[word])
        delta[word] = (math.log(l1) - math.log(l2))
        if zscore:
            delta[word] /= sigma[word]
    return delta


def get_log_odds_values(group1_df, group2_df, text_column, words2idx):
    # get counts
    counts1 = get_counts(group1_df[text_column], words2idx)
    counts2 = get_counts(group2_df[text_column], words2idx)
    prior = {}
    for k, v in counts1.items():
        prior[k] = v + counts2[k]

    # get log odds
    # note: we don't z-score because that makes the absolute values for large events significantly smaller than for smaller
    # events. however, z-scoring doesn't make a difference for our results, since we simply look at whether the log odds
    # are negative or positive (rather than their absolute value)
    delta = log_odds(counts1, counts2, prior, True)
    return prior, counts1, counts2, delta


In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import nltk
import string
import json

# Download nltk data
nltk.download('stopwords')
nltk.download('punkt')

sno = nltk.stem.SnowballStemmer('english')
punct_chars = list((set(string.punctuation) | {'’', '‘', '–', '—', '~', '|', '“', '”', '…', "'", "`", '_'}) - set(['#']))
punct_chars.sort()
punctuation = ''.join(punct_chars)
replace = re.compile('[%s]' % re.escape(punctuation))

INPUT_DIR = os.path.join(DIR_PATH, "data/text_processing/")

stopwords = set(open(INPUT_DIR + 'stopwords.txt', 'r').read().splitlines())

def clean_text_to_words(text, keep_stopwords, stem):
    if not keep_stopwords:
        stop = stopwords
    # lower case
    text = text.lower()
    # eliminate urls
    text = re.sub(r'http\S*|\S*\.com\S*|\S*www\S*', ' ', text)
    # eliminate @mentions
    text = re.sub(r'\s@\S+', ' ', text)
    # substitute all other punctuation with whitespace
    text = replace.sub(' ', text)
    # replace all whitespace with a single space
    text = re.sub(r'\s+', ' ', text)
    # strip off spaces on either end
    text = text.strip()
    # stem words
    words = text.split()
    if not keep_stopwords:
        words = [w for w in words if w not in stop]
    if stem:
        words = [sno.stem(w) for w in words]
    return words


In [None]:
# Let's first clean up the text data a bit more. You can check out how we process the data in utils/text_processing.py.
# The functions we used are essentially ones we've already seen in prior steps, but we've added a few more steps to clean up the text data. We encourage you to look at the documentation for new functions you haven't seen before!
import sys
import os
sys.path.append(os.getcwd())
CLEAN_TEXT_COLUMN = 'cleaned_text'
TEXT_COLUMN = 'text'

def plot_log_odds(group1, group2, logodds_factor=1.5):
  # Let's build a dictionary that maps all words from teacher and student to unique IDs
  words = set(teacher_df[CLEAN_TEXT_COLUMN].sum() + student_df[CLEAN_TEXT_COLUMN].sum())
  words2idx = {w: i for i, w in enumerate(words)}

  # Take a look at the first 10 words in the dictionary
  print(list(words2idx.items())[:10])

  # Run log odds
  _, _, _, log_odds = get_log_odds_values(
    group1_df=group1,
    group2_df=group2,
    text_column=CLEAN_TEXT_COLUMN,
    words2idx=words2idx
  )

  # Show a few of the log odds values
  print(list(log_odds.items())[:10])

  # Let's create a dataframe with the log odds values, and then plot the top and bottom 10 words in a barplot.
  log_odds_df = pd.DataFrame.from_dict(log_odds, orient='index', columns=['log_odds'])
  log_odds_df = log_odds_df.sort_values(by='log_odds', ascending=False)
  # Plot the words factor*std above and below 0.
  mean = 0
  std = log_odds_df['log_odds'].std()
  factor = logodds_factor
  top_bottom_df = pd.concat([log_odds_df[log_odds_df['log_odds'] >= mean + factor * std], log_odds_df[log_odds_df['log_odds'] <= mean - factor * std]])
  # x-axis is log odds, y-axis is words
  plt.figure(figsize=(8, 15))
  sns.barplot(x=top_bottom_df['log_odds'], y=top_bottom_df.index)
  plt.xlabel('Log odds')
  plt.ylabel('Words')

  # Put text on the left and right of the x-axis (more likely to be teacher or student)
  x_min, x_max = plt.xlim()
  y_min, y_max = plt.ylim()
  plt.text(x_min, y_min, 'More likely to be student', ha='left', va='center')
  plt.text(x_max, y_min, 'More likely to be teacher', ha='right', va='center')

  plt.title('Words by log odds')
  plt.show()


Let's look at the log-odds analysis where we keep the stopwords.

In [None]:
NUM_SAMPLES = 100
teacher_df = utterances[utterances['speaker'] == 'teacher'].sample(NUM_SAMPLES)
student_df = utterances[utterances['speaker'] == 'student'].sample(NUM_SAMPLES)
teacher_df[CLEAN_TEXT_COLUMN] = teacher_df[TEXT_COLUMN].apply(lambda x: clean_text_to_words(x, keep_stopwords=True, stem=False))
student_df[CLEAN_TEXT_COLUMN] = student_df[TEXT_COLUMN].apply(lambda x: clean_text_to_words(x, keep_stopwords=True, stem=False))
plot_log_odds(teacher_df, student_df, logodds_factor=2)

Now, let's compare it to when we don't keep the stopwords:

In [None]:
teacher_df = utterances[utterances['speaker'] == 'teacher'].sample(NUM_SAMPLES)
student_df = utterances[utterances['speaker'] == 'student'].sample(NUM_SAMPLES)
teacher_df[CLEAN_TEXT_COLUMN] = teacher_df[TEXT_COLUMN].apply(lambda x: clean_text_to_words(x, keep_stopwords=False, stem=False))
student_df[CLEAN_TEXT_COLUMN] = student_df[TEXT_COLUMN].apply(lambda x: clean_text_to_words(x, keep_stopwords=False, stem=False))
plot_log_odds(teacher_df, student_df, logodds_factor=2)

As you can see, the log-odds analysis changes depending on how we process the text. It's important to take your use case into account when making these decisions.

#### Topic Modeling

Next, let's try out topoic modeling on the data. For now, we'll just use the teacher's utterances.


Mallet for colab: https://colab.research.google.com/github/aiforpeople-git/First-AI4People-Workshop/blob/master/NLP_AI/Topic_Modeling.ipynb#scrollTo=fgt9WNMgNvqQ

In [None]:
!pip install --upgrade gensim

import os
def install_java():
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version
install_java()


In [None]:
!pip install little_mallet_wrapper

In [None]:
# Following the example in https://github.com/maria-antoniak/little-mallet-wrapper/blob/master/demo.ipynb

import little_mallet_wrapper as lmw


In [None]:
path_to_mallet = os.path.join(DIR_PATH, 'mallet-2.0.8/bin/mallet')
print(path_to_mallet)

In [None]:
training_data = [lmw.process_string(text) for text in teacher_df[TEXT_COLUMN].tolist()]
training_data = [d for d in training_data if d.strip()]

print(f"Number of training documents: {len(training_data)}")

In [None]:
# Training a topic model

NUM_TOPICS = 5
OUTPUT_DIR = os.path.join(DIR_PATH, 'output')

topic_keys, topic_distributions = lmw.quick_train_topic_model(
    path_to_mallet, OUTPUT_DIR, num_topics=NUM_TOPICS, training_data=training_data)

In [None]:
for i, t in enumerate(topic_keys):
    print(i, '\t', ' '.join(t[:10]))

#### Clustering


Another class of unsupervised methods that can be performed on the _sentence-level_ are clustering methods.
Note that before, we ran unsupervised methods on the _word-level_. Different units of analysis can yield different insights, depending on what we might be interested in finding!

Again, we will run this analysis on the sample of teacher's utterances.

An extremely useful library is https://www.sbert.net/. We will be using this in the following sections.

You may find the complete resource on clustering here: https://www.sbert.net/examples/applications/clustering/README.html

In [None]:
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering

model = SentenceTransformer('all-MiniLM-L6-v2')
NUM_CLUSTERS=10

sentences = teacher_df[TEXT_COLUMN].tolist()

# We are going to be using Agglomerative Clustering to cluster the sentences

corpus_embeddings = model.encode(sentences)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

# Perform kmean clustering
clustering_model = AgglomerativeClustering(distance_threshold=1.5, n_clusters=None)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(sentences[sentence_id])

for i, cluster in clustered_sentences.items():
    print("Cluster ", i+1)
    print(cluster)
    print("")

## Assignments

Now it's your turn to analyze the data!

### 1: Compute top 10 words for the student.

Your task is to plot a histogram of the top 10 words (alphabetic strings, non-stop words) uttered by the student in the entire transcript dataset.
The x-axis should be the words, and the y-axis should be the number of occurences.

**IMPORTANT FOR GRADING**:
- Complete the code for computing the top 10 words.
- Plot the results when you upload the PDF of the colab.

In [None]:
# YOUR CODE HERE

### 2: Plot student statistics

Use the documentation in the metadata to compute and report the following:

- Number of students
- % Male
- % Black
- % Asian
- % Hispanic/Asian
- % White
- % Free/Reduced Lunch
- % Special Education status
- % Limited English proficiency


**IMPORTANT FOR GRADING:**
- Complete the code for computing the student statistics.
- Have your answers for the following printed out when you upload the PDF version of the colab


In [None]:
# YOUR CODE HERE
num_students = None # REPLACE ME!
print(f"Number of students = {num_students}")

perc_male = None # REPLACE ME!
print(f"Percentage male = {perc_male}")

perc_black = None # REPLACE ME!
print(f"Percentage black = {perc_black}")

perc_asian = None # REPLACE ME!
print(f"Percentage asian = {perc_asian}")

perc_hispanic = None # REPLACE ME!
print(f"Percentage hispanic = {perc_hispanic}")

perc_white = None # REPLACE ME!
print(f"Percentage white = {perc_white}")

perc_free_lunch = None # REPLACE ME!
print(f"Percentage free lunch = {perc_free_lunch}")

perc_special_ed = None # REPLACE ME!
print(f"Percentage special ed = {perc_special_ed}")

perc_limited_english = None # REPLACE ME!
print(f"Percentage limited english = {perc_limited_english}")

#### Log odds of transcripts from teachers with highest vs. lowest MQI5 scores

This question will involve multiple steps.

1. Identify the metadata corresponding to the MQI5 scores.
2. Load the corresponding metadata file.
3. Identify the top 10 and bottom 10 transcript IDs with the highest and lowest MQI5 scores.
4. Load their transcripts and run the log-odds analysis on the teacher's word level.
5. Write up a short analysis and summary of your observations.


##### Step 1: Identify the metadata corresponding to the value-added scores.

In [None]:
# YOUR ANSWER HERE (PUT IN THE DIRECTORY AND NAME OF THE OUTCOME COLUMN)

##### Step 2: Load the metadata file.

In [None]:
# YOUR CODE HERE

##### Step 3: Identify the top 5 and bottom 5 teacher IDs with the highest and lowest value-added scores.

In [None]:
# YOUR CODE HERE

##### Step 4: Load their transcripts and run the log-odds analysis on the teacher's word level.

In [None]:
# YOUR CODE HERE

##### Step 5: Write up a short analysis and summary of your observations.

In [None]:
"""
YOUR ANALYSIS AND OBSERVATIONS HERE

"""

## Extra Assignments

You have the choice to do extra credit work on this assignment. Below are some ideas of what you can do. If you have other ideas, make sure you check with Rose and/or Dora that it's reasonable to do.

1. Play around with an unsupervised algorithm (e.g., clustering) to identify different pedagogy categories. Keep a log of things that you try out, things that do and don't work.

2. Perform the same log-odds analysis except with value-added scores as the outcome measure. What do you observe?

