# Lab 3: CSS 100

In this lab, you'll get more hands-on practice:

- Using the `nltk` library to process text.  
- Working with *corpora*.
- Computing correlations and regressions. 

This will all be in the context of trying to build an algorithm to predict the **readability** of text. We will be working with the [CLEAR corpus](https://link.springer.com/article/10.3758/s13428-022-01802-x), which contains human measures of readability (`BT_easiness`) and automated measures (see Q1). We'll also to build our own measure.

In [1]:
### Run this code!
import nltk
from nltk import word_tokenize, sent_tokenize
import scipy.stats as ss

import pandas as pd

In [2]:
### Run this code!
df_clear = pd.read_csv("data/CLEAR_corpus_final.csv")
df_clear.shape

(4724, 28)

## Part 1: Existing metrics

In this section, you'll compare the utility of existing *metrics* of readability.

### Q1. Use existing readability metrics

The *Clear Corpus* (`df_clear`) has several **automated measures** of text difficulty:

- `Flesch-Kincaid-Grade-Level`
- `Automated Readability Index`
- `SMOG Readability`

It also has a human "gold standard" measure (`BT_easiness`). For this problem, calculate the **Pearson's correlation** between each of these measures and `BT_easiness`. Name them as follows:

- `Flesch-Kincaid-Grade-Level` --> `fk_corr`
- `Automated Readability Index` --> `ari_corr`
- `SMOG Readability` --> `smog_corr`

**Note**: You can use `scipy.stats.pearsonr` (`ss`) to calculate $r$; make sure to select the correlation value, not the p-value.

**Note 2**: You can read more about how the measures all work [here](https://en.wikipedia.org/wiki/Readability#Popular_Readability_Formulas).

In [3]:
### BEGIN SOLUTION
fk_corr = ss.pearsonr(df_clear['BT_easiness'],
                     df_clear['Flesch-Kincaid-Grade-Level'])[0]
ari_corr = ss.pearsonr(df_clear['BT_easiness'],
                     df_clear['Automated Readability Index'])[0]
smog_corr = ss.pearsonr(df_clear['BT_easiness'],
                     df_clear['SMOG Readability'])[0]
### END SOLUTION

In [4]:
assert fk_corr
assert ari_corr
assert smog_corr

assert fk_corr < -.5
assert ari_corr < -.4
assert smog_corr < -.5

## Part 2: Build your own measure!

In this section, you'll build your own, progressively more sophisticated, measure of readability. Then you can compare this measure to the others to see which best predicts human judgments.

### Q2. A simple, naive measure

Write a function called `naive_difficulty`, which tries to estimate the *difficulty* of reading a text. This function should:

- Calculate the average sentence length (`num_words`/`num_sentences`).  
- Calculate the average word length, in characters (`num_characters`/`num_words`).  

These should be added together to produce a final estimate.

Don't hesitate to use `word_tokenize` and `sent_tokenize`! See the `assert` statements for guidance on how the function should work.

In [5]:
### BEGIN SOLUTION
def naive_difficulty(text):
    num_words = len(word_tokenize(text))
    num_sentences = len(sent_tokenize(text))
    num_characters = len(text)
    
    return num_words/num_sentences + num_characters/num_words
    
### END SOLUTION

In [6]:
assert naive_difficulty

text1 = "Hello world. Hello!"
assert round(naive_difficulty(text1)) == 6

text2 = "This is a test. This is only a test."
assert round(naive_difficulty(text2)) == 9

text3 = "He had come very far and was excited for the day. This is not to say he was not nervous at all. He was quite anxious."
assert round(naive_difficulty(text3)) == 14

### Q3. Evaluate your `naive_difficulty` measure

Now, `.apply` this measure to every `Excerpt` in `df_clear`. Call the new column `naive_difficulty`.

Then calculate Pearson's $r$ with `BT_easiness` and call this `naive_corr`.

In [7]:
### BEGIN SOLUTION
df_clear['naive_difficulty'] = df_clear['Excerpt'].apply(lambda x: naive_difficulty(x))
naive_corr = ss.pearsonr(df_clear['BT_easiness'],
                     df_clear['naive_difficulty'])[0]
### END SOLUTION

In [8]:
assert 'naive_difficulty' in df_clear.columns
assert naive_corr

assert naive_corr < 0
assert naive_corr < -.3

### Q4. Which measure is best?

Compare the magnitude of each of your correlations. Which one is *most correlated* with `BT_easiness`? Write down your answer below as one of "A", "B", "C", or "D", and set this to `ans4`.

A. `fk_corr`   
B. `smog_corr`  
C. `ari_corr`  
D. `naive_corr`  

In [14]:
### BEGIN SOLUTION
ans4 = "B"
### END SOLUTION

In [15]:
assert ans4 in ['A', 'B', 'C', 'D']
assert ans4 == 'B'

### Q5. Using *Age of Acquisition*

Some psycholinguists have hypothesized that words that are **learned earlier** are easier to read. This suggests that the **average age of acquisition of words** in a text should be correlated with the readability of that text. 

Build a new function called `calculate_avg_aoa(text)`, which looks up all the words in a text in `word_to_aoa`, a dictionary mapping words to their age of acquisition from a database (see the code below).

Desired function behavior:

- Passage text should first be made *lowercase*.  
- For each *word* in text, check whether it has an *exact match* in `word_to_aoa`.  
  - If so, track the `AoA` of that word.
  - If not, ignore it.
- Calculate the average `AoA` of words you found.

In [98]:
### Run this code!
df_aoa = pd.read_csv("data/AoA.csv")
word_to_aoa = dict(zip(df_aoa['Word'], df_aoa['AoA']))
word_to_aoa['dog']

2.8

In [99]:
### BEGIN SOLUTION
def calculate_avg_aoa(text):
    # Convert the text to lowercase to match the case of dictionary keys
    text = text.lower()
    # Tokenize the text into words
    words = word_tokenize(text)
    
    # List to hold the AoA of words found in the dictionary
    aoas = []
    
    # Check each word against the dictionary
    for word in words:
        if word in word_to_aoa:
            aoas.append(word_to_aoa[word])
    
    # Calculate the average AoA if there are any valid AoA values
    if aoas:
        avg_aoa = sum(aoas) / len(aoas)
    else:
        avg_aoa = 0  # Return 0 or an appropriate value if no words match
    
    return avg_aoa
### END SOLUTION

In [100]:
assert calculate_avg_aoa


text1 = "Hello world. Hello!"
assert round(calculate_avg_aoa(text1)) == 4

text2 = "This is a test. This is only a test."
assert round(calculate_avg_aoa(text2)) == 5

text3 = "He had come very far and was excited for the day. This is not to say he was not nervous at all. He was quite anxious."
assert round(calculate_avg_aoa(text3)) == 5

### Q6. Evaluate the predictive value of `AoA`

Now, `.apply` `calculate_avg_aoa` to every `Excerpt` in `df_clear`. Call the new column `avg_aoa`.

Then calculate Pearson's $r$ with `BT_easiness` and call this `aoa_corr`.

In [103]:
### BEGIN SOLUTION
df_clear['avg_aoa'] = df_clear['Excerpt'].apply(lambda x: calculate_avg_aoa(x))
aoa_corr = ss.pearsonr(df_clear['BT_easiness'],
                     df_clear['avg_aoa'])[0]
### END SOLUTION

In [104]:
assert 'avg_aoa' in df_clear.columns

assert aoa_corr
assert aoa_corr < -.5

### Q7. Combine `AoA` with `naive_difficulty`

Now, let's see how much we can improve our measures by *combining* them. The *wisest* approach to doing this requires building a *linear regression* model to see how each one independently relates to `BT_easiness`.  

Use the `statsmodels` package (imported below as `sm`) to `.fit` an `ols` model predicting `BT_easiness ~ avg_aoa + naive_difficulty`. Save the params in `combined_params`.

**Note**: This is review from CSS 2. If you're a little rusty using `statsmodels`, check out the [lecture from CSS 2](https://ucsd-css2.github.io/ucsd-css2-website/lectures/13-regression-intro.html#building-a-regression-model).

In [105]:
import statsmodels.formula.api as sm

In [106]:
### BEGIN SOLUTION
mod_combined = sm.ols(data = df_clear, formula = 'BT_easiness ~ avg_aoa + naive_difficulty').fit()
combined_params = mod_combined.params
### END SOLUTION

In [107]:
assert 'avg_aoa' in combined_params
assert 'naive_difficulty' in combined_params

assert round(combined_params['avg_aoa'], 2) == -1
assert round(combined_params['naive_difficulty'], 2) == -.03

### Q8. Use `combined_params` to build a new algorithm

We can use the **coefficients** from Q7 to build a new algorithm that estimates readability from `naive_difficulty` and `avg_aoa`. Specifically, you can think of these coefficients as *weights*, which you'll multiply by the respective measures:

$Y = \beta_{aoa} * X_{aoa} + \beta_{naive}*X_{naive}$

Write a function called `calculate_combined_readability(Excerpt)`, which:

- First calculates `naive_difficulty` and `avg_aoa`.  
- Then produces a **weighted** sum of these measures, using the coefficients of Q7.

You can use the *rounded* versions of the coefficients available in the `assert` test for Q7. Specifically, $\beta_{aoa}=-1$ and $\beta_{naive}=-.03$.

In [114]:
### BEGIN SOLUTION
def calculate_combined_readability(text):
    β_aoa = -1
    β_naive = -.03
    
    X_aoa = calculate_avg_aoa(text)
    X_naive = naive_difficulty(text)
    
    Y = β_aoa * X_aoa + β_naive * X_naive
    return Y
### END SOLUTION

In [119]:
assert calculate_combined_readability

text1 = "Hello world. Hello!"
assert round(calculate_combined_readability(text1)) == -4

text2 = "This is a test. This is only a test."
assert round(calculate_combined_readability(text2)) == -5

text3 = "He had come very far and was excited for the day. This is not to say he was not nervous at all. He was quite anxious."
assert round(calculate_combined_readability(text3)) == -5

### Q8. Evaluate the predictive value of this new measure

Now, `.apply` `calculate_combined_readability` to every `Excerpt` in `df_clear`. Call the new column `combined_readability`.

Then calculate Pearson's $r$ with `BT_easiness` and call this `combined_corr`.

In [120]:
### BEGIN SOLUTION
df_clear['combined_readability'] = df_clear['Excerpt'].apply(lambda x: calculate_combined_readability(x))
combined_corr = ss.pearsonr(df_clear['BT_easiness'],
                     df_clear['combined_readability'])[0]
### END SOLUTION

In [123]:
assert 'combined_readability' in df_clear.columns

assert combined_corr
assert combined_corr > 0.6

## Submit!

Congratulations! You just built an **automated measure** of readability that out-performs many of the basic measures.

By combining psycholinguistic variables with *automated variables*, we can come closer to estimating readability successfully.

Don't forget to 'validate' and then 'submit'!