# Practice: Analyzing Polysemy and Semantic Change Using COHA Corpus

## Overview
This practical session demonstrates how to analyze the polysemy of the word "mouse" over time using COHA (Corpus of Historical American English) data. Specifically, we will explore:

1. **Morphological Analysis**: Breaking sentences into morphemes and extracting meaningful context words.
2. **Window Collocations**: Calculating co-occurrence frequencies within a defined context window.
3. **Collocation Measures**: Using statistical metrics—Expected Frequency, MI, LL, Z-Score, and T-Score—to evaluate collocates.
4. **Semantic Change Detection**: Demonstrating the shift in the meaning of "mouse" from "animal" to "computer device" between the 1850s and 2000s.

## Step 1: Data Preparation
**Objective**: Ensure COHA data for the 1850s and 2000s is available and preprocess the data by splitting sentences into morphemes and filtering for content words.


In [None]:
!pip install nltk



In [None]:
# Mount Google Drive to this Notebook instance.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# fileDir = "drive/My Drive/corpus/coha_2000s.txt"
fileDir = "drive/My Drive/corpus/coha_1850s.txt"
fr = open(fileDir, 'r', encoding='utf-8')
contents = fr.readlines()
fr.close()

textList = []
AllWordsList = []

num = 0
for content in contents:
    content = content.replace("\n", "")
    if num < 100:
        print(content)
    textList.append(content.lower())

    words = content.lower().strip().split(" ")
    for word in words:
        AllWordsList.append(word)
    num += 1

Sentence
184 poor and proud produced by charles keller
html version by al haines
poor and proud or the fortunes of katy redburn a story for young folks by oliver optic to alice marie adams this book is affectionately dedicated by her father
poor and proud
preface
bobby bright and harry west whose histories were contained in the last two volumes of the library for young folks were both smart boys
the author very grateful for the genial welcome extended to these young gentlemen begs leave to introduce to his juvenile friends a smart girl miss katy redburn whose fortunes he hopes will prove sufficiently interesting to secure their attention
if any of my adult readers are disposed to accuse me of being a little extravagant i fear i shall have to let the case go by default but i shall plead in extenuation that i have tried to be reasonable even where a few grains of the romantic element were introduced for my shelf in boyhood and i may possibly have imbibed some of their peculiar spirit
but

### Extract sentences that contain the word "mouse."

In [None]:
targetWord = "mouse"

In [None]:
wordsInCorpus = 0
mouseText = []
AllText = ""
for each in textList:
    words = each.split(" ")
    for word in words:
        wordsInCorpus += 1

    if " " + targetWord + " " in each:
        mouseText.append(each)
        AllText = AllText + " " + each

AllText = AllText.strip()
print(wordsInCorpus)
print(len(AllWordsList))

15552035
15552035


### Check the concordance table for sentences containing "mouse."

In [None]:
from nltk import *
retokenize = RegexpTokenizer("[\w]+")
text = Text(retokenize.tokenize(AllText))
text.concordance(targetWord)

Displaying 25 of 109 matches:
                           ketched a mouse hev ye that ai nt a mouse said andy 
 ketched a mouse hev ye that ai nt a mouse said andy as captain lemuel gulliver
the world whose size was just that a mouse had just been caught in the trap and
st and the door was locked and not a mouse to be heard and it s been just so si
whale was but a species of magnified mouse or at least waterrat requiring only 
er to kill and boil from the crushed mouse 1851 i bear the trouble in my heart 
nceived in its crevice caught a tiny mouse lay pressed and lifeless who mid the
n and the oppress d and uncomplaints mouse find some oasis where the savory che
ains and no can be near let s go and mouse round their stoppingplace a little b
 for the most part was as still as a mouse glanced round at these words one of 
 by her cousin i shall not said this mouse waste the time of i view it the sche
ut doubt if indeed we could find any mouse who would do it a lion with the heat
but whilst

### Perform morphological analysis: Extract content words.

In [None]:
import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('stopwords')

# Noun or Predicate
def tokenizerPOScontent(doc):
    stop_words = set(stopwords.words('english'))
    tagged_list = pos_tag(word_tokenize(doc))
    contentWords_list = [t[0] for t in tagged_list if "N" in t[1]  or "V" in t[1]]
    filtered_contentWords = [w for w in contentWords_list if not w.lower() in stop_words]
    length_contentWords = [w for w in filtered_contentWords if len(w) > 2]
    return length_contentWords

mousePOSText = []
for each in mouseText:
    print(" ".join(tokenizerPOScontent(each)))
    mousePOSText.append(" ".join(tokenizerPOScontent(each.strip())))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


ketched mouse hev
mouse said andy
captain lemuel gulliver discovered island lilliput isaac know men world size mouse caught trap miller found
went make bed breakfast door locked mouse heard since
lost sense reverence marvels bulk ways anything like apprehension danger encountering opinion whale species mouse waterrat requiring circumvention application time trouble order kill boil
crushed mouse bear trouble heart hath extinguished life sin malice preconceived
crevice caught mouse lay pressed lifeless
mid wonders age seen water moving thought darting zone crowns fall anointed brows time world history seals turn oppress uncomplaints mouse find oasis savory swell like mountains near
let mouse round stoppingplace bart
elizabeth part mouse glanced round words secondings anything said cousin
said mouse waste time view scheme succeed without doubt find mouse
lion heat oppress day composed rest whilst dozed intended mouse royal ascended thought harm esop tells mistaking something travelled rou

## Step 2: Extract Context Words
**Objective**: Use a window collocations to identify collocations of "mouse" in the corpora.

### Apply window collocations to calculate co-occurrence frequencies and individual word frequencies.

In [None]:
wordType = set()
for each in mousePOSText:
    eachSplit = each.split(" ")
    for word in eachSplit:
        if word != targetWord:
            wordType.add(word)

wordList = list(wordType)

print(wordList[0:10])
print(len(wordType))

['left', 'hear', 'lock', 'size', 'harm', 'rickety', 'horse', 'jogg', 'tremont', 'hand']
1005


### Calculate co-occurrence frequencies and filter words top 30.

In [None]:
wordFreqDict = {}
for each in wordList:
    wordFreqDict[each] = 0

for each in mousePOSText:
    eachSplit = each.split(" ")
    for word in eachSplit:
        if word != targetWord:
            wordFreqDict[word] = wordFreqDict.get(word) + 1

wordFreqDicSorted = dict(sorted(wordFreqDict.items(), key=lambda x: x[1], reverse=True))

CoOccurringWords = []
collocations = []
topNum = 0
for key, value in wordFreqDicSorted.items():
  if topNum < 30:
    CoOccurringWords.append(key)
    collocations.append(value)
    # print(key, value)
  topNum += 1


# for key, value in wordFreqDicSorted.items():
#     if value > 5:
#         CoOccurringWords.append(key)
#         collocations.append(value)
print(len(CoOccurringWords))
print(CoOccurringWords[0:10])
print(collocations[0:10])

30
['rat', 'cat', 'said', 'made', 'nothing', 'like', 'upon', 'time', 'room', 'house']
[12, 11, 10, 9, 8, 8, 8, 8, 7, 7]


### Calculate individual frequencies of co-occurring words.

In [None]:
CoOccurringWordsFreqDict = {}
for each in CoOccurringWords:
    CoOccurringWordsFreqDict[each] = AllWordsList.count(each)
    print(each, AllWordsList.count(each))

rat 80
cat 335
said 41908
made 17960
nothing 9325
like 19180
upon 33929
time 22423
room 5374
house 9067
door 5044
without 14053
eyes 9764
species 1607
lion 400
thought 10782
found 10439
man 24071
deer 503
animals 988
went 6752
within 5612
world 9394
play 1484
caught 1197
seen 7133
eye 4590
heard 6349
hole 431
country 9279


In [None]:
freqCoWord = []
freqTargetWord = []
targetword = []
wordsInCorpusList = []
targetwordCount = AllWordsList.count(targetWord)

for key, value in CoOccurringWordsFreqDict.items():
    freqCoWord.append(value)
    freqTargetWord.append(targetwordCount)
    targetword.append(targetWord)
    wordsInCorpusList.append(wordsInCorpus)

print(freqCoWord[0:10])

[80, 335, 41908, 17960, 9325, 19180, 33929, 22423, 5374, 9067]


### Generate the final dataset.

In [None]:
import pandas as pd
collocationTable = pd.DataFrame({'Co-occurring words':CoOccurringWords,'Target word':targetword,'Collocations':collocations,'Freq co-occurring word':freqCoWord,'Freq target word':freqTargetWord,'Words in corpus':wordsInCorpusList})
print(collocationTable)

   Co-occurring words Target word  Collocations  Freq co-occurring word  \
0                 rat       mouse            12                      80   
1                 cat       mouse            11                     335   
2                said       mouse            10                   41908   
3                made       mouse             9                   17960   
4             nothing       mouse             8                    9325   
5                like       mouse             8                   19180   
6                upon       mouse             8                   33929   
7                time       mouse             8                   22423   
8                room       mouse             7                    5374   
9               house       mouse             7                    9067   
10               door       mouse             7                    5044   
11            without       mouse             7                   14053   
12               eyes    

In [None]:
collocationTable[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus
0,rat,mouse,12,80,134,15552035
1,cat,mouse,11,335,134,15552035
2,said,mouse,10,41908,134,15552035
3,made,mouse,9,17960,134,15552035
4,nothing,mouse,8,9325,134,15552035
5,like,mouse,8,19180,134,15552035
6,upon,mouse,8,33929,134,15552035
7,time,mouse,8,22423,134,15552035
8,room,mouse,7,5374,134,15552035
9,house,mouse,7,9067,134,15552035


## Step 3: Collocation measures
**Objective**: Apply the learned collocation measures to the dataset.

### Expected Frequency
The formula for Expected Frequency is:

$$E = \frac{f(w_1) \cdot f(w_2)}{N}$$

Where:
- $f(w_1)$: Frequency of the target word.
- $f(w_2)$: Frequency of the co-occurring word.
- $N$: Total number of tokens in the corpus.

In [None]:
collocationTable['ExpectedFreq'] = (collocationTable['Freq co-occurring word']*collocationTable['Freq target word'])/collocationTable['Words in corpus']
collocationTable.sort_values(by='ExpectedFreq', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq
2,said,mouse,10,41908,134,15552035,0.361089
6,upon,mouse,8,33929,134,15552035,0.29234
17,man,mouse,6,24071,134,15552035,0.207401
7,time,mouse,8,22423,134,15552035,0.193202
5,like,mouse,8,19180,134,15552035,0.165259
3,made,mouse,9,17960,134,15552035,0.154748
11,without,mouse,7,14053,134,15552035,0.121084
15,thought,mouse,6,10782,134,15552035,0.0929
16,found,mouse,6,10439,134,15552035,0.089945
12,eyes,mouse,7,9764,134,15552035,0.084129


### Z-Score
The formula for Z-Score is:

$$Z = \frac{f(w_1, w_2) - E}{\sqrt{E}}$$

Where:
- $f(w_1, w_2)$: Observed frequency of the collocation.
- $E$: Expected frequency.

In [None]:
import numpy as np
collocationTable['Z-score'] = (collocationTable['Collocations'] - collocationTable['ExpectedFreq'])/np.sqrt(collocationTable['ExpectedFreq'])
collocationTable.sort_values(by='Z-score', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq,Z-score
0,rat,mouse,12,80,134,15552035,0.000689,457.038219
1,cat,mouse,11,335,134,15552035,0.002886,204.690421
14,lion,mouse,6,400,134,15552035,0.003446,102.144016
18,deer,mouse,6,503,134,15552035,0.004334,91.074051
28,hole,mouse,4,431,134,15552035,0.003714,65.578153
19,animals,mouse,5,988,134,15552035,0.008513,54.099431
13,species,mouse,6,1607,134,15552035,0.013846,50.872273
24,caught,mouse,5,1197,134,15552035,0.010314,49.132333
23,play,mouse,5,1484,134,15552035,0.012786,44.104431
10,door,mouse,7,5044,134,15552035,0.04346,33.369288


## Step 4: Demonstrating Semantic Change
**Objective**: Create tables showing the shift in collocates between the 1850s and 2000s.

### In the 1850s, "mouse" is associated with collocates like "rat," "house," and "animals"!

In [None]:
collocationTable.sort_values(by='Z-score', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq,Z-score
0,rat,mouse,12,80,134,15552035,0.000689,457.038219
1,cat,mouse,11,335,134,15552035,0.002886,204.690421
14,lion,mouse,6,400,134,15552035,0.003446,102.144016
18,deer,mouse,6,503,134,15552035,0.004334,91.074051
28,hole,mouse,4,431,134,15552035,0.003714,65.578153
19,animals,mouse,5,988,134,15552035,0.008513,54.099431
13,species,mouse,6,1607,134,15552035,0.013846,50.872273
24,caught,mouse,5,1197,134,15552035,0.010314,49.132333
23,play,mouse,5,1484,134,15552035,0.012786,44.104431
10,door,mouse,7,5044,134,15552035,0.04346,33.369288


### By the 2000s, "mouse" is more commonly found with "mickey," "computer," and "click."

In [None]:
collocationTable.sort_values(by='Z-score', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq,Z-score
1,mickey,mouse,25,282,564,28132499,0.005654,332.415936
28,antibodies,mouse,10,114,564,28132499,0.002285,209.12835
9,mice,mouse,15,317,564,28132499,0.006355,188.079816
12,click,mouse,14,564,564,28132499,0.011307,131.553434
3,cat,mouse,24,1876,564,28132499,0.03761,123.560119
4,cells,mouse,20,1488,564,28132499,0.029831,115.623169
7,computer,mouse,17,3001,564,28132499,0.060164,69.062263
0,like,mouse,51,66617,564,28132499,1.335537,42.975191
27,field,mouse,10,4619,564,28132499,0.092602,32.55744
17,front,mouse,11,9820,564,28132499,0.196871,24.347726
