# Practice: Analyzing Polysemy and Semantic Change Using COHA Corpus
<br>

<a href="https://colab.research.google.com/github/seongmin-mun/Courses/blob/master/2024/CorpusLinguistics/Polysemy and Semantic Change/code/Code_Polysemy and Semantic Change.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<br>
## Overview
This practical session demonstrates how to analyze the polysemy of the word "mouse" over time using COHA (Corpus of Historical American English) data. Specifically, we will explore:

1. **Morphological Analysis**: Breaking sentences into morphemes and extracting meaningful context words.
2. **Window Collocations**: Calculating co-occurrence frequencies within a defined context window.
3. **Collocation Measures**: Using statistical metrics—Expected Frequency, MI, LL, Z-Score, and T-Score—to evaluate collocates.
4. **Semantic Change Detection**: Demonstrating the shift in the meaning of "mouse" from "animal" to "computer device" between the 1850s and 2000s.

## Step 1: Data Preparation
**Objective**: Ensure COHA data for the 1850s and 2000s is available and preprocess the data by splitting sentences into morphemes and filtering for content words.


In [21]:
# fileDir = "../data/coha_1850s.txt"
fileDir = "../data/coha_2000s.txt"
fr = open(fileDir, 'r', encoding='utf-8')
contents = fr.readlines()
fr.close()

textList = []
AllWordsList = []

num = 0
for content in contents:
    content = content.replace("\n", "")
    if num < 100:
        print(content)
    textList.append(content.lower())
    
    words = content.lower().strip().split(" ")
    for word in words:
        AllWordsList.append(word)
    num += 1

Sentence
13048 time late 1840 s the low down riverfront side of new orleans
a pretty black eyed olive skin white woman about 30 dressed as an africamerican slave performs menial household and yard tasks
her clothing is period urban slave
in a genuine mississippi delta style she variously sings hums hollers slurs a traditional slow tempo africamerican spiritual how long watchman how long
her singing matches her work pace
it is throaty lusty fervent but remains an accompaniment to the performance of her chores
she speaks and sings in a genuine black dialect
for a moment she stops singing as she silently remembers something and moves to a musical sound in her head something german a popular dance song circa 1818 perhaps some joyous music from a long ago wedding reception now a dim scarce trace of a memory that periodically careens through the unconcious like a comet and possesses her soul
the comet disappears shift to indicate an internal moment a soliloquy
the following is actually a cal

### Extract sentences that contain the word "mouse."

In [22]:
wordsInCorpus = 0
mouseText = []
AllText = ""
for each in textList:
    words = each.split(" ")
    for word in words:
        wordsInCorpus += 1
        
    if " mouse " in each:
        mouseText.append(each)
        AllText = AllText + " " + each

AllText = AllText.strip()
print(wordsInCorpus)
print(len(AllWordsList))

28132499
28132499


### Check the concordance table for sentences containing "mouse."

In [23]:
from nltk import *
retokenize = RegexpTokenizer("[\w]+")
text = Text(retokenize.tokenize(AllText))
text.concordance("mouse")

Displaying 25 of 450 matches:
t a creature was stirring not even a mouse the stockings were hung etc lights s
 lights slip up on mature ernie dead mouse just like he s been for twenty years
e s been for twenty years he s blind mouse like i said i sort of lost my head f
 the string and beads go flying into mouse holes in the corners the finance dir
orners the finance director used the mouse and expanded a window i want a drink
ight of drew s nose twitching like a mouse s but douggie s as disdainful as a c
ainful as a cat s ida there s been a mouse in the pantry she said shaking the c
he handeye coordination of using the mouse and learning how to type p i crept l
to type p i crept like a little damn mouse up the last steps and looked over th
 the edge of the bed she can see the mouse rolling that pearl to its hole in th
wn the aisle or the tiny feet of the mouse that might be chasing it p the mouse
mouse that might be chasing it p the mouse in the darkened grand isle church ro
 toward it

### Perform morphological analysis: Extract content words.

In [24]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Noun or Predicate
def tokenizerPOScontent(doc):
    stop_words = set(stopwords.words('english'))
    tagged_list = pos_tag(word_tokenize(doc))
    contentWords_list = [t[0] for t in tagged_list if "N" in t[1]  or "V" in t[1]]
    filtered_contentWords = [w for w in contentWords_list if not w.lower() in stop_words]
    length_contentWords = [w for w in filtered_contentWords if len(w) > 2]
    return length_contentWords

mousePOSText = []
for each in mouseText:
    print(" ".join(tokenizerPOScontent(each)))
    mousePOSText.append(" ".join(tokenizerPOScontent(each.strip())))

offstage twas night christmas house creature stirring mouse stockings lights slip mature ernie
mouse like years
mouse like said sort lost head
let get generations dagenhams spinning graves rip string beads flying mouse holes corners
finance director used mouse expanded window
want drink gem announced side wall interrupting reverie sight twitching like mouse cat
ida mouse pantry said shaking cookie bag
cohen hoped day know machine connect internet read logic machine language coordination using mouse learning type
crept like damn mouse steps looked edge bed
see mouse rolling pearl hole sacristy shelf father keeps hosts
silence ship cuts water hear pop monstrance roll aisle feet mouse chasing
mouse isle church rolls pearl toward hole
field mouse
scene killing discovered lines poem tattooed armpit victim send heart unarmed walk like girl prostitute rim grownup girl minnie mouse shoes
mouse hiding escaped boy foot
mouse time let things fly mouth
called mouse algernon
operashun works ill sho

drops sweat peered like eyes field mouse mustache
sober church mouse clearspoken valedictorian
cleaning garage resting place sorts animals months uncovered mouse middle giving birth
mouse eyes unlike ginger whiskers extended moving air trying size situation trying decide flight response though condition choice
ginger stopped talking mouse ducked built birth placed garage closed door
look sasha said pointed glove mickey mouse moving float
corner mouse body trap desiccated fur mistaking
uses polyethylene glove coated cocoa butter hand runs mouse left
holds mouse tail waves front glass
cat knows mouse knows control happens crane places cover aquarium turns valve canister connected cover
crane drops mouse aquarium
mouse runs corner cat drops prowls
mouse turns squeals cat retracts fur spiked begins ramming glass side desperate escape
crane takes syringe presses drop liquid onto mouse neck
mouse pauses stands legs hisses charges cat throat
minute squealing hissing mouse soaked cat
alfred wo

study university minnesota researchers isolated cells adult mice grew dishes injected mouse embryos developed nerve liver types cells
study scientists institutes health work cells mouse embryos developed brain cells produce dopamine used treat disease
piece spyware watch shoulder browse web record mouse clicks broadcast information computer marketing purposes
facing cutrate prices chains expected get gobbled mouse snake pit
photograph cotton mouse danger extinction cat predation
seem like deal lose florida ponce beach mouse vanished estate development predation cats iucn list put
danger pondering questions coming solutions opt play game cat mouse beattheheat guide looking head toe photograph face weather changes routine sunblock day
goal engineer mouse immune system used generate antibodies
biologists known decades immunize mouse cancer cell protein cell mouse respond generating antibodies fight foreigner effect anticancer antibody
inject cancer cell mouse generate antibodies specializ

## Step 2: Extract Context Words
**Objective**: Use a window collocations to identify collocations of "mouse" in the corpora.

### Apply window collocations to calculate co-occurrence frequencies and individual word frequencies.

In [25]:
wordType = set()
for each in mousePOSText:
    eachSplit = each.split(" ")
    for word in eachSplit:
        if word != "mouse":
            wordType.add(word)

wordList = list(wordType)

print(wordList[0:10])
print(len(wordType))

['works', 'analyze', 'disk', 'generate', 'desperate', 'vice', 'mischief', 'needed', 'read', 'hell']
2550


### Calculate co-occurrence frequencies and filter words with a minimum frequency of 5.

In [26]:
wordFreqDict = {}
for each in wordList:
    wordFreqDict[each] = 0
    
for each in mousePOSText:
    eachSplit = each.split(" ")
    for word in eachSplit:
        if word != "mouse":
            wordFreqDict[word] = wordFreqDict.get(word) + 1
        
wordFreqDicSorted = dict(sorted(wordFreqDict.items(), key=lambda x: x[1], reverse=True))

CoOccurringWords = []
collocations = []
for key, value in wordFreqDicSorted.items():
    if value > 5:
        CoOccurringWords.append(key)
        collocations.append(value)
print(len(CoOccurringWords))
print(CoOccurringWords[0:10])
print(collocations[0:10])

85
['like', 'said', 'mickey', 'cat', 'cells', 'time', 'see', 'computer', 'get', 'years']
[51, 25, 25, 24, 20, 19, 18, 17, 16, 15]


### Calculate individual frequencies of co-occurring words.

In [27]:
CoOccurringWordsFreqDict = {}
for each in CoOccurringWords:
    CoOccurringWordsFreqDict[each] = AllWordsList.count(each)
    print(each, AllWordsList.count(each))

like 66617
said 82556
mickey 282
cat 1876
cells 1488
time 46556
see 30031
computer 3001
get 35091
years 30281
man 27737
mice 317
click 564
house 17781
mother 17251
way 30753
around 24755
made 23898
people 33115
work 19069
make 24359
day 23424
know 34855
front 9820
hand 15580
looked 16567
used 11202
field 4619
trying 8273
hole 1890
antibodies 114
put 13807
face 16348
father 15703
came 15589
screen 2415
told 15159
let 14909
night 15382
eyes 19315
called 12430
cancer 2405
clicks 165
behind 12137
keyboard 241
button 928
room 16219
find 12461
clicked 301
using 4775
bed 6595
everyone 6349
scientists 1746
got 22345
tale 907
friend 5939
though 14908
desktop 219
door 13992
name 10515
gene 604
place 14773
feeling 4252
antibody 46
disney 414
thought 18131
thing 13074
since 14693
air 8791
went 13643
hear 7049
stem 562
cell 2354
born 3275
use 11048
world 19641
run 7434
head 18096
game 6869
take 20000
films 944
water 12512
move 6034
money 9402
corner 3379


In [28]:
freqCoWord = []
freqTargetWord = []
targetword = []
wordsInCorpusList = []
targetwordCount = AllWordsList.count("mouse")

for key, value in CoOccurringWordsFreqDict.items():
    freqCoWord.append(value)
    freqTargetWord.append(targetwordCount)
    targetword.append("mouse")
    wordsInCorpusList.append(wordsInCorpus)
    
print(freqCoWord[0:10])

[66617, 82556, 282, 1876, 1488, 46556, 30031, 3001, 35091, 30281]


### Generate the final dataset.

In [29]:
import pandas as pd
collocationTable = pd.DataFrame({'Co-occurring words':CoOccurringWords,'Target word':targetword,'Collocations':collocations,'Freq co-occurring word':freqCoWord,'Freq target word':freqTargetWord,'Words in corpus':wordsInCorpusList})
print(collocationTable)

   Co-occurring words Target word  Collocations  Freq co-occurring word  \
0                like       mouse            51                   66617   
1                said       mouse            25                   82556   
2              mickey       mouse            25                     282   
3                 cat       mouse            24                    1876   
4               cells       mouse            20                    1488   
..                ...         ...           ...                     ...   
80              films       mouse             6                     944   
81              water       mouse             6                   12512   
82               move       mouse             6                    6034   
83              money       mouse             6                    9402   
84             corner       mouse             6                    3379   

    Freq target word  Words in corpus  
0                564         28132499  
1                56

In [30]:
collocationTable[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus
0,like,mouse,51,66617,564,28132499
1,said,mouse,25,82556,564,28132499
2,mickey,mouse,25,282,564,28132499
3,cat,mouse,24,1876,564,28132499
4,cells,mouse,20,1488,564,28132499
5,time,mouse,19,46556,564,28132499
6,see,mouse,18,30031,564,28132499
7,computer,mouse,17,3001,564,28132499
8,get,mouse,16,35091,564,28132499
9,years,mouse,15,30281,564,28132499


## Step 3: Collocation measures
**Objective**: Apply the learned collocation measures to the dataset.

### Expected Frequency
The formula for Expected Frequency is:

$$E = \frac{f(w_1) \cdot f(w_2)}{N}$$

Where:
- $f(w_1)$: Frequency of the target word.
- $f(w_2)$: Frequency of the co-occurring word.
- $N$: Total number of tokens in the corpus.

In [31]:
collocationTable['ExpectedFreq'] = (collocationTable['Freq co-occurring word']*collocationTable['Freq target word'])/collocationTable['Words in corpus']

In [32]:
collocationTable[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq
0,like,mouse,51,66617,564,28132499,1.335537
1,said,mouse,25,82556,564,28132499,1.655082
2,mickey,mouse,25,282,564,28132499,0.005654
3,cat,mouse,24,1876,564,28132499,0.03761
4,cells,mouse,20,1488,564,28132499,0.029831
5,time,mouse,19,46556,564,28132499,0.933354
6,see,mouse,18,30031,564,28132499,0.602061
7,computer,mouse,17,3001,564,28132499,0.060164
8,get,mouse,16,35091,564,28132499,0.703504
9,years,mouse,15,30281,564,28132499,0.607073


In [33]:
collocationTable.sort_values(by='ExpectedFreq', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq
1,said,mouse,25,82556,564,28132499,1.655082
0,like,mouse,51,66617,564,28132499,1.335537
5,time,mouse,19,46556,564,28132499,0.933354
8,get,mouse,16,35091,564,28132499,0.703504
22,know,mouse,11,34855,564,28132499,0.698773
18,people,mouse,11,33115,564,28132499,0.663889
15,way,mouse,12,30753,564,28132499,0.616536
9,years,mouse,15,30281,564,28132499,0.607073
6,see,mouse,18,30031,564,28132499,0.602061
10,man,mouse,15,27737,564,28132499,0.556071


### Z-Score
The formula for Z-Score is:

$$Z = \frac{f(w_1, w_2) - E}{\sqrt{E}}$$

Where:
- $f(w_1, w_2)$: Observed frequency of the collocation.
- $E$: Expected frequency.

In [34]:
import numpy as np
collocationTable['Z-score'] = (collocationTable['Collocations'] - collocationTable['ExpectedFreq'])/np.sqrt(collocationTable['ExpectedFreq'])

In [35]:
collocationTable[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq,Z-score
0,like,mouse,51,66617,564,28132499,1.335537,42.975191
1,said,mouse,25,82556,564,28132499,1.655082,18.146072
2,mickey,mouse,25,282,564,28132499,0.005654,332.415936
3,cat,mouse,24,1876,564,28132499,0.03761,123.560119
4,cells,mouse,20,1488,564,28132499,0.029831,115.623169
5,time,mouse,19,46556,564,28132499,0.933354,18.700547
6,see,mouse,18,30031,564,28132499,0.602061,22.422163
7,computer,mouse,17,3001,564,28132499,0.060164,69.062263
8,get,mouse,16,35091,564,28132499,0.703504,18.237222
9,years,mouse,15,30281,564,28132499,0.607073,18.472625


In [36]:
collocationTable.sort_values(by='Z-score', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq,Z-score
2,mickey,mouse,25,282,564,28132499,0.005654,332.415936
63,antibody,mouse,7,46,564,28132499,0.000922,230.476425
30,antibodies,mouse,10,114,564,28132499,0.002285,209.12835
11,mice,mouse,15,317,564,28132499,0.006355,188.079816
42,clicks,mouse,8,165,564,28132499,0.003308,139.037838
12,click,mouse,14,564,564,28132499,0.011307,131.553434
3,cat,mouse,24,1876,564,28132499,0.03761,123.560119
4,cells,mouse,20,1488,564,28132499,0.029831,115.623169
44,keyboard,mouse,8,241,564,28132499,0.004832,115.022738
57,desktop,mouse,7,219,564,28132499,0.004391,105.576705


## Step 4: Demonstrating Semantic Change
**Objective**: Create tables showing the shift in collocates between the 1850s and 2000s.

### In the 1850s, "mouse" is associated with collocates like "rat," "house," and "animals"!

In [17]:
collocationTable.sort_values(by='Z-score', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq,Z-score
0,rat,mouse,12,80,134,15552035,0.000689,457.038219
1,cat,mouse,11,335,134,15552035,0.002886,204.690421
18,lion,mouse,6,400,134,15552035,0.003446,102.144016
16,deer,mouse,6,503,134,15552035,0.004334,91.074051
17,species,mouse,6,1607,134,15552035,0.013846,50.872273
9,door,mouse,7,5044,134,15552035,0.04346,33.369288
12,room,mouse,7,5374,134,15552035,0.046304,32.315293
6,nothing,mouse,8,9325,134,15552035,0.080346,27.93978
8,house,mouse,7,9067,134,15552035,0.078123,24.764711
11,eyes,mouse,7,9764,134,15552035,0.084129,23.843731


### By the 2000s, "mouse" is more commonly found with "mickey," "computer," and "click."

In [37]:
collocationTable.sort_values(by='Z-score', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq,Z-score
2,mickey,mouse,25,282,564,28132499,0.005654,332.415936
63,antibody,mouse,7,46,564,28132499,0.000922,230.476425
30,antibodies,mouse,10,114,564,28132499,0.002285,209.12835
11,mice,mouse,15,317,564,28132499,0.006355,188.079816
42,clicks,mouse,8,165,564,28132499,0.003308,139.037838
12,click,mouse,14,564,564,28132499,0.011307,131.553434
3,cat,mouse,24,1876,564,28132499,0.03761,123.560119
4,cells,mouse,20,1488,564,28132499,0.029831,115.623169
44,keyboard,mouse,8,241,564,28132499,0.004832,115.022738
57,desktop,mouse,7,219,564,28132499,0.004391,105.576705
