# Practice: Analyzing Polysemy and Semantic Change

## Overview

1. **Morphological Analysis**: Breaking sentences into morphemes and extracting meaningful context words.
2. **Window Collocations**: Calculating co-occurrence frequencies within a defined context window.
3. **Collocation Measures**: Using statistical metrics—Expected Frequency, MI, LL, Z-Score, and T-Score—to evaluate collocates.

## Step 1: Data Preparation


In [1]:
!pip install nltk



In [2]:
# Mount Google Drive to this Notebook instance.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


- https://dumps.wikimedia.org/frwiki/20250201/

In [4]:
import re
fileDir = "drive/My Drive/corpus/frwiki-20250201.txt"
fr = open(fileDir, 'r', encoding='utf-8')
contents = fr.readlines()
fr.close()

textList = []
AllWordsList = []

num = 0
for content in contents:
    content = content.replace("\n", "")
    content = re.sub("[^a-zA-Z]+"," ",content)
    content = re.sub("\s+"," ",content)
    content = content.strip().lower()
    if num < 100:
        print(content)
    textList.append(content)

    words = content.strip().split(" ")
    for word in words:
        AllWordsList.append(word)
    num += 1


mediawiki xmlns http www mediawiki org xml export xmlns xsi http www w org xmlschema instance xsi schemalocation http www mediawiki org xml export http www mediawiki org xml export xsd version xml lang fr
siteinfo
sitename wikip dia sitename
dbname frwiki dbname
base https fr wikipedia org wiki wikip c a dia accueil principal base
generator mediawiki wmf generator
case first letter case
namespaces
namespace key case first letter m dia namespace
namespace key case first letter sp cial namespace
namespace key case first letter
namespace key case first letter discussion namespace
namespace key case first letter utilisateur namespace
namespace key case first letter discussion utilisateur namespace
namespace key case first letter wikip dia namespace
namespace key case first letter discussion wikip dia namespace
namespace key case first letter fichier namespace
namespace key case first letter discussion fichier namespace
namespace key case first letter mediawiki namespace
namespace key case 

### Extract sentences that contain the word "souris."

In [5]:
targetWord = "souris"

In [6]:
wordsInCorpus = 0
mouseText = []
AllText = ""
for each in textList:
    words = each.split(" ")
    for word in words:
        wordsInCorpus += 1

    if " " + targetWord + " " in each:
        mouseText.append(each)
        AllText = AllText + " " + each

AllText = AllText.strip()
print(wordsInCorpus)
print(len(AllWordsList))

15802187
15802187


### Check the concordance table for sentences containing "souris."

In [7]:
from nltk import *
retokenize = RegexpTokenizer("[\w]+")
text = Text(retokenize.tokenize(AllText))
text.concordance(targetWord)

Displaying 25 of 490 matches:
dinateur avec l interface graphique souris de qui fit la fortune d apple avec 
r keyboard video mouse clavier cran souris voir commutateur kvm lt souris rat 
cran souris voir commutateur kvm lt souris rat l atari st apporte un espace de
ant une utilisation intensive de la souris l atari st coupl avec un chantillon
iel de composition musicale cran et souris atari st jeudi mars h lt ref gt deu
i mars h lt ref gt deux connecteurs souris joystick d sub de m le situ s sous 
fichier atari st mouse jpg vignette souris standard atari st le syst me d expl
 de g rer les fen tres les menus la souris etc en se servant de la vdi la vdi 
cr ments secs d oiseau ou de chauve souris notamment au chili au p rou en inde
es g nes de cellules pulmonaires de souris et pourrait causer des changements 
s lt ref gt un test effectu sur des souris de laboratoire souris a permis de m
fectu sur des souris de laboratoire souris a permis de montrer que les strog n
des cultures de tissu 

### Perform morphological analysis: Extract content words.

In [9]:
import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_fr')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('stopwords')

# Noun or Predicate
def tokenizerPOScontent(doc):
    stop_words = set(stopwords.words('french'))
    tagged_list = pos_tag(word_tokenize(doc))
    contentWords_list = [t[0] for t in tagged_list if "N" in t[1]  or "V" in t[1]]
    filtered_contentWords = [w for w in contentWords_list if not w.lower() in stop_words]
    length_contentWords = [w for w in filtered_contentWords if len(w) > 2]
    return length_contentWords

mousePOSText = []
for each in mouseText:
    print(" ".join(tokenizerPOScontent(each)))
    mousePOSText.append(" ".join(tokenizerPOScontent(each.strip())))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Error loading averaged_perceptron_tagger_fr: Package
[nltk_data]     'averaged_perceptron_tagger_fr' not found in index
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


refnec lisa tait premier ordinateur interface souris fortune apple macintosh lanc sans trop cher ferm apple tenta server commercialisation limit nuit beaucoup diffusion malgr attrait interface attira attention peu succ estime date janvier
keyboard video mouse clavier cran souris voir commutateur kvm
souris rat
atari apporte espace inou performant musiciens poque cran haute solution interface graphique graphics environment manager gem autorisant atari coupl chantillonneur sampler sonne glas toutes res configurations informatiques musicales jusqu alors musiciens fortun comme fairlight cmi prix drastiquement inf rieur ref franck ernould ans home studio recording ref multitude ordinateur couvrir divers besoins quenceur diteur partition musique partition diteur apprentissage etc importe quel musicien peut dor concerts ter chez aliser maquettes enregistrer album complet
fin ann patrick bruel invite chez musicien alain installer mat riel composition musicale studio parisien chanteur travaille

## Step 2: Extract Context Words
**Objective**: Use a window collocations to identify collocations of "souris" in the corpora.

### Apply window collocations to calculate co-occurrence frequencies and individual word frequencies.

In [10]:
wordType = set()
for each in mousePOSText:
    eachSplit = each.split(" ")
    for word in eachSplit:
        if word != targetWord:
            wordType.add(word)

wordList = list(wordType)

print(wordList[0:10])
print(len(wordType))

['nient', 'identique', 'southwest', 'alt', 'vaisseau', 'chlorocebus', 'blocage', 'comptant', 'soui', 'premenopausal']
4475


### Calculate co-occurrence frequencies and filter words top 30.

In [11]:
wordFreqDict = {}
for each in wordList:
    wordFreqDict[each] = 0

for each in mousePOSText:
    eachSplit = each.split(" ")
    for word in eachSplit:
        if word != targetWord:
            wordFreqDict[word] = wordFreqDict.get(word) + 1

wordFreqDicSorted = dict(sorted(wordFreqDict.items(), key=lambda x: x[1], reverse=True))

CoOccurringWords = []
collocations = []
topNum = 0
for key, value in wordFreqDicSorted.items():
  if topNum < 30:
    CoOccurringWords.append(key)
    collocations.append(value)
  topNum += 1
print(len(CoOccurringWords))
print(CoOccurringWords[0:10])
print(collocations[0:10])

30
['ref', 'quot', 'name', 'date', 'langue', 'titre', 'com', 'consult', 'www', 'http']
[289, 166, 102, 80, 75, 65, 61, 59, 58, 55]


### Calculate individual frequencies of co-occurring words.

In [12]:
CoOccurringWordsFreqDict = {}
for each in CoOccurringWords:
    CoOccurringWordsFreqDict[each] = AllWordsList.count(each)
    print(each, AllWordsList.count(each))

ref 198172
quot 158780
name 38027
date 56747
langue 48982
titre 78112
com 23827
consult 28946
www 42149
http 31202
lien 36738
comme 23271
apple 1864
ordinateur 1021
web 30997
tre 19936
https 29598
informatique 3289
cran 774
res 14571
autres 12960
cette 16267
peut 8883
site 27302
nom 41719
unit 22550
article 23972
sans 6989
deux 15919
interface 572


In [13]:
freqCoWord = []
freqTargetWord = []
targetword = []
wordsInCorpusList = []
targetwordCount = AllWordsList.count(targetWord)

for key, value in CoOccurringWordsFreqDict.items():
    freqCoWord.append(value)
    freqTargetWord.append(targetwordCount)
    targetword.append(targetWord)
    wordsInCorpusList.append(wordsInCorpus)

print(freqCoWord[0:10])

[198172, 158780, 38027, 56747, 48982, 78112, 23827, 28946, 42149, 31202]


### Generate the final dataset.

In [14]:
import pandas as pd
collocationTable = pd.DataFrame({'Co-occurring words':CoOccurringWords,'Target word':targetword,'Collocations':collocations,'Freq co-occurring word':freqCoWord,'Freq target word':freqTargetWord,'Words in corpus':wordsInCorpusList})
print(collocationTable)

   Co-occurring words Target word  Collocations  Freq co-occurring word  \
0                 ref      souris           289                  198172   
1                quot      souris           166                  158780   
2                name      souris           102                   38027   
3                date      souris            80                   56747   
4              langue      souris            75                   48982   
5               titre      souris            65                   78112   
6                 com      souris            61                   23827   
7             consult      souris            59                   28946   
8                 www      souris            58                   42149   
9                http      souris            55                   31202   
10               lien      souris            55                   36738   
11              comme      souris            53                   23271   
12              apple    

In [15]:
collocationTable[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus
0,ref,souris,289,198172,527,15802187
1,quot,souris,166,158780,527,15802187
2,name,souris,102,38027,527,15802187
3,date,souris,80,56747,527,15802187
4,langue,souris,75,48982,527,15802187
5,titre,souris,65,78112,527,15802187
6,com,souris,61,23827,527,15802187
7,consult,souris,59,28946,527,15802187
8,www,souris,58,42149,527,15802187
9,http,souris,55,31202,527,15802187


## Step 3: Collocation measures
**Objective**: Apply the learned collocation measures to the dataset.

### Expected Frequency
The formula for Expected Frequency is:

$$E = \frac{f(w_1) \cdot f(w_2)}{N}$$

Where:
- $f(w_1)$: Frequency of the target word.
- $f(w_2)$: Frequency of the co-occurring word.
- $N$: Total number of tokens in the corpus.

In [16]:
collocationTable['ExpectedFreq'] = (collocationTable['Freq co-occurring word']*collocationTable['Freq target word'])/collocationTable['Words in corpus']
collocationTable.sort_values(by='ExpectedFreq', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq
0,ref,souris,289,198172,527,15802187,6.608999
1,quot,souris,166,158780,527,15802187,5.295283
5,titre,souris,65,78112,527,15802187,2.605021
3,date,souris,80,56747,527,15802187,1.892502
4,langue,souris,75,48982,527,15802187,1.633541
8,www,souris,58,42149,527,15802187,1.405661
24,nom,souris,33,41719,527,15802187,1.391321
2,name,souris,102,38027,527,15802187,1.268193
10,lien,souris,55,36738,527,15802187,1.225205
9,http,souris,55,31202,527,15802187,1.040581


### Z-Score
The formula for Z-Score is:

$$Z = \frac{f(w_1, w_2) - E}{\sqrt{E}}$$

Where:
- $f(w_1, w_2)$: Observed frequency of the collocation.
- $E$: Expected frequency.

In [17]:
import numpy as np
collocationTable['Z-score'] = (collocationTable['Collocations'] - collocationTable['ExpectedFreq'])/np.sqrt(collocationTable['ExpectedFreq'])
collocationTable.sort_values(by='Z-score', ascending=False)[0:30]

Unnamed: 0,Co-occurring words,Target word,Collocations,Freq co-occurring word,Freq target word,Words in corpus,ExpectedFreq,Z-score
13,ordinateur,souris,49,1021,527,15802187,0.03405,265.359477
18,cran,souris,37,774,527,15802187,0.025813,230.13435
29,interface,souris,30,572,527,15802187,0.019076,217.070241
12,apple,souris,51,1864,527,15802187,0.062164,204.301157
17,informatique,souris,38,3289,527,15802187,0.109688,114.406195
0,ref,souris,289,198172,527,15802187,6.608999,109.845684
2,name,souris,102,38027,527,15802187,1.268193,89.448669
1,quot,souris,166,158780,527,15802187,5.295283,69.836749
27,sans,souris,33,6989,527,15802187,0.233082,67.87055
6,com,souris,61,23827,527,15802187,0.794626,67.538882
