# Projekt: Izrada jedne instance igrice *NYT Connections* pomoću NLP-a (biblioteke `nltk` i `gensim`)

## Uvod

*NYT Connections* jedna je od niza igara koje *New York Times* nudi svojim čitateljima. Njihove su igrice najčešće lingvističke prirode, od kojih je najpoznatija *Wordle*. *Connections* je igra koja je meni osobno najzanimljiva. Igrač dobije 16 riječi koje mora sklopiti u 4 skupine po 4 riječi, pri čemu sve riječi u skupini dijele nešto zajedničko. To nešto zajedničko što mogu imati može biti značenje, korištenje u istim frazemima, ali može biti i nešto vizualno kao npr. *superheroji, suci, boksači i vampiri svi nose plašteve*. Svaka skupina je teža od prethodne, a po težini poredane su sljedeće skupine: žuta, zelena, plava, ljubičasta. Cilj ovog projetka je napraviti jednu instancu takve igre pomoću tehnika naučenih na kolegiju *Računalno jezikoslovlje*, ali i nekim vlastitim istraživanjem mogućnosti koje nude biblioteke `nltk` i `gensim`, osobito po pitanju tehnike treniranja modela `Word2Vec`.

## Primjer igre *Connections*

U slikama u nastavku dan je primjer jedne instance originalnog *Connections*, kao i njegova točna rješenja. Dana je instanca igre od datuma 9.7.2025.

![Connections](connections.png)

![Connectionsrjesenja](connections_rjesenja.png)

Skinimo prvo datoteke koje ćemo koristiti tijekom projekta te postavimo neke vrijednosti koje ćemo koristiti tijekom projekta.

In [1]:
%pip install nltk

Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Using cached click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Using cached regex-2024.11.6-cp311-cp311-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Using cached regex-2024.11.6-cp311-cp311-win_amd64.whl (274 kB)
Using cached click-8.2.1-py3-none-any.whl (102 kB)
Using cached joblib-1.5.1-py3-none-any.whl (307 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, joblib, click, nltk
Successfully installed click-8.2.1 joblib-1.5.1 nltk-3.9.1 regex-2024.11.6 tqdm-4.67.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
%pip install gensim

Collecting gensim
  Using cached gensim-4.3.3-cp311-cp311-win_amd64.whl.metadata (8.2 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Using cached scipy-1.13.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Using cached smart_open-7.3.0.post1-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Using cached wrapt-1.17.2-cp311-cp311-win_amd64.whl.metadata (6.5 kB)
Using cached gensim-4.3.3-cp311-cp311-win_amd64.whl (24.0 MB)
Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl (15.8 MB)
Using cached scipy-1.13.1-cp311-cp311-win_amd64.whl (46.2 MB)
Using cached smart_open-7.3.0.post1-py3-none-any.whl (61 kB)
Using cached wrapt-1.17.2-cp311-cp311-win_amd64.whl (38 kB)
Installing collected packages: wrapt, numpy, smart-open, scipy, gensim
Successfully installed gensim-4.3.3 numpy-1.26.4 scipy-1.13.1 s


[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
%pip install pandas

Collecting pandas
  Using cached pandas-2.3.1-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.3.1-cp311-cp311-win_amd64.whl (11.3 MB)
Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.3.1 pytz-2025.2 tzdata-2025.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

import gensim.downloader as api
model = api.load("word2vec-google-news-300")

[nltk_data] Downloading package stopwords to C:\Users\Zvonimir
[nltk_data]     Šego\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Sada postavimo skup riječi iz kojeg ćemo birati centralne riječi.

In [5]:
import random
from nltk.corpus import brown

sentences = brown.sents()[:10000]

filtered_words = [
    word.lower() for sent in sentences for word in sent if word.isalnum() and word.lower() not in stop_words
]
random.shuffle(filtered_words)

In [6]:
model.similarity("alice", "wonderland")

0.3020006

In [7]:
# vec = model['rome'] - model['italy'] + model['france']
# model.most_similar(vec, topn = 5)
model.most_similar(positive=["rome", "france"], negative=["italy"], topn = 5)

[('paris', 0.47690874338150024),
 ('charles', 0.474092036485672),
 ('albert', 0.46750128269195557),
 ('alors', 0.4648790955543518),
 ('georgia', 0.4646168351173401)]

## Glavni program

### Pomoćne funkcije

Definirajmo prvo pomoćne funkcije koje ćemo koristiti tijekom generiranja jedne instance igre *Connections*.

In [8]:
# function which calculates how many letters the two words share in the order. Used to remove typos.
def LCS_similarity(x : str, y : str):
    table = []
    n = len(x)
    m = len(y)
    for i in range(n+1):
        row = []
        for j in range(m+1):
            row.append(-1)
        table.append(row)
    for i in range(m+1):
        table[0][i] = 0
    for i in range(n+1):
        table[i][0] = 0
    for i in range(1, n+1, 1):
        for j in range(1, m+1, 1):
            if x[i-1] == y[j-1]:
                table[i][j] = table[i-1][j-1] + 1
            else:
                table[i][j] = max(table[i-1][j], table[i][j-1])
    return table[n][m] / max(n, m)

In [9]:
LCS_similarity('saturday', 'sunday')
# suday -> 5/8

0.625

In [10]:
import pandas as pd

# an auxiallry function which generates a similarity matrix
def make_sim_matrix(words):
    return [[model.similarity(w1, w2) for w2 in words] for w1 in words]

# the function which displayes the similarity matrix
def display_similarity_matrix(words):
    similarity_matrix = make_sim_matrix(words)
    df = pd.DataFrame(similarity_matrix, index=words, columns=words)
    print(df.round(2).to_string())

### Glavni program

Prijeđimo sada na glavni program.

In [11]:
# if an error occurs here because index "j" is out of range, just run this window again.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

lemmatizer = WordNetLemmatizer()

# an axualliary function which transforms universal tags to wordnet tags, based on the given nltk.org data
def get_wordnet_pos(tag):
    if tag == 'ADJ':
        return wordnet.ADJ
    elif tag == 'VERB':
        return wordnet.VERB
    elif tag == 'NOUN':
        return wordnet.NOUN
    elif tag == 'ADV':
        return wordnet.ADV
    else:
        return wordnet.NOUN

cluster = []
clusters = []
words = []
solutions = {}
i = 0

# the main programm which generates one Connections instance
while len(clusters) < 4:
    cluster = []
    j = 0
    main_word = random.choice(filtered_words).lower()
    if main_word in model:
        most_sim = model.most_similar(positive=[main_word], topn = 500)
        if i == 0 and 0.8 <= most_sim[0][1]:
            while len(cluster) < 4:
                similar = False
                word = most_sim[j][0].lower()
                pair = pos_tag([word], tagset='universal')
                word, pos = pair[0][0], pair[0][1]
                word = lemmatizer.lemmatize(word, get_wordnet_pos(pos))
                for w in cluster:
                    if LCS_similarity(word, w) >= 0.75:
                        similar = True
                if similar:
                    j += 1
                    continue
                sim = most_sim[j][1]
                if word.isalnum() and word not in words:
                    cluster.append(word)
                    words.append(word)
                j += 1
            clusters.append(cluster)
            solutions[main_word] = cluster
            i += 1
        elif i == 1 and 0.6 <= most_sim[0][1]:
            while len(cluster) < 4:
                similar = False
                word = most_sim[j][0].lower()
                pair = pos_tag([word], tagset='universal')
                word, pos = pair[0][0], pair[0][1]
                word = lemmatizer.lemmatize(word, get_wordnet_pos(pos))
                for w in cluster:
                    if LCS_similarity(word, w) >= 0.75:
                        similar = True
                if similar:
                    j += 1
                    continue
                sim = most_sim[j][1]
                if (word.isalnum() and word not in words and 0.6 <= sim < 0.8 and len(cluster) == 0) or (word.isalnum() and word not in words):
                    cluster.append(word)
                    words.append(word)
                j += 1
            clusters.append(cluster)
            solutions[main_word] = cluster
            i += 1
        elif i == 2 and 0.5 <= most_sim[0][1]:
            while len(cluster) < 4:
                similar = False
                word = most_sim[j][0].lower()
                pair = pos_tag([word], tagset='universal')
                word, pos = pair[0][0], pair[0][1]
                word = lemmatizer.lemmatize(word, get_wordnet_pos(pos))
                for w in cluster:
                    if LCS_similarity(word, w) >= 0.75:
                        similar = True
                if similar:
                    j += 1
                    continue
                sim = most_sim[j][1]
                if (word.isalnum() and word not in words and 0.5 <= sim < 0.6 and len(cluster) == 0) or (word.isalnum() and word not in words):
                    cluster.append(word)
                    words.append(word)
                j += 1
            clusters.append(cluster)
            solutions[main_word] = cluster
            i += 1
        elif i == 3 and 0 <= most_sim[0][1]:
            while len(cluster) < 4:
                similar = False
                word = most_sim[j][0].lower()
                pair = pos_tag([word], tagset='universal')
                word, pos = pair[0][0], pair[0][1]
                word = lemmatizer.lemmatize(word, get_wordnet_pos(pos))
                for w in cluster:
                    if LCS_similarity(word, w) >= 0.75:
                        similar = True
                if similar:
                    j += 1
                    continue
                sim = most_sim[j][1]
                if word.isalnum() and word not in words and sim < 0.5:
                    cluster.append(word)
                    words.append(word)
                j += 1
            clusters.append(cluster)
            solutions[main_word] = cluster
            i += 1

for clus in clusters:
    clus.sort()

random.shuffle(words)

word_table = [[words[4*i+j] for j in range(4)] for i in range(4)]
df = pd.DataFrame(word_table)
df.columns = [''] * df.shape[1]
df.index = [''] * df.shape[0]
print(df.to_string())

                                                 
  communion           me  sacrament     ourselves
         ag  agriculture       true      actually
         we         kind   definite  agribusiness
  eucharist         them       farm       liturgy


In [12]:
# How to play:
# Write your guess in the appropriate area as 4 words, seperated by space. 
# Extra spaces will be removed.
# If you type in 'hint', the similarity matrix will show up

import time

attempts_left = 4
solved = False
guessed = 0
hints_used = 0

wordss = []
for word in words:
    wordss.append(word)

for key, value in solutions.items():
    value.sort()

while attempts_left > 0 and not solved:
    wordss_table = [[wordss[4*i+j] for j in range(4)] for i in range(int(len(wordss)/4))]
    wt = pd.DataFrame(wordss_table)
    wt.columns = [''] * wt.shape[1]
    wt.index = [''] * wt.shape[0]
    print(wt.to_string())
    print(f"Attempts left: {attempts_left}")
    attempt = input().split(" ")
    attempt.sort()
    if attempt == ['hint']:
        display_similarity_matrix(wordss)
        hints_used += 1
        continue
    for word in attempt:
        if word == '':
            attempt.remove(word)
    time.sleep(1)
    if attempt in clusters:
        guessed += 1
        print("Correct! You've guessed a cluster!")
        for key, value in solutions.items():
            if value == attempt:
                print(f"{key} : {value}")
        for word in attempt:
            wordss.remove(word)
        time.sleep(1)
        random.shuffle(wordss)
    else:
        print("Wrong! Try again!")
        attempts_left -= 1
        time.sleep(1)
    if len(wordss) == 0 or guessed == 4:
        solved = True
    print()
print("GAME OVER!")
if solved:
    print("Congrats! You're so good at this!")
else:
    print("Better luck next time!")
print()
for key, cluster in solutions.items():
    for word in cluster:
        print(word, end=" ")
    print(f"- {key}")
print(f"Hints used: {hints_used}")

                                                 
  communion           me  sacrament     ourselves
         ag  agriculture       true      actually
         we         kind   definite  agribusiness
  eucharist         them       farm       liturgy
Attempts left: 4
Correct! You've guessed a cluster!
communion : ['communion', 'eucharist', 'liturgy', 'sacrament']

                                       
  agribusiness  true     actually  farm
          kind    ag    ourselves  them
      definite    me  agriculture    we
Attempts left: 4
Correct! You've guessed a cluster!
agricultural : ['ag', 'agribusiness', 'agriculture', 'farm']

                                 
        we  kind   actually    me
  definite  them  ourselves  true
Attempts left: 4
             we  kind  actually    me  definite  them  ourselves  true
we         1.00  0.45      0.45  0.45      0.26  0.46       0.66  0.28
kind       0.45  1.00      0.50  0.47      0.33  0.31       0.35  0.34
actually   0.45  0.50      1