Using co-occurrence to calculate word vectors involves capturing the context in which words appear within a corpus of text to represent words as vectors. This approach is based on the distributional hypothesis, which suggests that words that occur in similar contexts tend to have similar meanings. Here's a simplified explanation of how to use co-occurrence to calculate word vectors:

Choose Your Corpus: The first step is to select a corpus of text. The choice of corpus is crucial as it determines the context in which words are used. The corpus should be large and diverse enough to capture a wide range of word contexts.

In [1]:
docs = [
    "Zhangsan likes computer!",
    "Lisi loves food.",
    "food is his thing?",
    "Wangwu loves food;"
]

Preprocessing the corpus by removing punctuation, tokenizing strings into words and converting all words to lowercase.

In [2]:
punctuations = ".,;:!?"

def remove_punctuations(docs):
    for i in range(len(docs)):
        for punctuation in punctuations:
            docs[i] = docs[i].replace(punctuation, "")
    return docs

def tokenize(docs):
    docs_tokenized = []
    for doc in docs:
        docs_tokenized.append(doc.lower().split(" "))
    return docs_tokenized

docs_repunc = remove_punctuations(docs)

docs_tokenized = tokenize(docs_repunc)

docs_tokenized

[['zhangsan', 'likes', 'computer'],
 ['lisi', 'loves', 'food'],
 ['food', 'is', 'his', 'thing'],
 ['wangwu', 'loves', 'food']]

Define the Context of a Word: Decide on the size of the context window—the number of words before and after a target word that will be considered its context. A smaller window size might capture more syntactic relationships, while a larger window size might capture more semantic relationships.

In [3]:
##window size = len(vocabulary)

def build_corpus(docs_tokenized):
    words_set = set()
    for doc in docs_tokenized:
        for word in doc:
            words_set.add(word)
    res = {}
    for i, word in enumerate(list(words_set)):
        res[word] = i
    return res    

corpus = build_corpus(docs_tokenized)
corpus

{'thing': 0,
 'loves': 1,
 'is': 2,
 'wangwu': 3,
 'likes': 4,
 'computer': 5,
 'food': 6,
 'his': 7,
 'zhangsan': 8,
 'lisi': 9}

Create the Co-occurrence Matrix: Construct a matrix where each row represents a target word and each column represents a context word (or vice versa). The value at each position in the matrix (i, j) represents the frequency with which word i occurs in the context of word j. This frequency can be raw count, or it can be weighted in various ways to reduce the impact of very common words or to emphasize closer word pairs within the context window.

In [7]:
import numpy as np

def get_tf(docs, corpus):
    res = []
    for doc in docs:
        freq = [0] * len(corpus)
        for word in doc:
            freq[corpus[word]] += 1
        total_count = sum(freq)
        for i in range(len(freq)):
            freq[i] /= total_count
        res.append(freq)
    return res

ans = get_tf(docs_tokenized, corpus)
print(np.matrix(ans))

[[0.         0.         0.         0.         0.33333333 0.33333333
  0.         0.         0.33333333 0.        ]
 [0.         0.33333333 0.         0.         0.         0.
  0.33333333 0.         0.         0.33333333]
 [0.25       0.         0.25       0.         0.         0.
  0.25       0.25       0.         0.        ]
 [0.         0.33333333 0.         0.33333333 0.         0.
  0.33333333 0.         0.         0.        ]]


Transform the Co-occurrence Matrix: Sometimes, the raw co-occurrence frequencies are transformed to improve the quality of the resulting vectors. Techniques such as Positive Pointwise Mutual Information (PPMI) or transformations like TF-IDF (Term Frequency-Inverse Document Frequency) can be used to adjust the values in the matrix to better capture meaningful associations between words.

In [11]:
from collections import Counter
import numpy as np

def normalize(vector):
    norm_constant = np.sqrt(sum(num ** 2 for num in vector))
    for i in range(len(vector)):
        vector[i] /= norm_constant
    return vector

def count(docs):
    res = []
    for doc in docs:
        res += doc
    return Counter(res)

def get_tfidf(docs, corpus):
    tf = get_tf(docs, corpus)
    idf = get_idf(docs, corpus)
    for i, doc in enumerate(tf):
        for j in range(len(doc)):
            doc[j] *= idf[j]
        tf[i] = normalize(doc)
    return tf

def get_idf(docs, corpus):
    n_docs = len(docs)
    counter = count(docs)
    res = [0] * len(corpus)
    for word, i in corpus.items():
        res[i] = 1 + np.log((n_docs + 1) / (1 + counter[word]))
    return res

tf_idf = get_tfidf(docs_tokenized, corpus)
for doc in tf_idf:
    print(doc)

[0.0, 0.0, 0.0, 0.0, 0.5773502691896257, 0.5773502691896257, 0.0, 0.0, 0.5773502691896257, 0.0]
[0.0, 0.5534923152870045, 0.0, 0.0, 0.0, 0.0, 0.4480997313625987, 0.0, 0.0, 0.7020348194149619]
[0.5417361046803605, 0.0, 0.5417361046803605, 0.0, 0.0, 0.0, 0.3457831381910465, 0.5417361046803605, 0.0, 0.0]
[0.0, 0.5534923152870044, 0.0, 0.7020348194149618, 0.0, 0.0, 0.4480997313625986, 0.0, 0.0, 0.0]


Reduce Dimensionality: The co-occurrence matrix can be very large and sparse, especially for a large corpus with a vast vocabulary. Dimensionality reduction techniques, such as Singular Value Decomposition (SVD), Principal Component Analysis (PCA), or non-negative matrix factorization (NMF), can be applied to the matrix to produce a more compact, dense representation of word vectors. This step reduces the size of the vector space while preserving as much of the significant structural information as possible.

In [12]:
def svd(co_occurrence_matrix):
    U, sigma, Vt = np.linalg.svd(co_occurrence_matrix, full_matrices=False)
    n_components = 2  # Number of components to keep
    reduced_matrix_svd = U[:, :n_components] * sigma[:n_components]
    return reduced_matrix_svd

In [14]:
svd(tf_idf)

array([[ 0.        ,  1.        ],
       [-0.83528054,  0.        ],
       [-0.43968363,  0.        ],
       [-0.83528054,  0.        ]])

Simpler implementation from scikit-learn toolkit

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
sklearn_tfidf = vectorizer.fit_transform(docs).toarray()

for doc in sklearn_tfidf:
    print(doc)

[0.57735027 0.         0.         0.         0.57735027 0.
 0.         0.         0.         0.57735027]
[0.         0.44809973 0.         0.         0.         0.70203482
 0.55349232 0.         0.         0.        ]
[0.         0.34578314 0.5417361  0.5417361  0.         0.
 0.         0.5417361  0.         0.        ]
[0.         0.44809973 0.         0.         0.         0.
 0.55349232 0.         0.70203482 0.        ]
