We are using  CountVectorizer for sparse vector representations and the Word2Vec model from the gensim library for dense vector representations.

Sparse Vector Space Model
A sparse vector space model represents text as high-dimensional vectors, where each dimension corresponds to a specific word in the vocabulary. We'll use CountVectorizer from sklearn to create a sparse representation.

Dense Vector Space Model
A dense vector space model represents text in a lower-dimensional continuous vector space, where similar words have similar vector representations. We'll use Word2Vec from gensim to create a dense representation.


In [11]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

# Sample corpus
corpus = [
    "I love machine learning, it should be taught widely",
    "Natural language processing is fun but not as fun as I think it should be",
    "Machine learning is a field of artificial intelligence"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X_sparse = vectorizer.fit_transform(corpus)

# Convert to DataFrame for better visualization
df_sparse = pd.DataFrame(X_sparse.toarray(), columns=vectorizer.get_feature_names_out())
print("Bag of Words Representation (Sparse Vector Space Model):")
print(df_sparse)

# One-Hot Encoding
def one_hot_encode(corpus, vocab):
    one_hot_vectors = []
    vocab_list = list(vocab)
    for doc in corpus:
        vector = np.zeros(len(vocab_list))
        for word in doc.split():
            if word in vocab_list:
                vector[vocab_list.index(word)] = 1
        one_hot_vectors.append(vector)
    return np.array(one_hot_vectors)

# Create vocabulary
vocab = vectorizer.get_feature_names_out()

# One-hot encode the corpus
one_hot_vectors = one_hot_encode(corpus, vocab)

# Convert to DataFrame for better visualization
df_one_hot = pd.DataFrame(one_hot_vectors, columns=vocab)
print("\nOne-Hot Encoding Representation:")
print(df_one_hot)


Bag of Words Representation (Sparse Vector Space Model):
   artificial  as  be  but  field  fun  intelligence  is  it  language  ...  \
0           0   0   1    0      0    0             0   0   1         0  ...   
1           0   2   1    1      0    2             0   1   1         1  ...   
2           1   0   0    0      1    0             1   1   0         0  ...   

   love  machine  natural  not  of  processing  should  taught  think  widely  
0     1        1        0    0   0           0       1       1      0       1  
1     0        0        1    1   0           1       1       0      1       0  
2     0        1        0    0   1           0       0       0      0       0  

[3 rows x 21 columns]

One-Hot Encoding Representation:
   artificial   as   be  but  field  fun  intelligence   is   it  language  \
0         0.0  0.0  1.0  0.0    0.0  0.0           0.0  0.0  1.0       0.0   
1         0.0  1.0  1.0  1.0    0.0  1.0           0.0  1.0  1.0       1.0   
2         1.0  

In [9]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Sample corpus
corpus = [
    "I love machine learning",
    "Natural language processing is fun",
    "Machine learning is a field of artificial intelligence"
]

# Step 1: Preprocess the corpus
corpus_processed = [simple_preprocess(doc) for doc in corpus]
print("Preprocessed Corpus:")
print(corpus_processed)

# Step 2: Initialize and train Word2Vec model
model = Word2Vec(sentences=corpus_processed, vector_size=10, window=5, min_count=1, workers=4)
print("\nWord2Vec Model Trained.")

# Step 3: Retrieve the dense vector for a specific word
word = 'machine'
vector_dense = model.wv[word]
print(f"\nDense Vector for the word '{word}':")
print(vector_dense)

# Step 4: Retrieve vectors for all words in the vocabulary
vocabulary = model.wv.key_to_index
vectors_dense = {word: model.wv[word] for word in vocabulary}
print("\nDense Vectors for the entire vocabulary:")
for word, vector in vectors_dense.items():
    print(f"{word}: {vector}")

# Step 5: Compute cosine similarity between two words
word1 = 'machine'
word2 = 'learning'
vector1 = model.wv[word1].reshape(1, -1)
vector2 = model.wv[word2].reshape(1, -1)

cos_sim = cosine_similarity(vector1, vector2)
print(f"\nCosine Similarity between '{word1}' and '{word2}': {cos_sim[0][0]}")

# Step 6: Compute cosine similarity matrix for the entire vocabulary
vocab_vectors = np.array([model.wv[word] for word in vocabulary])
cos_sim_matrix = cosine_similarity(vocab_vectors)

# Convert to DataFrame for better visualization
df_cos_sim = pd.DataFrame(cos_sim_matrix, index=vocabulary, columns=vocabulary)
print("\nCosine Similarity Matrix for the entire vocabulary:")
print(df_cos_sim)


Preprocessed Corpus:
[['love', 'machine', 'learning'], ['natural', 'language', 'processing', 'is', 'fun'], ['machine', 'learning', 'is', 'field', 'of', 'artificial', 'intelligence']]

Word2Vec Model Trained.

Dense Vector for the word 'machine':
[ 0.07311766  0.05070262  0.06757693  0.00762866  0.06350891 -0.03405366
 -0.00946401  0.05768573 -0.07521638 -0.03936104]

Dense Vectors for the entire vocabulary:
is: [-0.00536227  0.00236431  0.0510335   0.09009273 -0.0930295  -0.07116809
  0.06458873  0.08972988 -0.05015428 -0.03763372]
learning: [ 0.07381326 -0.01533646 -0.04537105  0.0655477  -0.04860705 -0.01816211
  0.02876889  0.00991992 -0.08286119 -0.09449868]
machine: [ 0.07311766  0.05070262  0.06757693  0.00762866  0.06350891 -0.03405366
 -0.00946401  0.05768573 -0.07521638 -0.03936104]
intelligence: [-0.07511582 -0.00930042  0.09538119 -0.07319167 -0.02333769 -0.01937741
  0.08077437 -0.05930896  0.00045162 -0.04753734]
artificial: [-0.0960355   0.05007293 -0.08759586 -0.04391825