Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to
a corresponding vector of real numbers which used to find word predictions, word similarities/semantics. The
process of converting words into numbers are called Vectorization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
df = pd.read_csv('data.csv')

In [3]:
df.head()

Unnamed: 0,test,class
0,I love Bangladesh,1
1,Could you give me an iphone?,0
2,Hello how are you?,1
3,I want to talk you.,1


# Count Vectorizer

Count Vectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a
given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

In [4]:
cv = CountVectorizer()

In [5]:
x = cv.fit_transform(df['test'])

In [6]:
x

<4x14 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [7]:
x.toarray()

array([[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]], dtype=int64)

In [8]:
df.head()

Unnamed: 0,test,class
0,I love Bangladesh,1
1,Could you give me an iphone?,0
2,Hello how are you?,1
3,I want to talk you.,1


In [9]:
df3 = df.copy()

In [10]:
df2 = pd.DataFrame(x.toarray(), index=df['test'], columns=cv.get_feature_names())

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [None]:
df2.head()

In [None]:
columns=cv.get_feature_names()

In [None]:
columns

# TF-IDF

In [None]:
idf = TfidfVectorizer()

In [None]:
x = idf.fit_transform(df3['test'])

In [None]:
x.toarray()

In [None]:
df4 = pd.DataFrame(x.toarray(), index=df['test'], columns=idf.get_feature_names())
df4

# Word2Vec

In [None]:
!pip install gensim

In [None]:
from gensim.models import Word2Vec, KeyedVectors

In [None]:
import nltk

In [None]:
nltk.download('punkt')

In [None]:
nltk.download('wordnet')

In [None]:
df = pd.read_csv('./DataSet/data.csv')

In [None]:
df.head()

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
text_vsc = [nltk.word_tokenize(test) for test in df['test']]

In [None]:
text_vsc

In [None]:
model = Word2Vec(text_vsc, min_count=1)

In [None]:
model

In [None]:
model.wv.most_similar('Hello')