# WORD EMBEDDING MODELS:

# BOW (Bag Of Words):
-- the Bag of Words (BoW) model, a fundamental approach in natural language processing. BoW simplifies text by counting word occurrences, disregarding their order
![image.png](attachment:a8cadf7d-70f5-4ba0-a1d3-a284a2094b68.png)

# TFIDF: 
-- It measures how frequently a term appears in a document. The assumption is that the more frequently a word appears, the more important it is within that specific document.
![image.png](attachment:1137bf6b-4951-43a0-a7a2-5322eef66035.png)


In [1]:
# Import Libraries

import nltk
import re # deals with regex
from nltk.corpus import stopwords # will remove the regular commonly used words

from nltk.stem import WordNetLemmatizer


In [2]:
paragraph =  """Inception (2010), directed by Christopher Nolan, follows Dom Cobb (Leonardo DiCaprio), a skilled thief who specializes in extracting secrets from people's subconscious during dreams. 
Cobb is offered a chance to have his criminal record erased if he can perform an "inception"—the act of planting an idea in someone's mind without them realizing it.
To do so, he assembles a team to navigate multiple layers of dreams, each more complex than the last. As the mission unfolds, Cobb must confront his own past and the blurred boundaries between dream and reality. 
The film explores themes of memory, guilt, and the nature of reality, leaving viewers questioning what is real until the very end."""


In [3]:
paragraph

'Inception (2010), directed by Christopher Nolan, follows Dom Cobb (Leonardo DiCaprio), a skilled thief who specializes in extracting secrets from people\'s subconscious during dreams. \nCobb is offered a chance to have his criminal record erased if he can perform an "inception"—the act of planting an idea in someone\'s mind without them realizing it.\nTo do so, he assembles a team to navigate multiple layers of dreams, each more complex than the last. As the mission unfolds, Cobb must confront his own past and the blurred boundaries between dream and reality. \nThe film explores themes of memory, guilt, and the nature of reality, leaving viewers questioning what is real until the very end.'

In [5]:
# text clean process
# create a portstem to get root of tokens

# create lemma for proper wording from sentences
lmt = WordNetLemmatizer()
# convert paragraph into tokens
sentence = nltk.sent_tokenize(paragraph)
print(sentence)

["Inception (2010), directed by Christopher Nolan, follows Dom Cobb (Leonardo DiCaprio), a skilled thief who specializes in extracting secrets from people's subconscious during dreams.", 'Cobb is offered a chance to have his criminal record erased if he can perform an "inception"—the act of planting an idea in someone\'s mind without them realizing it.', 'To do so, he assembles a team to navigate multiple layers of dreams, each more complex than the last.', 'As the mission unfolds, Cobb must confront his own past and the blurred boundaries between dream and reality.', 'The film explores themes of memory, guilt, and the nature of reality, leaving viewers questioning what is real until the very end.']


In [6]:
# create a empty list store the clean data
corpus = []


# try to apply any regex in documnet remove using re library
for i in range(len(sentence)):
    review = re.sub('[^a-zA-Z]','',sentence[i]) # removes non-alphabatical characters from document and replaces with space
    review = review.lower() # convert the context into lower other wise it will create count of vector for same token in Uppercase
    review = review.split() # splits the sentences to individual token
    
    # using lemmatization
    review = [lmt.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]   
    review = ' '.join(review)
    
    # add the cleaned data into list
    corpus.append(review)


    

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_bow = cv.fit_transform(corpus).toarray()
 



In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
X_tf = tf.fit_transform(corpus).toarray()

In [9]:
print("\n Bag of words Matrix:")
print(X_bow)


 Bag of words Matrix:
[[0 0 1 0 0]
 [0 1 0 0 0]
 [0 0 0 0 1]
 [1 0 0 0 0]
 [0 0 0 1 0]]


In [10]:
print("\n TFIDF of words Matrix:")
print(X_tf)


 TFIDF of words Matrix:
[[0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]]
