In [1]:
import warnings

# Disable future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

**Word tokenization is the process of breaking down a text or sentence into individual words or tokens**

- It involves splitting a sentence into its constituent words, punctuation marks and other meaningful units

- Word tokenization is like breaking a sentence into smaller pieces, where each piece represents a single word or token.
- It involves separating words from punctuation marks and other non-word characters to create a list of individual words.
- Tokenization is an essential step in natural language processing tasks as it helps in understanding and analyzing text at   the word level.

In [2]:
import nltk
from nltk.tokenize import word_tokenize

def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

# Example usage
text = "Hello! How are you doing today?"
tokens = tokenize_text(text)
print(tokens)

['Hello', '!', 'How', 'are', 'you', 'doing', 'today', '?']


### Stemming is the process of reducing words to their base or root form known as the stem
- It aims to remove suffixes and prefixes from words while keeping the stem intact.
1. Stemming simplifies words by reducing them to their core form, disregarding grammatical variations and prefixes or         suffixes.
2. It allows different word forms derived from the same root to be treated as a single entity, improving information          retrieval and text analysis.
3. Stemming can be helpful in various natural language processing tasks, such as search engines, text classification, and      information retrieval systems.

In [4]:
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatization example words
words = ["running", "jumps", "jumping", "running", "easily"]

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Print the lemmatized words
for original, lemmatized in zip(words, lemmatized_words):
    print(f"Original: {original} \t Lemmatized: {lemmatized}")

Original: running 	 Lemmatized: running
Original: jumps 	 Lemmatized: jump
Original: jumping 	 Lemmatized: jumping
Original: running 	 Lemmatized: running
Original: easily 	 Lemmatized: easily


In [5]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download the stopwords corpus (if not already downloaded)
nltk.download('stopwords')

# Set the language for stopwords
stopwords_language = 'english'

# Get the list of stopwords for the specified language
stopwords_list = stopwords.words(stopwords_language)

# Example sentence
sentence = "This is an example sentence with some stopwords."

# Tokenize the sentence into words
words = word_tokenize(sentence)

# Remove stopwords from the words
filtered_words = [word for word in words if word.lower() not in stopwords_list]

# Join the filtered words back into a sentence
filtered_sentence = ' '.join(filtered_words)

# Print the filtered sentence
print(filtered_sentence)

example sentence stopwords .


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Example documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to the documents and transform the documents into a matrix
X = vectorizer.fit_transform(documents)

# Get the feature names (unique words)
feature_names = vectorizer.get_feature_names()

# Print the feature names
print("Feature names:")
print(feature_names)

# Print the bag-of-words representation for each document
for i in range(len(documents)):
    document_vector = X[i]
    print(f"\nDocument {i+1} - Bag-of-Words representation:")
    for j, feature_index in enumerate(document_vector.indices):
        print(f"Word: {feature_names[feature_index]}, Count: {document_vector.data[j]}")

Feature names:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Document 1 - Bag-of-Words representation:
Word: this, Count: 1
Word: is, Count: 1
Word: the, Count: 1
Word: first, Count: 1
Word: document, Count: 1

Document 2 - Bag-of-Words representation:
Word: this, Count: 1
Word: is, Count: 1
Word: the, Count: 1
Word: document, Count: 2
Word: second, Count: 1

Document 3 - Bag-of-Words representation:
Word: this, Count: 1
Word: is, Count: 1
Word: the, Count: 1
Word: and, Count: 1
Word: third, Count: 1
Word: one, Count: 1

Document 4 - Bag-of-Words representation:
Word: this, Count: 1
Word: is, Count: 1
Word: the, Count: 1
Word: first, Count: 1
Word: document, Count: 1


In [7]:
from textblob import TextBlob

def analyze_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        return "Positive"
    elif sentiment < 0:
        return "Negative"
    else:
        return "Neutral"

# Example usage
text = "I really enjoyed the movie. It was fantastic!"
sentiment = analyze_sentiment(text)
print(sentiment)

Positive


#### TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical representation that reflects the importance of a word in a document relative to a collection of documents. 
#### It considers both the frequency of a word in a document (term frequency) and the rarity of the word in the entire document collection (inverse document frequency). 
1. TF-IDF assigns higher weights to words that appear frequently in a document but rarely in the entire document collection, thus capturing their significance.
2. It helps in identifying important and distinctive words in a document by emphasizing words that are both frequent within the document and unique across the document collection.

- TF-IDF is a measure that combines term frequency and inverse document frequency to determine the importance of words in a   document collection, enabling the identification of significant and distinctive terms in individual documents.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Collection of documents
documents = [
    "I enjoy playing tennis.",
    "I love to watch movies.",
    "Tennis is a popular sport.",
    "Movies are a great source of entertainment."
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Compute TF-IDF scores
tfidf_scores = vectorizer.fit_transform(documents)

# Get feature names (terms)
feature_names = vectorizer.get_feature_names()

# Print TF-IDF scores for each document
for i, doc in enumerate(documents):
    print(f"Document {i+1}:")
    for j, term in enumerate(feature_names):
        score = tfidf_scores[i, j]
        if score > 0:
            print(f"{term}: {score:.3f}")
    print()

Document 1:
enjoy: 0.618
playing: 0.618
tennis: 0.487

Document 2:
love: 0.525
movies: 0.414
to: 0.525
watch: 0.525

Document 3:
is: 0.525
popular: 0.525
sport: 0.525
tennis: 0.414

Document 4:
are: 0.422
entertainment: 0.422
great: 0.422
movies: 0.333
of: 0.422
source: 0.422



#### Word2Vec is a popular word embedding technique used in natural language processing to represent words as dense, low-dimensional vectors. 
#### Word2Vec captures the semantic and syntactic relationships between words based on their contextual usage within a large corpus of text.
1. Word2Vec takes a large text corpus as input and learns to represent each word as a numerical vector in a continuous vector space.

2. The resulting word vectors capture the meaning and relationships between words, enabling computations such as word similarity, analogy, and clustering.

**By representing words as vectors, Word2Vec allows algorithms to perform mathematical operations on words and extract meaningful insights from text data. It has proven useful in various NLP tasks, including language modeling, sentiment analysis, and information retrieval**

In [9]:
from gensim.models import Word2Vec
sentences = [
    ['I', 'enjoy', 'playing', 'tennis'],
    ['I', 'love', 'to', 'watch', 'movies'],
    ['Tennis', 'is', 'a', 'popular', 'sport'],
    ['Movies', 'are', 'a', 'great', 'source', 'of', 'entertainment']
]

# Train Word2Vec model
model = Word2Vec(sentences, min_count=1)

# Get the word vector for a specific word
word = 'tennis'
vector = model.wv[word]

# Find similar words to a given word
similar_words = model.wv.most_similar(word)

print(f"Word Vector for '{word}':\n{vector}")
print(f"\nSimilar words to '{word}':")
for word, similarity in similar_words:
    print(f"{word}: {similarity}")

Word Vector for 'tennis':
[-8.7274835e-03  2.1301603e-03 -8.7354420e-04 -9.3190884e-03
 -9.4281435e-03 -1.4107180e-03  4.4324086e-03  3.7040710e-03
 -6.4986944e-03 -6.8730689e-03 -4.9994136e-03 -2.2868442e-03
 -7.2502876e-03 -9.6033188e-03 -2.7436304e-03 -8.3628418e-03
 -6.0388758e-03 -5.6709289e-03 -2.3441387e-03 -1.7069983e-03
 -8.9569995e-03 -7.3519943e-04  8.1525063e-03  7.6904297e-03
 -7.2061159e-03 -3.6668323e-03  3.1185509e-03 -9.5707225e-03
  1.4764380e-03  6.5244650e-03  5.7464195e-03 -8.7630628e-03
 -4.5171450e-03 -8.1401607e-03  4.5955181e-05  9.2636319e-03
  5.9733056e-03  5.0673080e-03  5.0610616e-03 -3.2429171e-03
  9.5521836e-03 -7.3564244e-03 -7.2703888e-03 -2.2653891e-03
 -7.7856064e-04 -3.2161046e-03 -5.9258699e-04  7.4888230e-03
 -6.9751980e-04 -1.6249418e-03  2.7443981e-03 -8.3591007e-03
  7.8558037e-03  8.5361032e-03 -9.5840879e-03  2.4462652e-03
  9.9049713e-03 -7.6658037e-03 -6.9669201e-03 -7.7365185e-03
  8.3959224e-03 -6.8133592e-04  9.1444086e-03 -8.1582209e-0

#### The Bag-of-Words (BoW) model is a simple representation of text in natural language processing.
#### It treats a document as an unordered collection or "bag" of words, ignoring grammar and word order, and focuses on the frequency of occurrence of words within the document. Here's a brief explanation:

1. The Bag-of-Words model involves creating a vocabulary of unique words present in a given corpus of documents.
2. Each document is then represented by a numerical vector, where the elements correspond to the counts or frequencies of the words from the vocabulary in that particular document.
3. The resulting vector representation can be used for various purposes, such as text classification, sentiment analysis, and information retrieval.

- The Bag-of-Words model simplifies text by considering only word frequencies and discarding the order or structure of the   words. While it loses some contextual information, it remains a widely used approach for many text-based tasks due to its   simplicity and effectiveness.