### Text Preprocessing
Preprocessing involves cleaning and preparing text data for analysis. Common techniques include:
- **Tokenization**: Splitting text into words or sentences.
- **Stopword Removal**: Removing common words (e.g., "and", "the") that may not add value.
- **Stemming**: Reducing words to their base or root form (e.g., "running" to "run").
- **Lemmatization**: Similar to stemming but considers the context of the word (e.g., "better" to "goo

- ![Text Processing](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*pzjECYWP8WOWhwfCjebZVw.png)

d").

In the code below:
1. We use NLTK (Natural Language Toolkit) for tokenization, stopword removal, stemming, and lemmatization.
2. Tokenization splits the text into individual words.
3. Stopwords are filtered out to keep only meaningful words.
4. Stemming and Lemmatization reduce words to their root forms, though they operate dimmatization).


In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

text = "The running cats are jumping over the lazy dog."

# Tokenization
tokens = word_tokenize(text)

# Stopword removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print("Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rajka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rajka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rajka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Tokens: ['The', 'running', 'cats', 'are', 'jumping', 'over', 'the', 'lazy', 'dog', '.']
Filtered Tokens: ['running', 'cats', 'jumping', 'lazy', 'dog', '.']
Stemmed Tokens: ['run', 'cat', 'jump', 'lazi', 'dog', '.']
Lemmatized Tokens: ['running', 'cat', 'jumping', 'lazy', 'dog', '.']


### Bag of Words (BoW)
The Bag of Words model represents text data as a set of words without considering the order. Each document is transformed into a vector based on word frequency.

![BOW](https://alasheep.com/assets/static/bow_representation.6acf7b4.1e3b4dbd9fb1beacb8309a733db4f517.jpeg)

In the code below:
1. We use `CountVectorizer` from `sklearn` to convert text data into a BoW representation.
2. Each document is transformed into a vector, where each element counts the occurrences of a specific word.
3. The resulting array shows the word counts for each document.


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

documents = ["Cats are great pets.", "Dogs are great pets too."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Convert to array for better visualization
boW_array = X.toarray()
print("BoW Array:\n", boW_array)
print("Feature Names:", vectorizer.get_feature_names_out())


BoW Array:
 [[1 1 0 1 1 0]
 [1 0 1 1 1 1]]
Feature Names: ['are' 'cats' 'dogs' 'great' 'pets' 'too']


### Interpretation of Bag of Words (BoW) Result

- **BoW Array**: Each row corresponds to a document, and each column represents the count of a word in that document.

  - **Document 1** (`"Cats are great pets."`):
    - The words "are", "great", "pets", and "cats" each appear once, so they have counts of `1` in this row.
    - The words "dogs" and "too" do not appear in Document 1, so their counts are `0`.

  - **Document 2** (`"Dogs are great pets too."`):
    - The words "dogs", "are", "great", "pets", and "too" each appear once, so they have counts of `1`.
    - The word "cats" does not appear in Document 2, so its count is `0`.

- **Feature Names**: These are the unique words (vocabulary) extracted from both documents. Each word aligns with a column in the BoW array, allowing us to interpret the counts in each document.

This BoW representation is useful for basic text analysis, but it does not capture word importance across documents (like TF-IDF does) or semantic relationships between words.


### TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF improves upon the BoW model by considering the importance of words in the context of the entire corpus.

![BOW](https://www.romainberg.com/wp-content/uploads/TF_IDF-final-980x381.png)

In the code below:
1. We use `TfidfVectorizer` from `sklearn` to create a TF-IDF matrix for the documents.
2. Term Frequency (TF) counts how frequently each word appears, while Inverse Document Frequency (IDF) reduces the weight of common words.
3. The resulting matrix provides a more informative representation of word importance across the documents.

![formula](https://ptime.s3.ap-northeast-1.amazonaws.com/media/natural_language_processing/text_feature_Engineering/tf-idf-formula.PNG)

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Convert to array for visualization
tfidf_array = tfidf_matrix.toarray()
print("TF-IDF Array:\n", tfidf_array)
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())


TF-IDF Array:
 [[0.44832087 0.63009934 0.         0.44832087 0.44832087 0.        ]
 [0.37930349 0.         0.53309782 0.37930349 0.37930349 0.53309782]]
Feature Names: ['are' 'cats' 'dogs' 'great' 'pets' 'too']


### Interpretation of TF-IDF Result

- **TF-IDF Array**: Each row in the TF-IDF array corresponds to a document, and each column represents the TF-IDF score of a specific word in that document. These scores indicate the relative importance of each word in each document based on how often it appears in this document compared to the whole document set.

- **Document 1**: `["Cats are great pets."]`
  - The words "cats" and "pets" are assigned higher TF-IDF scores because they are important within this document relative to the rest of the corpus.
  
- **Document 2**: `["Dogs are great pets too."]`
  - The words "dogs" and "pets" receive higher scores here, emphasizing their relevance and distinguishing this document from others in the set.

- **Feature Names**: These are the unique vocabulary terms identified from the documents. They align with the columns in the TF-IDF array, helping us match each score in the array to the corresponding word.


### Word Embeddings
Word embeddings like Word2Vec or GloVe convert words into dense vectors, capturing semantic relationships.

In the code below:
1. We use the `Word2Vec` model from the `gensim` library to create embeddings for each word.
2. The model learns the vector representation based on word co-occurrences within a defined window.
3. The output shows the vector for a sample word, representing its learned semantic meaning in multi-dimensional space.



In [12]:
from gensim.models import Word2Vec

# Sample sentences for training
sentences = [["cats", "are", "great"], ["dogs", "are", "also", "great"]]
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, workers=4)

# Get word vectors
cat_vector = model.wv['cats']
print("Word Vector for 'cats':", cat_vector)


Word Vector for 'cats': [-0.0960355   0.05007293 -0.08759586 -0.04391825 -0.000351   -0.00296181
 -0.0766124   0.09614743  0.04982058  0.09233143]


### Interpretation of Word2Vec Result

- **Word Vector for 'cats'**: The output is a 10-dimensional vector that represents "cats" in the semantic space learned by the Word2Vec model.
  - Each element in this vector captures some aspect of the word’s context based on the training sentences.
  
- **Semantic Similarity**: Similar words (words that appear in similar contexts) will have similar vector representations.
  - For example, "cats" and "dogs" might have similar vectors since they share the context of "are" and "great" in the training sentences.

- **Use of Word Vectors**: Word vectors generated by Word2Vec are useful for capturing semantic relationships, allowing words with similar meanings or contexts to be compared or clustered together.
  
This is beneficial for various NLP tasks, such as identifying synonyms, grouping related words, or analyzing word similarity in a specific context.


### Named Entity Recognition (NER)
NER is used to identify and classify named entities in text (e.g., people, organizations).

In the code below:
1. We use the `spaCy` library to detect named entities in a sentence.
2. Entities like organizations, locations, and monetary values are extracted and labeled.
3. This technique is helpful for extracting structured information from unstructured text.


In [24]:
import spacy as spc

# Load the English NLP model
nlp = spc.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Identify named entities
for ent in doc.ents:
    print(ent.text, ent.label_)


AttributeError: module 'spacy' has no attribute 'load'