# Natural Language Processing

# 1. Text Preprocessing

# 💻 ✂️ strip (1/2)

In [2]:
texts = [
    '   Bonjour, comment ca va ?     ',
    '    Heyyyyy, how are you doing ?   ',
    '        Hallo, wie gehts ?     '
]
texts

['   Bonjour, comment ca va ?     ',
 '    Heyyyyy, how are you doing ?   ',
 '        Hallo, wie gehts ?     ']

In [3]:
[text.strip() for text in texts]

['Bonjour, comment ca va ?',
 'Heyyyyy, how are you doing ?',
 'Hallo, wie gehts ?']

# 💻 ✂️ strip (2/2)

In [4]:
text = "abcd Who is abcd ? That's not a real name!!! abcd"
text

"abcd Who is abcd ? That's not a real name!!! abcd"

Here, the strip() method is used to remove leading and trailing characters from the string. In this case, it removes any occurrences of the characters 'b', 'd', 'a', or 'c' from the beginning and end of the string. 

In [5]:
text.strip('bdac')

" Who is abcd ? That's not a real name!!! "

# 💻 👥 replace

In [6]:
text = "I love koalas, koalas are the cutest animals on Earth."
text

'I love koalas, koalas are the cutest animals on Earth.'

In [7]:
text.replace("koala", "panda")

'I love pandas, pandas are the cutest animals on Earth.'

# 💻 🪚 split

In [8]:
text = "linkin park / metallica /red hot chili peppers"

In [9]:
text.split("/")

['linkin park ', ' metallica ', 'red hot chili peppers']

# 💻 🔡 Lowercase

In [10]:
text = "i LOVE football sO mUch. FOOTBALL is my passion. Who else loves fOOtBaLL ?"
text

'i LOVE football sO mUch. FOOTBALL is my passion. Who else loves fOOtBaLL ?'

In [11]:
text.lower() 

'i love football so much. football is my passion. who else loves football ?'

In [12]:
text.upper() 

'I LOVE FOOTBALL SO MUCH. FOOTBALL IS MY PASSION. WHO ELSE LOVES FOOTBALL ?'

# 💻 🔢 Numbers

Removing numbers during text preprocessing is often beneficial, especially for tasks like text clustering and keyphrase extraction. Here's why:

- Text Clustering: Clustering algorithms, such as K-means or hierarchical clustering, group similar documents together based on their features. Including numbers in the text can introduce noise and hinder the clustering process because numbers typically do not carry semantic meaning or contribute significantly to the similarity between documents. By removing numbers, the clustering algorithm can focus on the meaningful textual content, leading to more accurate clusters that reflect the data.

- Collecting Keyphrases: Keyphrase extraction involves identifying the most important phrases or terms in a document that capture its main topics or concepts. Including numbers in the text can lead to irrelevant or nonsensical keyphrases being extracted, because numbers are not informative at all in this context. Removing numbers helps ensure that the keyphrases extracted from the text are relevant and representative of its content, improving the quality of the extracted information.

In [13]:
text = "i do not recommend this restaurant, we waited for so long, like 30 minutes, this is ridiculous"
text

'i do not recommend this restaurant, we waited for so long, like 30 minutes, this is ridiculous'

In [14]:
cleaned_text = ''.join(char for char in text if not char.isdigit())
cleaned_text

'i do not recommend this restaurant, we waited for so long, like  minutes, this is ridiculous'

In [15]:
print('a'.isdigit())
print('5'.isdigit())

False
True


# 💻 ❗️❓Punctuation and Symbols

#### Warning: you might want to keep punctuation and symbols for authorship attribution!

Punctuation and symbols play a crucial role in shaping an author's writing style. Factors such as the frequency and placement of commas, dashes, exclamation marks, and other symbols can be distinctive features of an author's writing.
Retaining punctuation and symbols allows the model to capture these nuances in writing style accurately, improving the accuracy of authorship attribution.

In [16]:
text = "I love bubble tea! OMG so #tasty @channel XOXO @$ ^_^ "
text

'I love bubble tea! OMG so #tasty @channel XOXO @$ ^_^ '

In [17]:
import string # "string" module is already installed with Python
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

`string.punctuation` is a string containing all ASCII punctuation characters. 
<br>
These characters include common symbols such as exclamation marks, double quotes, hash symbols, percent signs, ampersands, apostrophes, parentheses, asterisks, plus signs, commas, hyphens, periods, slashes, colons, semicolons, less than signs, equals signs, greater than signs, question marks, at symbols, square brackets, backslashes, circumflex accents, underscores, grave accents, curly braces, vertical bars, and tildes.

In [18]:
for punctuation in string.punctuation:
    text = text.replace(punctuation, '') 
    
text

'I love bubble tea OMG so tasty channel XOXO   '

In [19]:
text.strip()

'I love bubble tea OMG so tasty channel XOXO'

# 💻 💪 Combo: strip + lowercase + numbers + punctuation/symbols

In [20]:
sentences = [
    "   I LOVE Pizza 999 @^_^", 
    "  Le Wagon is amazing, take care - 666"
]

In [21]:
def basic_cleaning(sentence):
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') 
    
    sentence = sentence.strip()
    
    return sentence

In [22]:
cleaned = [basic_cleaning(sentence) for sentence in sentences]
cleaned

['i love pizza', 'le wagon is amazing take care']

#### Join


This line of code performs a text processing operation on the variable sentence, removing any digits (numeric characters) from it.

Iteration over Characters:

`for char in sentence`:
This part of the code iterates over each character in the sentence string. It uses a loop to go through every character, one by one.

Conditional Filtering:
`if not char.isdigit()`
Inside the loop, for each character (char), there's a condition checking whether the character is not a digit. The isdigit() method is a built-in method in Python that returns True if all characters in the string are digits, otherwise it returns False. The not keyword negates this condition, so it evaluates to True if the character is not a digit.

Joining Characters:`''.join(...)` The join() method is then used to concatenate the characters back together into a new string. It takes an iterable (in this case, a generator expression) as input and joins the elements together using the specified separator. In this case, the separator is an empty string '', which means the characters will be joined together without any separation between them.

Generator Expression:
`(char for char in sentence if not char.isdigit())`
Inside the join() method, there's a generator expression. It iterates over each character in the sentence string and yields only those characters that are not digits, effectively filtering out the digits from the original string.

Result:
The final result of this line of code is a new string containing only the characters from the original sentence string that are not digits. Essentially, it removes all numeric characters from the sentence.

In [23]:
sentence = ''.join(char for char in sentences[0] if not char.isdigit())
sentence

'   I LOVE Pizza  @^_^'

# 💻 🔍 Removing Tags with RegEx

In [24]:
import re

text = """<head><body>Hello Le Wagon!</body></head>"""
cleaned_text = re.sub('<[^<]+?>','', text)

print (cleaned_text)

Hello Le Wagon!


In [25]:
import re

txt = '''
    This is a random text, authored by darkvader@gmail.com 
    and batman@outlook.com, WOW!
'''

re.findall('[\w.+-]+@[\w-]+\.[\w.-]+', txt)

['darkvader@gmail.com', 'batman@outlook.com']

# 💻 Cleaning with NLTK

Natural Language Toolkit (NLTK) is an NLP library that provides preprocessing and modeling tools for text data

# 💻 🌲 Tokenizing - read

In [26]:
text = 'It is during our darkest moments that we must focus to see the light'

text

'It is during our darkest moments that we must focus to see the light'

In [27]:
from nltk.tokenize import word_tokenize
import nltk
# nltk.download('punkt')

word_tokens = word_tokenize(text)
print(word_tokens) # print displays the words in one line

['It', 'is', 'during', 'our', 'darkest', 'moments', 'that', 'we', 'must', 'focus', 'to', 'see', 'the', 'light']


# 💻 🛑 Stopwords

In [28]:
from nltk.corpus import stopwords 
import nltk
# nltk.download('stopwords')
# set(...): The list of stopwords is converted into a set. Using a set ensures that duplicate
# stopwords are removed, and it allows for efficient membership checks.
stop_words = set(stopwords.words('english')) # you can also choose other languages
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

🕺🏻 Here is an example of a tokenized sentence:

In [29]:
tokens = ["i", "am", "going", "to", "go", "to", "the", 
        "club", "and", "party", "all", "night", "long"]

#### ❓ What stopwords could be removed ❓

In [30]:
stopwords_removed = [w for w in tokens if w in stop_words] 
stopwords_removed

['i', 'am', 'to', 'to', 'the', 'and', 'all']

❓ What are the meaningful words in this sentence ❓

#### 👉 What if you are not going to the party?

😱 "not" is also considered as a stopword

We have to be careful with this statement when it comes to Sentiment Analysis:

Stopwords removal is generally not dangerous for sentiment analysis. In fact, it can be beneficial in some cases by reducing noise and focusing on sentiment-carrying words. However, the impact of stopword removal on sentiment analysis depends on the specific context and the sentiment lexicon used. Some stopwords may carry sentiment themselves (e.x., "not"), so their removal could potentially affect the sentiment analysis results. So, it's essential to carefully consider the stopwords to remove and their potential impact on sentiment analysis accuracy.

And when it comes to author attribution, Stopword removal is also not inherently dangerous. While some stopwords may carry author-specific stylistic features, their removal is unlikely to significantly impact the accuracy of authorship attribution models. Authorship attribution relies more on higher-level stylistic features, such as vocabulary choice, sentence structure, and syntactic patterns such as punctutions, which are not heavily influenced by stopwords.

# 💻 🌐 Lemmatizing - read

#### 👇 Look at the following sentence:

In [31]:
sentence = 'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'
sentence

'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'

#### 🧹 Step 1: Basic Cleaning ( the method we created)

In [32]:
sentence

'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'

In [33]:
cleaned_sentence = basic_cleaning(sentence)
cleaned_sentence

'he was running and eating at the same time  he has a bad habit of swimming after playing  hours in the sun'

# 🎄 Step 2 : Tokenize

So what is tokenizing again folks?

In [34]:
tokenized_sentence = word_tokenize(cleaned_sentence)
print(tokenized_sentence)

['he', 'was', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', 'he', 'has', 'a', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'hours', 'in', 'the', 'sun']


# 🛑 Step 3: Remove Stopwords

In [35]:
tokenized_sentence_no_stopwords = [w for w in tokenized_sentence if not w in stop_words] 
print(tokenized_sentence_no_stopwords)

['running', 'eating', 'time', 'bad', 'habit', 'swimming', 'playing', 'hours', 'sun']


# 🌐 Step 4: Lemmatizing

What does lemmatizing do exactly?

It reduces words to their base or canonical form, known as the lemma. The lemma represents the dictionary form or root word of a given word, which allows different inflected forms of the word to be treated as a single item. For example, the lemma of "running" is "run," and the lemma of "better" is "good."

In [36]:
from nltk.stem import WordNetLemmatizer
import pandas as pd
import nltk
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# Lemmatizing the verbs
verb_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos="v")  # v --> verbs
    for word in tokenized_sentence_no_stopwords
]

Here, a list comprehension is used to iterate over each word in the tokenized_sentence_no_stopwords list.
For each word, the WordNetLemmatizer().lemmatize(word, pos="v") method is called. This method lemmatizes the word with the specified part-of-speech (POS) tag, which in this case is "v" indicating a verb.
The lemmatized verbs are stored in the verb_lemmatized list.

in the same way, another list comprehension is used to iterate over each word in the verb_lemmatized list, which contains the lemmatized verbs.
For each word, the WordNetLemmatizer().lemmatize(word, pos="n") method is called with the POS tag "n", indicating a noun this time.
The lemmatized nouns are stored in the noun_lemmatized list.

In [37]:
# Lemmatizing the nouns
noun_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos="n")  # n --> nouns
    for word in verb_lemmatized
]

Here I create a dataframe with columns for original verbs, lemmatized verbs, and lemmatized nouns. It then displays the DataFrame, allowing us to inspect the original and lemmatized forms of the words in a tabular format.

In [38]:
# Create a DataFrame
df = pd.DataFrame({
    'Original Verb': tokenized_sentence_no_stopwords,
    'Verb Lemmatized': verb_lemmatized,
    'Noun Lemmatized': noun_lemmatized
})

# Display the DataFrame
df

Unnamed: 0,Original Verb,Verb Lemmatized,Noun Lemmatized
0,running,run,run
1,eating,eat,eat
2,time,time,time
3,bad,bad,bad
4,habit,habit,habit
5,swimming,swim,swim
6,playing,play,play
7,hours,hours,hour
8,sun,sun,sun


# 🤖 Machine Learning algorithms cannot process raw text, as it needs to be converted into numbers first

- Vectorize Texts: Use techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (e.g., Word2Vec, GloVe) to convert the text data into numerical vectors. TF-IDF represents each text document as a vector of word frequencies, while word embeddings represent each word as a dense vector in a continuous space. You can use libraries like scikit-learn (for TF-IDF) or TensorFlow/Keras (for word embeddings) to perform text vectorization.

- Encode Target Variable: Encode the target variable (e.g., "normal email" vs. "spam") into numerical labels. For binary classification tasks like spam detection, you can use label encoding to convert categorical labels into numerical values (e.g., 0 for "normal email" and 1 for "spam").

- Make Predictions: Once the model is trained, use it to make predictions on new text data. Vectorize the new text data using the same techniques used during training, and then feed the vectorized data into the trained model to get predictions. The predictions will be numerical labels representing the predicted class (e.g., 0 for "normal email" and 1 for "spam").

# 💻 CountVectorizer - Read

#### 👇 Look at the following sentences:

In [39]:
texts = [
    'the young dog is running with the cat',
    'running is good for your health',
    'your cat is young',
    'young young young young young cat cat cat'
]

#### 👩🏻‍🔬 Let's apply the CountVectorizer to generate a Bag-of-Words representation of these four sentences

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(texts)
X.toarray()

array([[1, 1, 0, 0, 0, 1, 1, 2, 1, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],
       [3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0]])

Code Explanation:

`count_vectorizer = CountVectorizer()`
CountVectorizer is a method provided by scikit-learn for converting a collection of text documents into a matrix of token counts. Each row of the matrix represents a document, and each column represents a unique word (or token) in the entire corpus of documents.
The CountVectorizer() function initializes a CountVectorizer object with default parameters. You can customize parameters such as tokenization rules, stopwords removal, and n-gram range, but in this case, it uses default settings.

In the context of natural language processing (NLP) and text analysis, a document typically refers to a single unit of text data. 

The `toarray()` method converts the sparse matrix X into a dense array format. Sparse matrices store only non-zero entries, which is efficient for memory usage when dealing with large matrices where most entries are zero. However, dense arrays store all entries, including zeros, which makes them more memory-intensive but easier to work with for certain operations.
By calling `toarray()`, the sparse matrix X is converted into a 2D NumPy array, where each row corresponds to a document and each column corresponds to a word, with the entry representing the count of that word in the document.

🤔 Can you guess which column represents which word?

# 🔥 As soon as the CountVectorizer is fitted to the text, you can retrieve all the words seen with get_feature_names_out():

In [41]:
count_vectorizer.get_feature_names_out()

array(['cat', 'dog', 'for', 'good', 'health', 'is', 'running', 'the',
       'with', 'young', 'your'], dtype=object)

In [42]:
# here we turn out results into a dataframe
import pandas as pd

vectorized_texts = pd.DataFrame(
    X.toarray(), 
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


# Be aware that there are some limitations when it comes to the bag-of-words representation read

While Bag-of-Words (BoW) representation is effective in capturing the frequency of individual words in a document, it lacks the ability to capture the context or the sequential relationship between words. This limitation can be addressed by using N-grams.

N-grams are contiguous sequences of n items (words in the context of NLP), where n refers to the number of words in the sequence. By considering sequences of words instead of individual words, N-grams capture more contextual information from the text data.

# 2.2. Tf-idf Representation - read

# Term Frequency (tf) & CountVectorizer - read

In [43]:
vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


here we are calculating the term frequency (TF) for the word "young" in a document. Term frequency is a measure of how often a term (word) appears in a document relative to the total number of words in that document.

The word "young" appears 5 times in the document.

The total number of words in the document is 8.

To calculate the term frequency (TF) for "young", you divide the number of occurrences of "young" by the total number of words in the document:

TF("young") = (Number of occurrences of "young") / (Total number of words in the document)
= 5 / 8
= 0.625

So, the term frequency (TF) for the word "young" in the document is 0.625. This means that "young" accounts for 62.5% of the total words in the document. TF is often used as a feature in text analysis tasks such as information retrieval, document classification, and sentiment analysis.

# Document Frequency (df) - read

In [44]:
vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


# Summarizing

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects the importance of a term (word) within a document relative to a collection of documents (corpus). It is calculated by multiplying the term frequency (TF), which measures how often a term appears in a document, by the inverse document frequency (IDF), which measures how unique or important a term is across the entire corpus. TF-IDF assigns higher weights to terms that are frequent within a document but rare across the corpus,by highlighting terms that are both relevant and discriminative for characterizing the content of individual documents. By considering both local (within-document) and global (corpus-wide) term characteristics, TF-IDF is widely used in information retrieval, text mining, and natural language processing tasks to improve the accuracy and effectiveness of document analysis, search, and classification.

# 2.3. 💻 TfidfVectorizer

The TfidfVectorizer is a feature extraction method provided by the scikit-learn library in Python, which converts a collection of raw documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. This method combines the functionality of both CountVectorizer and TfidfTransformer into a single step, making it convenient for transforming text data into a numerical representation suitable for machine learning algorithms.

In [45]:
texts

['the young dog is running with the cat',
 'running is good for your health',
 'your cat is young',
 'young young young young young cat cat cat']

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [47]:
# Instantiating the TfidfVectorizer
tf_idf_vectorizer = TfidfVectorizer()

# Training it on the texts
weighted_words = pd.DataFrame(tf_idf_vectorizer.fit_transform(texts).toarray(),
                 columns = tf_idf_vectorizer.get_feature_names_out())

weighted_words

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
0,0.227904,0.357056,0.0,0.0,0.0,0.227904,0.281507,0.714112,0.357056,0.227904,0.0
1,0.0,0.0,0.463709,0.463709,0.463709,0.29598,0.365594,0.0,0.0,0.0,0.365594
2,0.470063,0.0,0.0,0.0,0.0,0.470063,0.0,0.0,0.0,0.470063,0.580622
3,0.514496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.857493,0.0


the code demonstrates how to use the TfidfVectorizer from the scikit-learn library to transform a collection of raw text documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features and then convert it into a DataFrame for further analysis.

`weighted_words = pd.DataFrame(tf_idf_vectorizer.fit_transform(texts).toarray(),`<br>
`columns = tf_idf_vectorizer.get_feature_names_out())`

We call the fit_transform() method of the TfidfVectorizer object on the texts input. This method tokenizes the input texts, calculates the TF-IDF scores for each word in each document, and returns a sparse matrix representation of the TF-IDF features.
We convert the sparse matrix to a dense array format using .toarray().
Then, we create a DataFrame named weighted_words from the dense array. Each column in the DataFrame corresponds to a unique word (feature) extracted from the texts, and each row represents a document. The cell values are the corresponding TF-IDF scores for each word in each document.

The Curse of Dimensionality refers to various challenges and phenomena that arise when working with high-dimensional data in machine learning and data analysis. As the number of dimensions (features) in the dataset increases, the volume of the data space grows exponentially, leading to several consequences and difficulties

# How to use max_df and min_df parameters in practice?

- max_df (Maximum Document Frequency):This parameter specifies the threshold for the maximum document frequency of terms. Terms that appear in a higher percentage of documents than the specified threshold will be ignored. If max_df is a float between 0.0 and 1.0, it represents the proportion of documents in which a term must not exceed in order to be considered. For example, max_df = 0.5 means to ignore terms that appear in more than 50% of the documents. if max_df is an integer, it represents the absolute count of documents. For example, max_df = 20 means to ignore terms that appear in more than 20 documents.

- min_df (Minimum Document Frequency): This parameter specifies the threshold for the minimum document frequency of terms. Terms that appear in fewer documents than the specified threshold will be ignored. If min_df is a float between 0.0 and 1.0, it represents the proportion of documents in which a term must appear in order to be considered. For example, min_df = 0.1 means to ignore terms that appear in less than 10% of the documents. If min_df is an integer, it represents the absolute count of documents. For example, min_df = 5 means to ignore terms that appear in fewer than 5 documents.

- Defaults:
    - By default, max_df = 1.0, meaning no "frequent" word will be removed.
    - By default, min_df = 0, meaning no "infrequent" word will be removed.

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example usage with specified max_df and min_df
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=2)

In [49]:
tfidf_vectorizer

This creates a TfidfVectorizer object with a maximum document frequency of 50% and a minimum document frequency of 2 documents.

Adjusting these parameters allows you to control the size and quality of the vocabulary used for text analysis tasks.

In [50]:
# Number of occurences of each word
weighted_words

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
0,0.227904,0.357056,0.0,0.0,0.0,0.227904,0.281507,0.714112,0.357056,0.227904,0.0
1,0.0,0.0,0.463709,0.463709,0.463709,0.29598,0.365594,0.0,0.0,0.0,0.365594
2,0.470063,0.0,0.0,0.0,0.0,0.470063,0.0,0.0,0.0,0.470063,0.580622
3,0.514496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.857493,0.0


- X.toarray() converts the sparse matrix X (which likely represents a document-term matrix with TF-IDF or word counts) into a dense NumPy array. This operation converts the sparse matrix into a format that can be easily converted into a DataFrame.
- columns=count_vectorizer.get_feature_names_out() retrieves the feature names (i.e., the terms or words) from the CountVectorizer object count_vectorizer. These feature names are used as column labels in the DataFrame.
- index=texts sets the index of the DataFrame to be the texts variable. Presumably, texts contains the original raw text data used to create the document-term matrix. Each row in the DataFrame corresponds to a document, and the index labels each row with the corresponding raw text data.

In [51]:
# Instantiate the CountVectorizer with max_df = 2
count_vectorizer = CountVectorizer(max_df = 2) # removing "cat", "is", "young"

# Train it
X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(
    # A sparse matrix is a matrix that contains a large number of zero elements relative
    # to its total size. In other words, most of the entries in a sparse matrix are zero.
    X.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

X

Unnamed: 0,dog,for,good,health,running,the,with,your
the young dog is running with the cat,1,0,0,0,1,2,1,0
running is good for your health,0,1,1,1,1,0,0,1
your cat is young,0,0,0,0,0,0,0,1
young young young young young cat cat cat,0,0,0,0,0,0,0,0


In [52]:
texts

['the young dog is running with the cat',
 'running is good for your health',
 'your cat is young',
 'young young young young young cat cat cat']

# 💻 max_features

# How to use "max_features" in practice?

Here, count_vectorizer is an instance of the CountVectorizer class from the scikit-learn library, which is used to convert a collection of text documents into a matrix representing the count of each word (term) in each document.
fit_transform(texts) method fits the count_vectorizer to the texts data and transforms the text data into a document-term matrix. Each row of the matrix corresponds to a document, and each column corresponds to a unique word in the vocabulary. The values in the matrix represent the count of each word in each document.

- X.toarray() converts the sparse matrix X (output from fit_transform) into a dense NumPy array. This operation is performed to convert the sparse matrix into a format suitable for creating a DataFrame.
- columns=count_vectorizer.get_feature_names_out() retrieves the feature names (i.e., the terms or words) from the CountVectorizer object count_vectorizer. These feature names are used as column labels in the DataFrame.
- index=texts sets the index of the DataFrame to be the texts variable. Presumably, texts contains the original raw text data used to create the document-term matrix. Each row in the DataFrame corresponds to a document, and the index labels each row with the corresponding raw text data.

In [53]:
# CountVectorizer with the 3 most frequent words
count_vectorizer = CountVectorizer(max_features = 3)

X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(
    X.toarray(),
     columns = count_vectorizer.get_feature_names_out(),
     index = texts
)

X

Unnamed: 0,cat,is,young
the young dog is running with the cat,1,1,1
running is good for your health,0,1,0
your cat is young,1,1,1
young young young young young cat cat cat,3,0,5


# 2.4. N-grams

N-grams are contiguous sequences of n items (or words) from a given text or speech sample. These items can be characters, syllables, words, or even other linguistic units like morphemes or phonemes. N-grams are widely used in natural language processing (NLP) and text analysis tasks to capture local patterns and dependencies between adjacent elements in a sequence of text.

In [54]:
actors_movie = [
    "I like the movie but NOT the actors",
    "I like the actors but NOT the movie"
]

In [55]:
# Vectorize the sentences
count_vectorizer = CountVectorizer()
actors_movie_vectorized = count_vectorizer.fit_transform(actors_movie)

# Show the representations in a nice DataFrame
actors_movie_vectorized = pd.DataFrame(
    actors_movie_vectorized.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = actors_movie
)

# Show the vectorized movies
actors_movie_vectorized

Unnamed: 0,actors,but,like,movie,not,the
I like the movie but NOT the actors,1,1,1,1,1,2
I like the actors but NOT the movie,1,1,1,1,1,2


# 😥 With a unigram vectorization, we couldn't distinguish two sentences with the same words.

In [56]:
actors_movie_vectorized

Unnamed: 0,actors,but,like,movie,not,the
I like the movie but NOT the actors,1,1,1,1,1,2
I like the actors but NOT the movie,1,1,1,1,1,2


While unigram vectorization is useful for capturing the occurrence of individual words in each document, it does not consider the order or sequence of words within the document. Therefore, two sentences with the same words but in different orders will have identical unigram representations.

For example, consider the following two sentences:

"The quick brown fox jumps over the lazy dog."<br>
"The lazy dog jumps over the quick brown fox."
<br>
If we use unigram vectorization, both sentences will have the same vector representation because they contain the same words, regardless of the word order. This lack of consideration for word order means that unigram vectorization cannot distinguish between sentences that have the same words but different meanings or contexts.

# 👩🏻‍🔬 What about a bigram vectorization?

In [57]:
# Vectorize the sentences
count_vectorizer_n_gram = CountVectorizer(ngram_range = (2,2)) # BI-GRAMS
actors_movie_vectorized_n_gram = count_vectorizer_n_gram.fit_transform(actors_movie)

# Show the representations in a nice DataFrame
actors_movie_vectorized_n_gram = pd.DataFrame(
    actors_movie_vectorized_n_gram.toarray(),
    columns = count_vectorizer_n_gram.get_feature_names_out(),
    index = actors_movie
)

# Show the vectorized movies with bigrams
actors_movie_vectorized_n_gram

Unnamed: 0,actors but,but not,like the,movie but,not the,the actors,the movie
I like the movie but NOT the actors,0,1,1,1,1,1,1
I like the actors but NOT the movie,1,1,1,0,1,1,1


😄 The two sentences are now distinguishable

To overcome this limitation and capture the sequence of words, we can use techniques such as bigram or n-gram vectorization, which consider sequences of words (e.g., pairs of consecutive words). By incorporating the order of words into the vectorization process, these techniques can capture more detailed information about the structure and semantics of the text, allowing for better differentiation between sentences with similar word compositions but different meanings.

In [59]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
data.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
