# Introduction

In the notebook, you can find a collection of Text Preprocessing steps that are common in a variety of NLP tasks. However, the list is probably not complete.

It contains:
 - Text Cleaning steps
 - Text Normalization steps
 - Sources of information

It does NOT contain:
 - Word Embeddings
 - Bag of Words
 - TF-IDF

In [46]:
! pip install -q unidecode pyspellchecker autocorrect



1. **What does the code do?**
   - This code installs three Python packages: `unidecode`, `pyspellchecker`, and `autocorrect` using pip. These packages are commonly used for text preprocessing and spell-checking tasks in natural language processing (NLP) applications.

2. **Why is the `-q` flag used in the `pip install` command?**
   - The `-q` flag stands for "quiet" mode, which suppresses most of the output generated by pip during the installation process. This is useful for keeping the output clean and concise, especially when running scripts or commands in automated environments.

3. **What is the purpose of the `unidecode` package?**
   - The `unidecode` package is used for transliterating Unicode characters into ASCII characters. It's commonly used to remove accents and diacritics from text, making it easier to process and analyze text data that contains non-ASCII characters.

4. **What functionality does the `pyspellchecker` package provide?**
   - The `pyspellchecker` package provides tools for spell checking in Python. It allows users to identify and correct misspelled words in text data. This package is useful for tasks such as text correction, typo detection, and improving the accuracy of natural language processing models.

5. **How does the `autocorrect` package differ from `pyspellchecker`?**
   - The `autocorrect` package is another Python library for correcting spelling errors in text. While both `autocorrect` and `pyspellchecker` offer similar functionality, `autocorrect` focuses more on automatically correcting misspelled words as you type, akin to the autocorrect feature in word processors and text editors. It employs a different algorithm for spell correction compared to `pyspellchecker`.

# Table of Contents

1. [Lowercasing](#lowercase)
2. [Punctuation Removal](#removal)
    - [str.translate()](#rmvl-translate)
    - [Regular Expressions](#rmvl-re)
      - [re.compile()](#rmvl-re-compile)
      - [re.escape()](#rmvl-re-escape)
      - [Remove not words and not whitespaces](#rmvl-re-not-words)
    - [Excluding Punctuation](#rmvl-exclude)
    - [String Replace](#rmvl-str-replace)
3. [Numbers Removal](#removal-nb)
    - [str.translate()](#rmvl-nb-tr)
    - [Regular Expressions](#rmvl-nb-re)
    - [join with not digit](#rmvl-nb-not-digit)
    - [join with not alpha](#rmvl-nb-not-alpha)
4. [HTML Tags Removal](#html-rmvl)
    - [Regular Expressions](#html-rmvl-re)
    - [Beautiful Soup](#html-rmvl-bs)
5. [URL Removal](#url-rmvl)
    - [Regular Expressions](#url-rmvl-re)
6. [Newlines / spaces / tabs Removal](#spaces-rmvl)
    - [String split](#spaces-rmvl-split)
    - [Regular Expressions](#spaces-rmvl-re)
        - [re.sub()](#spaces-rmvl-re)
        - [re.findall()](#spaces-rmvl-re-findall)
7. [Emojis Removal](#emoji-rmvl)
    - [Regular Expressions](#emoji-rmvl-re)
8. [Replacing accented characters](#accented)
    - [Unidecode](#accented-unidecode)
9. [Spelling corrections](#spell)
    - [SpellChecker](#spell-checker)
    - [Autocorrect](#spell-autocorrect)
    - [TextBlob](#spell-tb)
9. [Tokenization](#tknz)
    - [Regular Expressions](#tknz-re)
    - [NLTK](#tknz-nltk)
    - [SpaCy](#tknz-spacy)
    - [Gensim](#tknz-gensim)
    - [Comparision of the methods](#tknz-compare)
10. [Sentence Tokenization](#tknz-sents)
    - [Regular Expressions](#tknz-sents-re)
    - [NLTK](#tknz-sents-nltk)
    - [SpaCy](#tknz-sents-spacy)
11. [Stopwords Removal](#stopwords)
    - [NLTK](#stopwords-nltk)
    - [SpaCy](#stopwords-spacy)
    - [Gensim](#stopwords-gensim)
    - [Comparision of the methods](#stopwords-compare)
    - [Comparision of stopwords lists](#stopwords-compare2)
12. [Lemmatization](#lemma)
    - [NLTK](#lemma-nltk)
    - [SpaCy](#lemma-spacy)
    - [TextBlob](#lemma-tb)
    - [Adding Tags](#lemma-tags)
        - [NLTK](#lemma-tags-nltk)
        - [TextBlob](#lemma-tags-tb)
    - [Comparision of the methods](#lemma-compare)
13. [Stemming](#stem)
    - [PorterStemmer](#stem-ps)
    - [SnowballStemmer](#stem-sno)


In [47]:
import string
import re
import nltk
import spacy



1. **What is the purpose of importing the `string` module in this code?**
   - The `string` module provides a collection of string constants and functions for working with strings in Python. It includes constants like `string.ascii_letters`, `string.digits`, and functions like `string.punctuation`. Importing `string` module in this code suggests that string manipulation or processing involving these constants or functions may be carried out.

2. **Why is the `re` module imported?**
   - The `re` module in Python provides support for regular expressions, which are a powerful tool for pattern matching and text manipulation. Its usage in this code indicates that the script may involve tasks such as pattern matching, substitution, or extraction based on specific patterns within text data.

3. **What is the purpose of importing `nltk`?**
   - `nltk` stands for Natural Language Toolkit, which is a leading platform for building Python programs to work with human language data. Importing `nltk` suggests that the script may utilize various NLP functionalities provided by the NLTK library, such as tokenization, stemming, lemmatization, part-of-speech tagging, and more.

4. **Why is `spacy` imported?**
   - `spacy` is another popular library for natural language processing in Python, known for its efficiency and ease of use. Importing `spacy` indicates that the script may employ Spacy's capabilities for tasks such as named entity recognition, dependency parsing, and text classification.

5. **How does `nltk` differ from `spacy` in terms of functionality?**
   - While both `nltk` and `spacy` are widely used in NLP tasks, they have different design philosophies and functionalities. `nltk` provides a wide range of basic NLP tools and algorithms, making it suitable for educational purposes and research. On the other hand, `spacy` focuses more on efficiency and ease of use, providing pre-trained models and streamlined APIs for common NLP tasks, making it more suitable for production-level applications.

## Lowercasing <a name="lowercase"></a>

In [48]:
text_with_upper = "Hello, POLAND is a Very BeautiFul CountrY!"
text_with_upper.lower()

'hello, poland is a very beautiful country!'

## Punctuation Removal <a name="removal"></a>

### **Most of them use `string.punctuation`**



Based on the [StackOverflow question](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)

I ordered the solutions from the fastest to the slowest

In [49]:
puncts_text = "HI!!! I overuse ... !(@*punctuations +-*/ and*(& other signs __*(&!!"

### using `str.translate()` <a name="rmvl-translate"></a>
Probably the fastest way

In [50]:
puncts_text.translate(str.maketrans('', '', string.punctuation))

'HI I overuse  punctuations  and other signs '

### Using `re` <a name="rmvl-re"></a>

`re` stands for Regular Expressions

#### Method 1 with string.punctuation <a name="rmvl-re-compile"></a>

In [51]:
regex = re.compile('[%s]' % re.escape(string.punctuation))
regex.sub('', puncts_text)

'HI I overuse  punctuations  and other signs '

#### Method 1 in a single line <a name="rmvl-re-compile"></a>

In [52]:
re.sub(f'[{re.escape(string.punctuation)}]','', puncts_text)

'HI I overuse  punctuations  and other signs '

#### Method 2 <a name="rmvl-re-not-words"></a>

Removes **not words** and **not spaces**

Note: `re` treats underscore as a word so its results are different

In [53]:
re.sub(r'[^\w\s]','', puncts_text)

'HI I overuse  punctuations  and other signs __'

### Excluding string.punctuation <a name="rmvl-exclude"></a>

In [54]:
# a slower solution
exclude = set(string.punctuation)
"".join(ch for ch in puncts_text if ch not in exclude)

'HI I overuse  punctuations  and other signs '

### Using `str.replace()` <a name="rmvl-str-replace"></a>

In [55]:
clean_text = puncts_text
for c in string.punctuation:
    clean_text = clean_text.replace(c, "")
clean_text

'HI I overuse  punctuations  and other signs '

Regular expressions, often abbreviated as regex or regexp, provide a powerful and flexible way to search, match, and manipulate text strings based on specific patterns. They are widely used in various programming languages and text processing tools for tasks such as data validation, search and replace operations, and text extraction.

Here's a more detailed explanation of regex:

1. **Pattern Matching**: Regex allows you to define a pattern, which is a sequence of characters that describes a set of strings. For example, the pattern `[0-9]+` matches one or more digits.

2. **Metacharacters**: Regex uses special characters called metacharacters to represent different types of characters or character classes. For example:
   - `.` matches any single character except newline.
   - `\d` matches any digit (equivalent to `[0-9]`).
   - `\w` matches any alphanumeric character (equivalent to `[a-zA-Z0-9_]`).
   - `\s` matches any whitespace character.
   - `[ ]` defines a character class, matching any single character within the brackets.

3. **Quantifiers**: Quantifiers specify the number of occurrences of a pattern. For example:
   - `*` matches zero or more occurrences.
   - `+` matches one or more occurrences.
   - `?` matches zero or one occurrence.
   - `{n}` matches exactly n occurrences.
   - `{n,}` matches n or more occurrences.
   - `{n,m}` matches between n and m occurrences.

4. **Anchors**: Anchors are used to specify the position in the string where a match should occur. For example:
   - `^` matches the start of the string.
   - `$` matches the end of the string.
   - `\b` matches a word boundary.

5. **Grouping and Capturing**: Parentheses `()` are used to group patterns together. They also create capturing groups, which can be referenced later. For example:
   - `(abc)+` matches one or more occurrences of the sequence "abc".
   - `(a|b)` matches either "a" or "b".

6. **Modifiers**: Modifiers are used to specify options or flags for the regex pattern. For example:
   - `i` makes the pattern case-insensitive.
   - `m` enables multi-line mode, where `^` and `$` match the start and end of each line.
   - `g` performs a global search, finding all matches rather than stopping after the first match.

Regex provides a concise and expressive way to describe complex text patterns, but it can also be challenging to master due to its compact syntax and wide range of features. However, once you become familiar with regex, it becomes an invaluable tool for text processing and manipulation tasks.



1. **What is the purpose of the `text_with_upper` variable and the subsequent `.lower()` method call?**
   - `text_with_upper` contains a string with mixed uppercase and lowercase characters. The `.lower()` method is called on this string to convert all characters to lowercase. This operation is commonly used for case normalization in text processing tasks to ensure consistent comparison and analysis.

2. **What does the `puncts_text` variable represent, and why are various methods applied to remove punctuation?**
   - `puncts_text` contains a string with various punctuation marks. Several methods are applied to remove punctuation from this string. This task is often performed in text preprocessing to eliminate noise and simplify text data for further analysis or processing.

3. **How do the different methods (`translate`, `regex`, `re.sub`, manual replacement) achieve punctuation removal?**
   - The methods shown (`translate`, `regex`, `re.sub`, manual replacement) all achieve the same goal of removing punctuation from the `puncts_text` string but using different approaches. They utilize string translation, regular expressions, and manual character replacement techniques. Each method has its advantages in terms of performance, readability, and flexibility.

4. **Why is there a comment mentioning a "slower solution"?**
   - The comment refers to the last method used for removing punctuation, which involves iterating over each character in the string and manually replacing punctuation characters with an empty string. This method is generally less efficient compared to other methods, especially for longer strings or when processing large volumes of text.

5. **What is the purpose of the `clean_text` variable, and how does it differ from the original `puncts_text`?**
   - The `clean_text` variable stores the result of removing punctuation from the `puncts_text` string using a manual replacement method. It differs from the original `puncts_text` by containing the same text but with all punctuation characters removed. This cleaned version of the text may be easier to process or analyze in certain text-based applications.

## Numbers Removal <a name="removal-nb"></a>



In [56]:
text_numbers = '12abcd405'

### Using `str.translate` with `string.digits` <a name="rmvl-nb-tr"></a>

In [57]:
text_numbers.translate(str.maketrans('', '', string.digits))

'abcd'

### Using `re` <a name="rmvl-nb-re"></a>

In [58]:
re.sub(r'\d+', '', text_numbers)

'abcd'

In [59]:
re.sub(r'[0-9]+', '', text_numbers)

'abcd'

### Using `join()` and NOT `isdigit()` <a name="rmvl-nb-not-digit"></a>

In [60]:
"".join(i for i in text_numbers if not i.isdigit())

'abcd'

### Using `join()` and `isalpha()` <a name="rmvl-nb-not-alpha"></a>

This actually isn't the correct solution, because `isalpha()` is True only for letters.

In [61]:
"".join(i for i in text_numbers if i.isalpha())

'abcd'



1. **What does the `text_numbers` variable represent?**
   - `text_numbers` contains a string that includes both alphabetic characters and numerical digits. This string serves as the input text from which we aim to remove numerical digits.

2. **Why are multiple methods used to remove numbers from the text?**
   - The code demonstrates various methods to achieve the same task, which is removing numbers from the text. This variety showcases different approaches using string manipulation and regular expressions, allowing users to choose the method that best fits their requirements or preferences.

3. **What is the purpose of the `str.maketrans('', '', string.digits)` method used in the `translate` function?**
   - This method constructs a translation table to be used with the `translate` function. It maps each digit character to `None`, effectively removing all digits from the input string. This method is a straightforward and efficient way to perform character-based replacements in Python.

4. **How do regular expressions (`re.sub`) facilitate number removal in the text?**
   - Regular expressions provide a flexible and powerful way to match patterns within strings. The `re.sub` function with the pattern `r'\d+'` searches for one or more consecutive digits (`\d+`) in the text and replaces them with an empty string, effectively removing all numerical digits from the text.

5. **What is the purpose of the last two methods using list comprehensions?**
   - The last two methods utilize list comprehensions to iterate over each character in the `text_numbers` string. The first list comprehension removes digits (`if not i.isdigit()`) from the text, leaving only alphabetic characters. The second list comprehension keeps only alphabetic characters (`if i.isalpha()`), effectively removing numbers from the text. These methods demonstrate alternative approaches using Python's built-in string methods for character filtering.

## HTML Tags Removal <a name="html-rmvl"></a>

Solutions from [Stack Overflow](https://stackoverflow.com/questions/753052/strip-html-from-strings-in-python)

In [62]:
html_text = """<tr class="color-5 negri a-bottom">
<td class="a-center" width="11%"><div style="min-width: 80px">3-Pointers</div></td>
<td><div class="left" style="min-width: 120px; max-width:175px; width: 57%">
<div class="left margen-l2">Player</div>
<div class="right"> Team</div>
</div>
</td>
<td><div style="min-width: 60px; ">Season</div></td>
<td><div class="">W/L Game</div>
</td>
</tr>"""


### Using `re` <a name="html-rmvl-re"></a>

 - begin with tag opening '<'
 - then not '<'
 - not '<' at least once

In [63]:
re.sub('<[^<]+?>', '', html_text)

'\n3-Pointers\n\nPlayer\n Team\n\n\nSeason\nW/L Game\n\n'

### Using `BeautifulSoup` <a name="html-rmvl-bs"></a>

 - `get_text()` removes HTML tags
 - `strip = True` removes whitespaces and newlines

In [64]:
from bs4 import BeautifulSoup

In [65]:
soup = BeautifulSoup(html_text, 'html.parser')
soup.get_text(",", strip=True)

'3-Pointers,Player,Team,Season,W/L Game'

### Extreme example

In [66]:
html_comment = "<img<!-- --> src=x onerror=alert(1);//><!-- -->"

Both solutions fail

In [67]:
re.sub('<[^<]+?>', '', html_comment)

'<img src=x onerror=alert(1);//>'

In [68]:
soup = BeautifulSoup(html_comment, 'html.parser')
soup.get_text(",", strip=True)

'src=x onerror=alert(1);//>'

In [69]:
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
no_tags = tag_re.sub('', html_comment)
no_tags

' src=x onerror=alert(1);//>'

In [70]:
import html

html.escape(no_tags)

' src=x onerror=alert(1);//&gt;'



1. **What does the provided `html_text` variable contain?**
   - `html_text` is a string containing HTML markup representing a table row (`<tr>`) with various attributes and nested elements such as table cells (`<td>`) and div containers (`<div>`). The text inside these HTML tags may include content like player names, team names, and other information related to basketball statistics.

2. **How does the `re.sub('<[^<]+?>', '', html_text)` line remove HTML tags from the `html_text` variable?**
   - This line uses a regular expression pattern (`<[^<]+?>`) to match any HTML tag (`<...>`) and replace it with an empty string. This effectively removes all HTML tags and leaves only the text content within the HTML markup.

3. **What is the purpose of using BeautifulSoup in the code?**
   - BeautifulSoup is a Python library used for parsing HTML and XML documents, extracting data, and navigating the document tree. In this code, BeautifulSoup is used to parse the `html_text` string and retrieve the text content while discarding the HTML tags. This is achieved using the `get_text()` method with appropriate parameters.

4. **How does the `tag_re.sub('', html_comment)` line remove HTML comments from the `html_comment` variable?**
   - This line utilizes a regular expression (`tag_re`) to match HTML comments (`<!-- ... -->`) and HTML tags (`<...>`), then replaces them with an empty string. This effectively removes both HTML comments and tags from the `html_comment` string.

5. **Why is `html.escape(no_tags)` used in the code?**
   - The `html.escape()` function is used to escape special characters in the `no_tags` string, ensuring that the resulting text is safe to display as plain text in HTML. This prevents potential security vulnerabilities such as cross-site scripting (XSS) attacks by converting characters like `<`, `>`, `&`, and `"`.

## URL Removal <a name="url-rmvl"></a>

From this [StackOverflow Question](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/11332580)

In [71]:
text_url = """text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/"""

### Using `re` <a name="url-rmvl-re"></a>

In [72]:
re.sub(r'http\S+', '', text_url)

'text1\ntext2\n\ntext3\ntext4\n\ntext5\ntext6\n'

## Newlines, spaces and tabs removal <a name="spaces-rmvl"></a>

### Using `str.split()` and `join()` <a name="spaces-rmvl-split"></a>

In [73]:
my_str="I want to Remove all white \t\n\n\r spaces, new lines \n and tabs \t"
" ".join(my_str.split())

'I want to Remove all white spaces, new lines and tabs'

### Using `re` <a name="spaces-rmvl-re"></a>

 - **`\s` stands for whitespace character, equivalent to `[ \n\r\t\f]`**
 - **`\S` stands for not whitespace character, equivalent to `[^\s]`**

In [74]:
re.sub('\s+', ' ', my_str)

  re.sub('\s+', ' ', my_str)


'I want to Remove all white spaces, new lines and tabs '

In [75]:
re.sub('[^\S]+', ' ', my_str)

  re.sub('[^\S]+', ' ', my_str)


'I want to Remove all white spaces, new lines and tabs '

In [76]:
re.sub('[\t\n\r\f ]+', ' ', my_str)

'I want to Remove all white spaces, new lines and tabs '

### Using `re.findall()` <a name="spaces-rmvl-re-findall"></a>

Taken from [StackOverflow](https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python)

NOTE: I believe this approach is slower than the one with `re.sub()`.

In [77]:
match = re.findall('[\w]+ ', my_str)
"".join(match)

  match = re.findall('[\w]+ ', my_str)


'I want to Remove all white new lines and tabs '



1. **What is the purpose of the `text_url` variable?**
   - `text_url` is a multiline string containing text along with URLs interspersed within it. These URLs may represent links to various web pages or resources.

2. **How does the `re.sub(r'http\S+', '', text_url)` line remove URLs from the `text_url` variable?**
   - This line uses a regular expression pattern (`http\S+`) to match any URL starting with `http` followed by non-whitespace characters (`\S+`). It then replaces all matched URLs with an empty string, effectively removing them from the text.

3. **What is the purpose of using `str.split()` and `join()` to remove white spaces, new lines, and tabs?**
   - These methods are used to remove white spaces, new lines, and tabs from the given string `my_str`. By splitting the string into a list of words using `split()`, and then rejoining these words with a single space separator using `join()`, all consecutive whitespace characters are effectively replaced by a single space.

4. **How do the different regular expressions (`\s+`, `[^\S]+`, `[\t\n\r\f ]+`) achieve whitespace removal in the text?**
   - Each of these regular expressions matches one or more whitespace characters, including spaces, tabs, newlines, and carriage returns. They then replace these matches with a single space, effectively collapsing multiple consecutive whitespace characters into a single space.

5. **What does the `re.findall('[\w]+ ', my_str)` line accomplish in the context of whitespace removal?**
   - This line uses a regular expression pattern (`[\w]+`) to match one or more word characters (alphanumeric characters and underscores) followed by a space. It then finds all such matches in the string `my_str`. By joining these matches together with an empty string (`"".join(match)`), it effectively removes all non-whitespace characters, leaving only whitespace between words.

## Emojis Removal <a name="emoji-rmvl"></a>

Found [here](https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b#gistcomment-3315605)

In [78]:
text_emojis = u"Hi ðŸ¤” How is your ðŸ™ˆ and ðŸ˜Œ. Have a nice weekend ðŸ’•ðŸ‘­ðŸ‘™\U0001F600\U0001F300"
text_emojis

'Hi ðŸ¤” How is your ðŸ™ˆ and ðŸ˜Œ. Have a nice weekend ðŸ’•ðŸ‘­ðŸ‘™ðŸ˜€ðŸŒ€'

### Using `re` <a name="emoji-rmvl-re"></a>

In [79]:
emojis_pattern = re.compile(pattern="["
                    u"\U0001F600-\U0001F64F"  # emoticons
                    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                    u"\U0001F680-\U0001F6FF"  # transport & map symbols
                    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                    u"\U00002500-\U00002BEF"  # chinese char
                    u"\U00002702-\U000027B0"
                    u"\U00002702-\U000027B0"
                    u"\U000024C2-\U0001F251"
                    u"\U0001f926-\U0001f937"
                    u"\U00010000-\U0010ffff"
                    u"\u2640-\u2642"
                    u"\u2600-\u2B55"
                    u"\u200d"
                    u"\u23cf"
                    u"\u23e9"
                    u"\u231a"
                    u"\ufe0f"  # dingbats
                    u"\u3030"
                "]+", flags = re.UNICODE)

emojis_pattern.sub(r'', text_emojis)

'Hi  How is your  and . Have a nice weekend '

## Replacing Accented Characters <a name="accented"></a>

### Using `unidecode` <a name="accented-unidecode"></a>

In [80]:
import unidecode
text_accented = "MÃ¡laga, Ã Ã©ÃªÃ¶hello. Polish: Å„ÅƒÄ‡Ä†Å›ÅšÄ™Ä…Ã³Å¼Å»ÅºÅ¹ letters. German Ã¼Ã¶Ã¤Ã¶ÃŸ letters"

unidecode.unidecode(text_accented)

'Malaga, aeeohello. Polish: nNcCsSeaozZzZ letters. German uoaoss letters'



1. **What does the `text_emojis` variable contain, and how are emojis represented in Python strings?**
   - `text_emojis` contains a string with various emojis encoded as Unicode characters, such as ðŸ¤”, ðŸ™ˆ, and ðŸ˜Œ. Emojis in Python strings are represented using Unicode characters, which allow for the inclusion of diverse symbols and characters from different languages and character sets.

2. **What is the purpose of the `emojis_pattern` variable and the associated regular expression pattern?**
   - `emojis_pattern` defines a regular expression pattern that matches a wide range of Unicode characters representing emojis, symbols, and pictographs. This pattern is designed to identify and extract emojis from text data, enabling their removal or processing separately from other text content.

3. **How does the `emojis_pattern.sub(r'', text_emojis)` line remove emojis from the `text_emojis` variable?**
   - This line uses the `sub()` method of the `emojis_pattern` regex object to substitute all matches of the emoji pattern with an empty string (`r''`). As a result, all emojis identified by the regex pattern are effectively removed from the original text.

4. **What does the `unidecode.unidecode(text_accented)` function accomplish?**
   - The `unidecode.unidecode()` function is used to remove diacritics and accents from accented characters in the `text_accented` string. It transliterates accented characters into their ASCII equivalents, making the text more suitable for processing and comparison in contexts where accents are not significant.

5. **Why is handling accented characters important in text processing tasks?**
   - Accented characters can pose challenges in text processing and analysis tasks due to variations in character encodings and representations across different systems and languages. Removing accents using methods like `unidecode` helps standardize text data, ensuring consistent and accurate processing across different environments and applications.

## Spelling corrections <a name="spell"></a>

Solutions from [StackOverflow](https://stackoverflow.com/questions/13928155/spell-checker-for-python)

### Using `spellchecker` <a name="spell-checker"></a>

In [81]:
from spellchecker import SpellChecker

spell = SpellChecker()

def correct_spelling(text):
    corrected_text = list()
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        next_word = word
        if word in misspelled_words:
            next_word = spell.correction(word)
        corrected_text.append(next_word)

    return " ".join(corrected_text)

text_misspelled = "I realli needt smoe corection. This sentnce has mispelled wirds"
correct_spelling(text_misspelled)

'I really need some corrections This sentence has misspelled words'

### Using `autocorrect`<a name="spell-autocorrect"></a>

In [82]:
from autocorrect import Speller

speller = Speller(lang='en')

print(speller(text_misspelled))

I really need some correction. This sentence has misspelled words


### Using `textblob` <a name="spell-tb"></a>

Found [here](https://www.geeksforgeeks.org/python-textblob-correct-method/)

In [83]:
from textblob import TextBlob

TextBlob(text_misspelled).correct()

TextBlob("I really need some correction. His sentence has dispelled words")



1. **What is the purpose of the `correct_spelling` function in the code?**
   - The `correct_spelling` function takes a string of text as input and corrects the misspelled words using a spellchecker. It splits the text into words, identifies misspelled words, and replaces them with their corrected versions using the `SpellChecker` from the `spellchecker` library.

2. **How does the `SpellChecker` from the `spellchecker` library work in the context of spelling correction?**
   - The `SpellChecker` object from the `spellchecker` library provides methods to identify misspelled words in a text and suggest corrections for them. It utilizes a pre-built dictionary of words and algorithms to determine the most likely correct spelling for a given word based on its context.

3. **What is the purpose of the `autocorrect.Speller` object from the `autocorrect` library?**
   - The `autocorrect.Speller` object provides a similar functionality to the `SpellChecker`, but it may use a different algorithm or dictionary for spell correction. It also identifies misspelled words and suggests corrections based on a predefined set of rules or patterns.

4. **How does the `TextBlob` object from the `textblob` library handle spelling correction?**
   - The `TextBlob` object provides a range of natural language processing (NLP) functionalities, including spelling correction. When the `correct()` method is called on a `TextBlob` object, it automatically corrects the spelling of words in the text using statistical methods and language models.

5. **What are the differences between the `SpellChecker`, `autocorrect.Speller`, and `TextBlob` approaches to spelling correction?**
   - While all three approaches aim to correct misspelled words in a text, they may use different algorithms, dictionaries, and language models. The effectiveness of each approach may vary depending on factors such as the quality of the dictionary, the complexity of the language, and the context of the text. Users may choose the most suitable approach based on their specific requirements and preferences.

### Create a longer text

Beginning of the Metamorphosis by Franz Kafka

In [84]:
text_lines = """One morning, when Gregor Samsa woke from troubled dreams,
 he found himself transformed in his bed into a horrible vermin.  He lay
 on his armour-like back, and if he lifted his head a little he could see his
 brown belly, slightly domed and divided by arches into stiff sections.
 The bedding was hardly able to cover it and seemed ready to slide off any
 moment.  His many legs, pitifully thin compared with the size of the rest of
 him, waved about helplessly as he looked.  "What's happened to me?" he thought.
 It wasn't a dream.  His room, a proper human room although a little too small,
 lay peacefully between its four familiar walls.  A collection of textile
 samples lay spread out on the table - Samsa was a travelling salesman - and
 above it there hung a picture that he had recently cut out of an illustrated
 magazine and housed in a nice, gilded frame.  It showed a lady fitted out with
 a fur hat and fur boa who sat upright, raising a heavy fur muff that covered
 the whole of her lower arm towards the viewer.  Gregor then turned to look out
 the window at the dull weather."""
text_lines

'One morning, when Gregor Samsa woke from troubled dreams,\n he found himself transformed in his bed into a horrible vermin.  He lay\n on his armour-like back, and if he lifted his head a little he could see his\n brown belly, slightly domed and divided by arches into stiff sections.\n The bedding was hardly able to cover it and seemed ready to slide off any\n moment.  His many legs, pitifully thin compared with the size of the rest of\n him, waved about helplessly as he looked.  "What\'s happened to me?" he thought.\n It wasn\'t a dream.  His room, a proper human room although a little too small,\n lay peacefully between its four familiar walls.  A collection of textile\n samples lay spread out on the table - Samsa was a travelling salesman - and\n above it there hung a picture that he had recently cut out of an illustrated\n magazine and housed in a nice, gilded frame.  It showed a lady fitted out with\n a fur hat and fur boa who sat upright, raising a heavy fur muff that covered\n t

## Tokenize Words <a name="tknz"></a>

### Using `re` <a name="tknz-re"></a>

Return the list of words

In [85]:
re_tokens = re.findall('[\w]+', text_lines)
print(len(re_tokens))
print(re_tokens)

200
['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room', 'although', 'a', 'little', 'too', 'small', 'lay', 'peacefully', 'between', 'its', 'four', 'familiar', 'walls', 'A', 'collection', 'of', 'textile

  re_tokens = re.findall('[\w]+', text_lines)


### Using `nltk` <a name="tknz-nltk"></a>

In [86]:
import nltk
from nltk.tokenize import word_tokenize

# Download both tokenizer resources
nltk.download('punkt')
nltk.download('punkt_tab')

text_lines = "This is a sample sentence. Let's test tokenization!"
nltk_tokens = word_tokenize(text_lines)
print(len(nltk_tokens), nltk_tokens)


11 ['This', 'is', 'a', 'sample', 'sentence', '.', 'Let', "'s", 'test', 'tokenization', '!']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Using `spaCy` <a name="tknz-spacy"></a>

In [87]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text_lines)
spacy_tokens = list([token for token in doc])
print(len(spacy_tokens))
print(spacy_tokens)

11
[This, is, a, sample, sentence, ., Let, 's, test, tokenization, !]


### Using `gensim` <a name="tknz-gensim"></a>

In [88]:
!pip install gensim

from gensim.utils import tokenize

gensim_tokens = list(tokenize(text_lines))
print(len(gensim_tokens))
print(gensim_tokens)

9
['This', 'is', 'a', 'sample', 'sentence', 'Let', 's', 'test', 'tokenization']




1. **What does the `text_lines` variable contain, and what is its significance in the code?**
   - `text_lines` contains a multi-line string representing a passage of text from Franz Kafka's novella "The Metamorphosis." This text serves as the input for tokenization, where the goal is to split the text into individual words or tokens.

2. **How does the `re.findall('[\w]+', text_lines)` line tokenize the text using regular expressions?**
   - This line uses a regular expression pattern (`[\w]+`) to match one or more word characters (alphanumeric characters and underscores). It then finds all such matches in the `text_lines` string, effectively tokenizing the text into words. However, this method may not handle certain cases like contractions or hyphenated words well.

3. **What is the purpose of the `nltk.tokenize.word_tokenize()` function call?**
   - The `word_tokenize()` function from the NLTK library is used to tokenize the text into words using a more sophisticated approach compared to simple regular expressions. It employs a pre-trained tokenizer specifically designed for natural language processing tasks, which may handle a wider range of tokenization challenges effectively.

4. **How does the `spacy.load("en_core_web_sm")` and subsequent tokenization using SpaCy work?**
   - This code loads the English language model (`en_core_web_sm`) provided by SpaCy, which includes pre-trained models and linguistic annotations. The `nlp()` function then processes the `text_lines` string, creating a SpaCy `Doc` object representing the analyzed text. Tokenization in SpaCy is performed automatically as part of the text processing pipeline, and the resulting `Doc` object contains tokenized words as individual `Token` objects.

5. **What is the purpose of the `gensim.utils.tokenize()` function for text tokenization?**
   - The `tokenize()` function from the Gensim library is used to tokenize the text into words in a memory-efficient manner, suitable for processing large volumes of text. It implements a simple tokenizer that splits the text into words based on whitespace and punctuation characters. The resulting tokens are returned as a generator, allowing for efficient iteration over the tokens without loading the entire text into memory at once.

### Tokenization Comparision <a name="tknz-compare"></a>

 - it seems that `gensim` uses the same `re` function, that we showed above. Both returned only words
 - `nltk` and `spacy` return also punctuations
 - `spacy` treats a whitespace as a token if there is a double whitespace. In our text each sentence-ending dot is followed by double whitespace. We could clean this but at least we see that `spacy` behaves differently
 - **not or n't** contraction gives different results. `spacy` and `nltk` splits *wasn't* to *was* and *n't*, whereas `re` and `gensim` to *wasn* and *t*.
 - when there is a hyphen between words, we get 3 different results. Our example is armour-like. `nltk` returns **a single token**: *armour-like*, `re` and `gensim` return **two tokens**: *armour* and *like*. `spacy` returns **three tokens**: *armour*, *-*, and *like*
 - `nltk` converts quotation marks. Quote opening: **``**, quote closing **' '**

## Tokenize Sentences <a name="tknz-sents"></a>

### Using `re` <a name="tknz-sents-re"></a>

In [89]:
re_sentences = re.compile('[.?!]').split(text_lines)
len(re_sentences), re_sentences

(3, ['This is a sample sentence', " Let's test tokenization", ''])

### Using `nltk` <a name="tknz-sents-nltk"></a>

In [90]:
from nltk.tokenize import sent_tokenize

nltk_sentences = sent_tokenize(text_lines)
print(len(nltk_sentences), nltk_sentences)

2 ['This is a sample sentence.', "Let's test tokenization!"]


### Using `spaCy` <a name="tknz-sents-spacy"></a>

In [91]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text_lines)
spacy_sentences = list([sent for sent in doc.sents])
len(spacy_sentences), spacy_sentences

(2, [This is a sample sentence., Let's test tokenization!])



1. **What does the `re.compile('[.?!]').split(text_lines)` line accomplish for sentence tokenization?**
   - This line uses a regular expression pattern (`[.?!]`) to split the `text_lines` string into a list of substrings at each occurrence of a period, question mark, or exclamation mark. Each substring represents a sentence, effectively tokenizing the text into sentences. However, this method may not handle all cases accurately, such as abbreviations or ellipses within sentences.

2. **How does the `nltk.tokenize.sent_tokenize()` function tokenize the text into sentences?**
   - The `sent_tokenize()` function from the NLTK library is specifically designed for sentence tokenization. It utilizes pre-trained models and rules to identify sentence boundaries accurately, handling various punctuation marks, abbreviations, and other linguistic features to split the text into individual sentences.

3. **What is the purpose of loading the SpaCy English language model (`en_core_web_sm`) in the code?**
   - The SpaCy library provides built-in support for sentence segmentation as part of its text processing pipeline. By loading the English language model (`en_core_web_sm`), the code prepares to tokenize the text into sentences using SpaCy's advanced linguistic analysis and rules for sentence boundary detection.

4. **How does the `doc.sents` attribute in SpaCy tokenize the text into sentences?**
   - Once the text is processed by SpaCy's language model (`nlp(text_lines)`), it creates a `Doc` object representing the analyzed text. The `doc.sents` attribute returns an iterator over sentence spans detected in the text. Each span represents a sentence, allowing for easy iteration over sentences in the text.

5. **What are the advantages of using SpaCy for sentence tokenization compared to other methods?**
   - SpaCy's sentence segmentation is based on advanced linguistic rules and statistical models, resulting in accurate and reliable sentence boundaries. It can handle complex sentence structures, abbreviations, and other linguistic phenomena effectively, making it suitable for a wide range of text processing tasks. Additionally, SpaCy integrates seamlessly with other SpaCy components and offers efficient processing speeds, making it a preferred choice for many NLP applications.

## Stopwords Removal <a name="stopwords"></a>

We'll ignore punctuations. Tokenization step using `re` gives us exactly that.

### Using `nltk` <a name="stopwords-nltk"></a>

In [92]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words_nltk = stopwords.words('english')
# print(len(stop_words_nltk),stop_words_nltk)

filtered_nltk = [word for word in re_tokens if word.lower() not in stop_words_nltk]
print(len(filtered_nltk), filtered_nltk)

106 ['One', 'morning', 'Gregor', 'Samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armour', 'like', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'Samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer', 'Gregor', 'turned', 'look', 'w

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Using `spaCy` <a name="stopwords-spacy"></a>

In [93]:
stop_words_spacy = nlp.Defaults.stop_words

filtered_spacy = [word for word in re_tokens if word.lower() not in stop_words_spacy]
print(len(filtered_spacy), filtered_spacy)

99 ['morning', 'Gregor', 'Samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armour', 'like', 'lifted', 'head', 'little', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'ready', 'slide', 'moment', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 's', 'happened', 'thought', 'wasn', 't', 'dream', 'room', 'proper', 'human', 'room', 'little', 'small', 'lay', 'peacefully', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'Samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'lower', 'arm', 'viewer', 'Gregor', 'turned', 'look', 'window', 'dull', 'weather']


### Using `gensim` <a name="stopwords-gensim"></a>

In [94]:
!pip install gensim

from gensim.parsing.preprocessing import STOPWORDS

stop_words_gensim = STOPWORDS

filtered_gensim = [word for word in re_tokens if word.lower() not in stop_words_gensim]
print(len(filtered_gensim), filtered_gensim)

97 ['morning', 'Gregor', 'Samsa', 'woke', 'troubled', 'dreams', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armour', 'like', 'lifted', 'head', 'little', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'ready', 'slide', 'moment', 'legs', 'pitifully', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 's', 'happened', 'thought', 'wasn', 't', 'dream', 'room', 'proper', 'human', 'room', 'little', 'small', 'lay', 'peacefully', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'Samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'lower', 'arm', 'viewer', 'Gregor', 'turned', 'look', 'window', 'dull', 'weather']


Another method from gensim using the `remove_stopwords` function.

In [95]:
from gensim.parsing.preprocessing import remove_stopwords

filtered_sentence_gensim = remove_stopwords(text_lines)
print(len(filtered_sentence_gensim), filtered_sentence_gensim)

46 This sample sentence. Let's test tokenization!




1. **What is the purpose of loading stopwords from the NLTK corpus using `stopwords.words('english')`?**
   - The NLTK library provides a collection of stopwords for various languages, including English. By loading the stopwords for English using `stopwords.words('english')`, the code retrieves a list of common words that are considered non-informative or irrelevant for text analysis tasks.

2. **How does the `nlp.Defaults.stop_words` attribute in SpaCy provide stopwords for English?**
   - SpaCy's language models include a set of default stopwords for the language they are trained on. By accessing the `nlp.Defaults.stop_words` attribute, the code retrieves a set of stopwords for English, which can be used to filter out non-essential words from text data.

3. **What is the significance of `STOPWORDS` from Gensim's preprocessing module?**
   - Gensim provides a set of stopwords through the `STOPWORDS` constant in its preprocessing module. These stopwords are commonly used in text processing tasks and can be employed to filter out unimportant words from text data.

4. **How does the list comprehension `[word for word in re_tokens if word.lower() not in stop_words]` remove stopwords from the text?**
   - This list comprehension iterates over each word token in the `re_tokens` list and checks if its lowercase version is not present in the `stop_words` set or list (depending on the library). If the word is not a stopword, it is included in the `filtered` list, effectively removing stopwords from the text data.

5. **What does the `remove_stopwords(text_lines)` function from Gensim accomplish?**
   - The `remove_stopwords()` function from Gensim's preprocessing module removes stopwords from the given text string (`text_lines`). It automatically filters out common stopwords based on the predefined list provided by Gensim's `STOPWORDS` constant, resulting in text with stopwords removed.

### Comparing results <a name="stopwords-compare"></a>

**`nltk` vs `spacy`**

In [96]:
list(set(filtered_nltk) ^ set(filtered_spacy))

['seemed',
 't',
 'back',
 'many',
 's',
 'whole',
 'although',
 'One',
 'four',
 'could',
 'wasn',
 'see',
 'towards']

**`nltk` vs `gensim`**

In [97]:
list(set(filtered_nltk) ^ set(filtered_gensim))

['seemed',
 't',
 'back',
 'found',
 'many',
 's',
 'whole',
 'although',
 'One',
 'four',
 'could',
 'thin',
 'wasn',
 'see',
 'towards']

**`gensim` vs `spacy`**

In [98]:
list(set(filtered_gensim) ^ set(filtered_spacy))

['found', 'thin']

### Comparing stopwords lists <a name="stopwords-compare2"></a>

Lists of words

In [99]:
# print("NLTK stopwords",stop_words_nltk)
# print("Spacy stopwords",stop_words_spacy)
# print("Gensim stopwords",stop_words_gensim)

Comparing length

In [100]:
print("NLTK stopwords len", len(stop_words_nltk))
print("Spacy stopwords len", len(stop_words_spacy))
print("Gensim stopwords len", len(stop_words_gensim))

NLTK stopwords len 198
Spacy stopwords len 326
Gensim stopwords len 337


## Lemmatization <a name="lemma"></a>

For lemmatization we'll ignore punctuations. Tokenization step using `re` gives us exactly that.

### Using `nltk` <a name="lemma-nltk"></a>

In [101]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

nltk_lemmatizer = WordNetLemmatizer()
nltk_lemmas = [nltk_lemmatizer.lemmatize(w) for w in re_tokens]
print(nltk_lemmas)

[nltk_data] Downloading package wordnet to /root/nltk_data...


['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dream', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arch', 'into', 'stiff', 'section', 'The', 'bedding', 'wa', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'leg', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'a', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room', 'although', 'a', 'little', 'too', 'small', 'lay', 'peacefully', 'between', 'it', 'four', 'familiar', 'wall', 'A', 'collection', 'of', 'textile', 'sample', 

text_lines = """One morning, when Gregor Samsa woke from troubled dreams,
 he found himself transformed in his bed into a horrible vermin.  He lay
 on his armour-like back, and if he lifted his head a little he could see his
 brown belly, slightly domed and divided by arches into stiff sections.
 The bedding was hardly able to cover it and seemed ready to slide off any
 moment.  His many legs, pitifully thin compared with the size of the rest of
 him, waved about helplessly as he looked.  "What's happened to me?" he thought.
 It wasn't a dream.  His room, a proper human room although a little too small,
 lay peacefully between its four familiar walls.  A collection of textile
 samples lay spread out on the table - Samsa was a travelling salesman - and
 above it there hung a picture that he had recently cut out of an illustrated
 magazine and housed in a nice, gilded frame.  It showed a lady fitted out with
 a fur hat and fur boa who sat upright, raising a heavy fur muff that covered
 the whole of her lower arm towards the viewer.  Gregor then turned to look out
 the window at the dull weather."""

### Using `spaCy` <a name="lemma-spacy"></a>

[spacy documentation](https://spacy.io/api/lemmatizer)

In [102]:
# spacy_text = " ".join([token.text for token in spacy_tokens])
text_from_re_tokens = " ".join([word for word in re_tokens])
# spacy_lemmas = [word.lemma_ for word in nlp(text_lines)]
spacy_lemmas = [word.lemma_ for word in nlp(text_from_re_tokens)]
print(len(spacy_lemmas), spacy_lemmas)

200 ['one', 'morning', 'when', 'Gregor', 'Samsa', 'wake', 'from', 'troubled', 'dream', 'he', 'find', 'himself', 'transform', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'he', 'lie', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lift', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'dome', 'and', 'divide', 'by', 'arch', 'into', 'stiff', 'section', 'the', 'bedding', 'be', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seem', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'his', 'many', 'leg', 'pitifully', 'thin', 'compare', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'he', 'wave', 'about', 'helplessly', 'as', 'he', 'look', 'what', 's', 'happen', 'to', 'I', 'he', 'think', 'it', 'wasn', 't', 'a', 'dream', 'his', 'room', 'a', 'proper', 'human', 'room', 'although', 'a', 'little', 'too', 'small', 'lie', 'peacefully', 'between', 'its', 'four', 'familiar', 'wall', 'a', 'collection', 'of', 'textile', 'sample', 'lie', 'sprea

### Using `TextBlob` <a name="lemma-tb"></a>

In [103]:
from textblob import TextBlob, Word

# create a TextBlob for our sentence
sent_tb = TextBlob(text_from_re_tokens)

# lemmatize each word
blob_lemmas = [word.lemmatize() for word in sent_tb.words]
print(len(blob_lemmas), blob_lemmas)

200 ['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dream', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arch', 'into', 'stiff', 'section', 'The', 'bedding', 'wa', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'leg', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'a', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room', 'although', 'a', 'little', 'too', 'small', 'lay', 'peacefully', 'between', 'it', 'four', 'familiar', 'wall', 'A', 'collection', 'of', 'textile', 'sampl



1. **What is the purpose of the `WordNetLemmatizer` from NLTK, and how does it work?**
   - The `WordNetLemmatizer` from NLTK is used for lemmatizing words, which involves reducing words to their base or dictionary form (known as lemma). It utilizes WordNet, a lexical database of the English language, to identify the base form of words. The lemmatization process considers the part of speech (POS) of each word to determine the appropriate lemma.

2. **How does the list comprehension `[nltk_lemmatizer.lemmatize(w) for w in re_tokens]` lemmatize words using NLTK?**
   - This list comprehension iterates over each word token (`w`) in the `re_tokens` list and applies the `lemmatize()` method of the `WordNetLemmatizer` to obtain the lemma of each word. The resulting list contains the lemmatized versions of the words in the `re_tokens` list.

3. **What is the purpose of using SpaCy for lemmatization in the code?**
   - SpaCy provides built-in support for lemmatization as part of its text processing pipeline. By accessing the `lemma_` attribute of each token in the SpaCy `Doc` object, the code retrieves the lemmatized form of each word token. SpaCy's lemmatization process takes into account the linguistic context of each word to produce accurate lemmas.

4. **How does TextBlob facilitate lemmatization, and what does the `TextBlob(text_from_re_tokens)` object represent?**
   - TextBlob provides a high-level interface for processing text data, including lemmatization. The `TextBlob(text_from_re_tokens)` object represents a text blob created from the string `text_from_re_tokens`. TextBlob automatically tokenizes the text and provides methods for various text processing tasks, including lemmatization. The `lemmatize()` method is applied to each word in the `TextBlob` object to obtain the lemmatized form.

5. **What are the differences between NLTK, SpaCy, and TextBlob in terms of lemmatization?**
   - NLTK, SpaCy, and TextBlob offer different approaches to lemmatization, each with its strengths and capabilities. NLTK provides a lemmatizer based on WordNet, which is suitable for basic lemmatization tasks. SpaCy's lemmatization is integrated into its text processing pipeline, offering efficient and accurate lemmatization with support for multiple languages and contextual information. TextBlob provides a simplified interface for text processing tasks, including lemmatization, making it easy to use for basic NLP tasks. Users may choose the most appropriate library based on their specific requirements and preferences.

### Adding POS Tags to `nltk` and `TextBlob` <a name="lemma-tags"></a>

By default, `nltk` and `TextBlob` treat every word as a noun. This is why words like "woke", "found", or "transformed" don't change after the lemmatization step. We can provide more information by adding the corresponding Part of Speech for each token.

### Using `nltk` <a name="lemma-tags-nltk"></a>

 - we use `pos_tag()` to get tokens along with tags
 - we call the `lemmatize()` function with the second parameter, that is a tag

[Source](https://www.guru99.com/stemming-lemmatization-python-nltk.html)

In [104]:
import nltk
from nltk.corpus import wordnet as wn
from nltk import pos_tag
from collections import defaultdict
from nltk.stem import WordNetLemmatizer

# Download both taggers (old and new)
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

# Example setup
nltk_lemmatizer = WordNetLemmatizer()
re_tokens = ["running", "quickly", "dogs"]

# POS tag mapping
tag_map_nltk = defaultdict(lambda : wn.NOUN)
tag_map_nltk['J'] = wn.ADJ
tag_map_nltk['V'] = wn.VERB
tag_map_nltk['R'] = wn.ADV

# Lemmatization
nltk_lemmas2 = [nltk_lemmatizer.lemmatize(token, tag_map_nltk[tag[0]]) for token, tag in pos_tag(re_tokens)]
print(nltk_lemmas2)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['run', 'quickly', 'dog']


### Using `TextBlob` <a name="lemma-tags-tb"></a>

 - we call `TextBlob(text).tags` to get tokens and tags
 - we call `word.lemmatize()` with the tag parameter

[Source](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#textbloblemmatizer)

In [105]:
tag_map_tb = {  "J": 'a', # adjectives
                "N": 'n', # nouns
                "V": 'v', # verbs
                "R": 'r'} # adverbs

words_and_tags = [(w, tag_map_tb.get(pos[0], 'n')) for w, pos in sent_tb.tags]
blob_lemmas2 = [word.lemmatize(tag) for word, tag in words_and_tags]
print(len(blob_lemmas2), blob_lemmas2)

200 ['One', 'morning', 'when', 'Gregor', 'Samsa', 'wake', 'from', 'troubled', 'dream', 'he', 'find', 'himself', 'transform', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lift', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divide', 'by', 'arch', 'into', 'stiff', 'section', 'The', 'bedding', 'be', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seem', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'leg', 'pitifully', 'thin', 'compare', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'wave', 'about', 'helplessly', 'a', 'he', 'look', 'What', 's', 'happen', 'to', 'me', 'he', 'think', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room', 'although', 'a', 'little', 'too', 'small', 'lay', 'peacefully', 'between', 'it', 'four', 'familiar', 'wall', 'A', 'collection', 'of', 'textile', 'sample', 'lay', 'spre

### Comparing Lemmatizers <a name="lemma-compare"></a>

In [106]:
list(set(nltk_lemmas) ^ set(blob_lemmas))

[]

In [107]:
list(set(nltk_lemmas2) ^ set(blob_lemmas2))

['low',
 'from',
 't',
 'size',
 'wake',
 'at',
 'familiar',
 'her',
 'who',
 'domed',
 'above',
 'muff',
 'compare',
 'little',
 'spread',
 'recently',
 'off',
 'window',
 'able',
 'bedding',
 'collection',
 'himself',
 'seem',
 'and',
 'a',
 'cut',
 'dream',
 'section',
 'Gregor',
 'gild',
 'he',
 'wall',
 'transform',
 'What',
 'stiff',
 'textile',
 'on',
 'ready',
 'dog',
 'belly',
 'Samsa',
 's',
 'to',
 'slightly',
 'that',
 'look',
 'lay',
 'It',
 'quickly',
 'too',
 'weather',
 'picture',
 'although',
 'One',
 'leg',
 'A',
 'there',
 'when',
 'morning',
 'four',
 'moment',
 'like',
 'cover',
 'between',
 'his',
 'house',
 'thin',
 'arm',
 'vermin',
 'wasn',
 'see',
 'upright',
 'the',
 'dull',
 'small',
 'peacefully',
 'pitifully',
 'arch',
 'boa',
 'human',
 'think',
 'lady',
 'in',
 'hardly',
 'raise',
 'into',
 'He',
 'His',
 'slide',
 'back',
 'any',
 'by',
 'travelling',
 'head',
 'table',
 'turn',
 'salesman',
 'lift',
 'brown',
 'divide',
 'heavy',
 'many',
 'about',
 'f

`nltk` and `TextBlob` return identical results.

Let's see what's changed after applying POS tags

In [108]:
sorted(list(set(nltk_lemmas) ^ set(nltk_lemmas2)))

['A',
 'Gregor',
 'He',
 'His',
 'It',
 'One',
 'Samsa',
 'The',
 'What',
 'a',
 'able',
 'about',
 'above',
 'although',
 'an',
 'and',
 'any',
 'arch',
 'arm',
 'armour',
 'at',
 'back',
 'bed',
 'bedding',
 'belly',
 'between',
 'boa',
 'brown',
 'by',
 'collection',
 'compared',
 'could',
 'cover',
 'covered',
 'cut',
 'divided',
 'dog',
 'domed',
 'dream',
 'dull',
 'familiar',
 'fitted',
 'found',
 'four',
 'frame',
 'from',
 'fur',
 'gilded',
 'had',
 'happened',
 'hardly',
 'hat',
 'he',
 'head',
 'heavy',
 'helplessly',
 'her',
 'him',
 'himself',
 'his',
 'horrible',
 'housed',
 'human',
 'hung',
 'if',
 'illustrated',
 'in',
 'into',
 'it',
 'lady',
 'lay',
 'leg',
 'lifted',
 'like',
 'little',
 'look',
 'looked',
 'lower',
 'magazine',
 'many',
 'me',
 'moment',
 'morning',
 'muff',
 'nice',
 'of',
 'off',
 'on',
 'out',
 'peacefully',
 'picture',
 'pitifully',
 'proper',
 'quickly',
 'raising',
 'ready',
 'recently',
 'rest',
 'room',
 'run',
 's',
 'salesman',
 'sample',

`nltk` did a great job at turning past tense verbs to the present tense

## Stemming with `nltk` <a name="stem"></a>

Compare 2 types of Stemmers. [Source](https://stackoverflow.com/questions/24647400/what-is-the-best-stemming-method-in-python)

### Using `PorterStemmer()` <a name="stem-ps"></a>

In [109]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

ps_stemms = [ps.stem(w) for w in re_tokens]
print(ps_stemms)

['run', 'quickli', 'dog']




**1. What is stemming and how does the `PorterStemmer` from NLTK perform stemming?**

Stemming is the process of reducing words to their root or base form, typically by removing suffixes. The `PorterStemmer` algorithm, developed by Martin Porter, is one of the most widely used stemming algorithms. It applies a series of heuristic rules to remove common suffixes from words, aiming to produce the stem or root form.

**2. How does the list comprehension `[ps.stem(w) for w in re_tokens]` perform stemming using the `PorterStemmer`?**

This list comprehension iterates over each word token (`w`) in the `re_tokens` list and applies the `stem()` method of the `PorterStemmer` (`ps`) to obtain the stemmed form of each word. The resulting list (`ps_stemms`) contains the stemmed versions of the words in the `re_tokens` list.

**3. What are some examples of stemming using the Porter Stemmer?**

- "Troubled" stems to "trouble"
- "Divided" stems to "divid"
- "Sections" stems to "section"
- "Stiff" stems to "stiff"
- "Ready" stems to "readi"

**4. How does stemming differ from lemmatization, and when might one be preferred over the other?**

Stemming and lemmatization both aim to reduce words to their base forms, but they use different techniques and have different levels of accuracy. Stemming is a simpler and faster process as it applies heuristic rules to chop off suffixes, often resulting in stems that are not actual words. On the other hand, lemmatization involves dictionary lookup to find the lemma or base form of words, resulting in more accurate but potentially slower processing.

Stemming might be preferred in applications where speed is crucial, such as information retrieval or text indexing, while lemmatization might be preferred in tasks where accuracy is paramount, such as natural language understanding or text analysis.

### Using `SnowballStemmer` <a name="stem-sno"></a>

In [110]:
from nltk.stem import SnowballStemmer

sno = SnowballStemmer('english')

sno_stemms = [sno.stem(w) for w in re_tokens]
print(sno_stemms)

['run', 'quick', 'dog']




**1. What is the Snowball Stemmer and how does it differ from the Porter Stemmer?**
The Snowball Stemmer is a stemming algorithm that supports multiple languages. It is an improvement over the Porter Stemmer, offering more accurate stemming for various languages. The Snowball Stemmer is also known as the Porter2 Stemmer. It provides stemming algorithms for different languages, each identified by its language code.

**2. How does the `SnowballStemmer` from NLTK perform stemming?**
The `SnowballStemmer` in NLTK uses the Snowball stemming algorithm to reduce words to their root or base form. It applies language-specific rules and algorithms to perform stemming, aiming to produce accurate stem forms for words in the specified language.

**3. What does the list comprehension `[sno.stem(w) for w in re_tokens]` accomplish?**
This list comprehension iterates over each word token (`w`) in the `re_tokens` list and applies the `stem()` method of the `SnowballStemmer` (`sno`) to obtain the stemmed form of each word. The resulting list (`sno_stemms`) contains the stemmed versions of the words in the `re_tokens` list.

**4. How does stemming with Snowball Stemmer differ from stemming with Porter Stemmer?**
Snowball Stemmer is an improvement over the original Porter Stemmer and provides more accurate stemming for various languages. It offers stemmers for different languages, while the Porter Stemmer is primarily focused on English stemming. Snowball Stemmer uses a more refined algorithm, resulting in better stem forms for words in different languages compared to the Porter Stemmer.

**5. In which scenarios might Snowball Stemmer be preferred over Porter Stemmer?**
Snowball Stemmer might be preferred over Porter Stemmer when dealing with text data in languages other than English. Since Snowball Stemmer provides support for multiple languages, it can produce more accurate stem forms for words in those languages. Additionally, Snowball Stemmer might be preferred when higher accuracy in stemming is desired, as it offers improved algorithms and language-specific rules.

### Comparing Stemmers <a name="stem-compare"></a>

In [111]:
sorted(list(set(ps_stemms) ^ set(sno_stemms)))

['quick', 'quickli']

`PorterStemmer` and `SnowballStemmer` handle adverbs differently