### Text Preprocessing

In this chapter, we are discussing the text preprocessing steps required before applying machine learning algorithms.
The steps include: cleaning, normalization, vectorization, and more.

1. NLP pipeline
2. Text Preprocessing:
   - with Pandas
   - with spaCy
3. Vectorization

The NLP pipeline is also known as the standard natural language processing workflow. This is similar to the data science workflow with extra steps.

Vectorization formats the text data in a way that a computer can process.

#### NLP Pipeline

Since the NLP pipeline is similar to a data science workflow, we will first explore its steps.

Step 1: Scoping a project (goal and what questions we want to answer)

Step 2: Gathering Data

Step 3: Cleaning Data

Step 4: Explore Data

Step 5: Modeling Data

Step 6: Share Insights

The NLP adds one more step between Cleaning Data and Exploring Data, which is called "Text Preprocessing."
This will ensure the data is ready to export and model.
Including the new step. Steps 2 to 5 are known as the NLP pipeline.

#### Text Preprocessing

Text preprocessing is about preparing raw data for analysis and modeling.

In the text preprocessing step, first, we may do Cleaning and Normalization (with Pandas or spaCy):
    
    Cleaning: Remove unnecessary text
    
    Normalization: Make text consistent

    Concepts:
    Lowering - Convert all text to lower case
    Special characters - Remove punctuation and special characters using regular expressions
    Tokenization - Split text into smaller units (words/sentences)
    Stemming/lemmatization - Reduce words to their root or base form
    Stop words - Remove common, non-essential words
    Parts of Speech (POS) tagging - Identifying grammatical roles of words (nouns, verbs, etc.)

Then, Vectorization (Turn text into a matrix of numbers).
    
    Here, each document (text) is represented by a vector of counts or TF-IDF (Term Frequency-Inverse Document Frequency) values.
    
    This can be done with scikit-learn

    Concepts:
    Document-Term Matrix (DTM) - Represent text by word frequency (known as Bag of Words)
    TF-IDF - Extension of DTM that weights words based on their importance.

Note: The order of steps may vary depending on the project.

Before moving forward, we will create a new environment called 'nlp_basics' and install the following:

    jupyter notebook
    matplotlib
    pandas
    python
    scikit-learn
    spacy

#### Text Preprocessing with Pandas

Pandas can be used for simple text cleaning and normalization.

The following are the functions we will be using.

    str.lower() - Make all text lowercase
    str.replace() - To replace special characters (punctuation/numbers/...)

Note: Regular expressions (known as regex) allow you to find patterns (Extension to Ctrl+F).

##### Create Data Sets

In [2]:
import pandas as pd

In [3]:
data = [
    "When life gives you lemons, make lemonade! ðŸ™‚",
    "She bought 2 lemons for $1 at Maven Market.",
    "A dozen lemons will make a gallon of lemonade. [AllRecipes]",
    "lemon, lemon, lemons, lemon, lemon, lemons",
    "He's running to the market to get a lemon â€” there's a great sale today.",
    "Does Maven Market carry Eureka lemons or Meyer lemons?",
    "An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",
    "iced tea is my favorite"
]

In [4]:
data

['When life gives you lemons, make lemonade! ðŸ™‚',
 'She bought 2 lemons for $1 at Maven Market.',
 'A dozen lemons will make a gallon of lemonade. [AllRecipes]',
 'lemon, lemon, lemons, lemon, lemon, lemons',
 "He's running to the market to get a lemon â€” there's a great sale today.",
 'Does Maven Market carry Eureka lemons or Meyer lemons?',
 'An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]',
 'iced tea is my favorite']

In [6]:
### Is is a list. We first convert it to a dataframe.

pd.DataFrame(data)

Unnamed: 0,0
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade....
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” th...
5,Does Maven Market carry Eureka lemons or Meyer...
6,"An Arnold Palmer is half lemonade, half iced t..."
7,iced tea is my favorite


In [7]:
pd.DataFrame(data, columns=['sentence'])

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade....
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” th...
5,Does Maven Market carry Eureka lemons or Meyer...
6,"An Arnold Palmer is half lemonade, half iced t..."
7,iced tea is my favorite


In [8]:
data_df = pd.DataFrame(data, columns=['sentence'])
data_df

### Note that some of the sentences are cut off. We can fix this by changing Pandas options.

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade....
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” th...
5,Does Maven Market carry Eureka lemons or Meyer...
6,"An Arnold Palmer is half lemonade, half iced t..."
7,iced tea is my favorite


In [9]:
pd.set_option('display.max_colwidth', None)

### Note that this option will apply to all the codes.

In [10]:
data_df = pd.DataFrame(data, columns=['sentence'])
data_df

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]"
7,iced tea is my favorite


In [11]:
# create a test data set

test = [
    "We're going to start this course with traditional NLP applications.",
    "Then we'll move on to modern NLP theory.",
    "Finally, we'll wrap things up with modern NLP applications."
]

In [12]:
test_series = pd.Series(test)
test_series

### Noye: Here we are using Series (Just a single column). Data frames are more flexible, and we will be adding more columns to the data_df.

0    We're going to start this course with traditional NLP applications.
1                               Then we'll move on to modern NLP theory.
2            Finally, we'll wrap things up with modern NLP applications.
dtype: object

In [13]:
### Text Preprocessing with pandas

In [14]:
data_df

### We will add a column which shows cleaned version of the sentences.

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]"
7,iced tea is my favorite


In [16]:
### To keep the original data, we will first make a copy.

df = data_df.copy()
df

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]"
7,iced tea is my favorite


In [None]:
### Step 1: Make text lower case.

In [17]:
df['sentence'].str.lower()

0                               when life gives you lemons, make lemonade! ðŸ™‚
1                                she bought 2 lemons for $1 at maven market.
2                a dozen lemons will make a gallon of lemonade. [allrecipes]
3                                 lemon, lemon, lemons, lemon, lemon, lemons
4    he's running to the market to get a lemon â€” there's a great sale today.
5                     does maven market carry eureka lemons or meyer lemons?
6              an arnold palmer is half lemonade, half iced tea. [wikipedia]
7                                                    iced tea is my favorite
Name: sentence, dtype: object

In [18]:
df['sentence_clean'] = df['sentence'].str.lower()
df

Unnamed: 0,sentence,sentence_clean
0,"When life gives you lemons, make lemonade! ðŸ™‚","when life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for $1 at maven market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade. [allrecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons","lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.,he's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]","an arnold palmer is half lemonade, half iced tea. [wikipedia]"
7,iced tea is my favorite,iced tea is my favorite


In [None]:
### Step 2: Remove text in square brackets.

In [19]:
df['sentence_clean'].str.replace('[wikipedia]','')

### This works. However, is is very much manual. Thus, we will look for pattern with regex

0                               when life gives you lemons, make lemonade! ðŸ™‚
1                                she bought 2 lemons for $1 at maven market.
2                a dozen lemons will make a gallon of lemonade. [allrecipes]
3                                 lemon, lemon, lemons, lemon, lemon, lemons
4    he's running to the market to get a lemon â€” there's a great sale today.
5                     does maven market carry eureka lemons or meyer lemons?
6                         an arnold palmer is half lemonade, half iced tea. 
7                                                    iced tea is my favorite
Name: sentence_clean, dtype: object

In [20]:
df['sentence_clean'].str.replace(r'\[.*?\]', '', regex=True)

0                               when life gives you lemons, make lemonade! ðŸ™‚
1                                she bought 2 lemons for $1 at maven market.
2                            a dozen lemons will make a gallon of lemonade. 
3                                 lemon, lemon, lemons, lemon, lemon, lemons
4    he's running to the market to get a lemon â€” there's a great sale today.
5                     does maven market carry eureka lemons or meyer lemons?
6                         an arnold palmer is half lemonade, half iced tea. 
7                                                    iced tea is my favorite
Name: sentence_clean, dtype: object

In [21]:
df['sentence_clean'] = df['sentence_clean'].str.replace(r'\[.*?\]', '', regex=True)
df

Unnamed: 0,sentence,sentence_clean
0,"When life gives you lemons, make lemonade! ðŸ™‚","when life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for $1 at maven market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade.
3,"lemon, lemon, lemons, lemon, lemon, lemons","lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.,he's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]","an arnold palmer is half lemonade, half iced tea."
7,iced tea is my favorite,iced tea is my favorite


In [None]:
### Step 3: Remove punctuations.

In [22]:
df['sentence_clean'].str.replace(r'[^\w\s]', '', regex=True)

0                              when life gives you lemons make lemonade 
1                              she bought 2 lemons for 1 at maven market
2                         a dozen lemons will make a gallon of lemonade 
3                                  lemon lemon lemons lemon lemon lemons
4    hes running to the market to get a lemon  theres a great sale today
5                  does maven market carry eureka lemons or meyer lemons
6                       an arnold palmer is half lemonade half iced tea 
7                                                iced tea is my favorite
Name: sentence_clean, dtype: object

In [23]:
df['sentence_clean'] = df['sentence_clean'].str.replace(r'[^\w\s]', '', regex=True)
df

Unnamed: 0,sentence,sentence_clean
0,"When life gives you lemons, make lemonade! ðŸ™‚",when life gives you lemons make lemonade
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for 1 at maven market
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade
3,"lemon, lemon, lemons, lemon, lemon, lemons",lemon lemon lemons lemon lemon lemons
4,He's running to the market to get a lemon â€” there's a great sale today.,hes running to the market to get a lemon theres a great sale today
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",an arnold palmer is half lemonade half iced tea
7,iced tea is my favorite,iced tea is my favorite


In [24]:
### Pro Tip:

### The above task can be performed with functions so that this can be applied to a different dataset without redundancy.

# df['sentence_clean'] = df['sentence'].str.lower()
# df['sentence_clean'] = df['sentence_clean'].str.replace(r'\[.*?\]', '', regex=True)
# df['sentence_clean'] = df['sentence_clean'].str.replace(r'[^\w\s]', '', regex=True)

def lower_replace(series):
    output = series.str.lower()
    output = output.str.replace(r'\[.*?\]', '', regex=True)
    output = output.str.replace(r'[^\w\s]', '', regex=True)
    return output

In [25]:
### Let's try this with test data.

test_series

0    We're going to start this course with traditional NLP applications.
1                               Then we'll move on to modern NLP theory.
2            Finally, we'll wrap things up with modern NLP applications.
dtype: object

In [26]:
lower_replace(test_series)

0    were going to start this course with traditional nlp applications
1                               then well move on to modern nlp theory
2             finally well wrap things up with modern nlp applications
dtype: object

In [27]:
### We can also try this with our original data.

df.sentence

0                               When life gives you lemons, make lemonade! ðŸ™‚
1                                She bought 2 lemons for $1 at Maven Market.
2                A dozen lemons will make a gallon of lemonade. [AllRecipes]
3                                 lemon, lemon, lemons, lemon, lemon, lemons
4    He's running to the market to get a lemon â€” there's a great sale today.
5                     Does Maven Market carry Eureka lemons or Meyer lemons?
6              An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]
7                                                    iced tea is my favorite
Name: sentence, dtype: object

In [28]:
lower_replace(df.sentence)

0                              when life gives you lemons make lemonade 
1                              she bought 2 lemons for 1 at maven market
2                         a dozen lemons will make a gallon of lemonade 
3                                  lemon lemon lemons lemon lemon lemons
4    hes running to the market to get a lemon  theres a great sale today
5                  does maven market carry eureka lemons or meyer lemons
6                       an arnold palmer is half lemonade half iced tea 
7                                                iced tea is my favorite
Name: sentence, dtype: object

In [30]:
### Next Object:

### We have given a flat file of the top 100 children's books over the past century.
### Use Pandas string functions to clean and normalize the text (lowercase, remove special characters, remove punctuations)

df = pd.read_csv('Chapter2_childrens_books.csv')

df.head()

Unnamed: 0,Ranking,Title,Author,Year,Rating,Description
0,1,Where the Wild Things Are,Maurice Sendak,1963,4.25,"Where the Wild Things AreÂ follows Max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. As their king, Max tames the beasts and eventually returns home to find his supper waiting for him. This iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through Sendak's whimsical illustrations and story."
1,2,The Very Hungry Caterpillar,Eric Carle,1969,4.34,"The Very Hungry CaterpillarÂ tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly. Eric Carleâ€™s use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers. The simple, engaging story introduces children to days of the week, counting, and the concept of metamorphosis. Itâ€™s a staple in early childhood education."
2,3,The Giving Tree,Shel Silverstein,1964,4.38,"The Giving TreeÂ is a touching and bittersweet story about a tree that gives everything it has to a boy over the course of his life. As the boy grows up, he takes more from the tree, and the tree continues to give, even when it has little left. Silversteinâ€™s minimalist text and illustrations convey deep themes of unconditional love, selflessness, and the passage of time. It has sparked much discussion about relationships and sacrifice."
3,4,Green Eggs and Ham,Dr. Seuss,1960,4.31,"InÂ Green Eggs and Ham, Sam-I-Am tries to convince a reluctant character to try a dish of green eggs and ham, despite his resistance. Through repetition and rhyme, Dr. Seussâ€™s classic story about being open to new experiences encourages children to be adventurous and try things outside their comfort zone. The playful illustrations and humorous dialogue make it a fun and educational read for young readers."
4,5,Goodnight Moon,Margaret Wise Brown,1947,4.31,"Goodnight MoonÂ is a gentle, rhythmic bedtime story where a little bunny says goodnight to everything in his room, from the moon to the ""quiet old lady whispering hush."" Its repetitive structure and comforting tone make it ideal for young children. The simple illustrations by Clement Hurd complement the soothing nature of the story, making it a beloved classic for sleep-time reading."


In [38]:
### Step 1: Lower case letters.

df['Description_Clean'] = df.Description.str.lower()
df.head(2)

### Note: df['Description'] & df.Description works the same way. However, df['Description'] is formal and works all the time.

Unnamed: 0,Ranking,Title,Author,Year,Rating,Description,Description_Clean
0,1,Where the Wild Things Are,Maurice Sendak,1963,4.25,"Where the Wild Things AreÂ follows Max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. As their king, Max tames the beasts and eventually returns home to find his supper waiting for him. This iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through Sendak's whimsical illustrations and story.","where the wild things areÂ follows max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. as their king, max tames the beasts and eventually returns home to find his supper waiting for him. this iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through sendak's whimsical illustrations and story."
1,2,The Very Hungry Caterpillar,Eric Carle,1969,4.34,"The Very Hungry CaterpillarÂ tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly. Eric Carleâ€™s use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers. The simple, engaging story introduces children to days of the week, counting, and the concept of metamorphosis. Itâ€™s a staple in early childhood education.","the very hungry caterpillarÂ tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly. eric carleâ€™s use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers. the simple, engaging story introduces children to days of the week, counting, and the concept of metamorphosis. itâ€™s a staple in early childhood education."


In [39]:
### Step 2: Replace special characters.

df['Description'].iloc[0]

### There is a special space, and it is captured by these characters.

"Where the Wild Things Are\xa0follows Max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. As their king, Max tames the beasts and eventually returns home to find his supper waiting for him. This iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through Sendak's whimsical illustrations and story."

In [41]:
# df['Description_Clean'] = df['Description_Clean'].str.replace('\xa0', '')
# df.head()

### Note that we need a space between the words. So the correct command is:

df['Description_Clean'] = df['Description_Clean'].str.replace('\xa0', ' ') # Keep a space between quatations
df.head(2)

Unnamed: 0,Ranking,Title,Author,Year,Rating,Description,Description_Clean
0,1,Where the Wild Things Are,Maurice Sendak,1963,4.25,"Where the Wild Things AreÂ follows Max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. As their king, Max tames the beasts and eventually returns home to find his supper waiting for him. This iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through Sendak's whimsical illustrations and story.","where the wild things are follows max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. as their king, max tames the beasts and eventually returns home to find his supper waiting for him. this iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through sendak's whimsical illustrations and story."
1,2,The Very Hungry Caterpillar,Eric Carle,1969,4.34,"The Very Hungry CaterpillarÂ tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly. Eric Carleâ€™s use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers. The simple, engaging story introduces children to days of the week, counting, and the concept of metamorphosis. Itâ€™s a staple in early childhood education.","the very hungry caterpillar tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly. eric carleâ€™s use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers. the simple, engaging story introduces children to days of the week, counting, and the concept of metamorphosis. itâ€™s a staple in early childhood education."


In [42]:
### Step 3: Remove punctuations.

df['Description_Clean'] = df['Description_Clean'].str.replace(r'[^\w\s]', '', regex=True) # r is there to say we are dealing with a raw string
df.head()

Unnamed: 0,Ranking,Title,Author,Year,Rating,Description,Description_Clean
0,1,Where the Wild Things Are,Maurice Sendak,1963,4.25,"Where the Wild Things AreÂ follows Max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. As their king, Max tames the beasts and eventually returns home to find his supper waiting for him. This iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through Sendak's whimsical illustrations and story.",where the wild things are follows max a young boy who after being sent to his room for misbehaving imagines sailing to an island filled with wild creatures as their king max tames the beasts and eventually returns home to find his supper waiting for him this iconic book explores themes of imagination adventure and the complex emotions of childhood all captured through sendaks whimsical illustrations and story
1,2,The Very Hungry Caterpillar,Eric Carle,1969,4.34,"The Very Hungry CaterpillarÂ tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly. Eric Carleâ€™s use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers. The simple, engaging story introduces children to days of the week, counting, and the concept of metamorphosis. Itâ€™s a staple in early childhood education.",the very hungry caterpillar tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly eric carles use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers the simple engaging story introduces children to days of the week counting and the concept of metamorphosis its a staple in early childhood education
2,3,The Giving Tree,Shel Silverstein,1964,4.38,"The Giving TreeÂ is a touching and bittersweet story about a tree that gives everything it has to a boy over the course of his life. As the boy grows up, he takes more from the tree, and the tree continues to give, even when it has little left. Silversteinâ€™s minimalist text and illustrations convey deep themes of unconditional love, selflessness, and the passage of time. It has sparked much discussion about relationships and sacrifice.",the giving tree is a touching and bittersweet story about a tree that gives everything it has to a boy over the course of his life as the boy grows up he takes more from the tree and the tree continues to give even when it has little left silversteins minimalist text and illustrations convey deep themes of unconditional love selflessness and the passage of time it has sparked much discussion about relationships and sacrifice
3,4,Green Eggs and Ham,Dr. Seuss,1960,4.31,"InÂ Green Eggs and Ham, Sam-I-Am tries to convince a reluctant character to try a dish of green eggs and ham, despite his resistance. Through repetition and rhyme, Dr. Seussâ€™s classic story about being open to new experiences encourages children to be adventurous and try things outside their comfort zone. The playful illustrations and humorous dialogue make it a fun and educational read for young readers.",in green eggs and ham samiam tries to convince a reluctant character to try a dish of green eggs and ham despite his resistance through repetition and rhyme dr seusss classic story about being open to new experiences encourages children to be adventurous and try things outside their comfort zone the playful illustrations and humorous dialogue make it a fun and educational read for young readers
4,5,Goodnight Moon,Margaret Wise Brown,1947,4.31,"Goodnight MoonÂ is a gentle, rhythmic bedtime story where a little bunny says goodnight to everything in his room, from the moon to the ""quiet old lady whispering hush."" Its repetitive structure and comforting tone make it ideal for young children. The simple illustrations by Clement Hurd complement the soothing nature of the story, making it a beloved classic for sleep-time reading.",goodnight moon is a gentle rhythmic bedtime story where a little bunny says goodnight to everything in his room from the moon to the quiet old lady whispering hush its repetitive structure and comforting tone make it ideal for young children the simple illustrations by clement hurd complement the soothing nature of the story making it a beloved classic for sleeptime reading


#### Text Preprocessing with spaCy

The spaCy library can handle many NLP tasks, including tokenization, lemmatization, stop words, and more.

With spaCy, the first step is to turn a text into a spaCy doc object (we have to specify which language we will be using).

Tokenization: This allows us to break text up into smaller units (words). It will use whitespace to make tokens.

Lemmatization: Reduce words to their base form. This is a common way to normalize text. For this, spaCy uses a combination of linguistic rules and statistical models to lemmatize text. (Examples: i -> I, lemons -> lemon, selling -> sell)

Stop words: The words without any significant meaning (the, are, for, ...)

Parts of speech (POS) tagging: This allows us to label nouns, verbs, ... within text data. This is optional. Sometimes, use this as a filtering technique to only look at nouns and pronouns for analysis.