### Text Preprocessing

In this chapter, we are discussing the text preprocessing steps required before applying machine learning algorithms.
The steps include: cleaning, normalization, vectorization, and more.

1. NLP pipeline
2. Text Preprocessing:
   - with Pandas
   - with spaCy
3. Vectorization

The NLP pipeline is also known as the standard natural language processing workflow. This is similar to the data science workflow with extra steps.

Vectorization formats the text data in a way that a computer can process.

#### NLP Pipeline

Since the NLP pipeline is similar to a data science workflow, we will first explore its steps.

Step 1: Scoping a project (goal and what questions we want to answer)

Step 2: Gathering Data

Step 3: Cleaning Data

Step 4: Explore Data

Step 5: Modeling Data

Step 6: Share Insights

The NLP adds one more step between Cleaning Data and Exploring Data, which is called "Text Preprocessing."
This will ensure the data is ready to export and model.
Including the new step. Steps 2 to 5 are known as the NLP pipeline.

#### Text Preprocessing

Text preprocessing is about preparing raw data for analysis and modeling.

In the text preprocessing step, first, we may do Cleaning and Normalization (with Pandas or spaCy):
    
    Cleaning: Remove unnecessary text
    
    Normalization: Make text consistent

    Concepts:
    Lowering - Convert all text to lower case
    Special characters - Remove punctuation and special characters using regular expressions
    Tokenization - Split text into smaller units (words/sentences)
    Stemming/lemmatization - Reduce words to their root or base form
    Stop words - Remove common, non-essential words
    Parts of Speech (POS) tagging - Identifying grammatical roles of words (nouns, verbs, etc.)

Then, Vectorization (Turn text into a matrix of numbers).
    
    Here, each document (text) is represented by a vector of counts or TF-IDF (Term Frequency-Inverse Document Frequency) values.
    
    This can be done with scikit-learn

    Concepts:
    Document-Term Matrix (DTM) - Represent text by word frequency (known as Bag of Words)
    TF-IDF - Extension of DTM that weights words based on their importance.

Note: The order of steps may vary depending on the project.

Before moving forward, we will create a new environment called 'nlp_basics' and install the following:

    jupyter notebook
    matplotlib
    pandas
    python
    scikit-learn
    spacy

#### Text Preprocessing with Pandas

Pandas can be used for simple text cleaning and normalization.

The following are the functions we will be using.

    str.lower() - Make all text lowercase
    str.replace() - To replace special characters (punctuation/numbers/...)

Note: Regular expressions (known as regex) allow you to find patterns (Extension to Ctrl+F).

##### Create Data Sets

In [2]:
import pandas as pd

In [3]:
data = [
    "When life gives you lemons, make lemonade! ðŸ™‚",
    "She bought 2 lemons for $1 at Maven Market.",
    "A dozen lemons will make a gallon of lemonade. [AllRecipes]",
    "lemon, lemon, lemons, lemon, lemon, lemons",
    "He's running to the market to get a lemon â€” there's a great sale today.",
    "Does Maven Market carry Eureka lemons or Meyer lemons?",
    "An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",
    "iced tea is my favorite"
]

In [4]:
data

['When life gives you lemons, make lemonade! ðŸ™‚',
 'She bought 2 lemons for $1 at Maven Market.',
 'A dozen lemons will make a gallon of lemonade. [AllRecipes]',
 'lemon, lemon, lemons, lemon, lemon, lemons',
 "He's running to the market to get a lemon â€” there's a great sale today.",
 'Does Maven Market carry Eureka lemons or Meyer lemons?',
 'An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]',
 'iced tea is my favorite']

In [6]:
### Is is a list. We first convert it to a dataframe.

pd.DataFrame(data)

Unnamed: 0,0
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade....
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” th...
5,Does Maven Market carry Eureka lemons or Meyer...
6,"An Arnold Palmer is half lemonade, half iced t..."
7,iced tea is my favorite


In [7]:
pd.DataFrame(data, columns=['sentence'])

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade....
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” th...
5,Does Maven Market carry Eureka lemons or Meyer...
6,"An Arnold Palmer is half lemonade, half iced t..."
7,iced tea is my favorite


In [8]:
data_df = pd.DataFrame(data, columns=['sentence'])
data_df

### Note that some of the sentences are cut off. We can fix this by changing pandas options.

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade....
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” th...
5,Does Maven Market carry Eureka lemons or Meyer...
6,"An Arnold Palmer is half lemonade, half iced t..."
7,iced tea is my favorite


In [9]:
pd.set_option('display.max_colwidth', None)

### Note that this option will apply to all the codes.

In [10]:
data_df = pd.DataFrame(data, columns=['sentence'])
data_df

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]"
7,iced tea is my favorite


In [11]:
# create a test data set

test = [
    "We're going to start this course with traditional NLP applications.",
    "Then we'll move on to modern NLP theory.",
    "Finally, we'll wrap things up with modern NLP applications."
]

In [12]:
test_series = pd.Series(test)
test_series

### Noye: Here we are using Series (Just a single column). Data frames are more flexible, and we will be adding more columns to the data_df.

0    We're going to start this course with traditional NLP applications.
1                               Then we'll move on to modern NLP theory.
2            Finally, we'll wrap things up with modern NLP applications.
dtype: object

In [13]:
### Text Preprocessing with pandas

In [14]:
data_df

### We will add a column which shows cleaned version of the sentences.

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]"
7,iced tea is my favorite


In [16]:
### To keep the original data, we will first make a copy.

df = data_df.copy()
df

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]"
7,iced tea is my favorite


In [None]:
### Step 1: Make text lower case.

In [17]:
df['sentence'].str.lower()

0                               when life gives you lemons, make lemonade! ðŸ™‚
1                                she bought 2 lemons for $1 at maven market.
2                a dozen lemons will make a gallon of lemonade. [allrecipes]
3                                 lemon, lemon, lemons, lemon, lemon, lemons
4    he's running to the market to get a lemon â€” there's a great sale today.
5                     does maven market carry eureka lemons or meyer lemons?
6              an arnold palmer is half lemonade, half iced tea. [wikipedia]
7                                                    iced tea is my favorite
Name: sentence, dtype: object

In [18]:
df['sentence_clean'] = df['sentence'].str.lower()
df

Unnamed: 0,sentence,sentence_clean
0,"When life gives you lemons, make lemonade! ðŸ™‚","when life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for $1 at maven market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade. [allrecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons","lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.,he's running to the market to get a lemon â€” there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]","an arnold palmer is half lemonade, half iced tea. [wikipedia]"
7,iced tea is my favorite,iced tea is my favorite


In [None]:
### Step 2: Remove text in square brackets.