# Exploring and Processing Text Data
In this chapter, we are going to cover various methods and techniques to
preprocess the text data along with exploratory data analysis.
We are going to discuss the following recipes under text preprocessing
and exploratory data analysis.
 
    Recipe 1. Lowercasing

    Recipe 2. Punctuation removal

    Recipe 3. Stop words removal
 
    Recipe 4. Text standardization

    Recipe 5. Spelling correction

    Recipe 6. Tokenization

    Recipe 7. Stemming

    Recipe 8. Lemmatization

    Recipe 9. Exploratory data analysis

    Recipe 10. End-to-end processing pipeline

# Recipe 1. Lowercasing

Recipe 2-1. Converting Text Data
to Lowercase

# Problem
   
    How to lowercase the text data?
    
# Solution

    The simplest way to do this is by using the default lower() function in Python. The lower() method converts all uppercase characters in a string into lowercase characters and returns them.

# How It Works
    
    Let’s follow the steps in this section to lowercase a given text or document. Here, we are going to use Python.    

In [2]:
text=['This is introduction to NLP','It is likely to be useful,to people ',
      'Machine learning is the new electrcity','There would be less hype around AI and more action going forward',
      'python is the best tool!','R is good langauage','I like this book','I want more books like this']

In [3]:
text

['This is introduction to NLP',
 'It is likely to be useful,to people ',
 'Machine learning is the new electrcity',
 'There would be less hype around AI and more action going forward',
 'python is the best tool!',
 'R is good langauage',
 'I like this book',
 'I want more books like this']

In [4]:
#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                                               tweet
0                        This is introduction to NLP
1               It is likely to be useful,to people 
2             Machine learning is the new electrcity
3  There would be less hype around AI and more ac...
4                           python is the best tool!
5                                R is good langauage
6                                   I like this book
7                        I want more books like this


In [5]:
df

Unnamed: 0,tweet
0,This is introduction to NLP
1,"It is likely to be useful,to people"
2,Machine learning is the new electrcity
3,There would be less hype around AI and more ac...
4,python is the best tool!
5,R is good langauage
6,I like this book
7,I want more books like this


# Step 1-2 Execute lower() function on the text data

In [7]:
x="LOWER"

In [9]:
x.lower()

'lower'

In [15]:
df["lower_tweet"]=df["tweet"].apply(lambda x : "".join(x.lower()))
#df["lower_tweet"]=df['tweet'].apply(lambda x: " ".join(x.lower()for x in x.split()))
df

Unnamed: 0,tweet,lower_tweet
0,This is introduction to NLP,this is introduction to nlp
1,"It is likely to be useful,to people","it is likely to be useful,to people"
2,Machine learning is the new electrcity,machine learning is the new electrcity
3,There would be less hype around AI and more ac...,there would be less hype around ai and more ac...
4,python is the best tool!,python is the best tool!
5,R is good langauage,r is good langauage
6,I like this book,i like this book
7,I want more books like this,i want more books like this


# Recipe 2-2. Removing Punctuation

In [16]:
import string

In [18]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# Problem

    You want to remove punctuation from the text data.

Solution
    
    The simplest way to do this is by using the regex and replace() function in Python.

How It Works
    
     Let’s follow the steps in this section to remove punctuation from the text data.

In [19]:
text=['This is introduction to NLP','It is likely to be useful,to people ',
      'Machine learning is the new electrcity','There would be less hype around AI and more action going forward',
      'python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to dataframe
import pandas as pd
df1 = pd.DataFrame({'tweets':text})
print(df1)

                                              tweets
0                        This is introduction to NLP
1               It is likely to be useful,to people 
2             Machine learning is the new electrcity
3  There would be less hype around AI and more ac...
4                           python is the best tool!
5                                R is good langauage
6                                   I like this book
7                        I want more books like this


# Step 2-2 Execute below function on the text data

In [20]:
import re
s = "I. like. This book!"

In [23]:
s1=re.sub(r"[^\w\s]","",s)
s1

'I like This book'

In [24]:
df1["lower_tweet"]=df1['tweets'].apply(lambda x: " ".join(x.lower()for x in x.split()))
df1

Unnamed: 0,tweets,lower_tweet
0,This is introduction to NLP,this is introduction to nlp
1,"It is likely to be useful,to people","it is likely to be useful,to people"
2,Machine learning is the new electrcity,machine learning is the new electrcity
3,There would be less hype around AI and more ac...,there would be less hype around ai and more ac...
4,python is the best tool!,python is the best tool!
5,R is good langauage,r is good langauage
6,I like this book,i like this book
7,I want more books like this,i want more books like this


In [26]:
df1["remove_punc"]=df1['lower_tweet'].str.replace(r"[^\w\s]","")

  df1["remove_punc"]=df1['lower_tweet'].str.replace(r"[^\w\s]","")


In [27]:
df1

Unnamed: 0,tweets,lower_tweet,remove_punc
0,This is introduction to NLP,this is introduction to nlp,this is introduction to nlp
1,"It is likely to be useful,to people","it is likely to be useful,to people",it is likely to be usefulto people
2,Machine learning is the new electrcity,machine learning is the new electrcity,machine learning is the new electrcity
3,There would be less hype around AI and more ac...,there would be less hype around ai and more ac...,there would be less hype around ai and more ac...
4,python is the best tool!,python is the best tool!,python is the best tool
5,R is good langauage,r is good langauage,r is good langauage
6,I like this book,i like this book,i like this book
7,I want more books like this,i want more books like this,i want more books like this
