### What is Natural Language Processing?

##### Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. 

The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

### Common Usage of NLP?

###### 1.Search Autocomplete

It is another type of NLP that many people uses on a daily basis and have almost get what you expect when you are searching. This is thanks in large part to pioneers like Google, Google has been using these features in their search engines for years. This feature is also much helpful in companies website.

<img src="search_autocomplete.PNG">

###### 2.Search Autocorrect

When we are typing something, we won't realize and make some mistakes while typing.If a search engine on a website won't catch those mistakes and instead show no results, then potential buyers might assume like you don't have the information or answer's they are looking for and may instead go to the competitor.

We seen these when we type something wrong and Google's search engine will autocorrect your result and give a correct information about the topic.


<img src="search_autocorrect.PNG">

###### 3.Machine Translation

Suppose you're in china and you don't know chinese and you have to ask some address to local people but they are conversing to you in chinese then at that time Machine translation is saviour for you. One of the famous machine translation tool made by google, it will give you probably correct results always.


<img src="machine_translation.PNG">

###### 4.Natural language Processing chatbot

An NLP based chatbot is a computer program or artificial intelligence that communicates with a customer via textual or sound methods. Such programs are often designed to support clients on websites or via phone. 


<img src="chat_bot.png">

### How To Approach NLP Problem?

##### Text preprocessing

###### Text preprocessing is a method to clean the text data and make it ready to feed data to the model. 

Text data contains noise in various forms like emotions, punctuation, numbers, shortforms, text in a different case. When we talk about Human Language then, there are different ways to say the same thing, And this is only the main problem we have to deal with because machines will not understand words, they need numbers so we need to convert text to numbers in an efficient manner.

###### String Operations

In [2]:
# basic string
a = "FIS Global"

#Sliceing - Get the character at position 4(Here, indexing starts from 0)
print(a[4])
print("-" * 50)
print(a[3:6])
print("-" * 50)

#To get the output by negative indexing from -6 position to -2 position.
print(a[-6:-2])
print("-" * 50)

#Get the results from position 2 to 6 but give result with the increment of 2.
print(a[2:6:2])

#strip() will remove whitespace in the string from begining to the end.
a = " FIS Global "
print(a.strip())

#split() will split in the list of words
a = "FIS Global"
print(a.split())

#split() will split the strings into substrings if it finds any instances of seprator.
a = "FIS Global"
print(a.split("G"))

#lower() will lowercase the words which are upper in the sentences.
a = "FIS Global"
print(a.lower())

#upper() will transform lowercase into upper.
a= "FIS Global"
print(a.upper())

#replace() will work like replace one string with another string.
a = "FIS Global"
print(a.replace("FIS", "FNIS"))

#String Concatenation
a = "FIS"
b = "Global"
print(a +" "+ b)

G
--------------------------------------------------
 Gl
--------------------------------------------------
Glob
--------------------------------------------------
SG
FIS Global
['FIS', 'Global']
['FIS ', 'lobal']
fis global
FIS GLOBAL
FNIS Global
FIS Global


In [5]:
# import necessary libraries 
import nltk
import string
import re

###### Text lowercase
We do lowercase the text to reduce the size of the vocabulary of our text data.

In [6]:
def lowercase_text(text): 
    return text.lower() 
  
input_str = "Weather is too Cloudy.Possiblity of Rain is High,Today!!"
lowercase_text(input_str)

'weather is too cloudy.possiblity of rain is high,today!!'

###### Remove numbers
We should either remove the numbers or convert those numbers into textual representations. We use regular expressions(re) to remove the numbers.

In [7]:
def remove_num(text): 
    result = re.sub(r'\d+', '', text) 
    return result 
  
input_s = "You bought 6 candies from shop, and 4 candies are in home."
remove_num(input_s)

'You bought  candies from shop, and  candies are in home.'

###### Remove Punctuation

We remove punctuations because of that we don't have different form of the same word. If we don't remove punctuations, then been, been, and been! will be treated separately.

In [8]:
# let's remove punctuation 
def rem_punct(text): 
    translator = str.maketrans('', '', string.punctuation) 
    return text.translate(translator) 
  
input_str = "Hey, Are you excited??, After a week, we will be in Shimla!!!"
rem_punct(input_str)

'Hey Are you excited After a week we will be in Shimla'

### Tokenization

Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP and Advanced Deep Learning-based architectures.

###### Tokens are the building blocks of Natural Language.

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

In [16]:
### using simple text split
text = "Hello everyone. Welcome to FIS Global."
text = text.split()
text

['Hello', 'everyone.', 'Welcome', 'to', 'FIS', 'Global.']

In [12]:
from nltk.tokenize import word_tokenize
  
text = "Hello everyone. Welcome to FIS Global."
word_tokenize(text)

['Hello', 'everyone', '.', 'Welcome', 'to', 'FIS', 'Global', '.']

In [14]:
from nltk.tokenize import sent_tokenize
  
text = "Hello everyone. Welcome to FIS Global. You are part of AI/ML COE team"
sent_tokenize(text)

['Hello everyone.', 'Welcome to FIS Global.', 'You are part of AI/ML COE team']

#### Remove default stopwords:

#### Stopwords are words that do not contribute to the meaning of the sentence. Hence, they can be safely removed without causing any change in the meaning of a sentence. The NLTK(Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [9]:
# importing nltk library
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

nltk.download('stopwords')
nltk.download('punkt')
  
# remove stopwords function 
def rem_stopwords(text): 
    stop_words = set(stopwords.words("english")) 
    word_tokens = word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words] 
    return filtered_text 
  
ex_text = "Data is the new oil. A.I is the last invention"
rem_stopwords(ex_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Data', 'new', 'oil', '.', 'A.I', 'last', 'invention']

In [23]:
### stopwards list

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

###### Stemming

From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, stemming a word may not result in actual words.

For Example: studies ---> studi

             studying ---> study
             
             going ---> go
             
             
If our sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into their root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster Stemmer. We usually use Porter stemmer among them.

In [10]:
#importing nltk's porter stemmer 
from nltk.stem.porter import PorterStemmer 
from nltk.tokenize import word_tokenize 
stem1 = PorterStemmer() 
  
# stem words in the list of tokenised words 
def s_words(text): 
    word_tokens = word_tokenize(text) 
    stems = [stem1.stem(word) for word in word_tokens] 
    return stems 
  
text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'
s_words(text)

['data',
 'is',
 'the',
 'new',
 'revolut',
 'in',
 'the',
 'world',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individu',
 'would',
 'gener',
 'terabyt',
 'of',
 'data',
 '.']

###### Lemmatization

As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization.So, we added pos(parts-of-speech) as a parameter.

In [15]:
from nltk.stem import wordnet 
from nltk.tokenize import word_tokenize 
lemma = wordnet.WordNetLemmatizer()
nltk.download('wordnet')
# lemmatize string 
def lemmatize_word(text): 
    word_tokens = word_tokenize(text) 
    # provide context i.e. part-of-speech(pos)
    lemmas = [lemma.lemmatize(word, pos ='v') for word in word_tokens] 
    return lemmas 
  
text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'
lemmatize_word(text)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Data',
 'be',
 'the',
 'new',
 'revolution',
 'in',
 'the',
 'World',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individual',
 'would',
 'generate',
 'terabytes',
 'of',
 'data',
 '.']

###### Additional Packages to perform text preprocessing

##### Beautiful Soup 
is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
