# Data preprocessing in NLP

If we want to make an inference from given data, we need to preprocess it first. Preprocessing step can be diffrent depending on which type of problem we are trying to solve. For examlpe, if we want to work with image data, the preprocessing is so much diffrent than we we want to work with text data that related to our language.

Generally speaking, every NLP dataset can be preprocessed in 5 steps.
    1. Lodaing the dataset
    2. *Optional Clean the dataset from html tags (only if fetching a website)
    3. Normalization
    4. Tokenization
    5. Remove stop words
    6. Stemming and Lemmatization

## 1. Loading the data

On the first step, we will load our data. There are a lot diffrent formats in dataset, that is why we choose on of many ways. However in here we would like to catch a dataset from webisite.

In [17]:
# Load a dataset
with open("data/hieroglyph.txt", "r") as f:
    text = f.read()
    
print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



## 2. *Optional Clean the dataset

We do this step only if we fetch a website and we have some HTML tags that we really don't want to have. In this particular dataset, we do not need to do that. 

In case of fetching a website you can use <strong>BeautifulSoup</strong> library to delete HTML tags. And get the actual text.

## 3. Normalization

For normalizing the text, we should take the following steps:the followings:
    1. Lowercase the words
    2. Remove the punctuations (like ".", "?", "!")

In [18]:
# Lower case the words
text = text.lower()

In [19]:
print(text)

hieroglyphic writing dates from c. 3000 bc, and is composed of hundreds of symbols. a hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



In [20]:
# Remove all punctuation (in other word keep all alphabets and numbers)
import re
text = re.sub(r"[^a-zA-Z0-9]", " ", text)

In [21]:
print(text)

hieroglyphic writing dates from c  3000 bc  and is composed of hundreds of symbols  a hieroglyph can represent a word  a sound  or a silent determinative  and the same symbol can serve different purposes in different contexts  hieroglyphs were a formal script  used on stone monuments and in tombs  that could be as detailed as individual works of art  


## 4. Tokenization

On the firth step we will split all the sentences into words.

In [31]:
# Tokenizing the text
from nltk.tokenize import word_tokenize
words = word_tokenize(text)

In [32]:
print(words)

['hieroglyphic', 'writing', 'dates', 'from', 'c', '3000', 'bc', 'and', 'is', 'composed', 'of', 'hundreds', 'of', 'symbols', 'a', 'hieroglyph', 'can', 'represent', 'a', 'word', 'a', 'sound', 'or', 'a', 'silent', 'determinative', 'and', 'the', 'same', 'symbol', 'can', 'serve', 'different', 'purposes', 'in', 'different', 'contexts', 'hieroglyphs', 'were', 'a', 'formal', 'script', 'used', 'on', 'stone', 'monuments', 'and', 'in', 'tombs', 'that', 'could', 'be', 'as', 'detailed', 'as', 'individual', 'works', 'of', 'art']


## 5. Remove stop words

In this step, we remove all stop words (such as "is", "our", "the", "in", "at", etc.) that do not add a lot of meaning to a sentence.

In [35]:
# Removing stop words
from nltk.corpus import stopwords

words_ = []
for w in words:
    if w not in stopwords.words("english"):
        words_.append(w)

In [36]:
print(words_)

['hieroglyphic', 'writing', 'dates', 'c', '3000', 'bc', 'composed', 'hundreds', 'symbols', 'hieroglyph', 'represent', 'word', 'sound', 'silent', 'determinative', 'symbol', 'serve', 'different', 'purposes', 'different', 'contexts', 'hieroglyphs', 'formal', 'script', 'used', 'stone', 'monuments', 'tombs', 'could', 'detailed', 'individual', 'works', 'art']


## 6. Stemming and Lemmatization

<strong>Stemming</strong> is the process of reducing a word to its stem or root form. For example, the words "branching", "branches", and "branched" can be redces to "branch". This help reduce complexity while retaining the essense of meaning that is carried by words.

In [37]:
# Stemming
from nltk.stem.porter import PorterStemmer
stemmed = [PorterStemmer().stem(w) for w in words_]

In [38]:
print(stemmed)

['hieroglyph', 'write', 'date', 'c', '3000', 'bc', 'compos', 'hundr', 'symbol', 'hieroglyph', 'repres', 'word', 'sound', 'silent', 'determin', 'symbol', 'serv', 'differ', 'purpos', 'differ', 'context', 'hieroglyph', 'formal', 'script', 'use', 'stone', 'monument', 'tomb', 'could', 'detail', 'individu', 'work', 'art']


<strong>Lemmatization</strong> is another technique to reduce words to a normalized form, but in here we use a dictionary for its transfformation. For example, words "is", "was", "were" will be converted to "be".

In [39]:
# Lemmatizating
from nltk.stem.wordnet import WordNetLemmatizer
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words_]

In [40]:
print(lemmed)

['hieroglyphic', 'writing', 'date', 'c', '3000', 'bc', 'composed', 'hundred', 'symbol', 'hieroglyph', 'represent', 'word', 'sound', 'silent', 'determinative', 'symbol', 'serve', 'different', 'purpose', 'different', 'context', 'hieroglyph', 'formal', 'script', 'used', 'stone', 'monument', 'tomb', 'could', 'detailed', 'individual', 'work', 'art']
