
# Text Preprocessing Techniques

In this notebook, we will explore various text preprocessing techniques that are essential for preparing text data for Natural Language Processing (NLP) tasks. These techniques include tokenization, stemming, lemmatization, and named entity recognition (NER).

In [1]:
# Importing necessary libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import ne_chunk, pos_tag

# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Sample text
text = "Apple is looking at buying U.K. startup for $1 billion. This move aims to expand their operations in Europe."


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ryann\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ryann\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ryann\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\ryann\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\ryann\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


## 1. Tokenization

Tokenization is the process of splitting text into individual tokens, which can be words, phrases, or other meaningful elements. This is the first step in text preprocessing and is crucial for further analysis.

- **Word Tokenization**: Splitting a sentence into words.
- **Sentence Tokenization**: Splitting a paragraph into sentences.

Tokenization helps in breaking down the text into manageable pieces, making it easier to analyze and process further. It forms the basis for other preprocessing tasks.

In [2]:
# Tokenization

# Word tokenization
word_tokens = word_tokenize(text)
print("Word Tokens:\n", word_tokens)

# Sentence tokenization
sentence_tokens = sent_tokenize(text)
print("\nSentence Tokens:\n", sentence_tokens)

Word Tokens:
 ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.', 'This', 'move', 'aims', 'to', 'expand', 'their', 'operations', 'in', 'Europe', '.']

Sentence Tokens:
 ['Apple is looking at buying U.K. startup for $1 billion.', 'This move aims to expand their operations in Europe.']


## 2. Stemming

Stemming is the process of reducing words to their root form by removing suffixes. This helps in normalizing the text data by grouping similar words together.

- **Example**: The words "running", "runner", and "ran" can be reduced to the root word "run".

Stemming is a rule-based process that often results in non-dictionary forms of words. It is useful for applications where speed and simplicity are more important than precision.

In [3]:
# Stemming

# Initializing the PorterStemmer
stemmer = PorterStemmer()

# Applying stemming to word tokens
stemmed_tokens = [stemmer.stem(word) for word in word_tokens]
print("\nStemmed Tokens:\n", stemmed_tokens)


Stemmed Tokens:
 ['appl', 'is', 'look', 'at', 'buy', 'u.k.', 'startup', 'for', '$', '1', 'billion', '.', 'thi', 'move', 'aim', 'to', 'expand', 'their', 'oper', 'in', 'europ', '.']


## 3. Lemmatization

Lemmatization is the process of reducing words to their base or dictionary form. Unlike stemming, lemmatization takes into account the context and converts the word to its meaningful base form.

- **Example**: The words "running" and "ran" are both converted to "run".

Lemmatization uses vocabulary and morphological analysis of words, making it more accurate than stemming. It is particularly useful for applications where understanding the meaning of the words is important.

In [4]:
# Lemmatization

# Initializing the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Applying lemmatization to word tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in word_tokens]
print("\nLemmatized Tokens:\n", lemmatized_tokens)


Lemmatized Tokens:
 ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.', 'This', 'move', 'aim', 'to', 'expand', 'their', 'operation', 'in', 'Europe', '.']


## 4. Named Entity Recognition (NER)

Named Entity Recognition (NER) is a technique used to identify and classify named entities (such as names of people, organizations, locations, dates, etc.) in the text. This helps in extracting important information from unstructured text.

- **Example**: In the sentence "Apple is looking at buying U.K. startup for $1 billion", NER will identify "Apple" as an organization, "U.K." as a location, and "$1 billion" as a monetary value.

NER is useful for information retrieval, question answering, and summarization tasks, as it helps in identifying and categorizing key information from the text.

In [5]:
# Named Entity Recognition (NER)

# POS tagging before NER
pos_tags = pos_tag(word_tokens)
print("\nPart-of-Speech Tags:\n", pos_tags)

# Applying NER
named_entities = ne_chunk(pos_tags)
print("\nNamed Entities:\n", named_entities)


Part-of-Speech Tags:
 [('Apple', 'NNP'), ('is', 'VBZ'), ('looking', 'VBG'), ('at', 'IN'), ('buying', 'VBG'), ('U.K.', 'NNP'), ('startup', 'NN'), ('for', 'IN'), ('$', '$'), ('1', 'CD'), ('billion', 'CD'), ('.', '.'), ('This', 'DT'), ('move', 'NN'), ('aims', 'VBZ'), ('to', 'TO'), ('expand', 'VB'), ('their', 'PRP$'), ('operations', 'NNS'), ('in', 'IN'), ('Europe', 'NNP'), ('.', '.')]

Named Entities:
 (S
  (GPE Apple/NNP)
  is/VBZ
  looking/VBG
  at/IN
  buying/VBG
  U.K./NNP
  startup/NN
  for/IN
  $/$
  1/CD
  billion/CD
  ./.
  This/DT
  move/NN
  aims/VBZ
  to/TO
  expand/VB
  their/PRP$
  operations/NNS
  in/IN
  (GPE Europe/NNP)
  ./.)
