# <b>Overview</b>

This example demonstrates the fundamental steps of raw text preprocessing in natural language processing (NLP). The code first imports essential Python libraries, then uploads a text dataset for analysis. After loading the data, it tokenizes the sentences—breaking the text into individual words or phrases—preparing it for further linguistic processing.



In [None]:
!pip install autocorrect # Install the autocorrect package

import nltk # Import the nltk package
nltk.download('punkt') # Download the Punkt tokenizer
nltk.download('punkt_tab') # Download the Punkt tokenizer
from nltk.tokenize import word_tokenize # Import the word_tokenize function# Import the word_tokenize function
nltk.download('averaged_perceptron_tagger') # Download the POS tagger
nltk.download('stopwords') # Download the stopwords
nltk.download('wordnet') # Download the WordNet lemmatizer
from nltk import word_tokenize   # Import the word_tokenize function
from nltk.stem.wordnet import WordNetLemmatizer   # Import the WordNet lemmatizer
from nltk.corpus import stopwords   # Import the stopwords
from autocorrect import Speller   # Import the speller checker
from nltk.wsd import lesk   # Import the Lesk algorithm
from nltk.tokenize import sent_tokenize  # type: ignore # Import the sentence tokenizer
import string   # Import the string module



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## <h2>Libraries Info:</h2>

  <ol type="1">
    <li>autocorrect: This package provides automatic spelling correction for words. It is useful in text preprocessing to correct common spelling mistakes.</li>
    <li>nltk (Natural Language Toolkit): A popular Python library for NLP tasks such as tokenization, lemmatization, part-of-speech tagging, and more.</li>
    <li>nltk.download('punkt'): Downloads the Punkt tokenizer, which is a pre-trained sentence and word tokenizer used for splitting text into sentences and words.</li>
    <li>word_tokenize (from nltk.tokenize): A function that tokenizes (splits) a given text into words.</li>
    <li>nltk.download('averaged_perceptron_tagger'): Downloads the averaged perceptron tagger, which is used for part-of-speech (POS) tagging.</li>
    <li>nltk.download('stopwords'): Downloads a predefined list of stopwords (common words like "the," "is," etc.) that are often removed in text processing.</li>
    <li>nltk.download('wordnet'): Downloads the WordNet lexical database, which is used for word sense disambiguation and lemmatization.</li>
    <li>word_tokenize (from nltk): Re-imports the word_tokenize function for tokenizing text into words. (This is redundant in your code.)</li>
    <li>WordNetLemmatizer (from nltk.stem.wordnet): A lemmatizer that reduces words to their base or root form using the WordNet database.</li>
    <li>stopwords (from nltk.corpus): Provides a predefined list of common stopwords that can be filtered out from text during preprocessing.</li>
    <li>spell (from autocorrect): Provides a spelling correction function that automatically corrects misspelled words. (Note: The correct import should be from autocorrect import Speller and then Speller() instead of spell.)</li>
    <li>lesk (from nltk.wsd): Implements the Lesk algorithm, a word sense disambiguation technique that determines the correct meaning of a word based on its surrounding context.</li>
    <li>sent_tokenize (from nltk.tokenize): A function that splits text into individual sentences.</li>
    <li>string: A built-in Python module that provides tools for working with textual data, including string manipulation and punctuation removal.</li>
  </ol>

In [None]:
sentence = open("/content/file.txt", "r").read()   # Read the text file

In [None]:
print(sentence)

In this book authored by Sohom Ghosh and Dwight Gunning, we shall learnning how to pracess Natueral Language and extract insights from it. The first four chapter will introduce you to the basics of NLP. Later chapters will describe how to deal with complex NLP prajects. If you want to get early access of it, you should book your order now.



In [None]:
words = word_tokenize(sentence)   # Tokenize the words

In [None]:
print(words[0:10])   # Print the first 20 words

['In', 'this', 'book', 'authored', 'by', 'Sohom', 'Ghosh', 'and', 'Dwight', 'Gunning']


In [None]:
corrected_sentences = ""
corrected_word_list = []
for word in words:
  if word not in string.punctuation:
    spell = Speller(lang='en')
    word_crctd = spell(word)
    if word_crctd != word:
      print(word+" has been corrected to: "+word_crctd)
      corrected_sentences = corrected_sentences+" "+word_crctd
      corrected_word_list.append(word_crctd)
    else:
      corrected_sentences = corrected_sentences+" "+word
      corrected_word_list.append(word)

Sohom has been corrected to: Show
Ghosh has been corrected to: Ghost
Dwight has been corrected to: Right
Gunning has been corrected to: Running
learnning has been corrected to: learning
pracess has been corrected to: process
Natueral has been corrected to: Natural
NLP has been corrected to: LP
NLP has been corrected to: LP
prajects has been corrected to: projects


In [None]:
corrected_sentences

' In this book authored by Show Ghost and Right Running we shall learning how to process Natural Language and extract insights from it The first four chapter will introduce you to the basics of LP Later chapters will describe how to deal with complex LP projects If you want to get early access of it you should book your order now'

In [None]:
print(corrected_word_list[0:20])

['In', 'this', 'book', 'authored', 'by', 'Show', 'Ghost', 'and', 'Right', 'Running', 'we', 'shall', 'learning', 'how', 'to', 'process', 'Natural', 'Language', 'and', 'extract']


In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
print(nltk.pos_tag(corrected_word_list))

[('In', 'IN'), ('this', 'DT'), ('book', 'NN'), ('authored', 'VBN'), ('by', 'IN'), ('Show', 'NNP'), ('Ghost', 'NNP'), ('and', 'CC'), ('Right', 'NNP'), ('Running', 'NNP'), ('we', 'PRP'), ('shall', 'MD'), ('learning', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('process', 'VB'), ('Natural', 'NNP'), ('Language', 'NNP'), ('and', 'CC'), ('extract', 'JJ'), ('insights', 'NNS'), ('from', 'IN'), ('it', 'PRP'), ('The', 'DT'), ('first', 'JJ'), ('four', 'CD'), ('chapter', 'NN'), ('will', 'MD'), ('introduce', 'VB'), ('you', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('basics', 'NNS'), ('of', 'IN'), ('LP', 'NNP'), ('Later', 'NNP'), ('chapters', 'NNS'), ('will', 'MD'), ('describe', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('deal', 'VB'), ('with', 'IN'), ('complex', 'JJ'), ('LP', 'NNP'), ('projects', 'NNS'), ('If', 'IN'), ('you', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('get', 'VB'), ('early', 'JJ'), ('access', 'NN'), ('of', 'IN'), ('it', 'PRP'), ('you', 'PRP'), ('should', 'MD'), ('book', 'NN'), ('your', 'PRP$'), ('ord

In [None]:
stop_words = stopwords.words('english')
corrected_word_list_without_stopwords = []
for word in corrected_word_list:
  if word not in stop_words:
    corrected_word_list_without_stopwords.append(word)
    print(corrected_word_list_without_stopwords[:20])

['In']
['In', 'book']
['In', 'book', 'authored']
['In', 'book', 'authored', 'Show']
['In', 'book', 'authored', 'Show', 'Ghost']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process', 'Natural']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process', 'Natural', 'Language']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process', 'Natural', 'Language', 'extract']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process', 'Natural', 'Language', 'extract', 'insights']
['I

Stemmer

In [None]:
stemmer = nltk.stem.PorterStemmer()
corrected_word_list_without_stopwords_stemmed = []
for word in corrected_word_list_without_stopwords:
  corrected_word_list_without_stopwords_stemmed.append(stemmer.stem(word))
  print(corrected_word_list_without_stopwords_stemmed[:20])

['in']
['in', 'book']
['in', 'book', 'author']
['in', 'book', 'author', 'show']
['in', 'book', 'author', 'show', 'ghost']
['in', 'book', 'author', 'show', 'ghost', 'right']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run', 'shall']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run', 'shall', 'learn']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run', 'shall', 'learn', 'process']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run', 'shall', 'learn', 'process', 'natur']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run', 'shall', 'learn', 'process', 'natur', 'languag']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run', 'shall', 'learn', 'process', 'natur', 'languag', 'extract']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run', 'shall', 'learn', 'process', 'natur', 'languag', 'extract', 'insight']
['in', 'book', 'author', 'show', 'ghost', 'right', 'run', 'shall', 'learn', 'process', 'n

Lemmatization

In [None]:
lemmatizer = WordNetLemmatizer()
corrected_word_list_without_stopwords_lemmatized = []
for word in corrected_word_list_without_stopwords:
  corrected_word_list_without_stopwords_lemmatized.append(lemmatizer.lemmatize(word))
  print(corrected_word_list_without_stopwords_lemmatized[:20])

['In']
['In', 'book']
['In', 'book', 'authored']
['In', 'book', 'authored', 'Show']
['In', 'book', 'authored', 'Show', 'Ghost']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process', 'Natural']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process', 'Natural', 'Language']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process', 'Natural', 'Language', 'extract']
['In', 'book', 'authored', 'Show', 'Ghost', 'Right', 'Running', 'shall', 'learning', 'process', 'Natural', 'Language', 'extract', 'insight']
['In

In [None]:
print(sent_tokenize(corrected_sentences))

[' In this book authored by Show Ghost and Right Running we shall learning how to process Natural Language and extract insights from it The first four chapter will introduce you to the basics of LP Later chapters will describe how to deal with complex LP projects If you want to get early access of it you should book your order now']
