Tokenization : Tokenization is the process of splitting a text document into individual words or tokens.

POS Tagging: POS (Part-of-Speech) tagging is the process of assigning grammatical information (like noun, verb, adjective, etc.) to each word in a sentence

Stop words : Stop words are commonly used words (like "the", "is", "and", etc.) that are often removed from text data because they don't carry significant meaning.

Stemming and Lemmatization : Stemming is a simpler process that chops off suffixes,while lemmatization considers the context and reduces words to their dictionary form.

Term Frequency (TF): TF measures the frequency of a term in a document. It is calculated as the ratio of the count of a term to the total number of terms in the document.

Inverse Document Frequency (IDF): IDF measures the importance of a term across multiple documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term.

In [1]:
!pip install PyPDF2




[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install python-docx




[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


## Extracting an sample documnent

In [3]:
with open('sample.txt' , 'r') as file:
    sample_document = file.read()

In [4]:
sample_document

"Tokenization is the process of splitting a text document into individual words or tokens. \nIt is an important step in natural language processing. \nPOS tagging assigns grammatical information to each word in a sentence. \nStop words are commonly used words that are often removed from text data because they don't carry significant meaning.\nStemming and lemmatization are techniques used to reduce words to their base or root form."

## Tokenization

In [5]:
import nltk
from nltk.tokenize import word_tokenize  

In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Samruddhi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
tokens = word_tokenize(sample_document)

In [8]:
tokens

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'splitting',
 'a',
 'text',
 'document',
 'into',
 'individual',
 'words',
 'or',
 'tokens',
 '.',
 'It',
 'is',
 'an',
 'important',
 'step',
 'in',
 'natural',
 'language',
 'processing',
 '.',
 'POS',
 'tagging',
 'assigns',
 'grammatical',
 'information',
 'to',
 'each',
 'word',
 'in',
 'a',
 'sentence',
 '.',
 'Stop',
 'words',
 'are',
 'commonly',
 'used',
 'words',
 'that',
 'are',
 'often',
 'removed',
 'from',
 'text',
 'data',
 'because',
 'they',
 'do',
 "n't",
 'carry',
 'significant',
 'meaning',
 '.',
 'Stemming',
 'and',
 'lemmatization',
 'are',
 'techniques',
 'used',
 'to',
 'reduce',
 'words',
 'to',
 'their',
 'base',
 'or',
 'root',
 'form',
 '.']

## POS Tagging

In [9]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Samruddhi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [10]:
pos = nltk.pos_tag(tokens)

In [11]:
pos

[('Tokenization', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('process', 'NN'),
 ('of', 'IN'),
 ('splitting', 'VBG'),
 ('a', 'DT'),
 ('text', 'NN'),
 ('document', 'NN'),
 ('into', 'IN'),
 ('individual', 'JJ'),
 ('words', 'NNS'),
 ('or', 'CC'),
 ('tokens', 'NNS'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('important', 'JJ'),
 ('step', 'NN'),
 ('in', 'IN'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('.', '.'),
 ('POS', 'NNP'),
 ('tagging', 'VBG'),
 ('assigns', 'RB'),
 ('grammatical', 'JJ'),
 ('information', 'NN'),
 ('to', 'TO'),
 ('each', 'DT'),
 ('word', 'NN'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('sentence', 'NN'),
 ('.', '.'),
 ('Stop', 'VB'),
 ('words', 'NNS'),
 ('are', 'VBP'),
 ('commonly', 'RB'),
 ('used', 'VBN'),
 ('words', 'NNS'),
 ('that', 'WDT'),
 ('are', 'VBP'),
 ('often', 'RB'),
 ('removed', 'VBN'),
 ('from', 'IN'),
 ('text', 'NN'),
 ('data', 'NNS'),
 ('because', 'IN'),
 ('they', 'PRP'),
 ('do', 'VBP'),
 ("n't", 'RB'),
 ('carry', 'VB'),
 ('si

## Stop Words Removal:

In [12]:
from nltk.corpus import stopwords

In [13]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Samruddhi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
stop_words = set(stopwords.words('english'))


In [15]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [16]:
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

In [17]:
filtered_tokens

['Tokenization',
 'process',
 'splitting',
 'text',
 'document',
 'individual',
 'words',
 'tokens',
 '.',
 'important',
 'step',
 'natural',
 'language',
 'processing',
 '.',
 'POS',
 'tagging',
 'assigns',
 'grammatical',
 'information',
 'word',
 'sentence',
 '.',
 'Stop',
 'words',
 'commonly',
 'used',
 'words',
 'often',
 'removed',
 'text',
 'data',
 "n't",
 'carry',
 'significant',
 'meaning',
 '.',
 'Stemming',
 'lemmatization',
 'techniques',
 'used',
 'reduce',
 'words',
 'base',
 'root',
 'form',
 '.']

## Stemming and Lemmatization:

In [18]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Samruddhi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [19]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
print("Lemmatized Tokens:", lemmatized_tokens)


Stemmed Tokens: ['token', 'process', 'split', 'text', 'document', 'individu', 'word', 'token', '.', 'import', 'step', 'natur', 'languag', 'process', '.', 'po', 'tag', 'assign', 'grammat', 'inform', 'word', 'sentenc', '.', 'stop', 'word', 'commonli', 'use', 'word', 'often', 'remov', 'text', 'data', "n't", 'carri', 'signific', 'mean', '.', 'stem', 'lemmat', 'techniqu', 'use', 'reduc', 'word', 'base', 'root', 'form', '.']
Lemmatized Tokens: ['Tokenization', 'process', 'splitting', 'text', 'document', 'individual', 'word', 'token', '.', 'important', 'step', 'natural', 'language', 'processing', '.', 'POS', 'tagging', 'assigns', 'grammatical', 'information', 'word', 'sentence', '.', 'Stop', 'word', 'commonly', 'used', 'word', 'often', 'removed', 'text', 'data', "n't", 'carry', 'significant', 'meaning', '.', 'Stemming', 'lemmatization', 'technique', 'used', 'reduce', 'word', 'base', 'root', 'form', '.']


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
filtered_document = ' '.join(filtered_tokens)

# Calculate TF-IDF representation
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([filtered_document])

# Print TF-IDF representation
print("TF-IDF Representation:")
print(tfidf_matrix.toarray())

# Get feature names (terms)
feature_names = vectorizer.get_feature_names_out()
print("\nFeature Names:")
print(feature_names)

TF-IDF Representation:
[[0.13245324 0.13245324 0.13245324 0.13245324 0.13245324 0.13245324
  0.13245324 0.13245324 0.13245324 0.13245324 0.13245324 0.13245324
  0.13245324 0.13245324 0.13245324 0.13245324 0.13245324 0.13245324
  0.13245324 0.13245324 0.13245324 0.13245324 0.13245324 0.13245324
  0.13245324 0.13245324 0.13245324 0.13245324 0.13245324 0.13245324
  0.26490647 0.13245324 0.13245324 0.26490647 0.13245324 0.52981294]]

Feature Names:
['assigns' 'base' 'carry' 'commonly' 'data' 'document' 'form'
 'grammatical' 'important' 'individual' 'information' 'language'
 'lemmatization' 'meaning' 'natural' 'often' 'pos' 'process' 'processing'
 'reduce' 'removed' 'root' 'sentence' 'significant' 'splitting' 'stemming'
 'step' 'stop' 'tagging' 'techniques' 'text' 'tokenization' 'tokens'
 'used' 'word' 'words']
