<a href="https://colab.research.google.com/github/vaishnashan/NLP/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
# Import necessary libraries
import re
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

# Download required NLTK packages
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')

# Initialize Stemmer and Lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [2]:
# Input paragraph
paragraph = """In 2003, word n-gram model, at the time the best statistical algorithm, was outperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language modelling) by Yoshua Bengio with co-authors.[9]

In 2010, Tomáš Mikolov (then a PhD student at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden layer to language modelling,[10] and in the following years he went on to develop Word2vec. In the 2010s, representation learning and deep neural network-style (featuring many hidden layers) machine learning methods became widespread in natural language processing. That popularity was due partly to a flurry of results showing that such techniques[11][12] can achieve state-of-the-art results in many natural language tasks, e.g., in language modeling[13] and parsing.[14][15] This is increasingly important in medicine and healthcare, where NLP helps analyze notes and text in electronic health records that would otherwise be inaccessible for study when seeking to improve care[16] or protect patient privacy.[17]"""



In [5]:
# Tokenize the paragraph into sentences
sentences = nltk.sent_tokenize(paragraph)

# Preprocessing: Cleaning, Tokenization, Removing Stopwords, Lemmatization
corpus = []
for sentence in sentences:
    # Remove non-alphabetic characters and convert to lowercase
    review = re.sub('[^a-zA-Z]', ' ', sentence)
    review = review.lower()
    review = review.split()

    # Remove stopwords and apply lemmatization
    review = [lemmatizer.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]

    # Rejoin words into a single string
    review = ' '.join(review)
    corpus.append(review)

# Output the cleaned corpus
print("Cleaned Corpus:")
print(corpus)

Cleaned Corpus:
['word n gram model time best statistical algorithm outperformed multi layer perceptron single hidden layer context length several word trained million word cpu cluster language modelling yoshua bengio co author', 'tom mikolov phd student brno university technology co author applied simple recurrent neural network single hidden layer language modelling following year went develop word vec', 'representation learning deep neural network style featuring many hidden layer machine learning method became widespread natural language processing', 'popularity due partly flurry result showing technique achieve state art result many natural language task e g language modeling parsing', 'increasingly important medicine healthcare nlp help analyze note text electronic health record would otherwise inaccessible study seeking improve care protect patient privacy', '']


In [6]:
# Perform CountVectorization
cv = CountVectorizer()
x = cv.fit_transform(corpus)

# Display the vocabulary and feature matrix
print("Vocabulary:", cv.vocabulary_)
print("Feature Array for First Sentence:", x[0].toarray())

Vocabulary: {'word': 86, 'gram': 22, 'model': 41, 'time': 79, 'best': 8, 'statistical': 71, 'algorithm': 1, 'outperformed': 51, 'multi': 44, 'layer': 32, 'perceptron': 55, 'single': 69, 'hidden': 26, 'context': 13, 'length': 34, 'several': 66, 'trained': 81, 'million': 40, 'cpu': 14, 'cluster': 11, 'language': 31, 'modelling': 43, 'yoshua': 89, 'bengio': 7, 'co': 12, 'author': 5, 'tom': 80, 'mikolov': 39, 'phd': 56, 'student': 72, 'brno': 9, 'university': 82, 'technology': 77, 'applied': 3, 'simple': 68, 'recurrent': 62, 'neural': 47, 'network': 46, 'following': 21, 'year': 88, 'went': 84, 'develop': 16, 'vec': 83, 'representation': 63, 'learning': 33, 'deep': 15, 'style': 74, 'featuring': 19, 'many': 36, 'machine': 35, 'method': 38, 'became': 6, 'widespread': 85, 'natural': 45, 'processing': 59, 'popularity': 57, 'due': 17, 'partly': 53, 'flurry': 20, 'result': 64, 'showing': 67, 'technique': 76, 'achieve': 0, 'state': 70, 'art': 4, 'task': 75, 'modeling': 42, 'parsing': 52, 'increasi

In [8]:
# Example of stemming and lemmatization
sample_word = "history"
print("Stemmed Word:", stemmer.stem(sample_word))

sample_word_lemma = "history"
print("Lemmatized Word:", lemmatizer.lemmatize(sample_word_lemma))

Stemmed Word: histori
Lemmatized Word: history
