![Tokenization-in-NLP.png](attachment:1920e5e9-ecbe-460d-9324-fbd53e40d913.png)

![image.png](attachment:c5a36d6f-8add-4639-bf85-2410a8f22726.png)

# STOPWORDS 
![image.png](attachment:314c00fe-90bf-4b2f-a827-52b1051928ef.png)

In [2]:
# libraries
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords



In [9]:
# Tokenization of paragraph
paragraph = """AI, machine learning and deep learning are common terms in enterprise 
                IT and sometimes used interchangeably, especially by companies in their marketing materials. 
                But there are distinctions. The term AI, coined in the 1950s, refers to the simulation of human 
                intelligence by machines. It covers an ever-changing set of capabilities as new technologies 
                are developed. Technologies that come under the umbrella of AI include machine learning and 
                deep learning. Machine learning enables software applications to become more accurate at 
                predicting outcomes without being explicitly programmed to do so. Machine learning algorithms 
                use historical data as input to predict new output values. This approach became vastly more 
                effective with the rise of large data sets to train on. Deep learning, a subset of machine 
                learning, is based on our understanding of how the brain is structured. Deep learning's 
                use of artificial neural networks structure is the underpinning of recent advances in AI, 
                including self-driving cars and ChatGPT."""

In [10]:
# tokenizing paragraph into sentences
sent_token = sent_tokenize(paragraph)
sent_token

['AI, machine learning and deep learning are common terms in enterprise \n                IT and sometimes used interchangeably, especially by companies in their marketing materials.',
 'But there are distinctions.',
 'The term AI, coined in the 1950s, refers to the simulation of human \n                intelligence by machines.',
 'It covers an ever-changing set of capabilities as new technologies \n                are developed.',
 'Technologies that come under the umbrella of AI include machine learning and \n                deep learning.',
 'Machine learning enables software applications to become more accurate at \n                predicting outcomes without being explicitly programmed to do so.',
 'Machine learning algorithms \n                use historical data as input to predict new output values.',
 'This approach became vastly more \n                effective with the rise of large data sets to train on.',
 'Deep learning, a subset of machine \n                learning, is

In [11]:
len(sent_token)

10

In [12]:
# tokenization to words
word_tokens = word_tokenize(paragraph)
word_tokens

['AI',
 ',',
 'machine',
 'learning',
 'and',
 'deep',
 'learning',
 'are',
 'common',
 'terms',
 'in',
 'enterprise',
 'IT',
 'and',
 'sometimes',
 'used',
 'interchangeably',
 ',',
 'especially',
 'by',
 'companies',
 'in',
 'their',
 'marketing',
 'materials',
 '.',
 'But',
 'there',
 'are',
 'distinctions',
 '.',
 'The',
 'term',
 'AI',
 ',',
 'coined',
 'in',
 'the',
 '1950s',
 ',',
 'refers',
 'to',
 'the',
 'simulation',
 'of',
 'human',
 'intelligence',
 'by',
 'machines',
 '.',
 'It',
 'covers',
 'an',
 'ever-changing',
 'set',
 'of',
 'capabilities',
 'as',
 'new',
 'technologies',
 'are',
 'developed',
 '.',
 'Technologies',
 'that',
 'come',
 'under',
 'the',
 'umbrella',
 'of',
 'AI',
 'include',
 'machine',
 'learning',
 'and',
 'deep',
 'learning',
 '.',
 'Machine',
 'learning',
 'enables',
 'software',
 'applications',
 'to',
 'become',
 'more',
 'accurate',
 'at',
 'predicting',
 'outcomes',
 'without',
 'being',
 'explicitly',
 'programmed',
 'to',
 'do',
 'so',
 '.',

In [13]:
len(word_tokens)

174

# STEMMING:

In [30]:
# Stemming gives root form tokens or words
stem_pst = PorterStemmer()
for i in range(len(sent_token)):
    words = nltk.word_tokenize(sent_token[i])
    words = [stem_pst.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sent_token[i] = ''.join(words)

In [24]:
sent_token

['ai , machinlearndeeplearncommontermenterprisitsometimuseinterchang , especicompanimarketmateri .',
 'butdistinct .',
 'thetermai , coin1950 , refersimulhumanintelligmachin .',
 'itcoverever-changsetcapablnewtechnologdevelop .',
 'technologcomeumbrellaaiincludmachinlearndeeplearn .',
 'machinlearnenablsoftwarapplicbecomaccurpredictoutcomwithoutexplicitliprogram .',
 'machinlearnalgorithmusehistordatainputpredictnewoutputvalu .',
 'thiapproachbecamvastlieffectriselargdatasettrain .',
 'deeplearn , subsetmachinlearn , baseunderstandbrainstructur .',
 "deeplearn'suseartificineuralnetworkstructurunderpinrecentadvancai , includself-drivcarchatgpt ."]

# Lemmatization

In [29]:
# tokenize the paragraph to sentenses
# Lemmatization gives you proper wording
sentence = sent_tokenize(paragraph)
sentence
lemma = WordNetLemmatizer()


for i in range(len(sentence)):
    words = nltk.word_tokenize(sentence[i])
    words = [lemma.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentence[i] = ' '.join(words)      
    

In [27]:
sentence

['AI , machine learning deep learning common term enterprise IT sometimes used interchangeably , especially company marketing material .',
 'But distinction .',
 'The term AI , coined 1950s , refers simulation human intelligence machine .',
 'It cover ever-changing set capability new technology developed .',
 'Technologies come umbrella AI include machine learning deep learning .',
 'Machine learning enables software application become accurate predicting outcome without explicitly programmed .',
 'Machine learning algorithm use historical data input predict new output value .',
 'This approach became vastly effective rise large data set train .',
 'Deep learning , subset machine learning , based understanding brain structured .',
 "Deep learning 's use artificial neural network structure underpinning recent advance AI , including self-driving car ChatGPT ."]