# Natural Language Processing (NLP) Techniques

This notebook demonstrates various Natural Language Processing (NLP) techniques, including tokenization, stemming, lemmatization, and feature extraction using Bag of Words and TF-IDF. The steps are as follows:

1. Tokenization
2. Stemming
3. Lemmatization
4. Removing Stopwords
5. Bag of Words
6. TF-IDF


The notebook provides a comprehensive overview of basic text preprocessing techniques in NLP, showcasing how to clean and transform text data for further analysis or modeling.

In [12]:
%pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [13]:
paragraph = """
Spotify is a digital music, podcast, and video service that gives you access to millions of songs and other content from creators all over the world.

Basic functions such as playing music are totally free, but you can also choose to upgrade to Spotify Premium.

Whether you have Premium or not, you can.

Get recommendations based on your taste.
Build collections of music and podcasts And more!
Spotify is available across a range of devices, including computers, phones, tablets, speakers, TVs, and cars, and you can easily transition from one to another with Spotify Connect.

Can I keep music from Spotify?
Spotify only gives access to music and podcasts through our apps. Our licensing means there's no way to export our content outside of the app.
"""

paragraph

"\nSpotify is a digital music, podcast, and video service that gives you access to millions of songs and other content from creators all over the world.\n\nBasic functions such as playing music are totally free, but you can also choose to upgrade to Spotify Premium.\n\nWhether you have Premium or not, you can.\n\nGet recommendations based on your taste.\nBuild collections of music and podcasts And more!\nSpotify is available across a range of devices, including computers, phones, tablets, speakers, TVs, and cars, and you can easily transition from one to another with Spotify Connect.\n\nCan I keep music from Spotify?\nSpotify only gives access to music and podcasts through our apps. Our licensing means there's no way to export our content outside of the app.\n"

In [14]:
paragraph = """
Spotify is a digital music, podcast, and video service that gives you access to millions of songs and other content from creators all over the world.

Basic functions such as playing music are totally free, but you can also choose to upgrade to Spotify Premium.

Whether you have Premium or not, you can.

Get recommendations based on your taste.
Build collections of music and podcasts And more!
Spotify is available across a range of devices, including computers, phones, tablets, speakers, TVs, and cars, and you can easily transition from one to another with Spotify Connect.

Can I keep music from Spotify?
Spotify only gives access to music and podcasts through our apps. Our licensing means there's no way to export our content outside of the app.
"""

paragraph

In [14]:
import nltk 
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [15]:
nltk.download('punkt')
Sentences = nltk.sent_tokenize(paragraph)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
nltk.download('punkt')
Sentences = nltk.sent_tokenize(paragraph)

['\nSpotify is a digital music, podcast, and video service that gives you access to millions of songs and other content from creators all over the world.', 'Basic functions such as playing music are totally free, but you can also choose to upgrade to Spotify Premium.', 'Whether you have Premium or not, you can.', 'Get recommendations based on your taste.', 'Build collections of music and podcasts And more!', 'Spotify is available across a range of devices, including computers, phones, tablets, speakers, TVs, and cars, and you can easily transition from one to another with Spotify Connect.', 'Can I keep music from Spotify?', 'Spotify only gives access to music and podcasts through our apps.', "Our licensing means there's no way to export our content outside of the app."]


In [17]:
stemmer= PorterStemmer()
stemmer.stem('paragraphs')

'paragraph'

In [18]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [19]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [20]:
lemmatizer.lemmatize('goes')

'go'

In [21]:
len(Sentences)

9

In [22]:
import re #regular expression
corpus=[]
for i in range(len(Sentences)):
  review= re.sub('[^a-zA-Z]',' ',Sentences[i])
  review = review.lower()
  corpus.append(review)


In [23]:
corpus

[' spotify is a digital music  podcast  and video service that gives you access to millions of songs and other content from creators all over the world ',
 'basic functions such as playing music are totally free  but you can also choose to upgrade to spotify premium ',
 'whether you have premium or not  you can ',
 'get recommendations based on your taste ',
 'build collections of music and podcasts and more ',
 'spotify is available across a range of devices  including computers  phones  tablets  speakers  tvs  and cars  and you can easily transition from one to another with spotify connect ',
 'can i keep music from spotify ',
 'spotify only gives access to music and podcasts through our apps ',
 'our licensing means there s no way to export our content outside of the app ']

Stemming

In [24]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [25]:
for i in corpus:
  words = nltk.word_tokenize(i)
  for word in words:
    if word not in set(stopwords.words('english')):
      print(stemmer.stem(word))

spotifi
digit
music
podcast
video
servic
give
access
million
song
content
creator
world
basic
function
play
music
total
free
also
choos
upgrad
spotifi
premium
whether
premium
get
recommend
base
tast
build
collect
music
podcast
spotifi
avail
across
rang
devic
includ
comput
phone
tablet
speaker
tv
car
easili
transit
one
anoth
spotifi
connect
keep
music
spotifi
spotifi
give
access
music
podcast
app
licens
mean
way
export
content
outsid
app


Lemmatization

In [26]:
for i in corpus:
  words = nltk.word_tokenize(i)
  for word in words:
    if word not in set(stopwords.words('english')):
      print(lemmatizer.lemmatize(word))

spotify
digital
music
podcast
video
service
give
access
million
song
content
creator
world
basic
function
playing
music
totally
free
also
choose
upgrade
spotify
premium
whether
premium
get
recommendation
based
taste
build
collection
music
podcasts
spotify
available
across
range
device
including
computer
phone
tablet
speaker
tv
car
easily
transition
one
another
spotify
connect
keep
music
spotify
spotify
give
access
music
podcasts
apps
licensing
mean
way
export
content
outside
app


Apply stopwords, Lemmatize

In [37]:
import re #regular expression
corpus=[]
for i in range(len(Sentences)):
  review= re.sub('[^a-zA-Z]',' ',Sentences[i])
  review = review.lower()
  review = review.split()
  review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)


Bag of Words

In [43]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=True, ngram_range=(3,3))

In [44]:
X=cv.fit_transform(corpus)
X

<9x50 sparse matrix of type '<class 'numpy.int64'>'
	with 50 stored elements in Compressed Sparse Row format>

In [45]:
cv.vocabulary_

{'spotify digital music': 41,
 'digital music podcast': 15,
 'music podcast video': 28,
 'podcast video service': 34,
 'video service give': 48,
 'service give access': 37,
 'give access million': 21,
 'access million song': 0,
 'million song content': 27,
 'song content creator': 38,
 'content creator world': 12,
 'basic function playing': 6,
 'function playing music': 19,
 'playing music totally': 33,
 'music totally free': 30,
 'totally free also': 44,
 'free also choose': 18,
 'also choose upgrade': 3,
 'choose upgrade spotify': 9,
 'upgrade spotify premium': 47,
 'get recommendation based': 20,
 'recommendation based taste': 36,
 'build collection music': 7,
 'collection music podcasts': 10,
 'spotify available across': 40,
 'available across range': 5,
 'across range device': 2,
 'range device including': 35,
 'device including computer': 14,
 'including computer phone': 23,
 'computer phone tablet': 11,
 'phone tablet speaker': 32,
 'tablet speaker tv': 43,
 'speaker tv car': 39

In [46]:
corpus[0]

'spotify digital music podcast video service give access million song content creator world'

In [47]:
X[0].toarray()

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 0]])

TF-IDF (Term Frequency - Inverse Dcoument Frequency)

In [65]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(ngram_range=(1,1), max_features=3) #can reduce no. of features
X = cv.fit_transform(corpus)

In [66]:
corpus[0]

'spotify digital music podcast video service give access million song content creator world'

In [67]:
X[0].toarray()

array([[0.71799085, 0.49218347, 0.49218347]])