**NLTK vs spaCy**

* NLTK was built by scholars and researchers as a tool to help you create complex NLP functions. It almost acts as a toolbox of NLP algorithms. In contrast, spaCy is similar to a service: it helps you get specific tasks done.

* A core difference between NLTK and spaCy stems from the way in which these libraries were built. NLTK is essentially a string processing library, where each function takes strings as input and returns a processed string. In contrast, spaCy takes an object-oriented approach.

* spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

* As spaCy uses the latest and best algorithms, its performance is usually good as compared to NLTK.

* spaCy has support for word vectors whereas NLTK does not.

**Corpus** - A text corpus is a large and unstructured set of texts. It contains docs

**Tokenization** - Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. In the process of tokenization, some characters like punctuation marks are discarded.

**Stopwords** - Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

**Stemming and Lemmatization** - Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.  


###Spacy examples

In [None]:
# Tokenization
import spacy
nlp = spacy.load('en_core_web_sm')

txt = 'This tutorial is about Natural Language Processing in Spacy.'
doc = nlp(txt)
# Extract tokens
print ([token.text for token in doc])


['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.']


In [None]:
# Reading from a txt file
file_name = 'intro.txt'
fl = open(file_name).read()
txt = nlp(fl)
# Extract tokens
print ([token.text for token in txt])

['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.']


In [None]:
# Sentence detection
txt = 'Jack fell down. And broke his crown.'
doc = nlp(txt)
sentences = list(doc.sents)
print(len(sentences))

for sentence in sentences:
    print (sentence)

2
Jack fell down.
And broke his crown.


In [None]:
# Extract tokens and index
txt = 'This tutorial is about Natural Language Processing in Spacy.'
doc = nlp(txt)

for token in doc:
  print (token, token.idx)

This 0
tutorial 5
is 14
about 17
Natural 23
Language 31
Processing 40
in 51
Spacy 54
. 59


In [None]:
# Stop words
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(len(spacy_stopwords))

for stop_word in list(spacy_stopwords)[:10]:
  print(stop_word)

326
could
whatever
beyond
afterwards
’ll
often
therein
give
of
everything


In [None]:
# Omitting stop words
txt = 'This tutorial is about Natural Language Processing in Spacy.'
doc = nlp(txt)
new_doc = [token for token in doc if not token.is_stop]
print (new_doc)

[tutorial, Natural, Language, Processing, Spacy, .]


In [None]:
# Lemmatization
# spaCy doesn't contain any function for stemming as it relies on lemmatization only.
txt = 'He ran while running for a run'
doc = nlp(txt)

for token in doc:
  print (token, token.lemma_)

He he
ran run
while while
running run
for for
a a
run run


**Word Frequency**
* Various insights about word patterns, such as common words or unique words in the text

In [None]:
from collections import Counter

txt='Stopwords are the English words which does not add much meaning \
to a sentence. They can safely be ignored without sacrificing the \
meaning of the sentence. For example, the words like the, he, have etc.'

doc = nlp(txt)
# Remove stop words and punctuation symbols
words = [token.text for token in doc
         if not token.is_stop and not token.is_punct]
word_freq = Counter(words)
print(word_freq)

# 3 commonly occurring words with their frequencies
common_words = word_freq.most_common(3)
print(common_words)

# Unique words
unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print(unique_words)

Counter({'words': 2, 'meaning': 2, 'sentence': 2, 'Stopwords': 1, 'English': 1, 'add': 1, 'safely': 1, 'ignored': 1, 'sacrificing': 1, 'example': 1, 'like': 1, 'etc': 1})
[('words', 2), ('meaning', 2), ('sentence', 2)]
['Stopwords', 'English', 'add', 'safely', 'ignored', 'sacrificing', 'example', 'like', 'etc']


**Part of speech tagging**
* It is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.

* Using POS tags, you can extract a particular category of words
You can use this to derive insights, remove the most common nouns, or see which adjectives are used for a particular noun.

In [None]:
txt = "Jack's fell down. And broke his crown."
doc = nlp(txt)
for token in doc:
  print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

Jack NNP PROPN noun, proper singular
's POS PART possessive ending
fell VBD VERB verb, past tense
down RP ADP adverb, particle
. . PUNCT punctuation mark, sentence closer
And CC CCONJ conjunction, coordinating
broke VBD VERB verb, past tense
his PRP$ PRON pronoun, possessive
crown NN NOUN noun, singular or mass
. . PUNCT punctuation mark, sentence closer


* **Dependency parsing** is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents.

* Dependency parsing helps you know what role a word plays in the text and how different words relate to each other.

* It’s also used in named entity recognition.

* Know more at https://nlp.stanford.edu/software/dependencies_manual.pdf

In [None]:
txt = 'Jack fell down. And broke his crown.'
doc = nlp(txt)
for token in doc:
  print (token.text, token.tag_, token.head.text, token.dep_)

Jack NNP fell nsubj
fell VBD fell ROOT
down RP fell prt
. . fell punct
And CC broke cc
broke VBD broke ROOT
his PRP$ crown poss
crown NN broke dobj
. . broke punct


**Named Entity Recognition (NER)**

* It is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

* You could use it to populate tags for a set of documents in order to improve the keyword search. You could also use it to categorize customer support tickets into relevant categories.

* You can also mask names etc. by replacing them

In [None]:
 txt = 'Jack fell down. And broke his crown at Delhi. Got it repaired at 10 AM'
doc = nlp(txt)
for ent in doc.ents:
  print(ent.text, ent.start_char, ent.end_char,
        ent.label_, spacy.explain(ent.label_))

Jack 0 4 PERSON People, including fictional
Delhi 39 44 GPE Countries, cities, states
10 AM 65 70 TIME Times smaller than a day


**Text vectorization**

* Since the beginning of the brief history of Natural Language Processing (NLP), there has been the need to transform text into something a machine can understand. That is, transforming text into a meaningful vector (or array) of numbers. The de-facto standard way of doing this in the pre-deep learning era was to use a **bag of words** approach.

* The idea behind this method is straightforward, though very powerful. First, we define a fixed length vector where each entry corresponds to a word in our pre-defined dictionary of words. The size of the vector equals the size of the dictionary. Then, for representing a text using this vector, we count how many times each word of our dictionary appears in the text and we put this number in the corresponding vector entry.

* Bag of words indicates the count of each word in the document. This simple model is used, for example, in naive Bayes

* Note that each row in a Document Term Matrix (DTM) corresponds to a word vector.

* In text mining, it is important to create the **document-term matrix (DTM)** of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf).

* To improve this representation, you can use some more advanced techniques like removing stopwords, lemmatizing words, using n-grams or using tf-idf instead of counts.

* The problem with this method is that it doesn’t capture the meaning of the text, or the context in which words appear, even when using n-grams.

* Learn more athttps://www.mygreatlearning.com/blog/bag-of-words/



---




**TF-IDF**

* It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

* TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents. The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents

**Sparse vs Dense matrices**

* Matrices that contain mostly zero values are called sparse, distinct from matrices where most of the values are non-zero, called dense.


* It is computationally expensive to represent and work with sparse matrices as though they are dense, and much improvement in performance can be achieved by converting them into dense arrays / matrices

* Learn more at https://www.geeksforgeeks.org/sparse-matrix-representation/

###Spam classification using Naive Bayes

* It works well for linearly non-separable data as it relies on probability rather than on any hyper-plane etc.

In [None]:
# Get the data from https://www.kaggle.com/uciml/sms-spam-collection-dataset
# SMS spam classification
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

smsDF=pd.read_csv("https://github.com/mohitgupta-1O1/Kaggle-SMS-Spam-Collection-Dataset-/raw/refs/heads/master/spam.csv", encoding='latin-1')
smsDF.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
smsDF=smsDF.drop(columns=['Unnamed: 2',	'Unnamed: 3',	'Unnamed: 4'])

smsDF.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
sw = stopwords.words('english')
def stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    return " ".join(text)

print(smsDF.head())
smsDF['v2'] = smsDF['v2'].apply(stopwords)
smsDF.head()

     v1                                                 v2
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


Unnamed: 0,v1,v2
0,ham,"go jurong point, crazy.. available bugis n gre..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor... u c already say...
4,ham,"nah think goes usf, lives around though"


In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def stemming(text):
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text)

print(smsDF.head())
smsDF['v2'] = smsDF['v2'].apply(stemming)
print(smsDF.head())

     v1                                                 v2
0   ham  go jurong point, crazy.. available bugis n gre...
1   ham                      ok lar... joking wif u oni...
2  spam  free entry 2 wkly comp win fa cup final tkts 2...
3   ham          u dun say early hor... u c already say...
4   ham            nah think goes usf, lives around though
     v1                                                 v2
0   ham  go jurong point, crazy.. avail bugi n great wo...
1   ham                        ok lar... joke wif u oni...
2  spam  free entri 2 wkli comp win fa cup final tkts 2...
3   ham          u dun say earli hor... u c alreadi say...
4   ham              nah think goe usf, live around though


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create the object of tfid vectorizer
tfid_vectorizer = TfidfVectorizer()

# extract the tfid representation matrix of the text data
tfid_matrix = tfid_vectorizer.fit_transform(smsDF['v2'])
print(type(tfid_matrix))
print(tfid_matrix[0])
print(tfid_matrix.shape)


<class 'scipy.sparse._csr.csr_matrix'>
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 18 stored elements and shape (1, 8672)>
  Coords	Values
  (0, 3550)	0.1481298737377147
  (0, 8030)	0.22998520738984352
  (0, 4350)	0.3264252905795869
  (0, 5920)	0.2553151503985779
  (0, 2327)	0.25279391746019725
  (0, 1303)	0.24415547176756056
  (0, 5537)	0.15618023117358304
  (0, 4087)	0.10720385321563428
  (0, 1751)	0.2757654045621182
  (0, 3634)	0.1803175103691124
  (0, 8489)	0.22080132794235655
  (0, 4476)	0.2757654045621182
  (0, 1749)	0.3116082237740733
  (0, 2048)	0.2757654045621182
  (0, 7645)	0.15566431601878158
  (0, 3594)	0.15318864840197105
  (0, 1069)	0.3264252905795869
  (0, 8267)	0.18238655630689804
(5572, 8672)


In [None]:
print(tfid_matrix[300])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 9 stored elements and shape (1, 8158)>
  Coords	Values
  (0, 3417)	0.20475211837619658
  (0, 5998)	0.29791514266610997
  (0, 3080)	0.26778946875694515
  (0, 7251)	0.40950423675239317
  (0, 4288)	0.2561210498967912
  (0, 1302)	0.34125451489376435
  (0, 6619)	0.34878383940180097
  (0, 3651)	0.36858834742637536
  (0, 2489)	0.4363004076017161


In [None]:
# collect the sparse matrix into dense one
array = tfid_matrix.todense()
print(type(array))
print(array[0])
print(array.shape)

<class 'numpy.matrix'>
[[0. 0. 0. ... 0. 0. 0.]]
(5572, 8672)


In [None]:
df = pd.DataFrame(array)
df[(df[5] != 0)].head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8662,8663,8664,8665,8666,8667,8668,8669,8670,8671
5456,0.0,0.0,0.0,0.0,0.0,0.304677,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df['v1']=smsDF['v1']

In [None]:
from sklearn.model_selection import train_test_split

features = df.columns.tolist()
features.remove('v1')

X_train, X_test, y_train, y_test = train_test_split(df[features], df['v1'], test_size=0.33)

* **Bernoulli Naive Bayes :** It assumes that all our features are binary

* **Multinomial Naive Bayes :** Its is used when we have discrete data (e.g. movie ratings ranging 1 and 5 as each rating will have certain frequency to represent). In text learning we have the count of each word to predict the class or label.

* **Gaussian Naive Bayes :** Because of the assumption of the normal distribution, Gaussian Naive Bayes is used in cases when all our features are continuous.

In [None]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score

#model=GaussianNB() # .89
model=BernoulliNB() # .97
#model=MultinomialNB() # .95

model.fit(X_train,y_train)
y_pred=model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print('Accuracy: %.3f' % acc)


Accuracy: 0.979


### Using Keras

In [None]:
# Get the data from https://www.kaggle.com/uciml/sms-spam-collection-dataset
# SMS spam classification
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer

smsDF=pd.read_csv("https://github.com/mohitgupta-1O1/Kaggle-SMS-Spam-Collection-Dataset-/raw/refs/heads/master/spam.csv", encoding='latin-1')
smsDF=smsDF.drop(columns=['Unnamed: 2', 'Unnamed: 3',   'Unnamed: 4'])

sw = stopwords.words('english')
def remove_stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    return " ".join(text)

smsDF['v2'] = smsDF['v2'].apply(remove_stopwords)

stemmer = SnowballStemmer("english")

def stemming(text):
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text)

smsDF['v2'] = smsDF['v2'].apply(stemming)

print("smsDF loaded and preprocessed successfully:")
print(smsDF.head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


smsDF loaded and preprocessed successfully:
     v1                                                 v2
0   ham  go jurong point, crazy.. avail bugi n great wo...
1   ham                        ok lar... joke wif u oni...
2  spam  free entri 2 wkli comp win fa cup final tkts 2...
3   ham          u dun say earli hor... u c alreadi say...
4   ham              nah think goe usf, live around though


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Vectorize the text data using TF-IDF
tfid_vectorizer = TfidfVectorizer()
tfid_matrix = tfid_vectorizer.fit_transform(smsDF['v2'])

# Convert the sparse matrix to a dense array for the neural network
tfid_array = tfid_matrix.todense()

# Encode the target variable 'v1' to numerical labels (0 for ham, 1 for spam)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(smsDF['v1'])

# Split the data into training and testing sets
X_train, X_test, y_train_encoded, y_test_encoded = train_test_split(
    tfid_array, y_encoded, test_size=0.33, random_state=42
)

# Define the neural network model
model = Sequential()
model.add(Dense(128, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid')) # Output layer for binary classification

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train_encoded, epochs=5, batch_size=32, verbose=1)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test_encoded, verbose=0)
print(f'Neural Network Accuracy: {accuracy:.3f}')

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/5
[1m117/117[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 15ms/step - accuracy: 0.8743 - loss: 0.4432
Epoch 2/5
[1m117/117[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.9940 - loss: 0.0558
Epoch 3/5
[1m117/117[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.9990 - loss: 0.0057
Epoch 4/5
[1m117/117[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 21ms/step - accuracy: 1.0000 - loss: 0.0015
Epoch 5/5
[1m117/117[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 19ms/step - accuracy: 1.0000 - loss: 6.4051e-04
Neural Network Accuracy: 0.980
