<a href="https://colab.research.google.com/github/shivanswamynathan/NLP/blob/main/NLP_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
with open("students.txt") as f:
  text = f.read()

In [5]:
doc = nlp(text)
doc

Dayton high school, 8th grade students information

Name	birth day   	email
-----	------------	------
Virat   5 June, 1882    virat@kohli.com
Maria	12 April, 2001  maria@sharapova.com
Serena  24 June, 1998   serena@williams.com 
Joe      1 May, 1997    joe@root.com




Tokenization

In [6]:
email = []
for token in doc:
  if token.like_email:
    email.append(token.text)
email

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

In [7]:
tokens = [token.text for token in doc[:10]]
tokens
nlp1 = spacy.blank("en")


In [8]:
from spacy.symbols import ORTH
nlp1.tokenizer.add_special_case("Dayaton" ,[ {ORTH : "Day"},{ORTH:"aton"}])



In [9]:
for sent in doc.sents:
  print(sent)

Dayton high school, 8th grade students information

Name	birth day   	email
-----	------------	------
Virat   5 June, 1882    virat@kohli.com

Maria	12 April, 2001  maria@sharapova.com
Serena  24 June, 1998   serena@williams.com 

Joe      1 May, 1997    joe@root.com






In [10]:
sentence = list(doc.sents)[1]
sentence

Maria	12 April, 2001  maria@sharapova.com
Serena  24 June, 1998   serena@williams.com 

In [11]:
text1='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

In [12]:
doc1 = nlp(text1)
doc1


Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.

In [13]:
url = []
for token in doc1:
  if token.like_url:
    url.append(token.text)
url


['http://www.data.gov/',
 'http://www.science',
 'http://data.gov.uk/.',
 'http://www3.norc.org/gss+website/',
 'http://www.europeansocialsurvey.org/.']

Figure out all transactions from this text with amount and currency

In [14]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc3 = nlp(transactions)


In [15]:
for token in doc3:
  if token.like_num and doc3[token.i+1].is_currency:
    print(token.text,doc3[token.i+1])

two $
500 €


Lemmatization

In [16]:
for token in doc1[:10]:
  print (token.text,"-->",token.lemma_)



 --> 

Look --> look
for --> for
data --> datum
to --> to
help --> help
you --> you
address --> address
the --> the
question --> question


In [17]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Named Entity Recognition


In [18]:
text2 = """Kiran want to know the famous foods in each state of India. So, he opened Google and search for this question. Google showed that
in Delhi it is Chaat, in Gujarat it is Dal Dhokli, in Tamilnadu it is Pongal, in Andhrapradesh it is Biryani, in Assam it is Papaya Khar,
in Bihar it is Litti Chowkha and so on for all other states"""

doc4 = nlp(text2)

In [19]:
from spacy import displacy
displacy.render(doc4,style="ent")

In [20]:
gpe = []
for ent in doc4.ents:
  if ent.label_ == 'GPE':
    gpe.append(ent)
gpe

[India, Delhi, Gujarat, Tamilnadu, Pongal, Andhrapradesh, Assam, Bihar]

In [21]:
text = """Sachin Tendulkar was born on 24 April 1973, Virat Kholi was born on 5 November 1988, Dhoni was born on 7 July 1981
and finally Ricky ponting was born on 19 December 1974."""

doc = nlp(text)

In [22]:
displacy.render(doc,style="ent")

In [23]:
date = []
for ent in doc.ents:
  if ent.label_ == 'DATE':
    date.append(ent)
date


[24 April 1973, 5 November 1988, 7 July 1981, 19 December 1974]

Bag Of Words (BOW)

In [24]:
import numpy as np
import pandas as pd

In [25]:
data = pd.read_csv("spam.csv")
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [28]:
data.groupby("Category").describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [29]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Access the 'Category' column with uppercase 'C' for transformation
data['category'] = le.fit_transform(data['Category'])
data.head()

Unnamed: 0,Category,Message,category
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [30]:
X = data.iloc[:,-2]
y = data.iloc[:,-1]


In [31]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<4457x7701 sparse matrix of type '<class 'numpy.int64'>'
	with 59275 stored elements in Compressed Sparse Row format>

In [33]:
X_train_cv.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [34]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_cv,y_train)

In [35]:
from sklearn.metrics import classification_report
y_pre = model.predict(v.transform(X_test))
report_str = classification_report(y_test, y_pre)

# Split the report into lines
report_lines = report_str.split('\n')
report_lines

['              precision    recall  f1-score   support',
 '',
 '           0       0.99      1.00      1.00       966',
 '           1       1.00      0.94      0.97       149',
 '',
 '    accuracy                           0.99      1115',
 '   macro avg       1.00      0.97      0.98      1115',
 'weighted avg       0.99      0.99      0.99      1115',
 '']

In [36]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!',
    'fldc.vk d.kdsjf k vfdsk.fdksd.lds l.ksd. zxkl lsd/ /xcl/ c;'
]
emailcv = v.transform(emails)
model.predict( emailcv )

array([0, 1, 0])

In [37]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [38]:
clf.fit(X_train, y_train)

In [39]:
classification_report(y_test,clf.predict(X_test))

'              precision    recall  f1-score   support\n\n           0       0.99      1.00      1.00       966\n           1       1.00      0.94      0.97       149\n\n    accuracy                           0.99      1115\n   macro avg       1.00      0.97      0.98      1115\nweighted avg       0.99      0.99      0.99      1115\n'

TF-IDF


In [40]:
data = pd.read_csv("Ecommerce_data.csv")
data.head()

Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [41]:
data['Label'] = le.fit_transform(data['label'])
data.head()

Unnamed: 0,Text,label,Label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,3
1,"Contrast living Wooden Decorative Box,Painted ...",Household,3
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,1
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,1


In [43]:
X = data.iloc[: , 0]
y = data.iloc[:, -1]

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [50]:
Knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski')
Knn.fit(X_train_tfidf, y_train)

In [51]:
y_pre = Knn.predict(X_test_tfidf)
report_str = classification_report(y_test, y_pre)

# Split the report into lines
report_lines = report_str.split('\n')
report_lines

['              precision    recall  f1-score   support',
 '',
 '           0       0.91      1.00      0.96       966',
 '           1       1.00      0.40      0.57       149',
 '',
 '    accuracy                           0.92      1115',
 '   macro avg       0.96      0.70      0.76      1115',
 'weighted avg       0.93      0.92      0.90      1115',
 '']