<a href="https://colab.research.google.com/github/shivanswamynathan/NLP/blob/main/NLP_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [1]:
import spacy

In [4]:
nlp = spacy.load("en_core_web_sm")

In [9]:
with open("students.txt") as f:
  text = f.read()

In [12]:
doc = nlp(text)
doc

Dayton high school, 8th grade students information

Name	birth day   	email
-----	------------	------
Virat   5 June, 1882    virat@kohli.com
Maria	12 April, 2001  maria@sharapova.com
Serena  24 June, 1998   serena@williams.com 
Joe      1 May, 1997    joe@root.com




Tokenization

In [13]:
email = []
for token in doc:
  if token.like_email:
    email.append(token.text)
email

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

In [23]:
tokens = [token.text for token in doc[:10]]
tokens
nlp1 = spacy.blank("en")


In [24]:
from spacy.symbols import ORTH
nlp1.tokenizer.add_special_case("Dayaton" ,[ {ORTH : "Day"},{ORTH:"aton"}])



In [27]:
for sent in doc.sents:
  print(sent)

Dayton high school, 8th grade students information

Name	birth day   	email
-----	------------	------
Virat   5 June, 1882    virat@kohli.com

Maria	12 April, 2001  maria@sharapova.com
Serena  24 June, 1998   serena@williams.com 

Joe      1 May, 1997    joe@root.com






In [30]:
sentence = list(doc.sents)[1]
sentence

Maria	12 April, 2001  maria@sharapova.com
Serena  24 June, 1998   serena@williams.com 

In [32]:
text1='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

In [33]:
doc1 = nlp(text1)
doc1


Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.

In [34]:
url = []
for token in doc1:
  if token.like_url:
    url.append(token.text)
url


['http://www.data.gov/',
 'http://www.science',
 'http://data.gov.uk/.',
 'http://www3.norc.org/gss+website/',
 'http://www.europeansocialsurvey.org/.']

Figure out all transactions from this text with amount and currency

In [38]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc3 = nlp(transactions)


In [39]:
for token in doc3:
  if token.like_num and doc3[token.i+1].is_currency:
    print(token.text,doc3[token.i+1])

two $
500 €


Lemmatization

In [44]:
for token in doc1[:10]:
  print (token.text,"-->",token.lemma_)



 --> 

Look --> look
for --> for
data --> datum
to --> to
help --> help
you --> you
address --> address
the --> the
question --> question


In [45]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Named Entity Recognition


In [46]:
text2 = """Kiran want to know the famous foods in each state of India. So, he opened Google and search for this question. Google showed that
in Delhi it is Chaat, in Gujarat it is Dal Dhokli, in Tamilnadu it is Pongal, in Andhrapradesh it is Biryani, in Assam it is Papaya Khar,
in Bihar it is Litti Chowkha and so on for all other states"""

doc4 = nlp(text2)

In [47]:
from spacy import displacy
displacy.render(doc4,style="ent")

In [52]:
gpe = []
for ent in doc4.ents:
  if ent.label_ == 'GPE':
    gpe.append(ent)
gpe

[India, Delhi, Gujarat, Tamilnadu, Pongal, Andhrapradesh, Assam, Bihar]

In [53]:
text = """Sachin Tendulkar was born on 24 April 1973, Virat Kholi was born on 5 November 1988, Dhoni was born on 7 July 1981
and finally Ricky ponting was born on 19 December 1974."""

doc = nlp(text)

In [54]:
displacy.render(doc,style="ent")

In [59]:
date = []
for ent in doc.ents:
  if ent.label_ == 'DATE':
    date.append(ent)
date


[24 April 1973, 5 November 1988, 7 July 1981, 19 December 1974]