In [0]:
!pip list

In [0]:
!python -m spacy download en

In [0]:
# importing and creating that very special engine object by loading en_core....
# the en_core_web_sm is just all the nlp ideas implemented for english language
# and we are gonna use this object to create specific tasks
import spacy
nlp = spacy.load("en_core_web_sm")

In [22]:
# things that happen in spacy in order
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7efcda3ef0f0>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7efc86020dc8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7efc86020e28>)]

In [0]:
# this is the sentence we need to work on for this very first test
# the sentence is essencially a unicode string. it cant be raw or other string formats
doc = nlp(u'I have eaten more foods than you ever ate')

In [4]:
# doc obj has tokens, which are granules of the sentence the words
# its iterable abd we printed out the parts of speech of them aswell!
# the token.pos given a number corresponding to a POS, with pos_ we get the names aswell
for token in doc:
  print(token, token.pos_)

I PRON
have VERB
eaten VERB
more ADJ
foods NOUN
than ADP
you PRON
ever ADV
ate VERB


In [5]:
# the process of breaking down into words is tokenization in simple words
# each token has its own parts of speech and other infos
# this time we printed out extra token.dep_ which stands for "Syntactic Dependency Parsing"
# we learn about a lot about it later on!
sentence = u"Say hello to my little 3kg friend!"
doc2 = nlp(sentence)
print([(token, token.pos_, token.dep_) for token in doc2])

[(Say, 'VERB', 'ROOT'), (hello, 'INTJ', 'dobj'), (to, 'ADP', 'prep'), (my, 'DET', 'poss'), (little, 'ADJ', 'amod'), (3, 'NUM', 'nummod'), (kg, 'NOUN', 'compound'), (friend, 'NOUN', 'pobj'), (!, 'PUNCT', 'punct')]


In [0]:
# we can also visualize the doc object
from spacy import displacy
displacy.render(doc2, style="dep", jupyter=True)

In [6]:
# extra space is also a token to spacy
string = u"Hey  brother!"
doc3 = nlp(string)
for token in doc3:
  print(token, token.pos_)

Hey INTJ
  SPACE
brother NOUN
! PUNCT


In [14]:
# the doc object also supports indexing and spanning [start:stop]
docObject = doc3[3]
print(docObject.text,'->' ,docObject.pos_)

! -> PUNCT


### Tokens have also other attributes rather than just `dep_, pos_`, here is list of important attributes. Each of which we will learn later

|Tag|Description|doc2[0].tag|
|:-------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape â€“ capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [17]:
# more on span object
# we said that doc object supports spanning(tokenwise slicing), it does that in a special way.
# it treats the slicing from a document as a seperate span object

string2 = u"Attitude is a choice. Happiness is a choice. Optimism is a choice. Kindness is a choice. Giving is a choice. Respect is a choice. Whatever choice you make makes you. Choose wisely."
doc4 = nlp(string2)

happiness_span_from_doc4 = doc4[5:9]
print(happiness_span_from_doc4)
type(happiness_span_from_doc4)

Happiness is a choice


spacy.tokens.span.Span

In [18]:
# spacy also supports sentence wise segmentation. doc object has sents attribute
# which seperates the sentences
doc5 = nlp(u'First sentence is this. Last sentence is this.')
for sent in doc5.sents:
  print(sent)

First sentence is this.
Last sentence is this.


In [21]:
# doc object is smart enough to understand start end of a sentence
print(doc5[5].text, doc5[5].is_sent_start, )

Last True


## TOKENIZATION IN DETAIL

### It is the process of breaking up the original text into component pieces known as "tokens". It happens by doing some specific things such as, splitting into whitespace, prefixing, exception removing, suffixing, infixing etc. <br><br>

<img src="https://miro.medium.com/max/1090/1*NoNGMFUyb4Hxo622VKNwDQ.png"><br><br>

*prefix*: character's at beginning <br>
*suffix*: characters at the end<br>
*infix*: characters in-between<br>
*exception*: special case rule to split a string into different tokens **or** prevent a token from being split where punctuation rules are applied

