# Contents 
 * <b><i> Installation and importing 
 * <b><i> reading document as nlp spacy object 
 * <b><i>  sentence tokens 
 *<b><i>  parts of speech tagging 
 *<b><i>  visualization with dislpacy 
 *<b><i>  lemmatization 
 *<b><i>  named entity recognition 

###  Installation and importing 

In [5]:
import spacy

In [None]:
## Installation 

In [13]:
# pip install spacy 
# python -m spacy download en

In [10]:
nlp = spacy.load('en_core_web_sm')

In [15]:
doc = nlp(u"Hello world this is nlp spacy") 
doc.text

'Hello world this is nlp spacy'

### Reading a document or text 

In [16]:
docs = nlp('spacy is a cool tool') 

In [19]:
docs2 = nlp(u'spacy is an amazing tool like nltk') 

In [28]:
file = open('sample documents\\machine learning.txt').read()

In [30]:
nlp_file = nlp(file)
nlp_file

Machine learning is an important component of the growing field of data science. Through the use of statistical methods, algorithms are trained to make classifications or predictions, uncovering key insights within data mining projects. These insights subsequently drive decision making within applications and businesses, ideally impacting key growth metrics. As big data continues to expand and grow, the market demand for data scientists will increase, requiring them to assist in the identification of the most relevant business questions and subsequently the data to answer them.

### Sentence Tokens 

In [35]:
for index,sentence in enumerate(nlp_file.sents):
    print(f'{index}: {sentence}')

0: Machine learning is an important component of the growing field of data science.
1: Through the use of statistical methods, algorithms are trained to make classifications or predictions, uncovering key insights within data mining projects.
2: These insights subsequently drive decision making within applications and businesses, ideally impacting key growth metrics.
3: As big data continues to expand and grow, the market demand for data scientists will increase, requiring them to assist in the identification of the most relevant business questions and subsequently the data to answer them.


In [38]:
print([token for token in nlp_file])

[Machine, learning, is, an, important, component, of, the, growing, field, of, data, science, ., Through, the, use, of, statistical, methods, ,, algorithms, are, trained, to, make, classifications, or, predictions, ,, uncovering, key, insights, within, data, mining, projects, ., These, insights, subsequently, drive, decision, making, within, applications, and, businesses, ,, ideally, impacting, key, growth, metrics, ., As, big, data, continues, to, expand, and, grow, ,, the, market, demand, for, data, scientists, will, increase, ,, requiring, them, to, assist, in, the, identification, of, the, most, relevant, business, questions, and, subsequently, the, data, to, answer, them, .]


In [41]:
for word in nlp_file:
    print(word.text, word.shape_)

Machine Xxxxx
learning xxxx
is xx
an xx
important xxxx
component xxxx
of xx
the xxx
growing xxxx
field xxxx
of xx
data xxxx
science xxxx
. .
Through Xxxxx
the xxx
use xxx
of xx
statistical xxxx
methods xxxx
, ,
algorithms xxxx
are xxx
trained xxxx
to xx
make xxxx
classifications xxxx
or xx
predictions xxxx
, ,
uncovering xxxx
key xxx
insights xxxx
within xxxx
data xxxx
mining xxxx
projects xxxx
. .
These Xxxxx
insights xxxx
subsequently xxxx
drive xxxx
decision xxxx
making xxxx
within xxxx
applications xxxx
and xxx
businesses xxxx
, ,
ideally xxxx
impacting xxxx
key xxx
growth xxxx
metrics xxxx
. .
As Xx
big xxx
data xxxx
continues xxxx
to xx
expand xxxx
and xxx
grow xxxx
, ,
the xxx
market xxxx
demand xxxx
for xxx
data xxxx
scientists xxxx
will xxxx
increase xxxx
, ,
requiring xxxx
them xxxx
to xx
assist xxxx
in xx
the xxx
identification xxxx
of xx
the xxx
most xxxx
relevant xxxx
business xxxx
questions xxxx
and xxx
subsequently xxxx
the xxx
data xxxx
to xx
answer xxxx
them xxxx
. .


In [42]:
ex_text = nlp('Hello HELLO heLLo') 
for word in ex_text:
    print(word.text, word.shape_, word.is_alpha, word.is_stop)

Hello Xxxxx True False
HELLO XXXX True False
heLLo xxXXx True False


### Part of speech tagging 

In [51]:
text1  =nlp('he drinks a drink')
for word in text1:
    print(word.text, word.pos_, word.tag_,word.dep_)

he PRON PRP nsubj
drinks VERB VBZ ROOT
a DET DT det
drink NOUN NN dobj


In [47]:
spacy.explain('PRP')

'pronoun, personal'

In [50]:
for token in nlp_file:
    print((token.text,spacy.explain(token.tag_)))

('Machine', 'noun, singular or mass')
('learning', 'noun, singular or mass')
('is', 'verb, 3rd person singular present')
('an', 'determiner')
('important', 'adjective (English), other noun-modifier (Chinese)')
('component', 'noun, singular or mass')
('of', 'conjunction, subordinating or preposition')
('the', 'determiner')
('growing', 'verb, gerund or present participle')
('field', 'noun, singular or mass')
('of', 'conjunction, subordinating or preposition')
('data', 'noun, plural')
('science', 'noun, singular or mass')
('.', 'punctuation mark, sentence closer')
('Through', 'conjunction, subordinating or preposition')
('the', 'determiner')
('use', 'noun, singular or mass')
('of', 'conjunction, subordinating or preposition')
('statistical', 'adjective (English), other noun-modifier (Chinese)')
('methods', 'noun, plural')
(',', 'punctuation mark, comma')
('algorithms', 'noun, plural')
('are', 'verb, non-3rd person singular present')
('trained', 'verb, past participle')
('to', 'infinitival

### Visualizing dependency using displacy 

In [56]:
from spacy import displacy 
print(nlp_file[0:10])
displacy.render(nlp_file[0:10], style = 'dep',jupyter = True)

Machine learning is an important component of the growing field


### Lemmatizing 

In [60]:
document = nlp_file[0:55]
document

Machine learning is an important component of the growing field of data science. Through the use of statistical methods, algorithms are trained to make classifications or predictions, uncovering key insights within data mining projects. These insights subsequently drive decision making within applications and businesses, ideally impacting key growth metrics.

In [61]:
for word in document:
    print(word.text, word.lemma_,word.pos_) 

Machine machine NOUN
learning learning NOUN
is be AUX
an an DET
important important ADJ
component component NOUN
of of ADP
the the DET
growing grow VERB
field field NOUN
of of ADP
data datum NOUN
science science NOUN
. . PUNCT
Through through ADP
the the DET
use use NOUN
of of ADP
statistical statistical ADJ
methods method NOUN
, , PUNCT
algorithms algorithm NOUN
are be AUX
trained train VERB
to to PART
make make VERB
classifications classification NOUN
or or CCONJ
predictions prediction NOUN
, , PUNCT
uncovering uncover VERB
key key ADJ
insights insight NOUN
within within ADP
data datum NOUN
mining mining NOUN
projects project NOUN
. . PUNCT
These these DET
insights insight NOUN
subsequently subsequently ADV
drive drive VERB
decision decision NOUN
making making NOUN
within within ADP
applications application NOUN
and and CCONJ
businesses business NOUN
, , PUNCT
ideally ideally ADV
impacting impact VERB
key key ADJ
growth growth NOUN
metrics metric NOUN
. . PUNCT


### Named entity recognition or detection 
 * classifying contents by getting relevant tags 
 * improve search algorithms 
 * for content recommendations
 * for information extraction 

In [78]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [79]:
displacy.render(doc, style = 'ent')

In [80]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'