# About spaCY

spaCy is **a free, open-source library** for **advanced Natural Language Processing (NLP)** in Python.

If you’re working with a lot of text, you’ll eventually want to know more about it. For example, 
- what’s it about? <br>
- What do the words mean in context? <br>
- Who is doing what to whom? <br>
- What companies and products are mentioned? Which texts are similar to each other? <br>

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

# What spaCy isn’t

**spaCy is not a platform** or **“an API”**. Unlike a platform, spaCy does not provide a software as a service, or a web application. It’s an open-source library designed to help you build NLP applications, not a consumable service.

**spaCy is not an out-of-the-box chat bot engine**. While spaCy can be used to power conversational applications, it’s not designed specifically for chat bots, and only provides the underlying text processing capabilities.

**spaCy is not research software**. It’s built on the latest research, but it’s designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.

**spaCy is not a company**. It’s an open-source library. Our company publishing spaCy and other software is called Explosion AI.
[Reference: Spacy.io](https://spacy.io/usage/spacy-101)

# spaCY Library Architecture
The central data structures in spaCy are the **Doc** and the Vocab. The **Doc** object owns the sequence of tokens and all their annotations. The Vocab object owns a set of look-up tables that make common information available across documents. By centralizing strings, word vectors and lexical attributes, we avoid storing multiple copies of this data. This saves memory, and ensures there’s a single source of truth.

Text annotations are also designed to allow a single source of truth: the **Doc** object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

![architecture](https://i.ibb.co/7pPtMcf/sp.png)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df=pd.read_csv('/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')

In [None]:
df.head()

In [None]:
df['comment_text'][8]

# Token
Below I am explaining the token-level entity annotation using the **BILUO** tagging scheme to describe the entity boundaries.
![](https://i.ibb.co/sJ3rcpc/spacy.png)

In [None]:
import pprint

In [None]:
doc = nlp('The ranchers seem motivated by mostly by greed; no one should have the right to allow their animals destroy public land.')



In [None]:
pprint.pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

In [None]:
doc1=nlp("My Name is Nikki Sharma. I love Nature Languge Processing.")
for word in doc1.ents:
    print(word)


## Explanation 
**"B"** means the token begins an entity, **"I"** means it is inside an entity, **"O"** means it is outside an entity, and **""** means no entity tag is set.

# Extracting named entity from an article

Now let’s do some serious stuff  with SpaCy and extracting named entities from toxic comments

In [None]:
text= df['comment_text'][19]
text

In [None]:
article = nlp(text)
len(article.ents)

There are 3 entities in the article and they are represented as 3 unique labels:

In [None]:
labels = [x.label_ for x in article.ents]
Counter(labels)

The following are three most frequent tokens.

In [None]:
doc = nlp(text)
pprint.pprint([(X.text, X.label_) for X in doc.ents])

Racists is NORD (nationalities or religious or political groups), 150 is a number hence Cardinal and here is the funny part Math is represented as Person ):

In [None]:
sentence = df['comment_text'][119]
sentence

In [None]:
displacy.render(nlp(str(sentence)), jupyter=True, style='ent')

Excellent classification Minorities repreents Person Police represents ORGANIZATION (ORG), Natives, literal meaning is group of local residents which is somewhat similar to ORG and atlast one is numerical which is CRADINAL GREAT !!!!!!

Lets Explore some other sentences for more fun.

In [None]:
sentence_1 = df['comment_text'][350]
sentence_1

In [None]:
displacy.render(nlp(str(sentence_1)), jupyter=True, style='ent')

In [None]:
sentence_2 = df['comment_text'][970]
sentence_2

In [None]:
displacy.render(nlp(str(sentence_2)), jupyter=True, style='ent')

Using spaCy’s built-in **displaCy visualizer**, here’s what the above sentence and its dependencies look like:

In [None]:
displacy.render(nlp(str(sentence)), style='dep', jupyter = True, options = {'distance': 120})

The dependency visualizer, **dep**, shows **part-of-speech tags** and **syntactic dependencies**.

The argument **options** lets you specify a dictionary of settings to customize the layout.
For a list of all available options, see the  [displacy API documentation](https://spacy.io/api/top-level#displacy_options)

Next, we verbatim, extract part-of-speech and lemmatize this sentence.


In [None]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentence)) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

In [None]:
dict([(str(x), x.label_) for x in nlp(str(sentence_2)).ents])

### Finally I am going to explore the entity of entire article of NY times article 

In [None]:
article = nlp('By the time Prime Minister Boris Johnson finished taking questions in Parliament on Wednesday, he had ushered in a new season of political mayhem in Britain, one in which the voters are now as likely as their feuding leaders to resolve the questions over how and when Britain should leave the European Union. The raucous spectacle in the House of Commons illustrated the obstacles Mr. Johnson will face as he tries to lead Britain out of the European Union next month. On Wednesday, Parliament handed he prime minister two stinging defeats.It first blocked his plans to leave the union with or without an agreement. And it then stymied his bid, at least for the moment, to call an election for Oct. 15, out of fear he could secure a new majority in favor of breaking with Europe, deal or no deal.')

In [None]:
len(article.ents)

In [None]:
sentences = [x for x in article.sents]

In [None]:
print(sentences)

In [None]:
displacy.render(nlp(str(sentences)), jupyter=True, style='ent')
