# Introduction to spaCy

### The `Nlp` object

At the center of Spacy is the object containing the processing pipeline. We usually call this variable `nlp`. For example to create an English `nlp` object you can import the English language class from `spacy.lang.en` and instanciated it:

In [1]:
# Import the English language class
from spacy.lang.en import English

In [2]:
# Create the nlp object
nlp = English()

You can use the `nlp` object like a function to analyze text. 

- It contains all the different components in the pipeline.
- It also includes all language-specific rules used for tokenizing the text into words and punctuation.

### The `Doc` object

When you process a text with the `nlp` object, Spacy creates a `doc` object, "doc" from document. 

In [3]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

The `doc` let you access information about the text in a structured way. No information is lost. The `doc` behaves like a normal Python sequence by the way. And lets you iterate over its tokens or get the tokens by its index.

In [4]:
# Itereate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


To get a token at a specific posiiton, you can index into the `doc`:

In [6]:
# Index into the Doc to get a single Token
token = doc[1]

Token object also provide various attributes that give you access more information about the tokens. For example the `.text` attribute returns the verbatin token text. 

In [7]:
# Get the token text via the .text attribute
print(token)

world


### The Span object

The `span` object is a slide of the document consisting of one or more tokens. It is only a view of the document, it doesn't contain any data itself.

 'Span' object is a group of tokens

To createa a span you can use Pyhton slice notation. For example:

In [21]:
doc = nlp("Hi this is vaishnavi")

In [22]:

span = doc[1:6]

# we can get Get the span text via the  ".text" attribute
print(span.text)

this is vaishnavi


### Lexical attributes
- .i  - gives the index number of token
- .token - prints the text of every token in the doc
- .is_punct,is_alpha,.like_num



In [25]:
doc = nlp("I'am working at Data Labs since a year!")
print('Index: ', [token.i for token in doc])

Index:  [0, 1, 2, 3, 4, 5, 6, 7, 8]


In [26]:
for token in doc:
    print(token.i)

0
1
2
3
4
5
6
7
8


`.text` returns the text:

In [27]:
print('Index: ', [token.text for token in doc])

Index:  ["I'am", 'working', 'at', 'Data', 'Labs', 'since', 'a', 'year', '!']


`.is_alpha`, `.is_punct` and `.like_num` return boolean values indicating weather the token consists of alphanumeric characters, weather is punctuation or weather it simbolos a number. For example the token ten. 

In [28]:
print('Index: ', [token.is_alpha for token in doc])
print('Index: ', [token.is_punct for token in doc])
print('Index: ', [token.like_num for token in doc])

Index:  [False, True, True, True, True, True, True, True, False]
Index:  [False, False, False, False, False, False, False, False, True]
Index:  [False, False, False, False, False, False, False, False, False]


### Language Support in spacy
(https://spacy.io/usage/models#languages).
it supports 30+ languages