# Spacy vs NLTK

## Spacy 
 
over 400 times faster

State-of-the art accuracy

Tokenizer maintains alignment

Powerful, concise API

Integrated word vectors


## NLTK

Slow 

Low accuracy

Tokens do not align to original string

Models return list of strings

No word vector support

## References:

https://spacy.io/docs/api/

https://www.quora.com/What-are-the-advantages-of-Spacy-vs-NLTK

https://spacy.io/docs/api/language-models


# Tokenization

In [2]:
from __future__ import unicode_literals
import spacy
import string
from spacy.en import English
# this is included because spacy is written in python 2 and python 3 but sometimes
# for strings it uses python 3 therefore, we need to include this line 
# to avoid writing "u" for unicode before writing text
 
nlp = spacy.load('en')  # load english language module

In [3]:
parser = English()

In [4]:
class Tokenize:
    def __init__(self,parser):
        self.parser = parser
        self.punctuations = list(string.punctuation)
        
    def word_tokenizer(self,text):
        list_of_tokenized_words = []
        lowercase_words = []
        # parse the data
        parsedData = self.parser(text)
        for token in parsedData:
            # tokenize the word 
            # each token contains properties
            # property with underscore(token_orth_) returns string
            # property without underscore(token_orth) returns an index (int) into spaCy's vocabulary
            lower_case = self.convert_characters_lower_case(token)
            token =  token.orth_
            # remove punctuation marks
            if token in self.punctuations:
                continue
            else:
                lowercase_words.append(lower_case)
                list_of_tokenized_words.append(token)
        print "Original text: \n ", text
        print "========================================================="
        print "List of tokenized words without punctutations: \n " + str(list_of_tokenized_words)  
        print "========================================================="
        print "Length of words: ", len(list_of_tokenized_words)
        print "========================================================="
        print "Lowercase words: \n " + str(lowercase_words)
        
    
    def convert_characters_lower_case(self,token):
        return token.lower_
        
    def sent_tokenizer(self,sentences):
        document =  nlp(sentences)
        list_of_tokenized_sentence = []
        for sentence in document.sents:
            list_of_tokenized_sentence.append(sentence)
        
        print "Original text: \n ", sentences
        print "=================================================================================="
        print "List of tokenized sentences are: \n ", list_of_tokenized_sentence
        print "=================================================================================="
        print "Number of sentences are: \n", len(list_of_tokenized_sentence)
        

In [5]:
tokenize = Tokenize(parser)

# Word Tokenizer

## Issues 

San Francisco is one word, but tokenizer has considered it two words

In [10]:
text ="San Francisco is part of the USA, where many peoples are living."

tokenize.word_tokenizer(text)

Original text: 
  San Francisco is part of the USA, where many peoples are living.
List of tokenized words without punctutations: 
 [u'San', u'Francisco', u'is', u'part', u'of', u'the', u'USA', u'where', u'many', u'peoples', u'are', u'living']
Length of words:  12
Lowercase words: 
 [u'san', u'francisco', u'is', u'part', u'of', u'the', u'usa', u'where', u'many', u'peoples', u'are', u'living']


# Sentence Tokenizer

Sentences are tokenized based on periods(.), exclamation marks(!) and question marks(?)

In [11]:
multipleSentence = "This is first sentence. This is second sentence! Let's try to tokenize the sentences. how are you? I am doing good"
tokenize.sent_tokenizer(multipleSentence)

Original text: 
  This is first sentence. This is second sentence! Let's try to tokenize the sentences. how are you? I am doing good
List of tokenized sentences are: 
  [This is first sentence., This is second sentence!, Let's try to tokenize the sentences., how are you?, I am doing good]
Number of sentences are: 
5
