# Tutorial Part 1: scikit-learn, nltk, spaCy 

In this tutorial, you'll get familiar with basic machine learning and NLP libraries needed to process text.

## Scikit-learn

https://scikit-learn.org/stable/

Scikit-learn is a universal library that supports many different machine learning algorithms, including various natrual language processing methods

In [2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
vectorizer = CountVectorizer()

In [3]:
corpus = [
    'Hello there',
    'How are you?',
    'Hello! Hello!',
]

In [4]:
vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
feature_names = vectorizer.get_feature_names()

In [6]:
feature_names

['are', 'hello', 'how', 'there', 'you']

In [7]:
X = vectorizer.transform(corpus)

In [8]:
X

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [9]:
X.toarray()

array([[0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [0, 2, 0, 0, 0]])

In [10]:
vectorizer.transform(['Are you doing fine?']).toarray()

array([[1, 0, 0, 0, 1]])

There are many parameters in the `CountVectorizer` class

In [11]:
vectorizer = CountVectorizer(min_df=2)

In [12]:
corpus = [
    'Hello there',
    'How are you?',
    'Hello! Hello!',
]

In [13]:
vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [14]:
vectorizer.get_feature_names()

['hello']

In [15]:
vectorizer.transform(corpus).toarray()

array([[1],
       [0],
       [2]])

# NLTK

http://www.nltk.org/

The NLTK library contains NLP-specific algorithms for many practical tasks, including word and sentence tokenization.

In [16]:
from nltk import word_tokenize

In [17]:
doc1 = 'I have a cat.'
doc2 = "He doesn't have a cat"

In [18]:
word_tokenize(doc1)

['I', 'have', 'a', 'cat', '.']

In [19]:
word_tokenize(doc2)

['He', 'does', "n't", 'have', 'a', 'cat']

# spaCy

https://spacy.io/

spaCy is a modern industrial-strength natural languge processing library that has a convinient interface and supports many standard tasks, such as parsing, tagging, and named entity recognition.

We are going to use a [pretrained spaCy model](https://spacy.io/models/en#en_core_web_lg) for several different tasks. It includes the aforementioned algrotihms as well as word vectors, trained on a large collection of documents form the web ([the Common Crawl](http://commoncrawl.org))

In [34]:
import spacy

In [35]:
nlp = spacy.load('en_core_web_lg')

In [36]:
doc1 = nlp('I have a cat')

In [37]:
doc1

I have a cat

In [38]:
doc1[1]

have

In [39]:
[w.text for w in doc1]

['I', 'have', 'a', 'cat']

In [40]:
[w.lemma_ for w in doc1]

['-PRON-', 'have', 'a', 'cat']

In [41]:
doc2 = nlp("He doesn't have a cat :(")

In [42]:
[w.text for w in doc2]

['He', 'does', "n't", 'have', 'a', 'cat', ':(']

In [44]:
[w.lemma_ for w in doc2]

['-PRON-', 'do', 'not', 'have', 'a', 'cat', ':(']

## Word vectors

spaCy also associates a word vector with each word. This vector, and its relative position to other vectors, reflecs the meaning of the word. Similar words will have similar vectors.

In [45]:
doc1 = nlp('I have a dog')
doc2 = nlp('I have a cat')
doc3 = nlp('I have a banana')
doc4 = nlp('Congress voted to reopen the government')

In [46]:
doc1[3]

dog

In [47]:
doc3[3].vector.shape

(300,)

In [48]:
doc3[3].vector

array([ 2.0228e-01, -7.6618e-02,  3.7032e-01,  3.2845e-02, -4.1957e-01,
        7.2069e-02, -3.7476e-01,  5.7460e-02, -1.2401e-02,  5.2949e-01,
       -5.2380e-01, -1.9771e-01, -3.4147e-01,  5.3317e-01, -2.5331e-02,
        1.7380e-01,  1.6772e-01,  8.3984e-01,  5.5107e-02,  1.0547e-01,
        3.7872e-01,  2.4275e-01,  1.4745e-02,  5.5951e-01,  1.2521e-01,
       -6.7596e-01,  3.5842e-01, -4.0028e-02,  9.5949e-02, -5.0690e-01,
       -8.5318e-02,  1.7980e-01,  3.3867e-01,  1.3230e-01,  3.1021e-01,
        2.1878e-01,  1.6853e-01,  1.9874e-01, -5.7385e-01, -1.0649e-01,
        2.6669e-01,  1.2838e-01, -1.2803e-01, -1.3284e-01,  1.2657e-01,
        8.6723e-01,  9.6721e-02,  4.8306e-01,  2.1271e-01, -5.4990e-02,
       -8.2425e-02,  2.2408e-01,  2.3975e-01, -6.2260e-02,  6.2194e-01,
       -5.9900e-01,  4.3201e-01,  2.8143e-01,  3.3842e-02, -4.8815e-01,
       -2.1359e-01,  2.7401e-01,  2.4095e-01,  4.5950e-01, -1.8605e-01,
       -1.0497e+00, -9.7305e-02, -1.8908e-01, -7.0929e-01,  4.01

We can calculate the similarity between the individual words...

In [58]:
doc1[3], doc2[3], doc3[3]

(dog, cat, banana)

In [52]:
doc1[3].similarity(doc2[3])

0.80168545

In [53]:
doc1[3].similarity(doc3[3])

0.24327643

...or the whole documents (sentences)

In [57]:
doc1, doc2, doc3

(I have a dog, I have a cat, I have a banana)

In [54]:
doc1.similarity(doc2)

0.9681672529980867

In [55]:
doc1.similarity(doc3)

0.8753348768953094

In [56]:
doc1.similarity(doc4)

0.5484653936168645

## Named Entity Recognition (NER)

As the model we are using was trained to perform NER, we can easily find all mentions of entities of specific types in the document

In [60]:
doc = nlp(
    'Apple is looking at buying U.K. startup for $1 billion. '
    'Microsoft acquires Citus Data, creators of a cloud-friendly version of the PostgreSQL database. '
)

In [61]:
doc.ents

(Apple, U.K., $1 billion, Microsoft, Citus Data, PostgreSQL)

In [62]:
for e in doc.ents:
    print(e, '->', e.label_)

Apple -> ORG
U.K. -> GPE
$1 billion -> MONEY
Microsoft -> ORG
Citus Data -> ORG
PostgreSQL -> ORG


We can see what is this `GPE` and what tags are available in the documentation: https://spacy.io/api/annotation#named-entities

In [63]:
doc.ents[0].sent

Apple is looking at buying U.K. startup for $1 billion.