# spaCy

<div class="alert-success">
modern-day version of NLTK</div>

### Models

Spacy comes with a variety of different models that can used per language. For instance, the models for English are available [here](https://spacy.io/models/en). You'll need to download each model separately:

```python
python3 -m spacy download en_core_web_sm
python3 -m spacy download en_core_web_md

```

<div class="alert-success">
<p>sm: small, md: medium, lg:large
<p>sm is not actually embedding but kind of distribution estimation
<p>while md is truly embedding
</div>

In [1]:
!python3 -m spacy download en_core_web_sm

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 3.1 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
!python3 -m spacy download en_core_web_md

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[K     |████████████████████████████████| 45.7 MB 5.0 MB/s eta 0:00:01
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## Pattern Matching Using Spacy

The below code and example is from Ashiq KS's article [Rule-Based Matching with spacy](https://medium.com/@ashiqgiga07/rule-based-matching-with-spacy-295b76ca2b68):

In [None]:
#The input text string is converted to a Document object
text = '''
Computer programming is the process of writing instructions that get executed by computers. 
The instructions, also known as code, are written in a programming language which the computer 
can understand and use to perform a task or solve a problem. Basic computer programming involves 
the analysis of a problem and development of a logical sequence of instructions to solve it. 
There can be numerous paths to a solution and the computer programmer seeks to design and 
code that which is most efficient. Among the programmer’s tasks are understanding requirements, 
determining the right programming language to use, designing or architecting the solution, coding, 
testing, debugging and writing documentation so that the solution can be easily
understood by other programmers.Computer programming is at the heart of computer science. It is the 
implementation portion of software development, application development 
and software engineering efforts, transforming ideas and theories into actual, working solutions.
'''

In [None]:
from spacy.matcher import Matcher #import Matcher class from spacy
#import the Span class to extract the words from the document object
from spacy.tokens import Span 

#Language class with the English model 'en_core_web_sm' is loaded
nlp = spacy.load("en_core_web_sm")

doc = nlp(text) # convert the string above to a document

#instantiate a new Matcher class object 
matcher = Matcher(nlp.vocab)

### Define the Target Pattern

The `pattern` object that you define should be a list of dictionary elements, each dictionary describing the token to match for. 

Here, we are matching for the usage of `computer` as a `NOUN`.

In [None]:
#define the pattern
pattern = [{'LOWER': 'computer', 'POS': 'NOUN'},
             {'POS':{'NOT_IN': ['VERB']}}]


### Load the Pattern into the Matcher

In [None]:
#add the pattern to the previously created matcher object
matcher.add("Matching", None, pattern)

## Using Regular Expressions in Spacy

The below example can be found at https://spacy.io/usage/rule-based-matching. It uses the `re.finditer()` function to
quickly iterate through all the matches found. 

<div class="alert-success">
<p>spacy object <i>nlp</i> could do w/ regx. 
<p><i>finditer</i> creates a iterator</div>

In [3]:
import spacy
import re
nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

expression = r"\b[Uu](nited|\.?) ?[Ss](tates|\.?)\b"
for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Found match: United States
Found match: United States
Found match: US


## Part of Speech Tagging

In [None]:
# !python3 -m spacy download en_core_web_sm
# !python3 -m spacy download en_core_web_md

In [4]:
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
import spacy

nlp = spacy.load('en_core_web_md')

In [5]:
import pandas as pd
rows = []
doc = nlp(u"Steve Jobs and Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    rows.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))
    
data = pd.DataFrame(rows, columns=["text", "lemma", "part_of_speech", "tag", "dependency", "shape", "is_alphanumeric", "is_stopword"])
data.head()

Unnamed: 0,text,lemma,part_of_speech,tag,dependency,shape,is_alphanumeric,is_stopword
0,Steve,Steve,PROPN,NNP,compound,Xxxxx,True,False
1,Jobs,Jobs,PROPN,NNP,nsubj,Xxxx,True,False
2,and,and,CCONJ,CC,cc,xxx,True,True
3,Apple,Apple,PROPN,NNP,conj,Xxxxx,True,False
4,is,be,AUX,VBZ,aux,xx,True,True


<div class="alert-success">
spacy conduct regular analyze automatically</div>

### Named Entity Recognition

In [6]:
# example from spacy docs
doc = nlp(u"Steve Jobs and Apple is looking at buying U.K. startup for $1 billion")
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_sm.load()

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Steve Jobs 0 10 PERSON
Apple 15 20 ORG
U.K. 42 46 GPE
$1 billion 59 69 MONEY


In [7]:
# visualize this using displacy:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

# Word Embeddings (word2vec Introduction) from Intro to Algorithmic Marketing

<div class="alert-success">
best textbook: <b>Introduction to Algorithmic Marketing</b>, which involves business uses</div>

## Continuous Bag of Words (Use Context to Predict Target Word)
ref: `Introduction to Algorithmic Marketing`
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/word2vec_cbow.png "Logo Title Text 1")

## Softmax

![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/softmax.png "Logo Title Text 1")

## Skipgram
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/skipgram.png "Logo Title Text 1")

![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/wordembedding_cluster.png "Logo Title Text 1")

<div class="alert-success">
word2vec: make synonyms close to each other</div>

In [8]:
import en_core_web_sm
import en_core_web_md
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_sm.load()

In [9]:
tokens = nlp(u'dog cat Beijing sad depressed couch sofa canine China Chinese France Paris banana')

for token1 in tokens:
    for token2 in tokens:
        if token1 != token2:
            print(f" {token1} - {token2}: {1 - cosine(token1.vector, token2.vector)}")

 dog - cat: 0.6064711213111877
 dog - Beijing: 0.2668081223964691
 dog - sad: 0.23861637711524963
 dog - depressed: 0.15567229688167572
 dog - couch: 0.40754202008247375
 dog - sofa: 0.28589534759521484
 dog - canine: 0.40234482288360596
 dog - China: 0.3280991315841675
 dog - Chinese: 0.22344091534614563
 dog - France: 0.45519715547561646
 dog - Paris: 0.48964983224868774
 dog - banana: 0.3760344386100769
 cat - dog: 0.6064711213111877
 cat - Beijing: 0.3463219106197357
 cat - sad: 0.24762609601020813
 cat - depressed: 0.08602728694677353
 cat - couch: 0.2503412663936615
 cat - sofa: 0.25963476300239563
 cat - canine: 0.3891853988170624
 cat - China: 0.28076720237731934
 cat - Chinese: 0.11958665400743484
 cat - France: 0.3794781267642975
 cat - Paris: 0.39501631259918213
 cat - banana: 0.46230363845825195
 Beijing - dog: 0.2668081223964691
 Beijing - cat: 0.3463219106197357
 Beijing - sad: 0.27167046070098877
 Beijing - depressed: 0.22452890872955322
 Beijing - couch: 0.1336926072835

<div class="alert-success">
<p>use embedding to <b>find similar docs:</b> simply average on tokenized vectors (i.e. bag of words)
<p> cons: no context, average out to mean $0$ (the prior of the words)
    <p>variation #1: weight words by TF-IDF score
    <p>variation #2: use word embedding as inputs of sequential models
</div>

# Finding Most Similar Words (Using Our Old Methods)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# inspect the default settings for CountVectorizer
CountVectorizer()

In [None]:
reviews = open("poor_amazon_toy_reviews.txt").readlines()

vectorizer = CountVectorizer(ngram_range=(1, 1), 
                             stop_words="english", 
                             max_features=500,token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b')
X = vectorizer.fit_transform(reviews)

data = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
data.head()

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# create similiarity matrix
similarity_matrix = pd.DataFrame(cosine_similarity(data.T.values), 
             columns=vectorizer.get_feature_names(),
                                 index=vectorizer.get_feature_names())

In [None]:
# unstack matrix into table
similarity_table = similarity_matrix.rename_axis(None).rename_axis(None, axis=1).stack().reset_index()

In [None]:
# rename columns
similarity_table.columns = ["word1", "word2", "similarity"]
similarity_table.shape

In [None]:
similarity_table = similarity_table[similarity_table["similarity"] < 0.99]
similarity_table.shape

In [None]:
similarity_table.sort_values(by="similarity", ascending=False).drop_duplicates(
    subset="similarity", keep="first").head(10)

In [None]:
top_500_words = vectorizer.get_feature_names()

# Exercise: Similar Words Using Word Embeddings

In [None]:
# load into spacy your top 500 words

tokens = nlp(f'{" ".join(top_500_words)}')

In [None]:
from itertools import product
# create a list of similarity tuples

similarity_tuples = []

for token1, token2 in product(tokens, repeat=2):
    similarity_tuples.append((token1, token2, token1.similarity(token2)))

similarities = pd.DataFrame(similarity_tuples, columns=["word1","word2", "score"])


In [None]:
# find similar words
similarities[similarities["score"] < 1].sort_values(
    by="score", ascending=False).drop_duplicates(
    subset="score", keep="first").head(5)

# CW for week4:

1. Which pair of words from the list of four below would have the closest similarity score using word2vec? Explain why based on your understanding of word2vec. *happy, hoppy, cheerful, derecha (Spanish word for right)*
- (happy, cheerful). As word2vec is based on the assumption that the meaning of a word is much affected by its context, it would return similar vectors for synonyms, which are used in similar context.

2. Identify all the named entities in the following document:
Obama will return to the White House for the first time as Democrats look ahead to midterm elections
- Obama, the White House, Democrats, midterm elections.

3. Write a named or unnamed capture group to extract email address' user names(the part before the @).
- r'(\w+)@\w+\.(?:com|net)'

## verProf
2. midterm election may not be a named entity
3. Don't forget word boundary