# Spacy

### Models

Spacy comes with a variety of different models that can used per language. For instance, the models for English are available [here](https://spacy.io/models/en). You'll need to download each model separately:

```python
python3 -m spacy download en_core_web_sm
python3 -m spacy download en_core_web_md

```

## Pattern Matching Using Spacy

The below code and example is from Ashiq KS's article [Rule-Based Matching with spacy](https://medium.com/@ashiqgiga07/rule-based-matching-with-spacy-295b76ca2b68):

In [None]:
#The input text string is converted to a Document object
text = '''
Computer programming is the process of writing instructions that get executed by computers. 
The instructions, also known as code, are written in a programming language which the computer 
can understand and use to perform a task or solve a problem. Basic computer programming involves 
the analysis of a problem and development of a logical sequence of instructions to solve it. 
There can be numerous paths to a solution and the computer programmer seeks to design and 
code that which is most efficient. Among the programmer’s tasks are understanding requirements, 
determining the right programming language to use, designing or architecting the solution, coding, 
testing, debugging and writing documentation so that the solution can be easily
understood by other programmers.Computer programming is at the heart of computer science. It is the 
implementation portion of software development, application development 
and software engineering efforts, transforming ideas and theories into actual, working solutions.
'''

In [None]:
from spacy.matcher import Matcher #import Matcher class from spacy
#import the Span class to extract the words from the document object
from spacy.tokens import Span 

#Language class with the English model 'en_core_web_sm' is loaded
nlp = spacy.load("en_core_web_sm")

doc = nlp(text) # convert the string above to a document

#instantiate a new Matcher class object 
matcher = Matcher(nlp.vocab)

### Define the Target Pattern

The `pattern` object that you define should be a list of dictionary elements, each dictionary describing the token to match for. 

Here, we are matching for the usage of `computer` as a `NOUN`.

In [None]:
#define the pattern
pattern = [{'LOWER': 'computer', 'POS': 'NOUN'},
             {'POS':{'NOT_IN': ['VERB']}}]


### Load the Pattern into the Matcher

In [None]:
#add the pattern to the previously created matcher object
matcher.add("Matching", None, pattern)

## Using Regular Expressions in Spacy

The below example can be found at https://spacy.io/usage/rule-based-matching. It uses the `re.finditer()` function to
quickly iterate through all the matches found. 

In [None]:
import spacy
import re
nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

expression = r"\b[Uu](nited|\.?) ?[Ss](tates|\.?)\b"
for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

## Part of Speech Tagging

In [None]:
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download en_core_web_md

In [None]:
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
import spacy

nlp = spacy.load('en_core_web_md')

In [None]:
import pandas as pd
rows = []
doc = nlp(u"Steve Jobs and Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    rows.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))
    
data = pd.DataFrame(rows, columns=["text", "lemma", "part_of_speech", "tag", "dependency", "shape", "is_alphanumeric", "is_stopword"])
data.head()

### Named Entity Recognition

In [None]:
# example from spacy docs
doc = nlp(u"Steve Jobs and Apple is looking at buying U.K. startup for $1 billion")
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_sm.load()

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
# visualize this using displacy:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

# Word Embeddings (word2vec Introduction) from Intro to Algorithmic Marketing

## Continuous Bag of Words (Use Context to Predict Target Word)
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/word2vec_cbow.png "Logo Title Text 1")

## Softmax
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/softmax.png "Logo Title Text 1")

## Skipgram
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/skipgram.png "Logo Title Text 1")

## Softmax
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/wordembedding_cluster.png "Logo Title Text 1")

In [None]:
import en_core_web_sm
import en_core_web_md
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_sm.load()

In [None]:
tokens = nlp(u'dog cat Beijing sad depressed couch sofa canine China Chinese France Paris banana')

for token1 in tokens:
    for token2 in tokens:
        if token1 != token2:
            print(f" {token1} - {token2}: {1 - cosine(token1.vector, token2.vector)}")

# Finding Most Similar Words (Using Our Old Methods)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# inspect the default settings for CountVectorizer
CountVectorizer()

In [None]:
reviews = open("poor_amazon_toy_reviews.txt").readlines()

vectorizer = CountVectorizer(ngram_range=(1, 1), 
                             stop_words="english", 
                             max_features=500,token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b')
X = vectorizer.fit_transform(reviews)

data = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
data.head()

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# create similiarity matrix
similarity_matrix = pd.DataFrame(cosine_similarity(data.T.values), 
             columns=vectorizer.get_feature_names(),
                                 index=vectorizer.get_feature_names())

In [None]:
# unstack matrix into table
similarity_table = similarity_matrix.rename_axis(None).rename_axis(None, axis=1).stack().reset_index()

In [None]:
# rename columns
similarity_table.columns = ["word1", "word2", "similarity"]
similarity_table.shape

In [None]:
similarity_table = similarity_table[similarity_table["similarity"] < 0.99]
similarity_table.shape

In [None]:
similarity_table.sort_values(by="similarity", ascending=False).drop_duplicates(
    subset="similarity", keep="first").head(10)

In [None]:
top_500_words = vectorizer.get_feature_names()

# Exercise: Similar Words Using Word Embeddings

In [None]:
# load into spacy your top 500 words

tokens = nlp(f'{" ".join(top_500_words)}')

In [None]:
from itertools import product
# create a list of similarity tuples

similarity_tuples = []

for token1, token2 in product(tokens, repeat=2):
    similarity_tuples.append((token1, token2, token1.similarity(token2)))

similarities = pd.DataFrame(similarity_tuples, columns=["word1","word2", "score"])


In [None]:
# find similar words
similarities[similarities["score"] < 1].sort_values(
    by="score", ascending=False).drop_duplicates(
    subset="score", keep="first").head(5)

# CW for week4:

1. Which pair of words from the list of four below would have the closest similarity score using word2vec? Explain why based on your understanding of word2vec. *happy, hoppy, cheerful, derecha (Spanish word for right)*
- (happy, cheerful). As word2vec is based on the assumption that the meaning of a word is much affected by its context, it would return similar vectors for synonyms, which are used in similar context.

2. Identify all the named entities in the following document:
Obama will return to the White House for the first time as Democrats look ahead to midterm elections
- Obama, the White House, Democrats, midterm elections.

3. Write a named or unnamed capture group to extract email address' user names(the part before the @).
- r'(\w+)@\w+\.(?:com|net)'

## verProf
2. midterm election may not be a named entity
3. Don't forget word boundary