# Spacy

### Models

Spacy comes with a variety of different models that can used per language. For instance, the models for English are available [here](https://spacy.io/models/en). You'll need to download each model separately:

```python
python3 -m spacy download en_core_web_sm
python3 -m spacy download en_core_web_md

```

## Pattern Matching Using Spacy

The below code and example is from Ashiq KS's article [Rule-Based Matching with spacy](https://medium.com/@ashiqgiga07/rule-based-matching-with-spacy-295b76ca2b68):

In [8]:
#The input text string is converted to a Document object
text = '''
Computer programming is the process of writing instructions that get executed by computers. 
The instructions, also known as code, are written in a programming language which the computer 
can understand and use to perform a task or solve a problem. Basic computer programming involves 
the analysis of a problem and development of a logical sequence of instructions to solve it. 
There can be numerous paths to a solution and the computer programmer seeks to design and 
code that which is most efficient. Among the programmer’s tasks are understanding requirements, 
determining the right programming language to use, designing or architecting the solution, coding, 
testing, debugging and writing documentation so that the solution can be easily
understood by other programmers.Computer programming is at the heart of computer science. It is the 
implementation portion of software development, application development 
and software engineering efforts, transforming ideas and theories into actual, working solutions.
'''

In [10]:
from spacy.matcher import Matcher #import Matcher class from spacy
#import the Span class to extract the words from the document object
from spacy.tokens import Span 

#Language class with the English model 'en_core_web_sm' is loaded
nlp = spacy.load("en_core_web_sm")

doc = nlp(text) # convert the string above to a document

#instantiate a new Matcher class object 
matcher = Matcher(nlp.vocab)

### Define the Target Pattern

The `pattern` object that you define should be a list of dictionary elements, each dictionary describing the token to match for. 

Here, we 

In [None]:
#define the pattern
pattern = [{'LOWER': 'computer', 'POS': 'NOUN'},
             {'POS':{'NOT_IN': ['VERB']}}]


### Load the Pattern into the Matcher

In [None]:
#add the pattern to the previously created matcher object
matcher.add("Matching", None, pattern)

## Using Regular Expressions in Spacy

The below example can be found at https://spacy.io/usage/rule-based-matching. It uses the `re.finditer()` function to
quickly iterate through all the matches found. 

In [7]:
import spacy
import re
nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

expression = r"[Uu](nited|\.?) ?[Ss](tates|\.?)"
for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

Found match: United States
Found match: United States
Found match: U.S.
Found match: US


## Part of Speech Tagging

In [None]:
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download en_core_web_md

In [None]:
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
import spacy

nlp = spacy.load('en_core_web_md')

In [3]:
import pandas as pd
rows = []
doc = nlp(u"Steve Jobs and Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    rows.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))
    
data = pd.DataFrame(rows, columns=["text", "lemma", "part_of_speech", "tag", "dependency", "shape", "is_alphanumeric", "is_stopword"])
data.head()

Unnamed: 0,text,lemma,part_of_speech,tag,dependency,shape,is_alphanumeric,is_stopword
0,Steve,steve,PROPN,NNP,compound,Xxxxx,True,False
1,Jobs,jobs,PROPN,NNP,nsubj,Xxxx,True,False
2,and,and,CCONJ,CC,cc,xxx,True,True
3,Apple,apple,PROPN,NNP,conj,Xxxxx,True,False
4,is,be,VERB,VBZ,aux,xx,True,True


### Named Entity Recognition

In [4]:
doc = nlp(u"Steve Jobs and Apple is looking at buying U.K. startup for $1 billion")
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_sm.load()

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Steve Jobs 0 10 PERSON
Apple 15 20 ORG
U.K. 42 46 GPE
$1 billion 59 69 MONEY


In [5]:
# visualize this using displacy:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

# Word Embeddings (word2vec Introduction)

## Continuous Bag of Words (Use Context to Predict Target Word)
![alt text](images/word2vec_cbow.png "Logo Title Text 1")

## Softmax
![alt text](images/softmax.png "Logo Title Text 1")

## Skipgram
![alt text](images/skipgram.png "Logo Title Text 1")

## Softmax
![alt text](images/wordembedding_cluster.png "Logo Title Text 1")

In [50]:
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_sm.load()

In [51]:
tokens = nlp(u'dog cat Beijing sad depressed couch sofa canine China Chinese France Paris banana')

for token1 in tokens:
    for token2 in tokens:
        if token1 != token2:
            print(f" {token1} - {token2}: {1 - cosine(token1.vector, token2.vector)}")

 dog - cat: 0.4564264118671417
 dog - Beijing: 0.1571345329284668
 dog - sad: 0.3079860210418701
 dog - depressed: 0.11385080963373184
 dog - couch: 0.5404482483863831
 dog - sofa: 0.33240464329719543
 dog - canine: 0.4633784294128418
 dog - China: 0.0019485866650938988
 dog - Chinese: 0.021737948060035706
 dog - France: 0.1857185661792755
 dog - Paris: 0.11601343750953674
 dog - banana: 0.3103766441345215
 cat - dog: 0.4564264118671417
 cat - Beijing: 0.25583046674728394
 cat - sad: 0.06742441654205322
 cat - depressed: 0.11650095880031586
 cat - couch: 0.37735462188720703
 cat - sofa: 0.414833128452301
 cat - canine: 0.45437631011009216
 cat - China: 0.14348067343235016
 cat - Chinese: 0.03203266113996506
 cat - France: 0.26350462436676025
 cat - Paris: 0.1825326830148697
 cat - banana: 0.4973468482494354
 Beijing - dog: 0.1571345329284668
 Beijing - cat: 0.25583046674728394
 Beijing - sad: 0.16756749153137207
 Beijing - depressed: 0.020596839487552643
 Beijing - couch: 0.17647489905

# Finding Most Similar Words (Using Our Old Methods)

In [53]:
from sklearn.feature_extraction.text import CountVectorizer

# inspect the default settings for CountVectorizer
CountVectorizer()

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [54]:
reviews = open("poor_amazon_toy_reviews.txt").readlines()

vectorizer = CountVectorizer(ngram_range=(1, 1), 
                             stop_words="english", 
                             max_features=500,token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b')
X = vectorizer.fit_transform(reviews)

data = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
data.head()

Unnamed: 0,able,absolutely,actual,actually,adult,advertised,ago,air,amazon,apart,...,working,works,worse,worst,worth,wouldn,wrong,year,years,zero
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
from sklearn.metrics.pairwise import cosine_similarity

# create similiarity matrix
similarity_matrix = pd.DataFrame(cosine_similarity(data.T.values), 
             columns=vectorizer.get_feature_names(),
                                 index=vectorizer.get_feature_names())

In [56]:
# unstack matrix into table
similarity_table = similarity_matrix.rename_axis(None).rename_axis(None, axis=1).stack().reset_index()

In [57]:
# rename columns
similarity_table.columns = ["word1", "word2", "similarity"]
similarity_table.shape

(250000, 3)

In [58]:
similarity_table = similarity_table[similarity_table["similarity"] < 0.99]
similarity_table.shape

(249500, 3)

In [59]:
similarity_table.sort_values(by="similarity", ascending=False).drop_duplicates(
    subset="similarity", keep="first").head(10)

Unnamed: 0,word1,word2,similarity
144497,old,year,0.754569
194593,service,customer,0.734095
237276,waste,money,0.655483
245419,working,stopped,0.589414
74650,figure,figures,0.553856
5949,arm,train,0.500379
172826,quality,poor,0.469081
179083,remote,control,0.45693
210121,store,dollar,0.426935
4648,apart,fell,0.385922


In [60]:
top_500_words = vectorizer.get_feature_names()

# Exercise: Similar Words Using Word Embeddings

In [61]:
# load into spacy your top 500 words

tokens = nlp(f'{" ".join(top_500_words)}')

In [62]:
from itertools import product
# create a list of similarity tuples

similarity_tuples = []

for token1, token2 in product(tokens, repeat=2):
    similarity_tuples.append((token1, token2, token1.similarity(token2)))

similarities = pd.DataFrame(similarity_tuples, columns=["word1","word2", "score"])


In [63]:
# find similar words
similarities[similarities["score"] < 1].sort_values(
    by="score", ascending=False).drop_duplicates(
    subset="score", keep="first").head(5)

Unnamed: 0,word1,word2,score
108082,inches,figures,0.833207
76607,figures,packs,0.825721
211253,stop,start,0.816715
48523,damaged,popped,0.812956
108233,inches,packs,0.812329
