https://spacy.io/usage/models

In [1]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl (777.4 MB)
     ------------------------------------ 777.4/777.4 MB 806.5 kB/s eta 0:00:00
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.2.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')




Word2vec is a 2-layer neural network that processes text. Its input is a text corpus and its output is a set of vectors (feature vectors for words in that corpus)<br>

The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. It detects similarities mathematically<br>

It trains words against other words that neighbour them in the input corpus. It does this either using context to predict a target word (Continuous bag of words) or using a word to predict a target context (skip-gram)

When we have the vectors for words we can use Cosine similarity to measure how similar words vectors are to each other

With these vectors we can then perform vector arithmetic, e.g:<br>
- v = king - man + woman
- this creates a vector v that we then attempt to find the most similar vector to it which would be queen

In [2]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_lg')

# needed to force all the lexemes to be loaded into the vocab
for s in nlp.vocab.vectors:
    _ = nlp.vocab[s]

In [5]:
nlp(u'lion').vector # get the vector for the word lion, can also get the vector for a document

array([ 1.8963e-01, -4.0309e-01,  3.5350e-01, -4.7907e-01, -4.3311e-01,
        2.3857e-01,  2.6962e-01,  6.4332e-02,  3.0767e-01,  1.3712e+00,
       -3.7582e-01, -2.2713e-01, -3.5657e-01, -2.5355e-01,  1.7543e-02,
        3.3962e-01,  7.4723e-02,  5.1226e-01, -3.9759e-01,  5.1333e-03,
       -3.0929e-01,  4.8911e-02, -1.8610e-01, -4.1702e-01, -8.1639e-01,
       -1.6908e-01, -2.6246e-01, -1.5983e-02,  1.2479e-01, -3.7276e-02,
       -5.7125e-01, -1.6296e-01,  1.2376e-01, -5.5464e-02,  1.3244e-01,
        2.7519e-02,  1.2592e-01, -3.2722e-01, -4.9165e-01, -3.5559e-01,
       -3.0630e-01,  6.1185e-02, -1.6932e-01, -6.2405e-02,  6.5763e-01,
       -2.7925e-01, -3.0450e-03, -2.2400e-02, -2.8015e-01, -2.1975e-01,
       -4.3188e-01,  3.9864e-02, -2.2102e-01, -4.2693e-02,  5.2748e-02,
        2.8726e-01,  1.2315e-01, -2.8662e-02,  7.8294e-02,  4.6754e-01,
       -2.4589e-01, -1.1064e-01,  7.2250e-02, -9.4980e-02, -2.7548e-01,
       -5.4097e-01,  1.2823e-01, -8.2408e-02,  3.1035e-01, -6.33

In [37]:
tokens = nlp(u'lion cat pet')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.5265438
lion pet 0.39923766
cat lion 0.5265438
cat cat 1.0
cat pet 0.7505457
pet lion 0.39923766
pet cat 0.7505457
pet pet 1.0


In [38]:
tokens = nlp(u'like love hate') # see that love and hate are "similar" because they occur in similar context

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

like like 1.0
like love 0.657904
like hate 0.65746516
love like 0.657904
love love 1.0
love hate 0.63930994
hate like 0.65746516
hate love 0.63930994
hate hate 1.0


In [14]:
tokens = nlp(u"dog cat nargle")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov) # oov is out of vocab

dog True 7.0336733 False
cat True 6.6808186 False
nargle False 0.0 True


In [8]:
from scipy import spatial

cosine_similarity = lambda vec1, vec2: 1 - spatial.distance.cosine(vec1, vec2)

In [10]:
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

new_v = king - man + woman

similar = []
for word in nlp.vocab:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                s = cosine_similarity(new_v, word.vector)
                similar.append((word, s))

In [42]:
similar = sorted(similar, key = lambda item: -item[1]) # descending order on index 1 (most similar first)

In [43]:
print([t[0].text for t in similar[:10]])

['king', 'queen', 'prince', 'kings', 'princess', 'royal', 'throne', 'queens', 'monarch', 'kingdom']


SENTIMENT<br>

VADER - Valence Aware Dictionary for sEntiment Reasoning), this model is sensitive to both polarity (+/-) and intensity of emotion<br>

VADER uses a dictionary which maps lexical features to emotion intensities<br>

It also takes into consideration the context of the word e.g. not love is negative. It will also take into account the capitalisation of the word so LOVE is more positive than love. It is difficult to spot sarcasm

In [44]:
import nltk 
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\nross\AppData\Roaming\nltk_data...


True

In [45]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [46]:
sid = SentimentIntensityAnalyzer()

In [48]:
a = "This is a good movie"
sid.polarity_scores(a) # max value is 1.0, compound is a normalisation across all 3

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

In [49]:
b = "This was the best, most awesome movie EVER MADE!!!!!"
sid.polarity_scores(b)

{'neg': 0.0, 'neu': 0.418, 'pos': 0.582, 'compound': 0.8948}

In [50]:
c = "I hated this movie, it was dreadful!!"
sid.polarity_scores(c)

{'neg': 0.658, 'neu': 0.342, 'pos': 0.0, 'compound': -0.8264}

In [51]:
d = "The actors were good but the story line was poorly made and the end was disappointing"
sid.polarity_scores(d)

{'neg': 0.212, 'neu': 0.691, 'pos': 0.096, 'compound': -0.5187}

In [52]:
e = "Cambridge is at the heart of the high-technology Silicon Fen with industries such as software and bioscience and many start-up companies born out of the university. Over 40 per cent of the workforce have a higher education qualification, more than twice the national average. The Cambridge Biomedical Campus, one of the largest biomedical research clusters in the world includes the headquarters of AstraZeneca, a hotel, and the relocated Royal Papworth Hospital"
sid.polarity_scores(e)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [53]:
import pandas as pd

In [56]:
df = pd.read_csv('amazonreviews.tsv', sep='\t') # \t tells it that it is tab seperated 

In [57]:
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [58]:
df['label'].value_countsounts()

neg    5097
pos    4903
Name: label, dtype: int64

In [60]:
# clean the data
df.dropna(inplace=True)

blanks = []
for i, lb, rv in df.itertuples():
    # (index, label, review)
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)
            
df.drop(blanks, inplace=True)

In [62]:
print(df.iloc[0]['review'])
sid.polarity_scores(df.iloc[0]['review'])

Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^


{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [63]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

In [64]:
df['compound'] = df['scores'].apply(lambda d: d['compound'])

In [65]:
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [66]:
df['comp_score'] = df['compound'].apply(lambda score: 'pos' if score >= 0 else 'neg')

In [69]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

accuracy_score(df['label'], df['comp_score'])

0.7097

In [70]:
print(classification_report(df['label'], df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.52      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [71]:
print(confusion_matrix(df['label'], df['comp_score']))

[[2629 2468]
 [ 435 4468]]


In [73]:
df2 = pd.read_csv('moviereviews.tsv', sep = '\t')
df2.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [74]:
# clean the data
df2.dropna(inplace=True)

blanks = []
for i, lb, rv in df2.itertuples():
    # (index, label, review)
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)
            
df2.drop(blanks, inplace=True)

In [79]:
df2['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

In [80]:
sid = SentimentIntensityAnalyzer()

In [81]:
df2['scores'] = df2['review'].apply(lambda review: sid.polarity_scores(review))

In [83]:
df2['compound'] = df2['scores'].apply(lambda d: d['compound'])

In [85]:
df2['comp_score'] = df2['compound'].apply(lambda score: 'pos' if score >= 0 else 'neg')

In [88]:
accuracy_score(df2['label'], df2['comp_score'])

0.6357069143446853

In [89]:
print(classification_report(df2['label'], df2['comp_score']))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938

