In this notebook, we will explore the [Spacy](https://spacy.io/) tool by using Fake News dataset. 

Let us import spacy and also import the 'english' language model.

In [None]:
import numpy as np
import pandas as pd
import spacy

# Import the english language model
nlp = spacy.load('en')

Let us look at the number of rows and columns present in the dataset.

In [None]:
df = pd.read_csv("../input/fake.csv")
df.shape

The description of the columns are as follows:

* uuid - Unique identifier
* ord_in_thread
* author - author of story
* published - date published
* title - title of the story
* text - text of story
* language - data from webhose.io
* crawled - date the story was archived
* site_url - site URL from BS detector
* country - data from webhose.io
* domain_rank - data from webhose.io
* thread_title
* spam_score - data from webhose.io
* main_img_url - image from story
* replies_count - number of replies
* participants_count - number of participants
* likes - number of Facebook likes
* comments - number of Facebook comments
* shares - number of Facebook shares
* type - type of website (label from BS detector)

Now let us look at the top few rows of the dataset to gain some more understanding.

In [None]:
df.head()

Columns "title", "text" and "thread_title" has textual data. For this introduction, let us concentrate on the 'title' column. So let us look at the top few rows of the columns alone

In [None]:
df["title"].head()

**Word-Level Attributes:**

Just calling the function "nlp" on the text column gets us a lot of information. Let us take an example row from the dataset and then apply the same.

In [None]:
txt = df["title"][1009]
txt

In [None]:
doc = nlp(txt)    
olist = []
for token in doc:
    l = [token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_]
    olist.append(l)
    
odf = pd.DataFrame(olist)
odf.columns= ["Text", "StartIndex", "Lemma", "IsPunctuation", "IsSpace", "WordShape", "PartOfSpeech", "POSTag"]
odf

So using "nlp" we got a lot of information. The details are as follows:

* Text - Tokenized word
* StartIndex - Index at which the word starts in the sentence
* Lemma - Lemma of the word (we need not do lemmatization separately)
* IsPunctuation - Whether the given word is a punctuation or not
* IsSpace - Whether the given word is just a white space or not
* WordShape - Gives information about the shape of word (If all letters are in upper case, we will get XXXXX, if all in lower case then xxxxx, if the first letter is upper and others lower then Xxxxx and so on)
* PartOfSpeech - Part of speech of the word
* POSTag - Tag for part of speech of word

**Named Entity Recognition:**

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. 

We also get named entity recognition as part of spacy package. It is inbuilt in the english language model and we can also train our own entities if needed.

In [None]:
doc = nlp(txt)
olist = []
for ent in doc.ents:
    olist.append([ent.text, ent.label_])
    
odf = pd.DataFrame(olist)
odf.columns = ["Text", "EntityType"]
odf

The complete list of different entity types can be seen [here](https://spacy.io/usage/linguistic-features#entity-types)

Spacy also includes a [displacy visualizer](displaCy visualizer with Jupyter support) with jupyter notebook support. This can be used to visualize the named entity recognition data.

In [None]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

Wow. This one looks cool. We can also take one more example and visualize the same. 

In [None]:
txt = df["title"][3003]
doc = nlp(txt)
colors = {'GPE': 'lightblue', 'NORP':'lightgreen'}
options = {'ents': ['GPE', 'NORP'], 'colors': colors}
displacy.render(doc, style='ent', jupyter=True, options=options)

**Noun Phrase Chunking:**

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". 

Now let us look at how to do noun phrase chunking using spacy. In addition to noun phrase chunking, spacy also gets us the root of the noun.

In [None]:
txt = df["title"][2012]
print(txt)

In [None]:
doc = nlp(txt)
olist = []
for chunk in doc.noun_chunks:
    olist.append([chunk.text, chunk.label_, chunk.root.text])
odf = pd.DataFrame(olist)
odf.columns = ["NounPhrase", "Label", "RootWord"]
odf

**Dependency Parser**

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads - [Stanford NLP](https://nlp.stanford.edu/software/nndep.html)

Spacy can be used to create these dependency parsers which can be used in a variety of tasks. 

In [None]:
doc = nlp(df["title"][1009])
olist = []
for token in doc:
    olist.append([token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children]])
odf = pd.DataFrame(olist)
odf.columns = ["Text", "Dep", "Head text", "Head POS", "Children"]
odf

The description of the columns are
* Text: The original token text.
* Dep: The syntactic relation connecting child to head.
* Head text: The original text of the token head.
* Head POS: The part-of-speech tag of the token head.
* Children: The immediate syntactic dependents of the token.

The best way to understand the dependency parser is to visualize the same and looking at it.

In [None]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

In [None]:
doc = nlp(df["title"][3012])
displacy.render(doc, style='dep', jupyter=True, options={'distance': 60})

**Word Similarity:**

Spacy has word vector model as well. So we can use the same to find similar words. The list of available models can be seen [here](https://spacy.io/models/).

For our case, let us use the 'en_core_web_lg' model available in spacy (more details about the model can be accessed in this [link](https://spacy.io/models/en#en_core_web_lg)). First step is to load the model.

In [None]:
nlp = spacy.load('en_core_web_lg')

Now we can use the cosine similarity to find the words that are similar to the word "Queen".

In [None]:
from scipy import spatial
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

queen = nlp.vocab['Queen'].vector
computed_similarities = []
for word in nlp.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue
    similarity = cosine_similarity(queen, word.vector)
    computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])

Different versions of king and queen came out as the top similar words. Now let us take the other important words from the sentence "Elizabeth", "Britain", "Dolphin" and also "King' and check the similarity.

In [None]:
queen = nlp.vocab['Queen']
elizabeth = nlp.vocab['Elizabeth']
britain = nlp.vocab['Britain']
dolphin = nlp.vocab['Dolphin']
king = nlp.vocab['King']
 
print("Word similarity score between Queen and Elizabeth : ",queen.similarity(elizabeth))
print("Word similarity score between Queen and Britain : ",queen.similarity(britain))
print("Word similarity score between Queen and Dolphin : ",queen.similarity(dolphin))
print("Word similarity score between Queen and King : ",queen.similarity(king))

"King" is the most similar word followed by "Elizabeth" and "Britain".

**References:**
1. [Complete Guide to Spacy](https://nlpforhackers.io/complete-guide-to-spacy/)
2. [Spacy documentation](https://spacy.io/)

**More to come. Stay tuned.!**