## NLP Using spaCy

Whenever you need to do any kind of NLP work, there is a bit more ease when using spaCy and I have found that out during my time working on NLP applications.<br>

This notebook is done to get you started on anything that is related to spaCy.


So, here are some of the things that you can do when you use spaCy.<br>

<h1>Table of Contents</h1>

1. [Tokenisation in spaCy](https://www.kaggle.com/charlessamuel/nlp-using-spacy-a-simple-explanation#Tokenisation)
2. [Removing stop words using spaCy](https://www.kaggle.com/charlessamuel/nlp-using-spacy-a-simple-explanation#Removing-Stopwords-in-Sentences)
3. [Normalisation in spaCy](https://www.kaggle.com/charlessamuel/nlp-using-spacy-a-simple-explanation#Removing-Stopwords-in-Sentences#Normalisation)
4. [NER in spaCy](https://www.kaggle.com/charlessamuel/nlp-using-spacy-a-simple-explanation#Removing-Stopwords-in-Sentences#NER-in-spaCy)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import spacy
import pandas as pd
import numpy as np

nlp = spacy.load('en')

So this cell just loads up the basic English model. Other codes include:

* French (fr)
* Chinese (zh)
* German (de)

And many more. Head over [here](https://spacy.io/models) to check them all out.<br>

If you're interested in a more comprehensive model you just need to execute this cell.

In [None]:
!python -m spacy download en_core_web_md

Here, 'md' stands for Medium. Other sizes are 'sm'(Small- Already present in Kaggle) and 'lg'(Large)

In [None]:
df = pd.read_csv('../input/newyork-room-rentalads/room-rental-ads.csv')
df.head()

We now load in our DataFrame from which we will perform our first 

# Tokenisation

One of the basic spaCy operations is Tokenisation. This is where the words of a sentence are split up into words called Tokens. For instance if you have a sentence:<br>

I like planes and tanks <br>

The sentence is split into:<br>

I<br>
like<br>
planes<br>
and<br>
tanks<br>

Its one of the most easiest code to implement in spaCy.

In [None]:
def tokenise(msg):
    doc = nlp(msg)
    
    for token in doc:
        print(token)

In [None]:
tokenise("I like planes and tanks")

This is the base of all spaCy operations. 

# Removing Stopwords in Sentences

Stopwords are words that never contribute to the overrall meaning of a sentence. Removing them is a very big part in cleaning the text in the data. It is once again really simple to remove them using our tokenisation function earlier. <br>

When we use token.is_stop, it returns True if the Token is a stop word else it is False. <br>

The following cell shows the stop words that are available in the model. More stopwords are available in bigger models.

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS)

## Stopwords Function

In [None]:
def stop_remove(msg):
    doc = nlp(msg)
    
    for token in doc:
        if token.is_stop:
            pass
        else:
            print(token)

In [None]:
stop_remove("I have liked reading books for a good few years now")

Stop words successfully filtered out.

# Normalisation

This is something I personally use to clean up the text in the data. It is a really simple function that also includes our stop_word condition along with a few more conditions:

* token.is_digit - Used to remove any digits
* token.is_punct - Used to remove any punctuations
* token.is_oov - Used to remove any words that are not present in the vocabulary

In [None]:
def normalize(msg):
    
    doc = nlp(msg)
    res=[]
    
    for token in doc:
        if(token.is_stop or token.is_digit or token.is_punct or not(token.is_oov)):
            pass
        else:
            res.append(token.lemma_.lower()) #Lower case of the token lemma is added
    
    return " ".join(res)

We can apply our normalisation function to the Description column in the dataset. This dataset in particular has special characters which can be removed using RegEx. The function that I used in this notebook can be found [here](https://www.kaggle.com/charlessamuel/rentals-in-the-big-apple-xgboost#NLP-Work)

In [None]:
import re

def normalize_1(msg):
    
    msg = re.sub('[^A-Za-z]+', ' ', str(msg)) #remove special character and intergers
    doc = nlp(msg)
    res=[]
    for token in doc:
        if(token.is_stop or token.is_punct or token.is_currency or token.is_space or len(token.text) <= 2): #Remove Stopwords, Punctuations, Currency and Spaces
            pass
        else:
            res.append(token.lemma_.lower())
    return res

In [None]:
df['Description'] = df['Description'].apply(normalize_1)
df.head()

The result is in a list to make sure I can take counts of number of words much easily.

# NER in spaCy

This is Named Entity Recognition(NER) using the basic English model. Displacy is a function used to render the entities that are found by the model.

In [None]:
from spacy import displacy

def displacify(msg):
    
    doc = nlp(msg)
    
    return displacy.render(doc, style='ent')

In [None]:
displacify("spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.")

This is some of the basic functions I have done so far in spaCy. <br>


## Things I am planning to add:

1. Sentiment detection(Have to look into this)

Anything else I can add? Let me know in the comments. Upvote if you liked this notebook :)