

![](https://i.pinimg.com/originals/40/b1/3d/40b13d00e57b21a195217db15e03403e.png)

This kernel will introduce you to the basics of text processing with spaCy. If you are a beginner like me then for sure you will find this guide helpful. 

The idea will be to learn basic preprocessing using spaCy which is an Industrial-Strength Natural Language Processing library and then level up based on the problems we are trying to solve. As this is my first kernel I hope I don't mess up a lot.

You can also take this course in detail by spaCy : [Advanced NLP with spaCy](https://course.spacy.io/)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from bs4 import BeautifulSoup

from spacy.lang.en import English
nlp = English()

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

We will use News Headlines dataset for Sarcasm Detection Dataset.
You can find the Dataset here : [News Headlines Dataset For Sarcasm Detection](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection)

In [None]:
# Load Data
print("Loading data...")
train = pd.read_json("../input/news-headlines-dataset-for-sarcasm-detection/Sarcasm_Headlines_Dataset.json", lines=True)
train = train.drop(['article_link'], axis=1)

print("Train shape:", train.shape)
train.head()

# Data Cleaning

Before we can apply machine learning algorithms we have to preprocess our text data. Here's how to clean your text data.

- Remove all irrelevant characters such as any non alphanumeric characters
- Tokenize your text by separating it into individual words
- Remove words that are not relevant, such as “@” twitter mentions or urls(if any)
- Convert all characters to lowercase(**Case folding**), in order to treat words such as “hello”, “Hello”, and “HELLO” the same. ******
- Consider combining misspelled or alternately spelled words to a single representation (e.g. “cool”/”kewl”/”cooool”)
- Consider lemmatization (reduce words such as “am”, “are”, and “is” to a common form such as “be”)
- Consider removing stopwords (such as a, an, the, be)etc.

**Note** : For tasks like speech recognition and information retrieval, everything is mapped to lower case. For sentiment analysis and other text classification tasks, information extraction, and machine translation, by contrast, case is quite helpful and case folding is generally not done (losing the difference, for example, between **US the country and us the pronoun** can outweigh the advantage in generality that case folding provides)

In [None]:
# Check the first review

print('The first review is:\n\n',train["headline"][0])

Let's create cleanData function. 
This function will remove stopwords, punctuations, convert text to lowercase and preform lemmatization.

**Stopwords**: Stop words are words that are particularly common in a text corpus and thus considered as rather un-informative.

**Lemmatization**: Lemmatization refers to normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.



In [None]:
# function to clean data

def cleanData(doc,stemming = False):
    doc = doc.lower()
    doc = nlp(doc)
    tokens = [tokens.lower_ for tokens in doc]
    tokens = [tokens for tokens in doc if (tokens.is_stop == False)]
    tokens = [tokens for tokens in tokens if (tokens.is_punct == False)]
    final_token = [token.lemma_ for token in tokens]
    
    return " ".join(final_token)

In [None]:
clean_review = cleanData(train['headline'][0])
clean_review

In [None]:
# clean description
print("Cleaning train data...\n")
train["headline"] = train["headline"].map(lambda x: cleanData(x))

# Basics of spaCy

**spaCy** is a free, open-source library for advanced Natural Language Processing (NLP) in Python. 

spaCy is designed specifically for **production use** and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to **pre-process text for deep learning**.

Let's get started

In [None]:
sample_text = """When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously. “I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to,” said Thrun, now the co-founder and CEO of online higher education startup Udacity, in an interview with Recode earlier this week.

A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor, wallet in hand, to secure their place in the fast-moving world of fully automated transportation."""
doc = nlp(sample_text)

So, what is really happening here is that  we pass a string of text to the nlp object, and receive a Doc object.

![](https://course.spacy.io/pipeline.png)

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. After tokenization, spaCy can parse and tag a given Doc.
There are several preprocessing tasks which we will go through one by one.
The best thing about spaCy pipeline is that you can always **add custom functions** in this pipeline depending on your problem.

Some of the basic functions are:

* Tokenization
* Part-of-speech (POS) Tagging
* Named Entity Recognition (NER) etc.

# Tokenization

> Tokenization is the process of converting a sequence of characters into a sequence of tokens.

![](https://raw.githubusercontent.com/theainerd/MLInterview/master/images/Screenshot%20from%202018-10-04%2014-14-08.png)



In [None]:
# print each token
for token in doc:
    print(token.text)

# Parts-of-Speech (POS) tagging

> *Part-of-speech tagging (POS tagging)* is the task of tagging a word in a text with its part of speech. A part of speech is a category of words with similar grammatical properties. Common English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.

the *"en_core_web_sm"* package is a small English model that supports all core capabilities and is trained on web text. The package provides the **binary weights** that enable spaCy to make predictions.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(sample_text)

In [None]:
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

# Named Entity Recognition (NER)

> In the Named Entity Recognition (NER) task, systems are required to recognize the Named Entities occurring in the text. More specifically, the task is to find Person (PER), Organization (ORG), Location (LOC) and Geo-Political Entities (GPE). For instance, in the statement ”Shyam lives in India”, NER system extracts Shyam which refers to name of the person and India which refers to name of the country.



In [None]:
# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

In [None]:
# Get quick definitions of the most common tags and labels

print(spacy.explain('GPE'))
print(spacy.explain('ORG'))

In [None]:
spacy.displacy.render(doc, style='ent', jupyter=True)

# Adding custom functions to pipelines

Can be added using the `nlp.add_pipe` method. You can also mention where you want to add the component using argumenets.

Let's add a component of word counts to the pipeline.

In [None]:
# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print('Doc length:', len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

Now everytime you create a new nlp object it will always print the document length.

In [None]:
# Process a text
doc = nlp(sample_text)

# Document similarity

spaCy can compare two objects and predict similarity.  In order to use similarity, you need a larger spaCy model that has word vectors included.

For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". 

In [None]:
nlp = spacy.load('en_core_web_lg')

In [None]:
doc1 = nlp("My name is shyam")
doc2 = nlp("My name is Ram")

In [None]:
print("The documents similarity is:" ,doc1.similarity(doc2))

This was a very basic guide to get started on NLP for beginners, we saw data cleaning for text data and how to use spaCy for different Natural Language Processing Tasks.

References:

[Advanced NLP with spaCy](https://course.spacy.io/)

Kernels you can explore:

[Hitchhiker's Guide to NLP in spaCy](https://www.kaggle.com/nirant/hitchhiker-s-guide-to-nlp-in-spacy/)