### Syntax

Syntax is the structure of a language which is governed by grammers. Any ordering of words can not be a sentence. Hence, we need syntactical analysis for natural languages.

## Table of Contents

* [Parts of Speech Tagging](#pos)
* [Dependency Parsing](#parsing)
* [Named Entity Recognition](#ner)

<a id='pos'></a>

### Parts of Speech Tagging

Parts of speech (POS) are specific lexical categories to which words are assigned, based on their syntactic context and role. Usually, words can fall into one of the following major categories.

* <strong>Nouns</strong>
* <strong>Verb</strong>
* <strong>Adjective</strong>
* <strong>Adverb</strong>

Besides these four major categories of parts of speech , there are other categories that occur frequently in the English language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging . POS tags are used to annotate words and depict their POS, which is really helpful to perform specific analysis, such as narrowing down upon nouns and seeing which ones are the most prominent, word sense disambiguation, and grammar analysis.


In [1]:
import pandas as pd

pd.options.display.max_colwidth = -1

data = pd.read_csv('amazon_reviews_cleaned.csv') #for excel file use read_excel

In [2]:
import nltk
import spacy
import en_core_web_sm
import re

nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

In [3]:
sample_text = data.clean_reviewtext.iloc[10]
print (sample_text)

would give less 1 star possible DONT buy product ice machine stop work four hour use first time notify New Air state would honor one year warranty authorize dealer sell buy product Amazon never even think would cross check purchase manufacturer NewAir stand product use method get honor warranty 200 piece junk never buy another NewAir product


In [4]:
text_tokenized = nlp(sample_text)

for token in text_tokenized:
    print ("{} ---> {}".format(token,token.pos_))

would ---> VERB
give ---> VERB
less ---> ADJ
1 ---> NUM
star ---> NOUN
possible ---> ADJ
DONT ---> VERB
buy ---> VERB
product ---> NOUN
ice ---> NOUN
machine ---> NOUN
stop ---> VERB
work ---> NOUN
four ---> NUM
hour ---> NOUN
use ---> NOUN
first ---> ADJ
time ---> NOUN
notify ---> VERB
New ---> PROPN
Air ---> PROPN
state ---> NOUN
would ---> VERB
honor ---> VERB
one ---> NUM
year ---> NOUN
warranty ---> NOUN
authorize ---> VERB
dealer ---> NOUN
sell ---> NOUN
buy ---> VERB
product ---> NOUN
Amazon ---> PROPN
never ---> ADV
even ---> ADV
think ---> VERB
would ---> VERB
cross ---> VERB
check ---> NOUN
purchase ---> NOUN
manufacturer ---> NOUN
NewAir ---> PROPN
stand ---> VERB
product ---> NOUN
use ---> NOUN
method ---> NOUN
get ---> VERB
honor ---> NOUN
warranty ---> NOUN
200 ---> NUM
piece ---> NOUN
junk ---> NOUN
never ---> ADV
buy ---> VERB
another ---> DET
NewAir ---> PROPN
product ---> NOUN


Usually POS tags are used for analysis, feature engineering or, feature selection. In this analysis, let us select only the words that are nouns, verbs, number and adjectives.

In [5]:
print (data.shape)
data = data.dropna(subset=['clean_reviewtext'])
print (data.shape)

(2277, 13)
(2272, 13)


In [6]:
def get_selected_pos(text):
    text_tokenized = nlp(text)
    selected_words = [token.string for token in text_tokenized if token.pos_ in ['NOUN','PROPN','NUM','ADJ','VERB']]
    return re.sub(' +',' ', " ".join(selected_words))

In [7]:
data.clean_reviewtext = data.clean_reviewtext.apply(get_selected_pos)

<a id='parsing'></a>

### Dependency Parsing

In dependency parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic dependencies and relationships between tokens in a sentence. The basic principle behind a dependency grammar is that in any sentence in the language, all words except one, have some relationship or dependency on other words in the sentence. The word that has no dependency is called the root of the sentence. The verb is taken as the root of the sentence in most cases. All the other words are directly or indirectly linked to the root verb using links , which are the dependencies.

In [8]:
from spacy import displacy

print (data.clean_reviewtext.iloc[0])

text_tokenized = nlp(data.clean_reviewtext.iloc[0])

displacy.render(text_tokenized, jupyter=True, 
                options={'distance': 110,
                         'arrow_stroke': 2,
                         'arrow_width': 8})

vent something keep house warm winter sand paint color house look great


<a id='ner'></a>

### Named Entity Recognition (NER)

In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities , which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on, which are often denoted by proper names. A naive approach could be to find these by looking at the noun phrases in text documents. Named entity recognition (NER) , also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

In [9]:
sample_text = data.clean_reviewtext.iloc[10]
print (sample_text)

text_tokenized = nlp(sample_text)

for ent in text_tokenized.ents:
    print ("{} ---> {}".format(ent.text, ent.label_))

would give less 1 star possible DONT buy product ice machine stop work four hour use first time notify New Air state would honor one year warranty authorize dealer sell buy product Amazon think would cross check purchase manufacturer NewAir stand product use method get honor warranty 200 piece junk buy NewAir product
1 ---> CARDINAL
four hour ---> TIME
first ---> ORDINAL
New Air ---> LOC
one year ---> DATE
Amazon ---> ORG
NewAir ---> ORG
warranty 200 ---> CARDINAL
NewAir ---> ORG


In [10]:
sample_text = data.clean_reviewtext.iloc[10].lower()

text_tokenized = nlp(sample_text)

for ent in text_tokenized.ents:
    print ("{} ---> {}".format(ent.text, ent.label_))

1 ---> CARDINAL
four hour ---> TIME
first ---> ORDINAL
one year ---> DATE
warranty 200 ---> CARDINAL


As we convert the proper nouns into lower cases, spacy NER engine fails to detect their entities. This shows the need of more sophisticated NER engines.