# NLP with spaCy

### Natural Language Processing (NLP) 자연어처리

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

컴퓨터를 이용해 사람의 자연어를 분석하고 처리하는 기술. 요소 기술로 자연어 분석, 이해, 생성 등이 있으며, 정보 검색, 기계 번역, 질의응답 등 다양한 분야에 응용된다.
#### [Link to SpaCy documents](https://spacy.io/usage/linguistic-features)


In [None]:
# pip install -U spacy
# conda install -c conda-forge spacy

In [None]:
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
from spacy.util import minibatch, compounding
import explacy

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
food_reviews_df=pd.read_csv('Reviews.csv')
food_reviews_df.shape

In [None]:
food_reviews_df.head()

### Tokenization 토큰화 
First step in any NLP pipeline is tokenizing text i.e breaking down paragraphs into sentenses and then sentenses into words, punctuations and so on.

* 말뭉치로부터 언어요소(Token)를 분리하는 작업


###  Lemmatisation 표제어 처리
Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

* 단어의 사전적 어원을 찾는 작업.

In [None]:
sample_review=food_reviews_df.Text[54]
sample_review

In [None]:
spacy_tok = spacy.load('en_core_web_sm')
# creating a spacy object
# https://spacy.io/models/en

parsed_review = spacy_tok(sample_review)
parsed_review

In [None]:
explacy.print_parse_info(spacy_tok, 'The salad was surprisingly tasty.')

In [None]:
explacy.print_parse_info(spacy_tok, food_reviews_df.Text[0])

### Part-of-speech (POS) Tagging 품사 태깅
After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

* 문장을 토큰화 한 후, 각 쪼개진 토큰에 영어의 8가지 품사 (명사, 동사, 대명사, 부사, 접속사, 전치사, 관사)를 부여하는 작업.

In [None]:
tokenized_text = pd.DataFrame()

for i, token in enumerate(parsed_review):
    tokenized_text.loc[i, 'text'] = token.text
    tokenized_text.loc[i, 'lemma'] = token.lemma_,
    tokenized_text.loc[i, 'pos'] = token.pos_
    tokenized_text.loc[i, 'tag'] = token.tag_
    tokenized_text.loc[i, 'dep'] = token.dep_
    tokenized_text.loc[i, 'shape'] = token.shape_
    tokenized_text.loc[i, 'is_alpha'] = token.is_alpha
    tokenized_text.loc[i, 'is_stop'] = token.is_stop
    tokenized_text.loc[i, 'is_punctuation'] = token.is_punct

tokenized_text[:20]

### Named Entity Recognition (NER) 개체명 인식
Named entity is real world object like Person, Organization etc. 
* 사람, 장소, 기관, 날짜 등 분야별 각각 명명된 (named) 객체 텍스트를 식별화하는 작업.

In [None]:
spacy.displacy.render(spacy_tok(food_reviews_df.Text[0]), style='ent', jupyter=True)

In [None]:
spacy.explain('')

### Dependency parsing 의존구조 분석
Syntactic Parsing or Dependency Parsing is process of identifyig sentenses and assigning a syntactic structure to it. As in Subject combined with object makes a sentence. Spacy provides parse tree which can be used to generate this structure.

* 수식을 받는 단어 (head / governor) -> 수식을 하는 단어 (dependent / modifier)들 사이의 의존관계를 파악하는 작업. 

#### Sentense Boundry Detection 문장 분리 작업
Figuring out where sentense starts and ends is very imporatnt part of NLP



In [None]:
sentence_spans = list(parsed_review.sents)

In [None]:
print(sentence_spans)

#### Visualising Dependency 의존 구조 시각화

In [None]:
displacy.render(parsed_review, style='dep', jupyter=True,options={'distance': 140})

In [None]:
spacy.explain('det')

#### Processing Noun Chunks 명사 청크 처리

In [None]:
noun_chunks_df = pd.DataFrame()

for i, chunk in enumerate(parsed_review.noun_chunks):
    noun_chunks_df.loc[i, 'text'] = chunk.text
    noun_chunks_df.loc[i, 'root'] = chunk.root,
    noun_chunks_df.loc[i, 'root.text'] = chunk.root.text,
    noun_chunks_df.loc[i, 'root.dep_'] = chunk.root.dep_
    noun_chunks_df.loc[i, 'root.head.text'] = chunk.root.head.text

noun_chunks_df[:20]