# Common Task in NLP

- text 요약
- text classification
- Language modeling
- speech recognition / correction
- translation
- chatbot 등등

### pytorch가 좋은 이유 

- 진입장벽이 상대적으로 낮고 python-like 형태. 
- graph creation이 쉬운 편. (nlp의 경우 dynamic computation graph가 중요하다고 함)
- 파이썬과 유사하기에 디버깅이 쉬움
- document
- device configuration (cpu, gpu 변환)
- data loading. 데이터 로딩을 위한 api 사용이 쉬운 편이며 parallelized.
- extension
- community
- interoperability.
- deployment. (production ready platform)

### projects
- sentiment analyzer
- Neural translation machine

감성분석 {
    - text classification problem
    - ouptput class of positive / negative sentiments
    - IMBD movie reviews as datasets.
    - data cleaning / preprocess with NLTK, spaCy
    - 원 핫 인코딩은 너무 많은 dimension 생성 때문에 예측 정확도가 떨어지므로, word embedding using Gensim.
    - recurrent NN 쓸 예정.
    - improve model by LSTM
}

Neural translation machine {
    - 언어 받아서 다른 언어로 번역하는 것. 구글번역기가 대표적. 단어 / 문장 단위 mapping이 아니라 NN 사용한 것
    - sequence to sequence model 사용 예정
    - 영어 -> 프랑스어 변환.
}

In [1]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/inspirit941/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
import spacy
!python -m spacy download en

In [3]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [6]:
nlp = spacy.load('en')

영화 리뷰 텍스트.

In [11]:
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/inspirit941/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [12]:
sent_tokenize('hello world, we are trying out nltk. this would be awesome.')
# 일반적인 경우 문장 단위로 끊는다. 하지만 이제 우리가 보려는 영화 리뷰는 \n 단위로 끊어야 한다
# 따라서 예제에 사용할 데이터는 파이썬 내장함수를 사용해 읽어들이자

['hello world, we are trying out nltk.', 'this would be awesome.']

In [14]:
reviewFile = open("./Hands-on-NLP-with-PyTorch-master/data/reviews.txt",'r')
reviews = list(map(lambda x: x[:-1], reviewFile.readlines()))
reviewFile.close()

labelFile = open("./Hands-on-NLP-with-PyTorch-master/data/labels.txt",'r')
labels = list(map(lambda x:x[:-1], labelFile.readlines()))
labelFile.close()

In [20]:
print(len(reviews), len(labels))
set(labels) # 긍정 / 부정 딱 두 개 있다.

25000 25000


{'negative', 'positive'}

In [19]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [21]:
nltk.word_tokenize(reviews[0])
# 그냥 돌리면, 마찬가지로 '필요없는 분류'가 많아진다.
# 고민해야 할 지점. -> 어떤 형태의 token이 의미가 있을 것인가.
# 만약 문장 의미 위주라면 말줄임표를 의미하는 점 여러개는 필요없지만, emotion 분석에는 의미 있을지도 모른다.
# 필요한 걸 regex로 분류해줘야 한다

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 '.',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 '.',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me',
 'to',
 'believe',
 'that',
 'bromwell',
 'high',
 's',
 'satire',
 'is',
 'much',
 'closer',
 'to',
 'reality',
 'than',
 'is',
 'teachers',
 '.',
 'the',
 'scramble',
 'to',
 'survive',
 'financially',
 'the',
 'insightful',
 'students',
 'who',
 'can',
 'see',
 'right',
 'through',
 'their',
 'pathetic',
 'teachers',
 'pomp',
 'the',
 'pettiness',
 'of',
 'the',
 'whole',
 'situation',
 'all',
 'remind',
 'me',
 'of',
 'the',
 'schools',
 'i',
 'knew',
 'and',
 'their',
 'students',
 '.',
 'when',
 'i',
 'saw',
 'the',
 'episode',
 'in',
 'which',
 'a',
 'student',
 'repeatedly',
 'tried',
 'to',
 'burn',
 'down',
 'the',
 'school',
 'i',
 'immediately',
 'recalled',
 '.',
 '.',
 '.',
 '.',
 '.',

In [22]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("\w+")

In [23]:
tokenizer.tokenize("I didn't like the movie.")
# didn't가 두 개로 나뉘는 문제가 발생했다. 이 부분을 수정해 주면

['I', 'didn', 't', 'like', 'the', 'movie']

In [24]:
tokenizer = RegexpTokenizer("\w+\'?\w+|\w+")
tokenizer.tokenize("I didn't like the movie.")
# 이 원칙으로 다시 리뷰를 돌려보자

['I', "didn't", 'like', 'the', 'movie']

In [25]:
tokenizer.tokenize(reviews[0])

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me',
 'to',
 'believe',
 'that',
 'bromwell',
 'high',
 's',
 'satire',
 'is',
 'much',
 'closer',
 'to',
 'reality',
 'than',
 'is',
 'teachers',
 'the',
 'scramble',
 'to',
 'survive',
 'financially',
 'the',
 'insightful',
 'students',
 'who',
 'can',
 'see',
 'right',
 'through',
 'their',
 'pathetic',
 'teachers',
 'pomp',
 'the',
 'pettiness',
 'of',
 'the',
 'whole',
 'situation',
 'all',
 'remind',
 'me',
 'of',
 'the',
 'schools',
 'i',
 'knew',
 'and',
 'their',
 'students',
 'when',
 'i',
 'saw',
 'the',
 'episode',
 'in',
 'which',
 'a',
 'student',
 'repeatedly',
 'tried',
 'to',
 'burn',
 'down',
 'the',
 'school',
 'i',
 'immediately',
 'recalled',
 'at',
 'high',
 'a',
 'classic',
 'line',
 'inspecto

In [27]:
reviews_tokenize = list(map(lambda x: tokenizer.tokenize(x.lower()), reviews))

In [28]:
len(reviews_tokenize)

25000

In [30]:
reviews_tokenize[0]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me',
 'to',
 'believe',
 'that',
 'bromwell',
 'high',
 's',
 'satire',
 'is',
 'much',
 'closer',
 'to',
 'reality',
 'than',
 'is',
 'teachers',
 'the',
 'scramble',
 'to',
 'survive',
 'financially',
 'the',
 'insightful',
 'students',
 'who',
 'can',
 'see',
 'right',
 'through',
 'their',
 'pathetic',
 'teachers',
 'pomp',
 'the',
 'pettiness',
 'of',
 'the',
 'whole',
 'situation',
 'all',
 'remind',
 'me',
 'of',
 'the',
 'schools',
 'i',
 'knew',
 'and',
 'their',
 'students',
 'when',
 'i',
 'saw',
 'the',
 'episode',
 'in',
 'which',
 'a',
 'student',
 'repeatedly',
 'tried',
 'to',
 'burn',
 'down',
 'the',
 'school',
 'i',
 'immediately',
 'recalled',
 'at',
 'high',
 'a',
 'classic',
 'line',
 'inspecto

## Stop words


영화리뷰에서 예컨대 this, is, are, an 등등은 문법 구조를 담당할 뿐, 영화리뷰에 관한 정보를 제공하지는 않는다.

전처리를 통해 performance를 더 낫게 만들 수 있다고 함

nltk의 stopwords와 spacy의 stopwords를 각각 불러오면

In [32]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [33]:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)

{'top', 'afterwards', 'somehow', 'he', 'except', 'while', 'those', 'less', 'whereby', 'whom', 'no', 'yet', 'perhaps', 'everything', 'ten', 'go', 'other', 'bottom', 'five', 'nothing', 'neither', 'besides', 'thereupon', 'off', 'anywhere', 'thence', 'up', 'into', 'along', 'see', 'what', 'regarding', 'would', 'hence', 'whence', 'beforehand', 'mine', 'however', 'some', 'least', 'about', 'many', 'since', 'as', 'here', 'do', 'whose', 'call', 'enough', 'was', 'when', 'due', 'until', 'now', 'towards', 'hereby', 'made', 'and', 'move', 'in', 'hers', 'whereupon', 'eight', 'over', 'too', 'whole', 'twelve', 'whenever', 'if', 'among', 'toward', 'eleven', 'herein', 'not', 'any', 'yours', 'cannot', 'something', 'its', 'just', 'should', 'were', 'rather', 'out', 'nobody', 'ourselves', 'to', 'yourselves', 'so', 'ever', 'hereafter', 'am', 'further', 'even', 'through', 'why', 'also', 'below', 'who', 'never', 'whither', 'former', 'almost', 'after', 'they', 'therein', 'doing', 'seems', 'empty', 'latterly', 'a

In [34]:
# 중복여부를 확인하는 건 set이 빠르고, iteration 관련해서는 list가 빠르다고 하니
stopwords = set(stop_words).union(STOP_WORDS)
# stopwords를 합친다

In [37]:
review_test = "this is a good movie".split(" ")
[token for token in review_test if token not in stopwords]

['good', 'movie']

In [38]:
review_test = "this is not a good movie".split(" ")
[token for token in review_test if token not in stopwords]

['good', 'movie']

In [39]:
# nltk의 stopword 조합에 not과 관련된 데이터가 있다. 이것 때문에 부정표현이 사라지는 현상 발생.
# 따라서 예외처리가 필요하다.
exceptionStopWords = {
    'again',
    'against',
    'ain',
    'almost',
    'among',
    'amongst',
    'amount',
    'anyhow',
    'anyway',
    'aren',
    "aren't",
    'below',
    'bottom',
    'but',
    'cannot',
    'couldn',
    "couldn't",
    'didn',
    "didn't",
    'doesn',
    "doesn't",
    'don',
    "don't",
    'done',
    'down',
    'except',
    'few',
    'hadn',
    "hadn't",
    'hasn',
    "hasn't",
    'haven',
    "haven't",
    'however',
    'isn',
    "isn't",
    'least',
    'mightn',
    "mightn't",
    'must',
    'mustn',
    "mustn't",
    'needn',
    "needn't",
    'neither',
    'never',
    'nevertheless',
    'no',
    'nobody',
    'none',
    'noone',
    'nor',
    'not',
    'nothing',
    'should',
    "should've",
    'shouldn',
    "shouldn't",
    'too',
    'top',
    'up',
    'wasn',
    "wasn't",
    'well',
    'weren',
    "weren't",
    'won',
    "won't",
    'wouldn',
    "wouldn't",
}
# 차집합 개념을 써서 예외 단어를 제거한다
finalstop = stopwords-exceptionStopWords

In [40]:
# 재사용성을 높이기 위해 함수를 하나 만들자.
def remove_stopwords(review):
    return [token for token in review if token not in finalstop]

In [44]:
# 전처리하자
reviews = list(map(lambda x: remove_stopwords(x), reviews_tokenize))
reviews[0]

['bromwell',
 'high',
 'cartoon',
 'comedy',
 'ran',
 'time',
 'programs',
 'school',
 'life',
 'teachers',
 'years',
 'teaching',
 'profession',
 'lead',
 'believe',
 'bromwell',
 'high',
 'satire',
 'closer',
 'reality',
 'teachers',
 'scramble',
 'survive',
 'financially',
 'insightful',
 'students',
 'right',
 'pathetic',
 'teachers',
 'pomp',
 'pettiness',
 'situation',
 'remind',
 'schools',
 'knew',
 'students',
 'saw',
 'episode',
 'student',
 'repeatedly',
 'tried',
 'burn',
 'down',
 'school',
 'immediately',
 'recalled',
 'high',
 'classic',
 'line',
 'inspector',
 'sack',
 'teachers',
 'student',
 'welcome',
 'bromwell',
 'high',
 'expect',
 'adults',
 'age',
 'think',
 'bromwell',
 'high',
 'far',
 'fetched',
 'pity',
 'isn']

## Lemmetization

언어학의 한 process. 
- grouping inflected forms of a word into single item.
- identified by the word's lemma, or dictionary form.

예컨대 is, are, be 는 전부 Be로 변경하고, walk, walking, walked는 walk로, 말하자면 기본형으로 요약하는 것.

reduce the feature space, improve the performance while more or less preserving the meaning of the review.

spacy에서 이 기능을 제공하며, 영어의 경우 꽤 잘 부합한다고 함

앞에서 이미 단어 토큰화를 완료했으니, 이 단어들의 일반형만 만들면 된다.

In [45]:
import spacy
nlp = spacy.load('en')

In [57]:
doc = nlp('walked')
for token in doc:
    print(token.lemma_)
doc = nlp("isn't")
for token in doc:
    print(token.lemma_)

walk
be
not


In [58]:
def lemmatization(review):
    lemma_result=[]
    
    for words in review:
        doc = nlp(words)
        for token in doc:
            lemma_result.append(token.lemma_)
    return lemma_result

In [59]:
lemmatization("this isn't good".split())

['this', 'be', 'not', 'good']

In [60]:
# 근데 이 함수를 만약 모든 리뷰에 실행할 경우 시간이 상당히 오래 걸린다.
# nlp 함수를 실행할 때 기본적으로 세 가지가 실행된다고 함. parser, tagger, ner
# 이 기능들은 speech tagging, named entity recognition에서 유용하다고 함. 하지만 지금은 필요없음
# 근데 줄여도 오래 걸리는데?
nlp = spacy.load('en', disable=['parser','tagger','ner'])

In [61]:
reviews = list(map(lambda x: lemmatization(x), reviews))

In [56]:
reviews

[['bromwell',
  'high',
  'cartoon',
  'comedy',
  'ran',
  'time',
  'programs',
  'school',
  'life',
  'teachers',
  'years',
  'teaching',
  'profession',
  'lead',
  'believe',
  'bromwell',
  'high',
  'satire',
  'closer',
  'reality',
  'teachers',
  'scramble',
  'survive',
  'financially',
  'insightful',
  'students',
  'right',
  'pathetic',
  'teachers',
  'pomp',
  'pettiness',
  'situation',
  'remind',
  'schools',
  'knew',
  'students',
  'saw',
  'episode',
  'student',
  'repeatedly',
  'tried',
  'burn',
  'down',
  'school',
  'immediately',
  'recalled',
  'high',
  'classic',
  'line',
  'inspector',
  'sack',
  'teachers',
  'student',
  'welcome',
  'bromwell',
  'high',
  'expect',
  'adults',
  'age',
  'think',
  'bromwell',
  'high',
  'far',
  'fetched',
  'pity',
  'isn'],
 ['story',
  'man',
  'unnatural',
  'feelings',
  'pig',
  'starts',
  'opening',
  'scene',
  'terrific',
  'example',
  'absurd',
  'comedy',
  'formal',
  'orchestra',
  'audience'

# Pipeline

지금까지의 과정을 one process to after the other로 변경해 data flow가 흐른 뒤 원하는 결과를 얻을 수 있도록 하는 과정

중간에 필요한 process가 있다면 쉽게 add or delete가 가능함.

spacy에서 이 기능을 제공함.

일단 함수로 차근차근 다시 만들어보자는 취지인 듯함. 워낙 인도발음이 세서 뭐라는지 모르겠음

In [62]:
def load_data():
    reviewsFile = open("./Hands-on-NLP-with-PyTorch-master/data/reviews.txt",'r')
    reviews = list(map(lambda x:x[:-1],reviewsFile.readlines()))
    reviewsFile.close()

    labelsFile = open("./Hands-on-NLP-with-PyTorch-master/data/labels.txt",'r')
    labels = list(map(lambda x:x[:-1],labelsFile.readlines()))
    labelsFile.close()
    
    return reviews,labels

In [63]:
reviews, labels = load_data()

In [66]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("\w+\'?\w+|\w+")

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

from spacy.lang.en.stop_words import STOP_WORDS

exceptionStopWords = {
    'again',
    'against',
    'ain',
    'almost',
    'among',
    'amongst',
    'amount',
    'anyhow',
    'anyway',
    'aren',
    "aren't",
    'below',
    'bottom',
    'but',
    'cannot',
    'couldn',
    "couldn't",
    'didn',
    "didn't",
    'doesn',
    "doesn't",
    'don',
    "don't",
    'done',
    'down',
    'except',
    'few',
    'hadn',
    "hadn't",
    'hasn',
    "hasn't",
    'haven',
    "haven't",
    'however',
    'isn',
    "isn't",
    'least',
    'mightn',
    "mightn't",
    'move',
    'much',
    'must',
    'mustn',
    "mustn't",
    'needn',
    "needn't",
    'neither',
    'never',
    'nevertheless',
    'no',
    'nobody',
    'none',
    'noone',
    'nor',
    'not',
    'nothing',
    'should',
    "should've",
    'shouldn',
    "shouldn't",
    'too',
    'top',
    'up',
    'wasn',
    "wasn't",
    'well',
    'weren',
    "weren't",
    'won',
    "won't",
    'wouldn',
    "wouldn't",
}

stop_words = set(stop_words).union(STOP_WORDS)

final_stop_words = stop_words-exceptionStopWords

import spacy
nlp = spacy.load("en",disable=['parser', 'tagger', 'ner'])

def make_token(review):
    return tokenizer.tokenize(str(review))

def remove_stopwords(review):
    return [token for token in review if token not in final_stop_words]

def lemmatization(review):
    lemma_result = []
    
    for words in review:
        doc = nlp(words)
        for token in doc:
            lemma_result.append(token.lemma_)
    return lemma_result

def pipeline(review):
    review = make_token(review)
    review = remove_stopwords(review)
    return lemmatization(review)

# %%time
reviews = list(map(lambda review: pipeline(review),reviews))