# **9장. 자연어 전처리**

# 9.1 자연어 처리란

자연어 처리 : 우리가 일상생활에서 접하는 언어의 의미 분석 -> 컴퓨터가 처리할 수 있도록 하는 과정

    - 말뭉치(코퍼스, corpus) : 자연어 처리에서 모델을 학습시키기 위한 데이터

    - 토큰(token) : 자연어 처리를 위해 문서를 나누는 단위
        + 토큰 생성(tokenizing) : 문자열을 토큰으로 나누는 작업
        + 토큰 생성 함수 : 문자열을 토큰으로 분리하는 함수

    - 토큰화(tokenization) : 텍스트를 문장이나 단어로 분리하는 것

    - 불용어(stop words) : 분석과 관계 없고 자주 등장해서 성능을 위해 사전에 제거해주어야 하는 단어
        (ex: 'the', 'she', 'a', he', ...)

    - 어간 추출(stemming) : 단어를 기본 형태로 만드는 작업
        (ex: 'consign', 'consigned', 'consigning', 'consignment' => 'consign'으로 통일)

    - 품사 태깅(part-of-speech tagging) : 품사를 식별하기 위해 붙여주는 태그(식별 정보)
        (Det: 한정사, Noun:명사 ,Verb: 동사 ,Prep: 전치사)

In [1]:
pip install nltk



In [2]:
#품사 태깅을 위한 문장 토큰화
import nltk
nltk.download()
text = nltk.word_tokenize("Is it possible distinguishing cats and dogs")
print(text)

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_eng Averaged Perceptron Tagger (JSON)
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] averaged_perceptron_tagger_rus Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bcp47............... BCP-47 Language Tags
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ 

    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package averaged_perceptron_tagger to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger.zip.
       | Downloading package averaged_perceptron_tagger_eng to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
       | Downloading package averaged_perceptron_tagger_ru to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_ru.zip.
       | Downloading package averaged_perceptron_tagger_rus to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_rus.zip.
       | Downloading package basque_grammars to /root/nltk_data...
       |   Unzipping grammars/basque_grammars.zip.
       | Download


---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q
['Is', 'it', 'possible', 'distinguishing', 'cats', 'and', 'dogs']


In [4]:
#태깅에 필요한 자원 내려받기
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
#품사 태깅
nltk.pos_tag(text)

[('Is', 'VBZ'),
 ('it', 'PRP'),
 ('possible', 'JJ'),
 ('distinguishing', 'VBG'),
 ('cats', 'NNS'),
 ('and', 'CC'),
 ('dogs', 'NNS')]

**자연어 처리 과정**

1. 자연어 이벽
2. 자연어 전처리(preprocessing)

    (1) 토큰화
    (2) 불용어 제거
    (3) 어간 추출
    (4) 정규화
3. 임베딩(embedding) : 단어 -> 벡터 변환
4. 모델/모형 적용 -> 데이터에 대한 분류 및 예측 수행

In [6]:
#NLTK 라이브러리 호출 및 문장 정의

import nltk
nltk.download('punkt') #문장을 단어로 쪼개기 위한 자원 내려받기
string1 = 'my favorite subject is math'
string2 = 'my favorite subject is math, english, economic and computer science'

print(nltk.word_tokenize(string1))
print(nltk.word_tokenize(string2))

['my', 'favorite', 'subject', 'is', 'math']
['my', 'favorite', 'subject', 'is', 'math', ',', 'english', ',', 'economic', 'and', 'computer', 'science']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
!pip install konlpy

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting JPype1>=0.7.0 (from konlpy)
  Downloading jpype1-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m81.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jpype1-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.1/494.1 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: JPype1, konlpy
Successfully installed JPype1-1.5.2 konlpy-0.6.0


In [12]:
#라이브러리 호출
from konlpy.tag import Komoran

#문장을 형태소로 변환
komoran = Komoran()
print(komoran.morphs('딥러닝이 쉽나요? 어렵나요?'))

['딥러닝이', '쉽', '나요', '?', '어렵', '나요', '?']


In [13]:
#품사 태깅
print(komoran.pos('소파 위에 있는 것이 고양이인가요? 강아지인가요?'))

[('소파', 'NNP'), ('위', 'NNG'), ('에', 'JKB'), ('있', 'VV'), ('는', 'ETM'), ('것', 'NNB'), ('이', 'JKS'), ('고양이', 'NNG'), ('이', 'VCP'), ('ㄴ가요', 'EF'), ('?', 'SF'), ('강아지', 'NNG'), ('이', 'VCP'), ('ㄴ가요', 'EF'), ('?', 'SF')]


# 9.2 전처리

전처리 과정:

문장 -> **결측치 확인, 토큰화** -> 단어 색인 -> **불용어 제거** -> 축소된 단어 색인 -> **어간 추출**

**9.1.1 결측치 확인**

In [14]:
#결측치를 확인할 데이터 호출
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/pytorch_ex/chap09/data/class2.csv')
print(df)

   Unnamed: 0      id tissue class class2      x      y      r
0           0  mdb000      C  CIRC      N  535.0  475.0  192.0
1           1  mdb001      A  CIRA      N  433.0  268.0   58.0
2           2  mdb002      A  CIRA      I    NaN    NaN    NaN
3           3  mdb003      C  CIRC      B    NaN    NaN    NaN
4           4  mdb004      F  CIRF      I  488.0  145.0   29.0
5           5  mdb005      F  CIRF      B  544.0  178.0   26.0


In [15]:
#결측치 개수 확인
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
id,0
tissue,0
class,0
class2,0
x,2
y,2
r,2


In [16]:
#전체 데이터 대비 결측치 비율
df.isnull().sum() / len(df)

Unnamed: 0,0
Unnamed: 0,0.0
id,0.0
tissue,0.0
class,0.0
class2,0.0
x,0.333333
y,0.333333
r,0.333333


In [17]:
#모든 요소가 결측치인 행 삭제 처리
df = df.dropna(how='all')
df

Unnamed: 0.1,Unnamed: 0,id,tissue,class,class2,x,y,r
0,0,mdb000,C,CIRC,N,535.0,475.0,192.0
1,1,mdb001,A,CIRA,N,433.0,268.0,58.0
2,2,mdb002,A,CIRA,I,,,
3,3,mdb003,C,CIRC,B,,,
4,4,mdb004,F,CIRF,I,488.0,145.0,29.0
5,5,mdb005,F,CIRF,B,544.0,178.0,26.0


In [18]:
#결측치가 1개라도 존재하는 행 삭제 처리
df1 = df.dropna() #데이터에 하나라도 결측치가 있으면 행을 삭제
print(df1)

   Unnamed: 0      id tissue class class2      x      y      r
0           0  mdb000      C  CIRC      N  535.0  475.0  192.0
1           1  mdb001      A  CIRA      N  433.0  268.0   58.0
4           4  mdb004      F  CIRF      I  488.0  145.0   29.0
5           5  mdb005      F  CIRF      B  544.0  178.0   26.0


In [19]:
#결측치를 0으로 채우기
df2 = df.fillna(0)
print(df2)

   Unnamed: 0      id tissue class class2      x      y      r
0           0  mdb000      C  CIRC      N  535.0  475.0  192.0
1           1  mdb001      A  CIRA      N  433.0  268.0   58.0
2           2  mdb002      A  CIRA      I    0.0    0.0    0.0
3           3  mdb003      C  CIRC      B    0.0    0.0    0.0
4           4  mdb004      F  CIRF      I  488.0  145.0   29.0
5           5  mdb005      F  CIRF      B  544.0  178.0   26.0


In [20]:
#결측치를 평균으로 채우기
df['x'].fillna(df['x'].mean(), inplace=True) #x열에 대해 / 결측치를 채움 / x열의 평균값으로
print(df)

   Unnamed: 0      id tissue class class2      x      y      r
0           0  mdb000      C  CIRC      N  535.0  475.0  192.0
1           1  mdb001      A  CIRA      N  433.0  268.0   58.0
2           2  mdb002      A  CIRA      I  500.0    NaN    NaN
3           3  mdb003      C  CIRC      B  500.0    NaN    NaN
4           4  mdb004      F  CIRF      I  488.0  145.0   29.0
5           5  mdb005      F  CIRF      B  544.0  178.0   26.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['x'].fillna(df['x'].mean(), inplace=True) #x열에 대해 / 결측치를 채움 / x열의 평균값으로


**9.2.2 토큰화**

: 주어진 텍스트를 단어/문자 단위로 자르는 것

- 문장 토큰화 : 문장의 마지막을 뜻하는 기호(마침표, 느낌표, 물음표 등)에 따라 문장 구분
- 단어 토큰화 : 띄어쓰기를 기준으로 단어 구분

In [22]:
#문장 토큰화(. , ! ?)
from nltk import sent_tokenize
text_sample = 'Natural Language Processing, or NLP, is the process of extracting the meaning, or intent, behind human language. In the field of Conversational artificial intelligence (AI), NLP allows machines and applications to understand the intent of human language inputs, and then generate appropriate responses, resulting in a natural conversation flow.'

tokenized_sentences = sent_tokenize(text_sample)
print(tokenized_sentences)

['Natural Language Processing, or NLP, is the process of extracting the meaning, or intent, behind human language.', 'In the field of Conversational artificial intelligence (AI), NLP allows machines and applications to understand the intent of human language inputs, and then generate appropriate responses, resulting in a natural conversation flow.']


In [24]:
#단어 토큰화(띄어쓰기)
from nltk import word_tokenize
sentence = 'This book is for deep learning learners'
words = word_tokenize(sentence)
print(words)

['This', 'book', 'is', 'for', 'deep', 'learning', 'learners']


In [25]:
#'가 포함된 문장에서의 단어 토큰화
from nltk.tokenize import WordPunctTokenizer
sentence = "It's nothing that you don't already know except most people aren't aware of how their inner world works."
words = WordPunctTokenizer().tokenize(sentence)
print(words)

['It', "'", 's', 'nothing', 'that', 'you', 'don', "'", 't', 'already', 'know', 'except', 'most', 'people', 'aren', "'", 't', 'aware', 'of', 'how', 'their', 'inner', 'world', 'works', '.']


In [27]:
pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━

In [28]:
#한글 토큰화 예제

#라이브러리 호출 밑 데이터세트 준비
import csv
from konlpy.tag import Okt
from gensim.models import word2vec

f = open(r'/content/drive/MyDrive/pytorch_ex/chap09/data/ratings_train.txt', 'r', encodiing='utf-8')
rdr = csv.reader(f, delimiter='\t')
rdw = list(rdr)
f.close()

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject