1. 대소문자 통일
2. 숫자, 문장부호, 특수문자 제거
3. 불용어 제거
4. 어근 통일화
5. N_gram
6. 품사분석
7. Vectorize

### 1. 대소문자 통일

In [1]:
s = 'Hello World'

In [2]:
s.lower()

'hello world'

In [3]:
s.upper()

'HELLO WORLD'

### 2. 숫자, 문장부호, 특수문자 제거

In [4]:
import re ## 정규식 모듈
p = re.compile('[0-9]+')

In [5]:
p.sub('a', '서울 부동산 가격이 올해 들어 평균 30% 상승했습니다.')

'서울 부동산 가격이 올해 들어 평균 a% 상승했습니다.'

In [6]:
p = re.compile('\W+') # 문장부호 및 특수문자 제거

In [7]:
p.sub('s', '주제_1 : *서울 부동산 가격이 올해 들어 평균 30% 상승했습니다.')

'주제_1s서울s부동산s가격이s올해s들어s평균s30s상승했습니다s'

In [8]:
p = re.compile('_')

In [9]:
p.sub('s', '주제_1 : *서울 부동산 가격이 올해 들어 평균 30% 상승했습니다.')

'주제s1 : *서울 부동산 가격이 올해 들어 평균 30% 상승했습니다.'

### 3. 불용어 제거

In [10]:
words_korean = ['추적','연휴','민족','대이동','시작','늘어','교통량','교통사고','특히','자동차',
                '고창','상당수','차지','나타','것','기자']

In [11]:
stopwords = ['가다','늘어','나타','것','기자']

In [12]:
[i for i in words_korean if i not in stopwords]

['추적', '연휴', '민족', '대이동', '시작', '교통량', '교통사고', '특히', '자동차', '고창', '상당수', '차지']

In [13]:
from nltk.corpus import stopwords

In [14]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
words_english = ['chief','justice','roberts','president','carter','clinton','bush','president','obama',
                'fellow',',','american','and','people','of',',','the','world','thank','you']

In [16]:
[w for w in words_english if not w in stopwords.words('english')]

['chief',
 'justice',
 'roberts',
 'president',
 'carter',
 'clinton',
 'bush',
 'president',
 'obama',
 'fellow',
 ',',
 'american',
 'people',
 ',',
 'world',
 'thank']

### 4. 같은 어근 동일화

In [17]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [18]:
ps_stemmer = PorterStemmer()

In [21]:
new_text = 'It is important to be immersed while you are pythoning with python. All pythoners have\
 phythoned poorly at least once.'

In [22]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [23]:
words = word_tokenize(new_text)
print(words)

['It', 'is', 'important', 'to', 'be', 'immersed', 'while', 'you', 'are', 'pythoning', 'with', 'python', '.', 'All', 'pythoners', 'have', 'phythoned', 'poorly', 'at', 'least', 'once', '.']


In [26]:
for w in words:
    print(ps_stemmer.stem(w), end = ' ')

It is import to be immers while you are python with python . all python have phython poorli at least onc . 

In [27]:
from nltk.stem.lancaster import LancasterStemmer

In [28]:
LS_stemmer = LancasterStemmer()

In [29]:
for w in words:
    print(LS_stemmer.stem(w), end = ' ')

it is import to be immers whil you ar python with python . al python hav phython poor at least ont . 

In [30]:
from nltk.stem.regexp import RegexpStemmer

In [31]:
RS_stemmer = RegexpStemmer('Python')

In [32]:
for w in words:
    print(RS_stemmer.stem(w), end = ' ')

It is important to be immersed while you are pythoning with python . All pythoners have phythoned poorly at least once . 

### 5. N-gram

In [33]:
from nltk import ngrams

In [34]:
sentence = 'Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama,\
 fellow Americans and people of the world, thank you. We, the citizens of America are now joined in a\
  great national effort to rebuild our country and restore its promise for all of our people. Together\
  , we will determine the course of America and the world for many, many years to come. we will \
  challenges. We will confront hardships, but we will get the job done.'

In [35]:
grams = ngrams(sentence.split(), 2)
for gram in grams:
    print(gram, end = ' ')

('Chief', 'Justice') ('Justice', 'Roberts,') ('Roberts,', 'President') ('President', 'Carter,') ('Carter,', 'President') ('President', 'Clinton,') ('Clinton,', 'President') ('President', 'Bush,') ('Bush,', 'President') ('President', 'Obama,') ('Obama,', 'fellow') ('fellow', 'Americans') ('Americans', 'and') ('and', 'people') ('people', 'of') ('of', 'the') ('the', 'world,') ('world,', 'thank') ('thank', 'you.') ('you.', 'We,') ('We,', 'the') ('the', 'citizens') ('citizens', 'of') ('of', 'America') ('America', 'are') ('are', 'now') ('now', 'joined') ('joined', 'in') ('in', 'a') ('a', 'great') ('great', 'national') ('national', 'effort') ('effort', 'to') ('to', 'rebuild') ('rebuild', 'our') ('our', 'country') ('country', 'and') ('and', 'restore') ('restore', 'its') ('its', 'promise') ('promise', 'for') ('for', 'all') ('all', 'of') ('of', 'our') ('our', 'people.') ('people.', 'Together') ('Together', ',') (',', 'we') ('we', 'will') ('will', 'determine') ('determine', 'the') ('the', 'course

In [36]:
grams = ngrams(sentence.split(), 3)
for gram in grams:
    print(gram, end = ' ')

('Chief', 'Justice', 'Roberts,') ('Justice', 'Roberts,', 'President') ('Roberts,', 'President', 'Carter,') ('President', 'Carter,', 'President') ('Carter,', 'President', 'Clinton,') ('President', 'Clinton,', 'President') ('Clinton,', 'President', 'Bush,') ('President', 'Bush,', 'President') ('Bush,', 'President', 'Obama,') ('President', 'Obama,', 'fellow') ('Obama,', 'fellow', 'Americans') ('fellow', 'Americans', 'and') ('Americans', 'and', 'people') ('and', 'people', 'of') ('people', 'of', 'the') ('of', 'the', 'world,') ('the', 'world,', 'thank') ('world,', 'thank', 'you.') ('thank', 'you.', 'We,') ('you.', 'We,', 'the') ('We,', 'the', 'citizens') ('the', 'citizens', 'of') ('citizens', 'of', 'America') ('of', 'America', 'are') ('America', 'are', 'now') ('are', 'now', 'joined') ('now', 'joined', 'in') ('joined', 'in', 'a') ('in', 'a', 'great') ('a', 'great', 'national') ('great', 'national', 'effort') ('national', 'effort', 'to') ('effort', 'to', 'rebuild') ('to', 'rebuild', 'our') ('r

### 6. 품사 Tag