- 각 문서를 고정된 길이의 벡터로 변환한다면 벡터간 비교를 통해 문서를 서로 비교할 수 있을 것.
- 문서를 문서벡터로 어떻게 변환?
    - 이미 구현된 패키지인 Doc2Vec이나 Sent2Vec을 쓸 수도 있음.
    - 가장 간단한 방법은 **문서 내에 존재하는 단어 벡터들의 평균**을 구하는 것.
- **문서 내 각 단어들을 Word2Vec을 통해 단어 벡터로 변환**하고, **평균으로 문서 벡터를 구해** 선호하는 도서와 유사한 도서를 찾아주는 간단한 추천 시스템을 만들어보자.

# Data Loading

In [1]:
import urllib.request
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

# 아래 모듈은 뭐하는걸까?
from PIL import Image
from io import BytesIO

In [4]:
df = pd.read_csv('book_data.csv')
print('전체 문서 수 : ', len(df))
df.head()

전체 문서 수 :  2382


Unnamed: 0.2,Unnamed: 0,Desc,Unnamed: 0.1,author,genre,image_link,rating,title
0,0,We know that power is shifting: From West to E...,0.0,Moisés Naím,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.63,The End of Power: From Boardrooms to Battlefie...
1,1,Following the success of The Accidental Billio...,1.0,Blake J. Harris,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.94,"Console Wars: Sega, Nintendo, and the Battle t..."
2,2,How to tap the power of social software and ne...,2.0,Chris Brogan,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.78,Trust Agents: Using the Web to Build Influence...
3,3,William J. Bernstein is an American financial ...,3.0,William J. Bernstein,Business,https://i.gr-assets.com/images/S/compressed.ph...,4.2,The Four Pillars of Investing
4,4,Amazing book. And I joined Steve Jobs and many...,4.0,Akio Morita,Business,https://i.gr-assets.com/images/S/compressed.ph...,4.05,Made in Japan: Akio Morita and Sony


    Unnamed: 0, Unnamed: 0.1 이라는 불필요한 컬럼이 있음.
    
    줄거리에 해당하는 Desc열이 중요.
    전처리 & 토큰화 과정이 필요하므로 함수로 만들어서 해결하자. (좋은 접근방법인듯)

In [7]:
# 유니코드 상으로 ASCII코드 문자가 아닌 것들 제거.
def removeNonAscii(s):
    return ''.join(i for i in s if ord(i)<128)

# 대문자->소문자 변환
def makeLower(text):
    return text.lower()

# 불용어 제거
def removeStopWords(text):
    text = text.split() # 공백기준 분할 후
    stops = set(stopwords.words('english')) # 불용어 사전을 불러와
    text = [w for w in text if not w in stops] # 불용어 사전에 없는것들만 다시 text에 할당. (불용어 제거)
    text = ' '.join(text) # 기존엔 문장이었으니 단어들을 공백기준으로 다시 배치.
    return text

# html양식 (block) 제거
def removeHTML(text):
    htmlPattern = re.compile('<.*?>') # 괄호로 둘러쌓인 영역은 HTML양식.
    return htmlPattern.sub(r'', text)

# 특수문자(punctuation) 제거
def removePunct(text):
    # 영어 대소문자만 필터링
    tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
    text = tokenizer.tokenize(text)
    text = ' '.join(text) # 공백 기준으로 재배치
    return text

In [8]:
# cleaned라는 열에 전처리 된 문장 추가.
df['cleaned'] = df['Desc'].apply(removeNonAscii)
df['cleaned'] = df['cleaned'].apply(makeLower)
df['cleaned'] = df['cleaned'].apply(removeStopWords)
df['cleaned'] = df['cleaned'].apply(removePunct)
df['cleaned'] = df['cleaned'].apply(removeHTML)

In [9]:
df.head()

Unnamed: 0.2,Unnamed: 0,Desc,Unnamed: 0.1,author,genre,image_link,rating,title,cleaned
0,0,We know that power is shifting: From West to E...,0.0,Moisés Naím,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.63,The End of Power: From Boardrooms to Battlefie...,know power shifting west east north south pres...
1,1,Following the success of The Accidental Billio...,1.0,Blake J. Harris,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.94,"Console Wars: Sega, Nintendo, and the Battle t...",following success accidental billionaires mone...
2,2,How to tap the power of social software and ne...,2.0,Chris Brogan,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.78,Trust Agents: Using the Web to Build Influence...,tap power social software networks build busin...
3,3,William J. Bernstein is an American financial ...,3.0,William J. Bernstein,Business,https://i.gr-assets.com/images/S/compressed.ph...,4.2,The Four Pillars of Investing,william j bernstein american financial theoris...
4,4,Amazing book. And I joined Steve Jobs and many...,4.0,Akio Morita,Business,https://i.gr-assets.com/images/S/compressed.ph...,4.05,Made in Japan: Akio Morita and Sony,amazing book joined steve jobs many akio morit...


In [10]:
df['cleaned'].head()

0    know power shifting west east north south pres...
1    following success accidental billionaires mone...
2    tap power social software networks build busin...
3    william j bernstein american financial theoris...
4    amazing book joined steve jobs many akio morit...
Name: cleaned, dtype: object

In [11]:
df['cleaned'][0]

'know power shifting west east north south presidential palaces public squares formidable corporate behemoths nimble startups and slowly surely men women power merely shifting dispersing also decaying power today constrained risk losing ever before end power award winning columnist former foreign policy editor moiss nam illuminates struggle once dominant megaplayers new micropowers challenging every field human endeavor drawing provocative original research nam shows antiestablishment drive micropowers topple tyrants dislodge monopolies open remarkable new opportunities also lead chaos paralysis nam deftly covers seismic changes underway business religion education within families matters war peace examples abound walks life eighty nine countries ruled autocrats today half world s population lives democracies ceo s constrained shorter tenures predecessors modern tools war cheaper accessible make possible groups like hezbollah afford drones second half top ten hedge funds earned world s

In [13]:
# 빈 값이 있는 행이 있는지 확인
df['cleaned'].isna().sum()

0

빈값 NA가 다른거였나

In [14]:
df['cleaned'].replace('', np.nan, inplace=True)
df = df[df['cleaned'].notna()]
print('전체 문서 수 : ', len(df))

전체 문서 수 :  2381


빈 값이 있던 행이 1개 있었는듯. 1개 줄음.

Tokenize를 통해 corpus라는 리스트에 토큰 저장.  <br>
이 corpus 리스트를 통해 Word2Vec을 train할 것.

In [15]:
corpus = []
for words in df['cleaned']:
    corpus.append(words.split())

# Pre-trained word embedding

<br>

- Word2Vec을 처음부터 학습할 수도 있지만, 데이터가 충분하지 않은 상황에서
- pre-trained 워드 임베딩을 단어 벡터의 초기값으로 사용해 성능을 높일 수 있음.
- 사전 훈련된 Word2Vec을 써서 초기 단어 벡터값을 만들자.

In [25]:
urllib.request.urlretrieve("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz",
                           filename='GoogleNews-vectors-negative300.bin.gz')

KeyboardInterrupt: 

    1.5GB ㅋㅋㅋ

그냥 Word2Vec 훈련시켜서 돌려보자.

In [26]:
word2vec_model = Word2Vec(
    vector_size = 300,
    window = 5,
    min_count = 2,
    workers = -1
)
# Build vocabulary from a sequence of sentences (can be a once-only generator stream)
word2vec_model.build_vocab(corpus)
word2vec_model.train(corpus, total_examples = word2vec_model.corpus_count, epochs = 20)

(0, 0)

In [28]:
word2vec_model.wv.most_similar('behemoths')

[('geopolitical', 0.2552624046802521),
 ('troubles', 0.21435749530792236),
 ('function', 0.19762557744979858),
 ('infancy', 0.19695225358009338),
 ('educators', 0.19205407798290253),
 ('sea', 0.191654771566391),
 ('victim', 0.19070063531398773),
 ('indicating', 0.1876041293144226),
 ('despite', 0.18610814213752747),
 ('discrimination', 0.1846054196357727)]

    이건 사전훈련 파일 받아두고 하는게 맞겠다. 시간이 너무 오래걸린다.