### 14주차 실습 - CBOW와 skip-gram ###
출처: 딥 러닝을 이용한 자연어 처리 입문(유원준, 안상준) https://wikidocs.net/50739

### 1. TED 데이터 전처리

In [1]:
import urllib.request
import zipfile
from lxml import etree
import re

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tomat\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# 데이터 다운로드
urllib.request.urlretrieve("https://raw.githubusercontent.com/ukairia777/tensorflow-nlp-tutorial/main/09.%20Word%20Embedding/dataset/ted_en-20160408.xml", filename="ted_en-20160408.xml")

# XML 파일 parsing
targetXML = open('ted_en-20160408.xml', 'r', encoding='utf-8')
target_text = etree.parse(targetXML)

In [3]:
# XML의 일부분 확인
print(etree.tostring(target_text, pretty_print=True).decode('utf-8')[0:5000])

<xml language="en"><file id="1">
  <head>
    <url>http://www.ted.com/talks/knut_haanaes_two_reasons_companies_fail_and_how_to_avoid_them</url>
    <pagesize>72832</pagesize>
    <dtime>Fri Apr 01 00:57:03 CEST 2016</dtime>
    <encoding>UTF-8</encoding>
    <content-type>text/html; charset=utf-8</content-type>
    <keywords>talks, business, creativity, curiosity, goal-setting, innovation, motivation, potential, success, work</keywords>
    <speaker>Knut Haanaes</speaker>
    <talkid>2470</talkid>
    <videourl>http://download.ted.com/talks/KnutHaanaes_2015S.mp4</videourl>
    <videopath>talks/KnutHaanaes_2015S.mp4</videopath>
    <date>2015/06/30</date>
    <title>Knut Haanaes: Two reasons companies fail -- and how to avoid them</title>
    <description>TED Talk Subtitles and Transcript: Is it possible to run a company and reinvent it at the same time? For business strategist Knut Haanaes, the ability to innovate after becoming successful is the mark of a great organization. He shares

In [4]:
# XML로의 content 태그 사이의 내용 추출
parse_text = '\n'.join(target_text.xpath('//content/text()'))

# 괄호로 둘러싸인 부분 제거 (예: 배경음 설명 등)
content_text = re.sub(r'\([^)]*\)', '', parse_text)

In [5]:
# 텍스트의 일부분 확인
content_text[0:1000]

"Here are two reasons companies fail: they only do more of the same, or they only do what's new.\nTo me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.\nConsider Facit. I'm actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. And what did Facit do when the electronic calculator came along? They continued doing exactly the same. In six months, they went from maximum revenue ... and they were gone. Gone.\nTo me, the irony about the Facit story is hearing about the Facit engineers, who had bought cheap, small electronic calculators in Japan that they used to double-check their calculators.\n\nFacit did too much exploitation. But exploration can go wild, too.\nA few years back, I worked closely alongside a European bio

In [6]:
# 문장 토큰화 수행
sent_text = sent_tokenize(content_text)

# 구두점 제거 및 소문자 변환
normalized_text = []
for string in sent_text:
     tokens = re.sub(r"[^a-z0-9]+", " ", string.lower())
     normalized_text.append(tokens)

# 단어 토큰화 수행
result = [word_tokenize(sentence) for sentence in normalized_text]

In [7]:
print('총 토큰의 개수 : {}'.format(len(result)))

# 토큰의 일부 출력
for line in result[:5]:
    print(line)

총 토큰의 개수 : 273424
['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']
['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']
['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing']
['consider', 'facit']
['i', 'm', 'actually', 'old', 'enough', 'to', 'remember', 'them']


### 2. 워드 임베딩 생성

Gensim 라이브러리: 자연어 처리, 정보 검색 분야에 쓰이는 파이썬 라이브러리

Word2Vec 함수: CBOW, skip-gram 워드 임베딩 학습을 위한 함수로 다음과 같은 매개변수를 가짐
- vector_size: 임베딩 벡터의 차원
- window: 컨텍스트 윈도우 크기
- min_count: 단어 최소 빈도 수 제한
- workers: 학습을 위해 사용되는 프로세스 수
- sg: 임베딩 방법 (0: CBOW, 1: skip-gram)

In [10]:
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

# Word2Vec 모델 학습
model = Word2Vec(sentences=result, vector_size=100, window=5, min_count=5, workers=4, sg=1)

ModuleNotFoundError: No module named 'gensim'

In [None]:
# 임베딩 벡터 확인
word_to_lookup = "woman"

if word_to_lookup in model.wv:
    vector_for_word = model.wv.get_vector(word_to_lookup)
    print(f"Vector for '{word_to_lookup}': {vector_for_word}")
else:
    print(f"'{word_to_lookup}' not found in the vocabulary.")

Vector for 'woman': [-0.40105575 -1.9523422   0.6010022  -1.1534516   1.2500429   0.43475133
 -0.54875875  0.5031962  -1.6090009   0.5067355  -0.9001674  -1.1807873
  0.42813724  0.3682448   0.15507914 -0.71756834 -0.25210974 -1.0072099
  0.04452446 -0.04126176  0.9736489   2.2508304   0.36223066 -0.15979382
  1.1075225  -0.36000058 -0.3220777  -0.4598301   0.93577653 -2.146943
  0.11464866 -0.21335709  0.44883585  0.776743   -1.0548542  -1.0393814
 -1.0187848   0.35771662 -2.1171849   0.29087767  0.12248597 -1.1697259
 -1.4297632   1.6655838  -0.9858602   0.06607041 -0.9100405  -1.5996414
 -0.480018    0.48989853 -0.77850324 -2.000727    0.7876283   1.8373076
 -0.5669436  -0.9938785  -0.24849814  0.27585444 -1.5640423  -0.81301117
 -0.6485581  -0.1363179   0.19228472 -0.51926804 -1.0830201   0.22246948
  0.5918681  -0.79827404 -0.96797955 -0.34656867  0.9246383   0.25893635
  0.9112023  -2.0943859   0.6851555  -1.8688266  -0.3170995  -0.18229042
 -1.427857    1.199572    2.1767716   1.2567209  -1.205488    0.6402748
 -0.51675993 -0.24697746 -0.74058664  0.82939965 -2.0690477  -0.77335393
  0.32260522  0.67955697  0.45374754  0.47575012  2.2876542   0.806388
  0.36289242 -0.21004926 -3.7385774   3.3191776 ]

In [None]:
# 임베딩이 유사한 단어 찾기
model_result = model.wv.most_similar("man")
model_result