# fastText

https://github.com/facebookresearch/fastText

**FastText**는 자연어 처리(NLP) 작업에서 사용되는 오픈소스 라이브러리로, 텍스트 분류 및 단어 임베딩을 위한 빠르고 효율적인 도구이다. 이는 Facebook AI Research 팀에서 개발했으며, 특히 대규모 텍스트 데이터에서도 높은 성능과 속도를 제공한다. FastText는 아래와 같은 주요 특징을 가진다:


**주요 특징**
1. **단어 벡터 학습 (Word Embeddings)**  
   - FastText는 단어를 고정된 크기의 벡터로 변환하는 단어 임베딩 모델을 학습한다. 이는 단어의 의미를 벡터 공간에 매핑하여 유사한 단어가 가까운 벡터로 표현되도록 한다.
   - 기존의 Word2Vec과 유사하지만, FastText는 단어를 **서브워드(subword)** 단위로 처리한다.

2. **서브워드 기반 모델 (Subword-based Model)**  
   - 단어를 n-그램(예: 'apple' → ['app', 'ppl', 'ple'])으로 분해하여 학습하기 때문에, **희귀 단어**나 **철자 오류**에도 강건하다.
   - 이는 단어 외에도 철자 패턴과 같은 더 세밀한 정보를 학습하는 데 유용하다.

3. **텍스트 분류 (Text Classification)**  
   - FastText는 문서나 문장을 빠르고 정확하게 분류하는 데 최적화되어 있다.
   - 학습 과정이 빠르고, 모델의 크기가 작으며, 정확도도 뛰어나다.

4. **효율적인 구현**  
   - FastText는 CPU 기반으로도 높은 성능을 내도록 설계되었으며, 대규모 데이터셋에서도 빠르게 작동한다.

**FastText의 작동 원리**
1. **단어 표현**  
   - 단어를 n-그램 서브워드로 나눈 후, 각 서브워드에 대해 벡터를 학습한다.
   - 예를 들어, "cat"이라는 단어는 'c', 'ca', 'cat'과 같은 다양한 조합으로 분해된다.
   - 결과적으로 단어 벡터는 각 서브워드 벡터의 합으로 표현된다.

2. **모델 구조**  
   - FastText는 Skip-gram 모델이나 CBOW 모델을 기반으로 동작한다.
   - 단, 기존 모델과 달리 단어 자체가 아닌 서브워드를 사용하여 학습한다.

**FastText의 장점**
1. **희귀 단어 처리 능력**  
   - 서브워드 기반 접근 방식 덕분에 희귀 단어 또는 새로운 단어에 대해 더 좋은 일반화 성능을 발휘한다.
2. **빠른 학습 속도**  
   - 단순한 모델 구조와 최적화된 구현으로 매우 빠르게 학습할 수 있다.
3. **다양한 언어 지원**  
   - 다양한 언어에서 동작하며, 특히 굴절어(inflected languages)와 같은 복잡한 언어에서도 효과적이다.

**활용 사례**
1. **단어 임베딩**  
   - 단어 간 유사도 계산, 문장 표현 학습.
2. **텍스트 분류**  
   - 스팸 필터링, 감정 분석, 뉴스 분류.
3. **다언어 지원**  
   - 다국어 데이터셋에서 빠른 응답 성능 제공.

### gensim FastText

In [5]:
from gensim.models import FastText
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd

In [4]:
f = open('ted_en.xml', 'r', encoding='UTF-8')
xml = etree.parse(f)

corpus = '\n'.join(xml.xpath('//content/text()'))
corpus = re.sub(r'\([^)]*\)', '', corpus)

sentences = sent_tokenize(corpus)
preprocessed_sentences = []

for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub(r'[^0-9a-zA-Z]', ' ', sentence)
    tokens = word_tokenize(sentence)
    preprocessed_sentences.append(tokens)

In [6]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences=preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)

In [12]:
w2v_model.wv.vectors.shape

(21613, 100)

In [7]:
w2v_df = pd.DataFrame(w2v_model.wv.vectors, index=w2v_model.wv.index_to_key)
w2v_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
the,-0.527831,-0.991119,-0.431061,0.176533,-0.798458,-0.454664,-0.503759,-0.818328,1.077504,-0.204905,0.787259,0.065678,-0.566567,0.331894,-1.506041,0.41646,0.637067,-0.140049,-1.024785,-1.400214,-0.188335,0.16542,-0.731407,-0.537187,0.175498,-2.920455,0.735696,1.161644,1.056228,0.863677,0.85769,-1.039752,0.739269,-0.893393,0.447687,0.314165,0.095119,-1.09651,0.334622,-0.955146,...,0.127239,1.281829,0.784616,-1.516122,-0.044071,-0.554164,-0.239096,1.442343,0.822801,0.388586,-0.800053,1.439184,-0.254261,-1.069053,0.704364,0.024723,1.45402,0.027857,0.34507,0.083318,-0.242658,-0.665069,1.401302,-0.07175,-2.473154,-0.588112,-0.043829,0.632793,0.060005,-0.29908,0.122602,-0.101336,1.064751,-0.888132,1.613576,0.272793,-0.823178,-0.386726,0.300896,0.301727
and,-0.955321,0.340457,-0.720046,0.142199,1.164295,-0.586238,-1.415029,-1.227834,-1.152483,0.371843,-0.713591,-0.482881,1.027496,0.956733,0.718449,0.810937,0.120779,0.083824,-0.081408,-0.552842,-0.651948,-0.5625,-0.558641,-0.387008,0.227421,-1.079575,-0.562385,-0.163353,-0.501923,0.553619,1.170733,0.84781,-0.944636,-1.37147,-0.306146,0.66518,1.199554,-0.039544,1.602312,1.909751,...,0.230516,0.035026,0.518992,-0.714637,-0.162442,1.466909,0.17088,-0.74904,0.957487,0.247446,0.277451,0.268122,0.039791,-0.504575,0.01943,0.347457,0.078383,-0.306975,-0.022689,0.08281,-0.135304,0.712904,-0.160595,0.52976,-0.879576,-0.539555,0.824874,0.171551,-0.513973,-0.521128,-1.367024,-0.651905,0.843698,-0.630076,-0.266087,-0.498182,0.469074,0.38064,0.748283,-0.562316
to,0.441249,0.76666,0.180415,-1.934559,1.242931,0.119064,-3.839187,-0.789677,-1.335547,0.818646,-2.382058,2.193699,0.354098,-0.141712,-0.381386,0.309622,-0.27073,1.277246,-1.425298,0.892333,-2.921926,-1.364358,-2.589902,-3.046213,0.110581,-1.214204,-0.409328,1.37031,1.0978,0.000701,2.552547,-1.184461,-1.588248,0.771678,-1.031563,-1.551031,-0.30784,-0.00957,0.585124,-1.071169,...,0.986762,-0.842128,0.612385,0.217199,0.974182,1.132147,2.037469,-2.353202,1.167348,0.584368,0.051552,1.034137,-0.965185,-0.139477,-1.740345,1.358064,1.242742,0.885954,0.091414,1.412347,1.295005,-0.336771,0.118551,2.347224,1.131476,-0.869251,-1.735349,-1.343007,0.582638,1.037885,-0.942522,1.847417,1.937669,-0.21398,2.455755,-0.373544,0.275315,0.510297,4.095771,0.307599
of,-2.122652,1.479303,0.42192,-0.703881,0.148725,-1.367172,-1.289011,0.52593,0.026986,-0.529832,1.957227,1.86219,0.600555,0.423755,-0.346979,-0.517078,0.862744,-1.219983,-2.282036,2.060056,0.742639,-0.657751,1.485813,-0.156859,-1.065384,-0.244538,-1.167387,1.742927,0.778673,-1.265691,0.267227,-1.155191,0.322691,-1.1578,1.088727,1.201165,-0.431566,-0.370211,0.970789,0.5748,...,0.605144,1.238028,-0.183468,0.413502,0.475517,-2.137532,-2.192112,1.143657,-0.535472,-1.046255,1.190112,0.818156,0.327329,1.59154,1.821494,0.157637,-1.08235,0.482613,-0.547505,0.061584,-0.420299,1.818647,0.116801,0.77821,-1.701144,-0.988859,1.769987,0.798443,-2.500652,-1.095607,-0.694992,1.014789,1.861125,-0.905103,1.5464,-1.782633,1.673942,-0.042565,-0.033746,2.074891
a,-0.154135,-1.265388,0.237353,-0.831103,0.750391,2.263263,-0.248927,-0.316375,0.931026,0.724209,-1.723238,0.483752,-1.765673,0.988921,-2.115136,-1.231515,0.908662,0.472532,1.437909,-1.364891,0.263535,1.399412,2.633103,-0.74469,0.043935,-2.063401,0.871124,0.705477,0.004184,-0.135314,2.428367,-0.048338,0.562036,-1.870849,-0.751135,-1.557101,-2.123344,-0.640661,-1.648788,-0.365693,...,-2.046162,0.21285,0.088539,0.203269,-1.066005,-0.108915,2.173506,1.443727,0.658747,1.045622,-2.171579,2.873612,-1.666081,0.679269,1.063456,0.006573,-1.066204,0.853445,0.24342,0.615274,-1.394237,-0.82733,0.644393,0.723838,-2.086857,-1.060111,-0.144608,-0.266131,-2.500905,-1.026584,0.004368,1.081071,1.818447,0.061441,0.703274,-1.345361,0.705217,0.551396,2.424098,3.000943
that,-0.629701,0.006731,-1.782054,-0.61706,0.239269,0.048871,-1.752848,-1.092032,-1.016261,1.908733,-0.110462,0.448769,0.26695,0.597585,-1.877244,-0.704651,0.425686,-0.341707,1.071681,0.470554,-1.444661,1.046365,-0.750877,0.239317,-0.84762,-1.076124,-1.35065,-0.353285,-0.315551,-0.29015,1.069038,-0.535816,-0.017178,-0.436965,0.109813,0.659505,0.57099,-0.207363,1.391301,-1.174286,...,1.796832,-0.14711,-0.11685,-0.668011,-1.996393,0.767274,-0.067685,-1.19662,-1.27432,-1.469162,-0.633338,1.271284,-1.138406,1.81398,0.384498,-0.704534,-0.00468,0.257123,-0.178863,0.431226,-0.77855,1.267313,1.279761,0.834015,0.0404,-2.090657,0.801683,0.213536,-1.075105,0.346809,0.050977,0.184547,1.536635,0.09491,1.037666,-0.950664,-0.990728,1.352528,2.244245,0.477056
i,0.455519,0.860838,1.358174,1.540772,0.537108,-0.120438,-2.212554,1.182185,-0.353883,-0.9631,0.660591,2.61432,-0.298268,2.746932,-0.90638,-0.663598,2.539464,-0.401434,1.084808,-1.378873,0.557069,0.590877,-0.695871,-1.043036,0.42722,-0.354091,-2.467513,-0.28164,1.273458,-0.454542,1.005999,0.941599,-2.802638,0.922892,-1.110628,-0.691425,-0.993794,-0.596508,-0.81409,-1.10647,...,0.75573,-0.460687,1.707562,1.83286,-1.794242,1.799519,-0.374503,-3.513063,1.427954,0.044265,-1.107983,3.686952,-2.5451,1.80764,0.626371,-2.246068,-0.037194,-0.255794,-0.868887,2.194964,3.895068,-0.821489,1.0778,1.143421,0.593428,-0.831259,2.751821,0.384485,-1.190298,3.32925,-2.436102,0.950984,0.790232,-0.378234,-2.09186,-1.223745,0.926949,0.202015,1.16767,2.123801
in,-0.627337,0.02147,2.260724,0.615792,0.972528,1.549737,-0.987746,0.617182,-1.04449,-0.9709,0.870994,1.746341,1.204435,1.29713,0.287519,0.464372,1.018827,0.471089,-1.480158,0.799112,1.587865,1.808118,0.408905,-0.941231,-0.73115,0.019314,1.58024,3.437203,-0.214192,0.523926,0.585417,0.101527,0.443936,0.121557,-0.378598,0.240469,0.152706,-0.78876,-1.492728,0.352991,...,3.34945,-0.075585,-0.78415,0.822069,0.677816,0.073635,-1.725859,1.503806,-0.791743,-0.943963,-1.147732,-0.416007,1.158077,1.521883,-2.121241,-0.078801,1.136749,-3.334992,0.779132,-1.415731,-1.282516,0.594459,0.21181,0.368138,-0.173545,-1.008354,0.579193,2.014713,0.097249,-1.196857,-0.421713,0.680081,0.088759,1.991052,0.352285,-1.388717,-0.805531,-0.039199,0.533705,-0.693541
it,-0.384179,0.190671,-0.355408,-1.224103,0.420592,1.005954,-0.707762,-0.082503,-0.907533,1.14718,0.815864,-0.339972,0.697326,0.419063,-1.18095,-0.697964,0.777681,0.414199,1.625229,0.408929,0.071588,0.218936,-1.288399,-0.206165,0.135642,-1.280761,-0.237504,-1.4935,-0.498078,-0.744426,0.358023,-0.62622,1.046292,0.137596,-0.848642,0.025504,-0.768695,-0.698594,0.713974,-1.257636,...,-0.436037,-0.893278,0.835602,1.082458,-0.975986,1.829976,0.99229,-1.142131,-1.082778,-0.40787,-1.017131,0.726795,-0.748043,0.890596,0.065908,-2.375075,1.602134,-1.270247,0.757231,0.610336,-0.855597,0.786037,0.992262,2.448284,-0.901411,-0.922121,0.078558,-1.551828,-2.769822,2.858369,0.513159,-0.975459,0.414586,0.693176,2.100041,1.908936,-0.302949,1.403589,1.903111,2.428267
you,-0.554841,0.781258,0.004477,-0.970952,-1.872005,-1.386823,-2.426647,0.651405,-0.474897,-1.25044,0.483392,0.372315,0.408063,1.961433,-2.789055,-0.462391,0.599496,-1.906003,1.063386,-2.049342,-1.232907,0.427507,1.212424,-1.970914,-0.926085,-1.358877,-2.975024,0.587187,-0.22156,-0.248863,0.141711,-0.198252,-1.626099,3.640931,-1.651351,-0.283344,0.602658,0.077876,0.487463,1.762809,...,0.676159,1.753308,0.503612,1.178722,-1.342377,0.19158,1.773611,-3.309163,2.172834,-1.07348,0.260338,2.769606,-1.563843,0.318363,0.671811,-1.773057,-1.245502,-0.079024,3.631496,3.009385,1.639559,-0.807223,-0.579926,2.713413,-0.68612,-0.656348,0.131806,-0.130557,0.10871,2.948233,-1.608242,2.632559,-1.007921,0.241322,-0.863136,-1.976516,-1.720505,0.650658,1.229968,0.229845


In [10]:
w2v_model.wv.most_similar('father')
# w2v_model.wv.most_similar('luckyfather')

[('husband', 0.9318971037864685),
 ('son', 0.927040159702301),
 ('mother', 0.9079253673553467),
 ('daughter', 0.8966442346572876),
 ('brother', 0.8913626074790955),
 ('dad', 0.8911597728729248),
 ('sister', 0.8905789256095886),
 ('wife', 0.8808419704437256),
 ('grandmother', 0.8764315247535706),
 ('mom', 0.8763014078140259)]

In [11]:
# FastText
from gensim.models import FastText

fasttext_model = FastText(
    sentences=preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)
fasttext_model.wv.vectors.shape

(21613, 100)

In [14]:
fasttext_df = \
    pd.DataFrame(fasttext_model.wv.vectors, index=fasttext_model.wv.index_to_key)
fasttext_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
the,1.377874,0.468153,-2.003119,-0.190203,-1.492787,2.879536,-2.805759,2.687111,-0.736107,2.553334,1.14233,4.646501,2.431187,3.203297,-3.244699,2.910411,-4.558417,1.249897,0.902412,1.084771,2.168221,1.354713,-0.513644,1.220402,1.544486,0.638001,-0.436747,-1.365161,-0.300813,3.118076,-2.086008,0.899245,1.779906,-0.093494,-1.387843,2.078522,-1.575623,-1.620673,0.583503,-0.185198,...,3.012525,-1.261372,1.355153,-0.017684,0.304464,3.563114,-0.451461,-3.199232,1.041802,-1.26378,0.764663,2.682312,-1.805383,-2.213557,1.674062,-0.930981,1.926677,0.237142,-2.183252,0.468222,-0.216402,-2.854967,-1.912987,0.739761,-0.520495,1.138058,0.479088,0.781859,1.689963,3.563392,-1.506091,-2.00939,2.430433,0.671135,0.200777,-2.086873,0.929412,-0.526385,-0.134678,1.694684
and,-1.622916,1.147473,-0.537926,-1.049441,-0.052231,-0.628562,-4.14539,-0.169191,-2.636476,2.343396,-2.059203,0.735139,-0.250555,-1.950201,-2.42551,-2.098824,-2.218978,-1.262239,-1.405609,-0.512323,-0.558714,-1.907204,-0.168659,0.198178,2.324103,-1.778295,-1.506242,1.815349,-0.542935,0.465458,0.150366,-0.763013,0.108948,-0.5837,-0.681327,0.628183,0.218478,1.50655,-0.481171,0.786663,...,1.63779,-1.070101,0.595319,0.262377,-1.191031,0.878815,1.578964,-0.622281,1.711415,-0.092627,-1.632096,1.197443,0.35186,-0.75126,-0.627876,-0.185318,-1.98416,0.674448,-1.206628,1.296025,-0.250202,-0.304545,0.851864,-0.340031,0.04071,-0.554614,0.38192,1.753387,-0.443277,-0.500079,0.866797,0.570839,2.287233,1.398848,0.420064,0.223484,-1.50837,-1.569376,1.181037,-1.007198
to,1.134272,4.828455,-2.187743,-1.509375,-4.338516,-3.899843,-2.902852,1.104264,-1.21116,1.436514,-6.618051,-2.865725,3.125999,-1.793485,0.458396,-7.157339,-5.376697,0.22114,4.651585,2.866665,1.465603,1.572662,-0.412727,-7.450743,-3.59359,1.35468,-0.312778,-3.602789,1.397854,-1.239146,1.022574,0.502716,-3.1144,2.344236,0.535748,6.95881,-3.929124,2.128888,0.079776,0.776114,...,1.185043,-6.947286,4.763322,2.132139,5.581911,-0.805267,4.504787,0.964982,-1.138876,4.120871,0.048315,0.855838,-6.908513,-4.526392,-9.079096,0.992992,-6.771884,-1.639058,3.206917,1.86067,-7.667994,-3.083079,4.245511,-0.322625,0.715489,6.62859,1.23214,0.641612,3.990431,-4.455183,-0.997958,-0.509327,1.534217,2.766307,4.46266,0.93034,-0.410479,2.737226,2.050124,-2.416155
of,-3.00036,-0.38168,-0.392379,-0.587583,-0.669065,4.46465,-1.077141,3.139119,1.719383,-3.005946,3.21403,6.379323,7.023457,-1.090387,-2.877302,5.459604,-1.692151,-1.518646,-5.162414,-0.784937,3.337531,-1.608997,-7.530288,0.53077,-0.559126,-1.442906,-6.735584,-2.243682,-1.499559,3.603215,-3.599008,0.356827,1.147024,-2.410646,1.657362,-0.443276,4.566418,-0.460654,-4.853464,-7.601526,...,-1.845475,-0.852075,2.825541,0.757303,-1.565141,-0.580772,-2.966674,2.840778,0.836582,-0.263056,2.08178,-0.652789,-2.875643,-1.096983,0.869202,-2.366234,-3.075619,6.093274,6.957736,-6.266261,-2.880261,3.682036,-0.993209,-0.139566,-0.09567,0.840709,1.607021,3.39227,2.415264,-0.451096,-0.559225,-2.071574,-6.000182,-1.399313,-2.007997,1.173105,-0.48768,-8.743364,1.431875,-1.755706
a,8.871532,1.013761,-2.342971,0.614927,-2.04263,12.753938,-3.9749,2.423313,2.36047,-0.210002,1.518399,1.585896,-3.114969,9.287522,-4.856328,-2.380473,6.088399,13.042816,0.673826,-2.789187,-5.361286,1.946605,5.368293,6.897269,-2.750194,0.335223,-4.323215,-4.279906,-4.304938,7.324235,1.744864,5.991812,0.471652,-3.48155,-5.090742,-1.028513,-5.711629,1.312809,9.147606,-3.858868,...,-2.007067,-6.953296,2.219402,10.173272,2.88741,3.382102,4.736368,7.014721,2.817507,3.380248,-4.226091,-5.001102,-1.696727,-2.91508,1.429159,4.851408,0.976779,5.446826,-6.994893,-1.177296,-9.44136,1.671674,-5.815738,-3.374095,-9.287489,-1.530306,1.495933,-5.228634,1.027024,2.819239,-0.362102,-8.409935,-5.263332,-3.740083,3.226045,1.08811,2.256011,2.427722,-1.054306,0.156988
that,1.490676,1.105358,-0.416748,1.011285,-1.282413,0.117933,-2.600906,-1.018762,-1.96032,0.98247,-2.600682,1.663014,-0.510356,0.095727,-3.381122,0.483146,-1.603157,0.2605,1.64987,-0.098416,0.820085,-0.11268,-0.857767,2.16262,-0.346555,1.03029,-1.939065,2.434827,-0.775783,1.053134,0.059759,0.004675,-0.837702,1.982743,1.46039,1.257543,0.043018,-1.098239,0.20775,-0.987391,...,1.442583,-0.373703,0.782577,-0.996664,2.106037,1.630659,0.116106,-1.973591,0.543328,-2.591171,-1.846793,0.869473,-1.494224,0.657953,1.041658,1.007434,-0.91055,1.196032,0.128466,-0.468164,-2.420242,0.592614,-1.228671,0.364483,1.645172,-1.218663,-0.829736,0.200716,2.032936,1.083155,0.759024,-0.496062,1.876807,-0.903592,0.162303,-3.039562,0.748821,1.076104,1.623173,-0.102278
i,2.880567,1.032765,-7.716611,-5.740876,9.710812,5.530775,-10.426237,-10.063011,-4.358613,4.020981,4.610197,-1.380988,-16.572163,-5.174395,-8.48159,-6.397494,1.717517,5.565747,11.482553,2.315569,6.027496,1.175645,2.606643,5.780694,-2.218083,-2.677105,3.963229,0.193911,5.443459,3.569869,1.914212,-2.934721,-7.449457,0.445452,5.823866,4.412278,-5.39998,2.101076,5.077994,5.969984,...,-0.226817,-7.635606,1.837013,-4.918573,2.308897,-8.504686,-6.634487,-12.996136,-3.678484,0.970645,-1.242985,2.461218,4.453411,2.906365,-3.042515,1.493754,2.267012,4.362629,6.166944,-1.427953,5.476775,5.95728,3.298986,8.179225,5.715386,8.972958,5.135445,-2.506586,4.403453,-1.091497,7.523407,-1.91588,11.378318,-6.892385,-1.098249,2.922935,-4.162537,10.452902,0.847161,-12.655406
in,-3.919905,-0.06853,1.220638,-4.473778,5.293237,3.544967,1.424714,3.957442,-1.394956,-1.909927,5.886707,0.072757,2.963343,1.329085,-1.465205,3.517431,0.461965,-1.893425,-4.374436,-3.777164,0.091184,3.406722,-3.410224,-5.242613,-1.191051,3.259661,-0.654013,4.610467,0.30396,-0.368888,-10.195038,5.388124,-2.691312,-3.382575,-1.366691,1.026566,2.045068,-3.979584,-1.487524,3.497853,...,2.339128,-2.044535,1.481824,3.328491,4.417562,2.163779,5.42986,-3.100556,0.41607,-6.321592,3.795887,-2.347981,1.436648,-4.752098,-3.16626,-3.675223,-2.093637,-0.359196,0.802712,-1.083741,-0.733175,4.057523,0.410562,0.230401,0.138625,-5.733574,0.446285,-0.207499,6.558749,-1.248592,-0.610787,-1.478663,-1.180642,4.192719,3.382943,1.805365,-4.096996,-5.523595,-0.876197,-0.499477
it,2.740747,-1.437987,-1.232582,4.777555,-1.48588,-1.777776,1.573349,-0.864027,-6.059803,2.092599,-2.45372,2.164049,0.690942,3.609122,-0.363019,-2.201227,-3.057371,4.672037,3.909593,2.183423,-2.940165,-2.206738,-0.745435,7.228295,-4.92578,-3.632478,0.802484,-1.215089,-1.859228,3.02815,0.13527,0.018061,3.641402,3.365059,3.991227,1.122584,-1.851389,-0.703158,2.814427,4.156867,...,0.939504,-1.912577,-0.489151,0.049198,0.493011,2.162342,-2.878947,-0.062811,-5.109236,-2.676585,-5.391191,-0.993233,-0.228315,5.167516,0.86871,1.646949,3.714179,1.912213,0.390104,0.277414,-4.64599,2.710757,3.167863,-2.092793,2.208032,-2.85279,-1.882744,-5.287745,2.368277,3.886752,0.331094,-2.58661,2.42468,-0.590555,-3.569338,-2.656489,-0.944517,5.96429,2.751074,2.215691
you,0.440146,1.222946,-3.643459,0.668428,-3.664704,-0.644275,-4.40216,-4.980962,-1.685166,1.854585,1.618184,1.023265,-2.277097,-3.08168,0.071781,-0.825162,-3.002413,4.758995,6.090196,2.179303,6.128154,-0.036701,1.390526,1.123819,1.708737,1.688167,1.154237,2.181783,-0.342247,1.192011,-1.700529,-1.803051,-5.39934,4.517856,1.088958,-1.92169,-0.723434,-2.266074,3.806852,4.227902,...,2.204078,-2.156252,1.970955,-1.820304,4.558099,-1.55947,-0.170712,-3.666578,0.343809,-2.057641,-4.406912,2.211897,3.926432,2.653824,-0.865128,2.481119,-4.285754,-0.406589,2.849388,1.395581,1.230102,1.694418,3.524389,4.223559,1.199167,3.007519,-0.110575,-3.145842,2.354708,1.230054,2.317105,-2.653279,0.048245,-5.310089,-0.990711,-2.425966,-0.858359,2.453079,3.437063,-3.725536


In [18]:
fasttext_model.wv.most_similar('father')
fasttext_model.wv.most_similar('luckyfather')

[('father', 0.9550184607505798),
 ('godfather', 0.9402095079421997),
 ('grandfather', 0.920823872089386),
 ('mother', 0.892403781414032),
 ('grandmother', 0.8900907039642334),
 ('luther', 0.8887468576431274),
 ('stepfather', 0.8883705735206604),
 ('brother', 0.877119243144989),
 ('feather', 0.8711089491844177),
 ('slaughter', 0.8525289297103882)]

In [19]:
fasttext_model.wv['luckyfather']

array([ 0.52951795,  0.17600404, -0.3634852 , -0.5151461 , -0.24303114,
       -0.5838578 ,  0.7655095 ,  0.2661119 , -0.33740398, -0.5161004 ,
       -0.77591646, -0.42840827, -0.14740628, -0.5141961 , -0.73358196,
       -0.3103012 , -0.23767571,  0.8081848 , -0.29916894,  0.07813474,
        0.7972376 , -0.00497292,  0.26302746,  0.7603544 , -0.0917149 ,
        0.66599125, -0.05170941,  1.0597012 ,  0.07625537, -0.108527  ,
       -0.04097013,  0.02126335, -0.4441494 , -0.31167516,  0.48212337,
        1.022242  , -0.85066295, -0.4424005 ,  0.59140223,  0.83461595,
        0.14614472, -1.2749021 , -0.791914  ,  1.4543985 ,  0.00329887,
       -0.37440434,  0.97431195, -1.3950533 , -0.8794698 , -0.03731766,
        0.03960813, -0.8155881 ,  0.24428885,  0.3336134 , -0.361567  ,
        0.16862   ,  0.5095138 , -0.08963376, -0.519758  ,  0.67095006,
        0.99115473, -0.27967358,  0.15237865, -0.7988661 ,  0.6865865 ,
       -0.16717745,  0.04186538, -1.0009664 , -0.05760446, -0.44

### fasttext 패키지 설치

In [20]:
!pip install fasttext-wheel

Collecting fasttext-wheel
  Downloading fasttext_wheel-0.9.2-cp312-cp312-win_amd64.whl.metadata (16 kB)
Collecting pybind11>=2.2 (from fasttext-wheel)
  Downloading pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Downloading fasttext_wheel-0.9.2-cp312-cp312-win_amd64.whl (234 kB)
Downloading pybind11-2.13.6-py3-none-any.whl (243 kB)
Installing collected packages: pybind11, fasttext-wheel
Successfully installed fasttext-wheel-0.9.2 pybind11-2.13.6


In [21]:
import fasttext
import fasttext.util

model = fasttext.train_unsupervised(
    'naver_movie_ratings.txt',
    model='skipgram',
    minCount=1,
    dim=100,
    minn=3,
    maxn=5
)

In [22]:
model.get_word_vector('극장')

array([ 4.97961402e-01, -4.09789503e-01, -5.47236621e-01,  8.38761985e-01,
       -2.97477484e-01,  1.61565840e-02, -1.98367149e-01,  2.59191543e-03,
        1.03078291e-01,  6.34898663e-01,  7.19406307e-02,  4.25291479e-01,
        4.83332813e-01, -3.70605797e-01, -6.93864107e-01, -1.05472672e+00,
        2.25807745e-02, -1.46036780e+00, -3.87844115e-01, -1.97107151e-01,
        5.28702617e-01,  3.80765975e-01,  2.56852627e-01, -1.16306685e-01,
        5.41756034e-01, -9.46432829e-01, -2.01850414e-01,  5.02840519e-01,
       -8.32938075e-01,  3.97156000e-01, -2.65158951e-01,  5.07263660e-01,
        1.67569891e-02,  3.28071207e-01, -6.37806892e-01,  2.76154786e-01,
        3.62869725e-02,  3.28003407e-01, -3.88301432e-01, -2.84384400e-01,
        1.15324676e-01,  3.19033742e-01,  3.71708214e-01, -2.88371086e-01,
        1.08095074e+00, -1.66034356e-01,  2.08173454e-01, -4.57889915e-01,
       -8.50110203e-02,  2.84243494e-01,  2.60663509e-01, -8.75886142e-01,
       -8.67269635e-02,  

In [23]:
model.get_subwords('영화관')

(['영화관', '<영화', '<영화관', '<영화관>', '영화관', '영화관>', '화관>'],
 array([   2062, 1921845, 1442415, 1378913, 2245977, 1515139, 1352938]))

In [None]:
model.get_subwords('특선영화')

(['특선영화',
  '<특선',
  '<특선영',
  '<특선영화',
  '특선영',
  '특선영화',
  '특선영화>',
  '선영화',
  '선영화>',
  '영화>'],
 array([  54542,  989150,  929201, 1543251, 2496531,  878545, 1046555,
        2645177, 2342883, 2504929]))