In [1]:
text = '''A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react
against the air to create lift and drag. A kite consists of wings, tethers, and anchors.
Kites often have a bridle to guide the face of the kite at the correct angle so the wind
can lift it. A kite’s wing also may be so designed so a bridle is not needed; when
kiting a sailplane for launch, the tether meets the wing at a single point. A kite may
have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of
tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is
still often called the kite.
The lift that sustains the kite in flight is generated when air flows around the kite’s
surface, producing low pressure above and high pressure below the wings. The
interaction with the wind also generates horizontal drag along the direction of the
wind. The resultant force vector from the lift and drag force components is opposed
by the tension of one or more of the lines or tethers to which the kite is attached. The
anchor point of the kite line may be static or moving (such as the towing of a kite by
a running person, boat, free-falling anchors as in paragliders and fugitive parakites
or vehicle).
The same principles of fluid flow apply in liquids and kites are also used under water.
A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite
lifting surface is called a kytoon.
Kites have a long and varied history and many different types are flown
individually and at festivals worldwide. Kites may be flown for recreation, art or
other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a
competition. Power kites are multi-line steerable kites designed to generate large forces
which can be used to power activities such as kite surfing, kite landboarding, kite
fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have
been made.'''

In [2]:
from nltk.tokenize import TreebankWordTokenizer

In [3]:
tokenizer = TreebankWordTokenizer()

In [4]:
token_list = tokenizer.tokenize(text.lower())

In [5]:
token_list

['a',
 'kite',
 'is',
 'traditionally',
 'a',
 'tethered',
 'heavier-than-air',
 'craft',
 'with',
 'wing',
 'surfaces',
 'that',
 'react',
 'against',
 'the',
 'air',
 'to',
 'create',
 'lift',
 'and',
 'drag.',
 'a',
 'kite',
 'consists',
 'of',
 'wings',
 ',',
 'tethers',
 ',',
 'and',
 'anchors.',
 'kites',
 'often',
 'have',
 'a',
 'bridle',
 'to',
 'guide',
 'the',
 'face',
 'of',
 'the',
 'kite',
 'at',
 'the',
 'correct',
 'angle',
 'so',
 'the',
 'wind',
 'can',
 'lift',
 'it.',
 'a',
 'kite’s',
 'wing',
 'also',
 'may',
 'be',
 'so',
 'designed',
 'so',
 'a',
 'bridle',
 'is',
 'not',
 'needed',
 ';',
 'when',
 'kiting',
 'a',
 'sailplane',
 'for',
 'launch',
 ',',
 'the',
 'tether',
 'meets',
 'the',
 'wing',
 'at',
 'a',
 'single',
 'point.',
 'a',
 'kite',
 'may',
 'have',
 'fixed',
 'or',
 'moving',
 'anchors.',
 'untraditionally',
 'in',
 'technical',
 'kiting',
 ',',
 'a',
 'kite',
 'consists',
 'of',
 'tether-set-coupled',
 'wing',
 'sets',
 ';',
 'even',
 'in',
 'tech

In [6]:
len(token_list)

361

In [7]:
# TRAIN YOUR DOMAIN-SPECIFIC WORD2VEC MODEL

In [8]:
from gensim.models.word2vec import Word2Vec



In [9]:
# Parameters to control Word2vec model training

In [10]:
# Number of vector elements (dimensions)
# to represent the word vector

In [11]:
num_features = 300

In [12]:
# Min number of word count to be considered
# in the Word2vec model. If your corpus is
# small, reduce the min count. If you’re training
# with a large corpus, increase the min count

In [13]:
min_word_count = 3

In [14]:
# Number of CPU cores used for the
# training. If you want to set the
# number of cores dynamically,
# check out import multiprocessing:
# num_workers = multiprocessing.cpu_count().

In [15]:
num_workers = 2

In [16]:
# Context window size

In [17]:
window_size = 6

In [18]:
# Subsampling rate for frequent terms

In [19]:
subsampling = 1e-3

In [20]:
# Instantiating a Word2vec model

In [21]:
model = Word2Vec(
                 token_list,
                 workers=num_workers,
                 size=num_features,
                 min_count=min_word_count,
                 window=window_size,
                 sample=subsampling)

In [22]:
'''Word2vec models can consume quite a bit of memory. But remember that only the
weight matrix for the hidden layer is of interest. Once you’ve trained your word
model, you can reduce the memory footprint by about half if you freeze your model
and discard the unnecessary information. The following command will discard the
unneeded output weights of your neural network:'''

'Word2vec models can consume quite a bit of memory. But remember that only the\nweight matrix for the hidden layer is of interest. Once you’ve trained your word\nmodel, you can reduce the memory footprint by about half if you freeze your model\nand discard the unnecessary information. The following command will discard the\nunneeded output weights of your neural network:'

In [23]:
model.init_sims(replace=True)

In [24]:
'''The init_sims method will freeze the model, storing the weights of the hidden layer
and discarding the output weights that predict word co-ocurrences. The output
weights aren’t part of the vector used for most Word2vec applications. But the model
cannot be trained further once the weights of the output layer have been discarded.'''

'The init_sims method will freeze the model, storing the weights of the hidden layer\nand discarding the output weights that predict word co-ocurrences. The output\nweights aren’t part of the vector used for most Word2vec applications. But the model\ncannot be trained further once the weights of the output layer have been discarded.'

In [25]:
# You can save the trained model with the following command and preserve it for
# later use:

In [26]:
model_name = "my_domain_specific_word2vec_model"

In [27]:
model.save(model_name)

In [28]:
# Loading a saved Word2vec model

In [29]:
from gensim.models.word2vec import Word2Vec

In [30]:
model_name = "my_domain_specific_word2vec_model"

In [31]:
model = Word2Vec.load(model_name)