# Word2Vec

Instead of relying on pre-computed co-occurrence counts, Word2Vec takes 'raw' text as input and learns a word by predicting 
its surrounding context (in the case of the skip-gram model) or predict a word given its surrounding context 
(in the case of the cBoW model) using gradient descent with randomly initialized vectors.
As an example let us look at the latter case applied to the following sentence.

"The fox jumped over the lazy dog"

Now, we want to learn the word vector for 'over' from its surrounding context, we call this vector vin
vin
, Word2Vec uses different vectors for word embeddings depending on whether it is the word we are conditioning on or the word we are trying to predict. The probability we are trying to maximize is then:
P(vout|vin)
P(vout|vin)
, where vout
vout
is the output word and vin
vin
the input.
The algorithm then moves over each word in the corpus and repeats the training step in an online fashion. 
The interesting property that word vectors obtained this way exhibit is that they encode not only syntactic but also semantic
relationships between words. That means that not only are similar words close to each other in the vector space 
(as measured by some norm), but word analogies are reflected by the difference between word vectors. This property is
referred to as 'additive compositionality' in the literature (Mikolov) and refers to the linear structure in the vector
space that allows analogical reasoning. Word vectors can be seen as representing the distribution of the context in which
a word appears and the sum of vectors roughly represents an AND concatenation, so if for instance 'Volga River' appears in 
the same context with words like 'river' and 'Russian', the sum of these two word vectors will be close to the 
vector of "Volga River". 

## Implementation 

In [114]:
import nltk
from gensim.models import Word2Vec
from nltk.corpus import stopwords
import re

# Steve Jobs' Speech In Stanford

In [115]:
para = """Job’s first story was about connecting the dots.
He talked to the graduates about dropping out of college and “dropping in” on courses he wanted to take, like calligraphy, which, at the time, had nothing to do with what he wanted to do with his life. A decade later, he incorporated what he had learned into the design of the Macintosh. It was the first computer with a beautiful typograhy. In the movie Jobs, Steve was so pedantic about typography that he insisted on it being part of the Macintosh design. When his best engineer on the team thought the idea was silly, he was fired on the spot.
Typography made me fall in love with Apple products, fonts are everything. Jobs once said, “the design of the iPhone’s buttons has to be so good that users would want to lick them.” And he was right.
In Stoicism, everything is opportunity. Sometimes we look back at missed opportunities with regret, but we have to have faith that the dots will connect in the future somehow. Nothing is ever wasted.
Roger Federer is one the greatest tennis players in the world. Many people don’t know that he didn’t really like tennis growing up, in fact, his mom was a tennis coach and didn’t want to coach him since he was bad at it. He went on to win the Spanish inter-league championship as a striker, playing football. He also played basketball, badminton, and cricket. Federer didn’t focus on tennis until the age of 12, seven years before winning his first grand slam title at the 2005 French Open.
Unlike Tiger Woods who has been playing golf since he was a baby, Roger tried almost any sport that involved a ball and credits all of those sports for enhancing his coordination. These are dots connecting.
Love and Loss
Jobs’ second story was about love and loss.
He recalled falling in love with computers, meeting Steve Woz, building Apple, and getting fired by the Board of Directors. He also recalled how getting fired was the best thing that happened to him at the time, and the only thing that kept him going was loving what he did.
He made a huge impact at Apple that, after an eleven year absence, his philosophy still echoed within its corridors. “You’ve got to find what you love.” he said. And let’s not forget the prolific thoughts of Maya Angelou who said, “… pursue the things you love doing and then do them so well that people can’t take their eyes off you.”
In hindsight, loving your work doesn’t guarantee impact or success.
A study done by Forbes Insights, found that all cases of work being studied shared a single intention — the work was focused on making a difference that someone else would love, instead of the person performing the work. They were focused on the recipient of their work — their customer, their colleague who depends on them, their leader who trusts in them, the community who expects their support, or others who benefit from their work.
This is the philosophy that makes Apple a great company, it is customer focused. As was Steve Jobs.
Work is love made visible.
Death
Jobs’ third and final story was about death.
“Remembering that are you going to die is the best way I know to avoid the trap thinking you have something to lose. You are already naked. There is no reason not to follow your heart.”
I have been mediating on death a lot lately. Not in a sad and gloomy way, but as a reminder that our time on earth is limited. We don’t know when will our lights go out. More important, how will we be remembered.
Death calibrates so many things in life. It reminds us that nothing is permanent, that we spend a great deal of time on things that don’t matter. After Jobs learned of his cancer, everything changed. His focus shifted. The first thing he did after returning to Apple was to kill all unnecessary projects. Those who worked with him recall how he was brutally honest, a jerk sometimes, and how focussed his was.
Apple products are not only known for their sleek designs but Jobs had a philosophy, which drives Apple to this day, of simplicity and minimalism. When cellphone makers were battling it out with keyboard, iPhone came out with one button. Minimalism can also be seen in how Jobs communicated on emails. No fluff. Say your piece and keep your peace. Nothing more.
Steve Jobs saves his call to attention for the end of the speech: “Stay Hungry. Stay Foolish. I have always wished that for myself. And now, as you graduate to begin anew, I wish that for you. Stay Hungry. Stay Foolish.”."""

In [116]:
# Preprocessing the data
corpus = re.sub(r'\[[0-9]*\]',' ',para)
corpus = re.sub(r'\s+',' ',corpus)
corpus = corpus.lower()
corpus = re.sub(r'\d',' ',corpus)
corpus = re.sub(r'\s+',' ',corpus)

In [117]:
corpus

'job’s first story was about connecting the dots. he talked to the graduates about dropping out of college and “dropping in” on courses he wanted to take, like calligraphy, which, at the time, had nothing to do with what he wanted to do with his life. a decade later, he incorporated what he had learned into the design of the macintosh. it was the first computer with a beautiful typograhy. in the movie jobs, steve was so pedantic about typography that he insisted on it being part of the macintosh design. when his best engineer on the team thought the idea was silly, he was fired on the spot. typography made me fall in love with apple products, fonts are everything. jobs once said, “the design of the iphone’s buttons has to be so good that users would want to lick them.” and he was right. in stoicism, everything is opportunity. sometimes we look back at missed opportunities with regret, but we have to have faith that the dots will connect in the future somehow. nothing is ever wasted. ro

In [118]:
## sentence tokenizer...
sent= nltk.sent_tokenize(corpus)
sent


['job’s first story was about connecting the dots.',
 'he talked to the graduates about dropping out of college and “dropping in” on courses he wanted to take, like calligraphy, which, at the time, had nothing to do with what he wanted to do with his life.',
 'a decade later, he incorporated what he had learned into the design of the macintosh.',
 'it was the first computer with a beautiful typograhy.',
 'in the movie jobs, steve was so pedantic about typography that he insisted on it being part of the macintosh design.',
 'when his best engineer on the team thought the idea was silly, he was fired on the spot.',
 'typography made me fall in love with apple products, fonts are everything.',
 'jobs once said, “the design of the iphone’s buttons has to be so good that users would want to lick them.” and he was right.',
 'in stoicism, everything is opportunity.',
 'sometimes we look back at missed opportunities with regret, but we have to have faith that the dots will connect in the futur

In [119]:
sent = [nltk.word_tokenize(i) for i in sent]

In [120]:
sent[:2]

[['job',
  '’',
  's',
  'first',
  'story',
  'was',
  'about',
  'connecting',
  'the',
  'dots',
  '.'],
 ['he',
  'talked',
  'to',
  'the',
  'graduates',
  'about',
  'dropping',
  'out',
  'of',
  'college',
  'and',
  '“',
  'dropping',
  'in',
  '”',
  'on',
  'courses',
  'he',
  'wanted',
  'to',
  'take',
  ',',
  'like',
  'calligraphy',
  ',',
  'which',
  ',',
  'at',
  'the',
  'time',
  ',',
  'had',
  'nothing',
  'to',
  'do',
  'with',
  'what',
  'he',
  'wanted',
  'to',
  'do',
  'with',
  'his',
  'life',
  '.']]

In [121]:
##remove stopwords
for i in range(len(sent)):
     sent[i] = [word for word in sent[i] if word not in stopwords.words('english')]

In [122]:
sent[:2]

[['job', '’', 'first', 'story', 'connecting', 'dots', '.'],
 ['talked',
  'graduates',
  'dropping',
  'college',
  '“',
  'dropping',
  '”',
  'courses',
  'wanted',
  'take',
  ',',
  'like',
  'calligraphy',
  ',',
  ',',
  'time',
  ',',
  'nothing',
  'wanted',
  'life',
  '.']]

In [123]:
## Lemmatization...
from nltk.stem import WordNetLemmatizer
lemmatizer= WordNetLemmatizer()

for i in range(len(sent)):
     sent[i] = [lemmatizer.lemmatize(i) for i in sent[i]]

In [124]:
sent[1:3]

[['talked',
  'graduate',
  'dropping',
  'college',
  '“',
  'dropping',
  '”',
  'course',
  'wanted',
  'take',
  ',',
  'like',
  'calligraphy',
  ',',
  ',',
  'time',
  ',',
  'nothing',
  'wanted',
  'life',
  '.'],
 ['decade',
  'later',
  ',',
  'incorporated',
  'learned',
  'design',
  'macintosh',
  '.']]

In [125]:
w2v_model= Word2Vec(sent,  min_count=1) ## min_counts find occurence of word

In [126]:
words=w2v_model.wv.vocab

In [127]:
len(words)

285

In [128]:
# Finding Word Vectors
vector =w2v_model.wv['decade']
vector ## with 100 dimensions

array([-2.58538779e-03, -2.11863942e-03, -3.70551297e-03,  1.39855198e-03,
        4.70542861e-03,  7.59585004e-04, -4.60192841e-03, -3.40852188e-03,
        2.58458033e-03,  2.42529204e-03, -2.93863053e-03,  4.58041113e-03,
        1.78027840e-03, -1.76313391e-04,  3.77977313e-03, -1.59371455e-04,
       -3.04354960e-03, -2.21502967e-03, -4.08910401e-03, -2.05423939e-03,
        2.50410638e-03, -2.47312011e-03,  4.21752082e-03, -1.18382368e-03,
       -8.16729560e-04,  4.64993855e-03,  3.54476739e-03, -3.39434342e-03,
       -1.26178667e-03,  3.71288275e-03, -3.04459641e-03,  2.79450393e-03,
        2.64187669e-03,  4.16286429e-03,  3.14399973e-03, -3.69941653e-03,
        4.05399315e-03, -1.94639631e-03, -3.88599979e-03,  3.71220085e-04,
        2.51882221e-03, -7.32240325e-04, -2.10951269e-03, -4.58331604e-04,
       -4.03513387e-03, -4.58720559e-03, -8.49693373e-04, -2.65549636e-03,
        1.07225194e-03, -4.05139569e-03, -3.49623756e-03,  2.52786279e-03,
       -3.49534466e-03,  

In [129]:
cap_similariity= w2v_model.wv.most_similar('decade')

In [130]:
cap_similariity

[('follow', 0.29708123207092285),
 ('user', 0.2543463706970215),
 ('macintosh', 0.24056144058704376),
 ('button', 0.23258930444717407),
 ('love', 0.2320091426372528),
 ('worked', 0.21800382435321808),
 ('performing', 0.21093666553497314),
 ('fluff', 0.20520809292793274),
 ('else', 0.18994814157485962),
 (',', 0.18918977677822113)]

In [132]:
cap_similariity= w2v_model.wv.most_similar('learned')
cap_similariity

[('typography', 0.2253848910331726),
 ('simplicity', 0.21713969111442566),
 ('job', 0.2087312787771225),
 ('others', 0.20427215099334717),
 ('study', 0.1959155797958374),
 ('you.', 0.19486233592033386),
 ('already', 0.18627208471298218),
 ('day', 0.18533962965011597),
 ('absence', 0.17194335162639618),
 ('dot', 0.16743800044059753)]