<small><i>This notebook was put together by [Abel Meneses-Abad](http://www.menesesabad.com) for PyData 2018. Source and license info is on [GitHub](https://github.com/sorice/wordembeded/).</i></small>

# An Introduction to Word Embedding Models in Python

Thanks to Jake Vanderplas for his innumerables contributions to python world and science with python. I use part of the structure that I learn from him to made this tutorial.

## Goals of this Tutorial

- **Introduce the basics of Word Embedding Text Representation models**, and some useful code snippets in practice.
- **Apply Semantic Text Similarity using these models**, so you can see the tools availables and an example of application for this knowledge.

## Notebooks Index

The following links are to notebooks containing the tutorial materials.
Note that many of these require files that are in the directory structure of the [github repository](https://github.com/sorice/wordembeded/) in which they are contained.
Many resources for this tutorial must be downloaded from their original source, like some NLTK text corpus used and the wikipedia dump.
A strong recommendation is to start studying the tutorial at home, and make the data generation from wikipedia dump, usign Gensim package. This could take a long time that you don't want to lose on real presentation.

### 1. Preliminaries

  + [01-Preliminaries.ipynb](01-Preliminaries.ipynb)
  
### 2. Word Embedding Models

  + [02.1 Term Frecuency - Inverse Document Frecuency Model](02.1-TfIdf.ipynb)
  + [02.2 TfIdf with Wikipedia Corpus Notebook](2.2-TfIdf-Wikipedia.ipynb)
  + [02.3 GloVe Model](2.3-GloVe.ipynb)
  + [02.4 GloVe with Wikipedia Corpus](2.4-GlobVe-Wikipedia.ipynb)
  + [02.5 Word2Vec Model](2.5-word2vec.ipynb)
  + [02.6 Word2Vec with Wikipedia Corpus](2.6-word2vec-Wikipedia.ipynb)
  + [02.7 LSI Model](2.7-LSI.ipynb)
  + [02.8 LSI with Wikipedia Corpus](2.6-word2vec-Wikipedia.ipynb)
  + [02.9 Paragraph2Vec Model](2.9-Paragraph2Vec.ipynb)
  + [2.10 P2V Model with Wikipedia Corpus](2.10-Paragraph2Vec-Wikipedia.ipynb)
  
  
### 3. Full Example Using WordEmbedding in Paraphrase Recognition

  + [3.1 Playfull Experiments with MSRP Corpus](3.1-Playfull-Experiments-with-MSRPC.ipynb)
  
### 4. Future Works

   + [Sense2Vec Model](4.1-Sense2Vec.ipynb)
   + [Have SVD over TF-matrix any effect in Paraphrase Recognition?](4.2-Tf_applying_SVD.ipynb)
   + [Paragram2Vec Model]()

## Preliminaries

This tutorial requires the following packages:

- Python version 2.6-2.7 or 3.3-3.4
- `jupyter` version 1.0 or later, with notebook support: http://jupyter.org
- `notebook` version 5.4 or later, with notebook support: http://jupyter.org
- `numpy` version 1.14 or later: http://www.numpy.org/
- `scipy` version 1.0 or later: http://www.scipy.org/
- `sklearn` version 0.19 or later: http://scikit-learn.org

- `gensim` version 3.3 or later, with notebook support: https://radimrehurek.com/gensim/index.html
- `nltk` version 3.2.5 or later, with notebook support: http://nltk.org

The easiest way to get these is to use the [conda](http://store.continuum.io/) environment manager.
I suggest downloading and installing [miniconda](http://conda.pydata.org/miniconda.html).

The following command will install all required packages:
```
$ conda install numpy scipy sklearn jupyter-notebook
```

Alternatively, you can download and install with:
```
$ pip install numpy scipy sklearn jupyter notebook
```

### Checking your installation

You can run the following code to check the versions of the packages on your system:

(in IPython notebook, press `shift` and `return` together to execute the contents of a cell)

In [1]:
from __future__ import print_function

import jupyter
print('IPython:', jupyter.__version__)

import notebook
print('notebook', notebook.__version__)

import numpy
print('numpy:', numpy.__version__)

import scipy
print('scipy:', scipy.__version__)

import sklearn
print('scikit-learn:', sklearn.__version__)

import gensim
print('gensim:', gensim.__version__)

import nltk
print('nltk:', nltk.__version__)

IPython: 1.0.0
notebook 5.4.0
numpy: 1.14.0
scipy: 1.0.0
scikit-learn: 0.19.1
gensim: 3.3.0
nltk: 3.2.5


## Data

For paraphrase recognition examples de Microsoft Paraphrase Corpus [(Dolan2004)](#Dolan2004) will be used.

## Useful Resources

- **Gutenberg corpus NLTK selections:** https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/gutenberg.zip 
- **scikit-learn:** http://scikit-learn.org (see especially the narrative documentation)
- **matplotlib:** http://matplotlib.org (see especially the gallery section)
- **IPython:** http://ipython.org (also check out http://nbviewer.ipython.org)

<a id='References'></a>
# References


[1] *[Dolan2004]* Dolan, Bill & Quirk, Chris & Brockett, Chris (2004). 
**Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources.** 
Published on Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). ACM. 
<a id='Dolan2004'></a>