Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Update README.md
  • Loading branch information
bharat-suri committed Jul 5, 2018
1 parent 9283c66 commit e3de349
Showing 1 changed file with 35 additions and 7 deletions.
42 changes: 35 additions & 7 deletions gsoc2018-bharat/README.md
Expand Up @@ -2,16 +2,27 @@

This project aims to generate complex word embeddings for Out-Of-Vocabulary entities. After completion, the package will be able to generate pre-trained embeddings, improve them by generating embeddings on-the-fly, evaluate the benchmarks using pre-trained embeddings, and make the same evaluations on the imporoved embeddings.

## Requirements
### Getting started

All the libraries used under this project are included in the file, requirements.txt. To install, just run the command
Here is how you can start training and testing the model yourself. First, clone the repository on your local machine. I assume you are familiar with Git, and have it installed on your system prior to using this repo.

```shell
$ git clone git@github.com:tramplingWillow/embeddings.git
$ cd embeddings/gsoc2018-bharat
```

### Creating python virtual environment

Next, we need to setup the python environment with all the libraries that are used in this project. This project uses Python3 so make sure you have it installed. Check the installation procedure and download the latest version [here](https://www.python.org/downloads/).

```
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install --upgrade pip
$ pip install -r requirements.txt
```

## Downloading and cleaning wiki dump
### Downloading and cleaning wiki dump

Here, you can find the first 1 billion bytes of English Wikipedia.

Expand All @@ -21,13 +32,24 @@ $ wget -c http://mattmahoney.net/dc/enwik9.zip -P data
$ unzip data/enwik9.zip -d data
```

This is a raw Wikipedia dump and needs to be cleaned because it contains a lot of HTML/XML data. There are two ways to pre-process it. Here, I am using the [wikifil.pl](https://github.com/tramplingWillow/ComplexEmbeddings/blob/master/src/package/wikifil.pl) bundled with FastText (the script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/).) to pre-process it.
This is a raw Wikipedia dump and needs to be cleaned because it contains a lot of HTML/XML data. There are two ways to pre-process it. Here, I am using the Wiki Extractor script available [here](https://github.com/attardi/wikiextractor/blob/master/WikiExtractor.py) bundled with FastText to pre-process it.

```shell
$ perl src/package/wikifil.pl data/enwik9 > data/fil9
$ wget -c https://github.com/attardi/wikiextractor/blob/master/WikiExtractor.py -P src
$ python src/WikiExtractor.py data/enwik9 -l -o data/output
$ python src/WikiExtractor.py data/enwik9 -o data/text
```

## Pre-Training and Evaluating Analogy using Google Dataset
Now, we have the extracted text from the XML dump, with and without html links. Next we need to generate surface forms from this and combine the files into one single file for training the FastText model. While doing so, the descriptions for each entity will also be extracted.

```
$ python src/surface_forms.py data/output
$ python src/check_person.py data/text data/Genders.csv
$ python src/mention_extractor data/output data/AnchorDictionary.csv data/Genders.csv
$ python src/combine.py data/output data/text8 data/descriptions.json
```

### Pre-Training using FastText

The script, [pre-train.py](https://github.com/tramplingWillow/ComplexEmbeddings/blob/master/src/pre-train.py) takes the following arguments:
- Files
Expand All @@ -41,7 +63,13 @@ The script, [pre-train.py](https://github.com/tramplingWillow/ComplexEmbeddings/

```shell
$ mkdir model
$ python src/pre-train.py -i data/fil9 -o model/pre_wiki -s 300 -sg 1 -hs 1 -e 5
$ python src/pre-train.py -i data/text8 -o model/entity_fasttext_n100 -m fasttext -s 100 -sg 1 -hs 1 -e 10
```

### Training the LSTM model

```
$ python src/train_lstm.py
```

Next, we try to see how these pre-trained embeddings perform on the [Google Analogy Task](http://download.tensorflow.org/data/questions-words.txt). For this we have the [analogy.py](https://github.com/tramplingWillow/ComplexEmbeddings/blob/master/src/analogy.py).<br>
Expand Down

0 comments on commit e3de349

Please sign in to comment.