Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
.gitignore
README.md
distance.py
pair.py
prepare_data.py
prepare_data.sh
prepare_data_big.sh
prepare_virtualenv.sh
query.py
reading1
reading1.html
reading10a
reading10a.html
reading10b
reading10b.html
reading10c
reading10c.html
reading10d
reading10d.html
reading10f
reading10f.html
reading10g
reading10g.html
reading10h
reading10h.html
reading11a
reading11a.html
reading11b
reading11b.html
reading11c
reading11c.html
reading1b
reading1b.html
reading2
reading2.html
reading2b
reading2b.html
reading4
reading4.html
reading4b
reading4b.html
reading4c
reading4c.html
reading4d
reading4d.html
reading4e
reading4e.html
reading4f
reading4f.html
reading4g
reading4g.html
reading5a
reading5a.html
reading5b
reading5b.html
reading6a
reading6a.html
reading6b
reading6b.html
reading7a
reading7a.html
reading8a
reading8a.html
reading8b
reading8b.html
reading8c
reading8c.html
reading8d
reading8d.html
reading8e
reading8e.html
reading8f
reading8f.html
reading8g
reading8g.html
reading8h
reading8h.html
reading8i
reading8i.html
reading8j
reading8j.html
reading9a
reading9a.html
reading9b
reading9b.html
sentences.py

README.md

RWET Final Project

Scripts:

  • prepare_virtualenv.sh: creates a virtualenv environment and installs the dependencies
  • prepare_data.sh: downloads the word2vec project to obtain a small word vector database
  • prepare_data_big.sh: prompts the user to download GoogleNews-vectors-negative300.bin.gz and converts it to a very large (3.7 gigabyte) word vector database

Five programs:

Generated poems.

distance.py

Distance.py is the main program which takes as input a couplet from a source text and then blends together each pair of words, one chosen from each line of the couplet, into a grid of new words to make a poem. It uses word vector data from the word2vec open source project to blend words together through vector addition.

$ ./distance.py --help
usage: distance.py [-h] [--number NUMBER] --vocabulary VOCABULARY --vectors
                   VECTORS [-p PROBABILITY] [--html]

Find the nearest words to sums of pairs of words from a couplet of an original
text.

optional arguments:
  -h, --help            show this help message and exit
  --number NUMBER       the number of similar words
  --vocabulary VOCABULARY
                        the input vocabulary text file
  --vectors VECTORS     the input numpy vector binary file
  -p PROBABILITY, --probability PROBABILITY
                        the probability parameter for the geometric
                        distribution for choosing words
  --html                whether to output html

Example

$ cat | ./distance.py --vocabulary vocabulary.txt --vectors vectors.dat
Hello world.
Where are you?
Hello world.
Where are you?

      Hello world.
Where when when
are   meleon these
you?  yourself myself

pair.py

Pair.py prints two randomly selected neighboring lines from the input file.

$ ./pair.py --help
usage: pair.py [-h]

Chooses two neighboring non-empty lines at random.

optional arguments:
  -h, --help  show this help message and exit

Example

$ cat | ./pair.py 
One.
Two.
Three.
Four.
Three.
Four.

prepare_data.py

Prepare_data.py reads in a word2vec vector database and outputs a text file representing the vocabulary of the database with one word per line. It also outputs a numpy 2d array that stores the vectors representing each word.

$ ./prepare_data.py --help
usage: prepare_data.py [-h] --input INPUT --vocabulary VOCABULARY --vectors
                       VECTORS

Converts word2vec data to a vocabulary text file and a numpy vector binary
file.

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         the word2vec data file
  --vocabulary VOCABULARY
                        the output vocabulary text file
  --vectors VECTORS     the output numpy vector binary file

Example

$ ./prepare_data.py --input vectors.bin --vocabulary vocabulary.txt --vectors vectors.dat

query.py

Query.py returns the words that most closely match the sum of the vectors of the words on each input line.

$ ./query.py --help
usage: query.py [-h] [--number NUMBER] --vocabulary VOCABULARY --vectors
                VECTORS

Find the nearest words.

optional arguments:
  -h, --help            show this help message and exit
  --number NUMBER       the number of similar words
  --vocabulary VOCABULARY
                        the input vocabulary text file
  --vectors VECTORS     the input numpy vector binary file

Example

$ ./query.py --vectors vectors.dat --vocabulary vocabulary.txt
red fruit
red fruit flowers yellow white green dried purple colored chocolate

sentences.py

Sentences.py reads in a file and outputs each sentence from the file in order using TextBlob.

$ ./sentences.py --help
usage: sentences.py [-h]

Prints the sentences in the text on standard input.

optional arguments:
  -h, --help  show this help message and exit

Example

$ cat | ./sentences.py
I am going to the store. Would you like to come with me? Okay.
I am going to the store.
Would you like to come with me?
Okay.
You can’t perform that action at this time.