PCML Project 2

Loïc Ottet, Baptiste Raemy, Alexis Semple

Initial data

In folder twitter-datasets/

pos_train.txt
neg_train.txt

The steps described below are analoguous for the full dataset (pos_train_full.txt and neg_train_full.txt). All scripts act on the reduced datasets, but they can take full as their last argument to act to the full dataset.

Preprocessing steps (execute while in `twitter-datasets/` folder)

Build the vocabulary (build_vocab.sh). Yields vocab.txt
Cut the vocabulary (cut_vocab.sh). Yields vocab_cut.txt
Convert the vocabulary to a python dictionary mapping words to ids (pickle_vocab.py). Yields vocab.pkl

Word embedding computation (execute while in `twitter-datasets/` folder)

Provided GloVe algorithm (`glove-basic`)

Compute the cooccurence matrix (cooc.py). Yields cooc.pkl
Compute the GloVe matrix (glove_solution.py). Yields embeddings_glove-basic.npy

Stanford GloVe vectors (`glove`)

In what follows, ** represents the number of features of the word vector (25, 50, 100 or 200)

The vectors are in twitter-datasets/glove.twitter.27B.**d.txt
Compute the word embedding for a given dimension (filterVocab.py **). Yields embeddings_glove**.npy)

FastText (`fasttext`)

Compute the fastText vectors (./fasttext skipgram -input data.txt -output model). Yields model.vec (data.txt is a concatenation of the positive and negative train sets)
Compute the word embedding for our vocabulary (filterVocabFastText.py). Yields embeddings_fasttext.npy

Network training (execute while in main folder)

To apply trainTensorflow.py and predic.py to the full dataset, use --full

Load and pad the training data (loadData.py). Yields x_train_padded.npy and y_train.npy
Train the neural network (trainTensorflow.py --embeddings=***), where *** is the name of the chosen embedding (glove-basic, glove** or fasttext). Yields detailed run data in uns/***_****/, where **** is the timestamp of the run 3 . Generate predictions from the test set (predic.py --name=***_****). Yields predictions.csv

External libraries uses

External datasets used

Stanford GloVe Twitter data

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Util		Util
.gitignore		.gitignore
Plotting.ipynb		Plotting.ipynb
Plotting2.ipynb		Plotting2.ipynb
README.md		README.md
TextCNN.py		TextCNN.py
functions_helpers.py		functions_helpers.py
loadData.py		loadData.py
predic.py		predic.py
run.py		run.py
trainTensorflow.py		trainTensorflow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCML Project 2

Initial data

Preprocessing steps (execute while in `twitter-datasets/` folder)

Word embedding computation (execute while in `twitter-datasets/` folder)

Provided GloVe algorithm (`glove-basic`)

Stanford GloVe vectors (`glove`)

FastText (`fasttext`)

Network training (execute while in main folder)

External libraries uses

External datasets used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PCML Project 2

Initial data

Preprocessing steps (execute while in twitter-datasets/ folder)

Word embedding computation (execute while in twitter-datasets/ folder)

Provided GloVe algorithm (glove-basic)

Stanford GloVe vectors (glove)

FastText (fasttext)

Network training (execute while in main folder)

External libraries uses

External datasets used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Preprocessing steps (execute while in `twitter-datasets/` folder)

Word embedding computation (execute while in `twitter-datasets/` folder)

Provided GloVe algorithm (`glove-basic`)

Stanford GloVe vectors (`glove`)

FastText (`fasttext`)

Packages