GitHub

REPO IN CONSTRUCTION

This repo implements a language model based on the word level.

This code is meant to work with heavy datasets (>500 Go of text), so we optimized the data loading and processing. See Lazy contiguous dataset

Through the code you will see some checks

if os.name=='nt': 
  ...

Which is a statement that allows to check if the code is run on a local windows machine for debugging or on a Linux cluster.

The small dataset used to debug/optimize this project can be found at : https://www.kaggle.com/c/asap-sas/data

Large vocabulary

Also, as the vocabulary size is large (~100K to 1M words similar to the 1B words benchmark) we implemented an efficient softmax + Cross Entropy based on : https://arxiv.org/abs/1609.04309

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.spyproject		.spyproject
TF		TF
local_models		local_models
losses		losses
preProcess		preProcess
stats		stats
.gitignore		.gitignore
README.md		README.md
arguments.py		arguments.py
infoToTrack.py		infoToTrack.py
lazyContiguousDataset.py		lazyContiguousDataset.py
torch_LM.py		torch_LM.py
visualisation.py		visualisation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

simon555/LM_word

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages