Skip to content

simon555/LM_word

Repository files navigation

  • REPO IN CONSTRUCTION

This repo implements a language model based on the word level.

This code is meant to work with heavy datasets (>500 Go of text), so we optimized the data loading and processing. See Lazy contiguous dataset

Through the code you will see some checks

if os.name=='nt': 
  ...

Which is a statement that allows to check if the code is run on a local windows machine for debugging or on a Linux cluster.

The small dataset used to debug/optimize this project can be found at : https://www.kaggle.com/c/asap-sas/data

  • Large vocabulary

Also, as the vocabulary size is large (~100K to 1M words similar to the 1B words benchmark) we implemented an efficient softmax + Cross Entropy based on : https://arxiv.org/abs/1609.04309

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published