Skip to content

Implemented BiDirectional Long Short- Term Memory (BiLSTM) to build a Future Word Prediction model. The project involved training these models using large datasets of textual data and tuning hyperparameters to optimize the accuracy of the model.

License

Notifications You must be signed in to change notification settings

tanalpha-aditya/Future-word-prediction-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intro to Natural Language Processing

Directories:

  • All the dataSet are present in the ./dataSet folder
  • All the Perplexity result are present inside the folder ./results
  • The Checkpoint of the model is saved in ./SavedModel
  • All the Codes are present inside ./sourceCode
  • The Splitted Datasets are present at ./sourceCode/split

Tokenization

For tokenization and cleaning I used the following embeddings :

•	Lower case
•	Hashtag
•	Mentions
•	Num
•	Currency
•	Url
•	Percentage
•	Punctuations
•	Removing extra spaces
•	Removal of ‘/n’

These Embeddings are clearly shown in the code with the help of comments

Input Format:

  • language_model.py $ python3 language_model.py Eg=> $ python3 language_model.py k ./corpus.txt

  • neural_language_model.py $ python3 neural_language_model.py Eg=> $ python3 neural_language_model.py ../models/trained_nlm.pth

Then a Prompt will come whcih will ask for a sentence as input. After entering it, it will print the preplexity of that sentence.

NOTE: neural_language_model.py accepts sentence of length less than 34 only.

Also the neural_language_model.py works on kaggle but i haven't tested it outside kaggle

as i was unable to correctly install pytorch in my MAC M1

SavedModel

  • FirstSave.pth = for the corpus Pride-and-Prejudice-Jane-Austen
  • FirstSave2.pth = for the corpus Ulysses-James-Joyce

About

Implemented BiDirectional Long Short- Term Memory (BiLSTM) to build a Future Word Prediction model. The project involved training these models using large datasets of textual data and tuning hyperparameters to optimize the accuracy of the model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages