NLP - Serverless deployment of word embeddings and retrieving most similar words using Kmeans, AWS Aurora Serverless and Lambda

Refer to this Medium Post for a detailed explanation of this repository.

This repository contains all the code to place a word embeddings file (.vec) onto a database for running most_similar function in a serverless manner to retrieve the most similar words for a given query word.

The file sample_word2vec.vec is the word embeddings used as an example to show the functionality. It is similar in format to the real word2vec files loaded by gensim.

The file create_clusters_insert_in_database.py contains all the code to -

Load the word2vec file
Create clusters using SKlearn MinibatchKmeans.
Merge clusters with less than a few elements to a nearby cluster.
Upload the word, word embedding and cluster_id to an Amazon Aurora serverless database.

The file lambda_getsimilarwords_deployment_package/aws_lambda_deployment_getsimilarwords.py contains the code that is deployed on AWS lambda. The lambda function can be called with input as: { "word": "cat" }

The returned output is in the following format : { "similar": [ "kitten","dog",...............] }

Note: The Lambda functions requires pymysql to connect to database and numpy to run cosine similarity in the same cluster and find the closest word vector to a given word vector. Pymysql is already added to the lambda package but numpy is obtained by adding predefined lambda layer for Sklearn and Python 3.6 already provided by AWS.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
lambda_getsimilarwords_deployment_package		lambda_getsimilarwords_deployment_package
README.md		README.md
create_clusters_insert_in_database.py		create_clusters_insert_in_database.py
sample_word2vec.vec		sample_word2vec.vec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lambda_getsimilarwords_deployment_package

lambda_getsimilarwords_deployment_package

README.md

README.md

create_clusters_insert_in_database.py

create_clusters_insert_in_database.py

sample_word2vec.vec

sample_word2vec.vec

Repository files navigation

NLP - Serverless deployment of word embeddings and retrieving most similar words using Kmeans, AWS Aurora Serverless and Lambda

About

Releases

Packages

Languages

ramsrigouthamg/serverless_word_embeddings_and_similar_words_retrieval_AWS

Folders and files

Latest commit

History

Repository files navigation

NLP - Serverless deployment of word embeddings and retrieving most similar words using Kmeans, AWS Aurora Serverless and Lambda

About

Resources

Stars

Watchers

Forks

Languages