Files

notebooks

Hamel Husain

Merge pull request #36 from ZhaoYi1031/zhaoyi1031-patch

Mar 25, 2020

9455a7d · Mar 25, 2020

Name		Name	Last commit message	Last commit date
parent directory ..
diagram		diagram	Merge pull request #7 from hamelsmu/edit-part-1	May 23, 2018
1 - Preprocess Data.ipynb		1 - Preprocess Data.ipynb	fix folder path not exist bug	Jan 23, 2019
2 - Train Function Summarizer With Keras + TF.ipynb		2 - Train Function Summarizer With Keras + TF.ipynb	fix typo in processor keyword argument.	Jul 10, 2019
3 - Train Language Model Using FastAI.ipynb		3 - Train Language Model Using FastAI.ipynb	added recommendation for tensorboard in nb3	May 28, 2018
4 - Train Model To Map Code Embeddings to Language Embeddings.ipynb		4 - Train Model To Map Code Embeddings to Language Embeddings.ipynb	proofread notebooks	May 27, 2018
5 - Build Search Index.ipynb		5 - Build Search Index.ipynb	updated nb 5	May 27, 2018
README.md		README.md	renamed	May 28, 2018
fastai		fastai	fix symlink	May 15, 2018
general_utils.py		general_utils.py	updated general utils	May 25, 2018
lang_model_utils.py		lang_model_utils.py	Merge branch 'master' into edit-part-1	May 26, 2018
seq2seq_utils.py		seq2seq_utils.py	removed whitespace	May 17, 2018

README.md

Table of Contents

Each step in the above diagram corresponds to a Jupyter notebook in this repo. Below is a high level description of each step:

1 - Preprocess Data: describes how to get python files from BigQuery, and use the AST module to clean code and extract docstrings.

2 - Train Function Summarizer: build a sequence-to-sequence model to predict a docstring given a python function or method. The primary purpose of this model is for a transfer learning task that requires the extraction of features from code.

3 - Train Language Model: Build a language model using Fastai on a corpus of docstrings. We will use this model for transfer learning to encode short phrases or sentences, such as docstrings and search queries.

4 - Train Code2Emb Model: Fine-tune the model from step 2 to predict vectors instead of docstrings. This model will be used to represent code in the same vector space as the sentence embeddings produced in step 3.

5 - Build Search Engine: Use the assets you created to created in steps 3 and 4 to create a semantic search tool.