Skip to content

Files

Latest commit

9455a7d · Mar 25, 2020

History

History

notebooks

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
May 23, 2018
Jan 23, 2019
Jul 10, 2019
May 28, 2018
May 27, 2018
May 27, 2018
May 28, 2018
May 15, 2018
May 25, 2018
May 26, 2018
May 17, 2018

Table of Contents

Alt Text

Each step in the above diagram corresponds to a Jupyter notebook in this repo. Below is a high level description of each step:

1 - Preprocess Data: describes how to get python files from BigQuery, and use the AST module to clean code and extract docstrings.

2 - Train Function Summarizer: build a sequence-to-sequence model to predict a docstring given a python function or method. The primary purpose of this model is for a transfer learning task that requires the extraction of features from code.

3 - Train Language Model: Build a language model using Fastai on a corpus of docstrings. We will use this model for transfer learning to encode short phrases or sentences, such as docstrings and search queries.

4 - Train Code2Emb Model: Fine-tune the model from step 2 to predict vectors instead of docstrings. This model will be used to represent code in the same vector space as the sentence embeddings produced in step 3. 

5 - Build Search Engine: Use the assets you created to created in steps 3 and 4 to create a semantic search tool.