# Chapter 3 - Github Embeddings

In this notebook we're going to go beyond using pre-trained embeddings and models we download from the internet and start to create our own secondary models that can improve the primary model through transfer learning. We're going to train text and code embeddings based on Github's [CodeSearchNet](https://github.com/rjurney/CodeSearchNet) datasets. They include both doc strings and code for 2 million posts and while they use the data to map from text search queries to code, we'll be using it to create separate BERT embeddings to drive our Stack Overflow tagger.

The paper for CodeSearchNet is on arXiv at [CodeSearchNet Challenge: Evaluating the State of Semantic Code Search](https://arxiv.org/abs/1909.09436).

In [12]:
import gc
from pathlib import Path

import pandas as pd

In [15]:
df = pd.DataFrame()

# Load all Gzipped JSON Lines files in the data directory
for filename in Path('../data/CodeSearchNet').glob('**/*.jsonl.gz'):
    new_df = pd.read_json(filename, lines=True)
    df = pd.concat([df, new_df])
    
    # Carefully manage memory
    del new_df
    gc.collect()

df.head()

Unnamed: 0,code,code_tokens,docstring,docstring_tokens,func_name,language,original_string,partition,path,repo,sha,url
0,protected final void fastPathOrderedEmit(U val...,"[protected, final, void, fastPathOrderedEmit, ...",Makes sure the fast-path emits in order.\n@par...,"[Makes, sure, the, fast, -, path, emits, in, o...",QueueDrainObserver.fastPathOrderedEmit,java,protected final void fastPathOrderedEmit(U val...,test,src/main/java/io/reactivex/internal/observers/...,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...
1,@CheckReturnValue\n @NonNull\n @Schedule...,"[@, CheckReturnValue, @, NonNull, @, Scheduler...",Mirrors the one ObservableSource in an Iterabl...,"[Mirrors, the, one, ObservableSource, in, an, ...",Observable.amb,java,@CheckReturnValue\n @NonNull\n @Schedule...,test,src/main/java/io/reactivex/Observable.java,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...
2,"@SuppressWarnings(""unchecked"")\n @CheckRetu...","[@, SuppressWarnings, (, ""unchecked"", ), @, Ch...",Mirrors the one ObservableSource in an array o...,"[Mirrors, the, one, ObservableSource, in, an, ...",Observable.ambArray,java,"@SuppressWarnings(""unchecked"")\n @CheckRetu...",test,src/main/java/io/reactivex/Observable.java,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...
3,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })...","[@, SuppressWarnings, (, {, ""unchecked"", ,, ""r...",Concatenates elements of each ObservableSource...,"[Concatenates, elements, of, each, ObservableS...",Observable.concat,java,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })...",test,src/main/java/io/reactivex/Observable.java,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...
4,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })...","[@, SuppressWarnings, (, {, ""unchecked"", ,, ""r...",Returns an Observable that emits the items emi...,"[Returns, an, Observable, that, emits, the, it...",Observable.concat,java,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })...",test,src/main/java/io/reactivex/Observable.java,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...


In [19]:
len(df['docstring'].index)

2070536