# Data Exploration

This notebook explores the pre-processed data, and shows some basic statistics that may be useful.  

In [13]:
import json
import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint

## Part 1: Preview The Dataset
    
Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data.  While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3.  

The s3 links follow this pattern:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip

For example, the link for the `python` is:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

First we download and decompress this dataset:

In [14]:
#!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

In [15]:
#doesnt work on windows os based anaconda system
#!unzip java.zip

Finally, we can inspect `python/final/jsonl/test/python_test_0.jsonl.gz` to see its contents:

In [16]:
#same problem with windows
# decompress this gzip file
#!gzip -d ../resources/data/java/final/jsonl/test/python_test_0.jsonl.gz

Read in the file and display the first row.  The data is stored in [JSON Lines](http://jsonlines.org/) format.

In [17]:
with open('../resources/data/java_codesearchnet/java/final/jsonl/test/java_test_0.jsonl', 'r') as f:
    sample_file = f.readlines()
sample_file[0]

'{"repo": "ReactiveX/RxJava", "path": "src/main/java/io/reactivex/internal/observers/QueueDrainObserver.java", "func_name": "QueueDrainObserver.fastPathOrderedEmit", "original_string": "protected final void fastPathOrderedEmit(U value, boolean delayError, Disposable disposable) {\\n        final Observer<? super V> observer = downstream;\\n        final SimplePlainQueue<U> q = queue;\\n\\n        if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\\n            if (q.isEmpty()) {\\n                accept(observer, value);\\n                if (leave(-1) == 0) {\\n                    return;\\n                }\\n            } else {\\n                q.offer(value);\\n            }\\n        } else {\\n            q.offer(value);\\n            if (!enter()) {\\n                return;\\n            }\\n        }\\n        QueueDrainHelper.drainLoop(q, observer, delayError, disposable, this);\\n    }", "language": "java", "code": "protected final void fastPathOrderedEmit(U value, boolean d

We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:

In [18]:
pprint(json.loads(sample_file[0]))

{'code': 'protected final void fastPathOrderedEmit(U value, boolean '
         'delayError, Disposable disposable) {\n'
         '        final Observer<? super V> observer = downstream;\n'
         '        final SimplePlainQueue<U> q = queue;\n'
         '\n'
         '        if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\n'
         '            if (q.isEmpty()) {\n'
         '                accept(observer, value);\n'
         '                if (leave(-1) == 0) {\n'
         '                    return;\n'
         '                }\n'
         '            } else {\n'
         '                q.offer(value);\n'
         '            }\n'
         '        } else {\n'
         '            q.offer(value);\n'
         '            if (!enter()) {\n'
         '                return;\n'
         '            }\n'
         '        }\n'
         '        QueueDrainHelper.drainLoop(q, observer, delayError, '
         'disposable, this);\n'
         '    }',
 'code_tokens': ['pr

Definitions of each of the above fields are located in the  in the README.md file in the root of this repository.

## Part 2: Exploring The Full Dataset

You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding.

The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.

The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:

In [19]:
java_train_files = sorted(Path('../resources/data/java_codesearchnet/java/final/jsonl/train').glob('**/*.gz'))
java_test_files =  sorted(Path('../resources/data/java_codesearchnet/java/final/jsonl/test').glob('**/*.gz'))
java_valid_files = sorted(Path('../resources/data/java_codesearchnet/java/final/jsonl/valid').glob('**/*.gz'))
all_files = java_train_files + java_test_files + java_valid_files

# To match all files, in all directories (from the base directory and deeper)

# **/*.nupkg
# Will match

# sample.nupkg
# sample-2.nupkg
# tmp/sample.nupkg
# tmp/other.nupkg
# other/new/sample.nupkg

# python_files = sorted(Path('../resources/data/python/').glob('**/*.gz'))
# java_files = sorted(Path('../resources/data/java/').glob('**/*.gz'))
# go_files = sorted(Path('../resources/data/go/').glob('**/*.gz'))
# php_files = sorted(Path('../resources/data/php/').glob('**/*.gz'))
# javascript_files = sorted(Path('../resources/data/javascript/').glob('**/*.gz'))
# ruby_files = sorted(Path('../resources/data/ruby/').glob('**/*.gz'))
# all_files = python_files + go_files + java_files + php_files + javascript_files + ruby_files

In [20]:
#total number of java files = 16, only includes the trained data
#why is test of size zero?
print(f'Total number of files: {len(all_files):,}')
print(len(java_test_files)) #why is size of test zero?
print(len(java_valid_files))

Total number of files: 18
1
1


To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: 

In [21]:
columns_long_list = ['repo', 'path', 'url', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = ['code_tokens', 'docstring_tokens', 
                      'language', 'partition']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f, 
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

This is what the python dataset looks like:

In [23]:
#jvdf_train = jsonl_list_to_dataframe(java_files)
jvdf_valid = jsonl_list_to_dataframe(java_valid_files)
jvdf_test = jsonl_list_to_dataframe(java_test_files)
#pydf = jsonl_list_to_dataframe(python_files)
print(jvdf_test.shape)
print(jvdf_valid.shape)

(26909, 9)
(15328, 9)


In [25]:
jvdf_valid.head(3)
#pydf.head(3)

Unnamed: 0,repo,path,url,code,code_tokens,docstring,docstring_tokens,language,partition
0,google/guava,guava/src/com/google/common/escape/ArrayBasedUnicodeEscaper.java,https://github.com/google/guava/blob/7155d12b70a2406fa84d94d4b8b3bc108e89abfd/guava/src/com/google/common/escape/ArrayBasedUnicodeEscaper.java#L142-L154,@Override\n public final String escape(String s) {\n checkNotNull(s); // GWT specific check (do not optimize)\n for (int i = 0; i < s.length(); i++) {\n char c = s.charAt(i);\n if ((c < replacementsLength && replacements[c] != null)\n || c > safeMaxChar\n || c ...,"[@, Override, public, final, String, escape, (, String, s, ), {, checkNotNull, (, s, ), ;, // GWT specific check (do not optimize), for, (, int, i, =, 0, ;, i, <, s, ., length, (, ), ;, i, ++, ), {, char, c, =, s, ., charAt, (, i, ), ;, if, (, (, c, <, replacementsLength, &&, replacements, [, c,...",/*\nThis is overridden to improve performance. Rough benchmarking shows that this almost doubles\nthe speed when processing strings that do not require any escaping.,"[/, *, This, is, overridden, to, improve, performance, ., Rough, benchmarking, shows, that, this, almost, doubles, the, speed, when, processing, strings, that, do, not, require, any, escaping, .]",java,valid
1,google/guava,guava/src/com/google/common/escape/ArrayBasedUnicodeEscaper.java,https://github.com/google/guava/blob/7155d12b70a2406fa84d94d4b8b3bc108e89abfd/guava/src/com/google/common/escape/ArrayBasedUnicodeEscaper.java#L161-L173,@Override\n protected final char[] escape(int cp) {\n if (cp < replacementsLength) {\n char[] chars = replacements[cp];\n if (chars != null) {\n return chars;\n }\n }\n if (cp >= safeMin && cp <= safeMax) {\n return null;\n }\n return escapeUnsafe(cp);\...,"[@, Override, protected, final, char, [, ], escape, (, int, cp, ), {, if, (, cp, <, replacementsLength, ), {, char, [, ], chars, =, replacements, [, cp, ], ;, if, (, chars, !=, null, ), {, return, chars, ;, }, }, if, (, cp, >=, safeMin, &&, cp, <=, safeMax, ), {, return, null, ;, }, return, esca...",Escapes a single Unicode code point using the replacement array and safe range values. If the\ngiven character does not have an explicit replacement and lies outside the safe range then\n{@link #escapeUnsafe} is called.,"[Escapes, a, single, Unicode, code, point, using, the, replacement, array, and, safe, range, values, ., If, the, given, character, does, not, have, an, explicit, replacement, and, lies, outside, the, safe, range, then, {]",java,valid
2,google/guava,guava/src/com/google/common/escape/ArrayBasedUnicodeEscaper.java,https://github.com/google/guava/blob/7155d12b70a2406fa84d94d4b8b3bc108e89abfd/guava/src/com/google/common/escape/ArrayBasedUnicodeEscaper.java#L176-L188,"@Override\n protected final int nextEscapeIndex(CharSequence csq, int index, int end) {\n while (index < end) {\n char c = csq.charAt(index);\n if ((c < replacementsLength && replacements[c] != null)\n || c > safeMaxChar\n || c < safeMinChar) {\n break;\n ...","[@, Override, protected, final, int, nextEscapeIndex, (, CharSequence, csq, ,, int, index, ,, int, end, ), {, while, (, index, <, end, ), {, char, c, =, csq, ., charAt, (, index, ), ;, if, (, (, c, <, replacementsLength, &&, replacements, [, c, ], !=, null, ), ||, c, >, safeMaxChar, ||, c, <, sa...",/* Overridden for performance.,"[/, *, Overridden, for, performance, .]",java,valid


In [24]:
jvdf_test.head(3)
#pydf.head(3)

Unnamed: 0,repo,path,url,code,code_tokens,docstring,docstring_tokens,language,partition
0,ReactiveX/RxJava,src/main/java/io/reactivex/internal/observers/QueueDrainObserver.java,https://github.com/ReactiveX/RxJava/blob/ac84182aa2bd866b53e01c8e3fe99683b882c60e/src/main/java/io/reactivex/internal/observers/QueueDrainObserver.java#L88-L108,"protected final void fastPathOrderedEmit(U value, boolean delayError, Disposable disposable) {\n final Observer<? super V> observer = downstream;\n final SimplePlainQueue<U> q = queue;\n\n if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\n if (q.isEmpty()) {\n ...","[protected, final, void, fastPathOrderedEmit, (, U, value, ,, boolean, delayError, ,, Disposable, disposable, ), {, final, Observer, <, ?, super, V, >, observer, =, downstream, ;, final, SimplePlainQueue, <, U, >, q, =, queue, ;, if, (, wip, ., get, (, ), ==, 0, &&, wip, ., compareAndSet, (, 0, ...","Makes sure the fast-path emits in order.\n@param value the value to emit or queue up\n@param delayError if true, errors are delayed until the source has terminated\n@param disposable the resource to dispose if the drain terminates","[Makes, sure, the, fast, -, path, emits, in, order, .]",java,test
1,ReactiveX/RxJava,src/main/java/io/reactivex/Observable.java,https://github.com/ReactiveX/RxJava/blob/ac84182aa2bd866b53e01c8e3fe99683b882c60e/src/main/java/io/reactivex/Observable.java#L118-L124,"@CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> amb(Iterable<? extends ObservableSource<? extends T>> sources) {\n ObjectHelper.requireNonNull(sources, ""sources is null"");\n return RxJavaPlugins.onAssembly(new Obser...","[@, CheckReturnValue, @, NonNull, @, SchedulerSupport, (, SchedulerSupport, ., NONE, ), public, static, <, T, >, Observable, <, T, >, amb, (, Iterable, <, ?, extends, ObservableSource, <, ?, extends, T, >, >, sources, ), {, ObjectHelper, ., requireNonNull, (, sources, ,, ""sources is null"", ), ;,...","Mirrors the one ObservableSource in an Iterable of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</...","[Mirrors, the, one, ObservableSource, in, an, Iterable, of, several, ObservableSources, that, first, either, emits, an, item, or, sends, a, termination, notification, ., <p, >, <img, width, =, 640, height, =, 385, src, =, https, :, //, raw, ., github, ., com, /, wiki, /, ReactiveX, /, RxJava, /,...",java,test
2,ReactiveX/RxJava,src/main/java/io/reactivex/Observable.java,https://github.com/ReactiveX/RxJava/blob/ac84182aa2bd866b53e01c8e3fe99683b882c60e/src/main/java/io/reactivex/Observable.java#L144-L158,"@SuppressWarnings(""unchecked"")\n @CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> ambArray(ObservableSource<? extends T>... sources) {\n ObjectHelper.requireNonNull(sources, ""sources is null"");\n int len = sources...","[@, SuppressWarnings, (, ""unchecked"", ), @, CheckReturnValue, @, NonNull, @, SchedulerSupport, (, SchedulerSupport, ., NONE, ), public, static, <, T, >, Observable, <, T, >, ambArray, (, ObservableSource, <, ?, extends, T, >, ..., sources, ), {, ObjectHelper, ., requireNonNull, (, sources, ,, ""s...","Mirrors the one ObservableSource in an array of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</b><...","[Mirrors, the, one, ObservableSource, in, an, array, of, several, ObservableSources, that, first, either, emits, an, item, or, sends, a, termination, notification, ., <p, >, <img, width, =, 640, height, =, 385, src, =, https, :, //, raw, ., github, ., com, /, wiki, /, ReactiveX, /, RxJava, /, im...",java,test


In [26]:
#we save the dataframe into a csv so we don't need to create it again
#jvdf_train.to_csv("jvdf_train.csv")
jvdf_test.to_csv("jvdf_test.csv")
jvdf_valid.to_csv("jvdf_valid.csv")

In [30]:
#training dataset : 454451 rows, and all are supposed to be full i.e. non-null
#valid dataset: 15328 rows, all columns have same number of non null entries
#testing dataset: 26,909 rows, same condition as above
#total is 496688, the same as solved below
jvdf_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26909 entries, 0 to 26908
Data columns (total 9 columns):
repo                26909 non-null object
path                26909 non-null object
url                 26909 non-null object
code                26909 non-null object
code_tokens         26909 non-null object
docstring           26909 non-null object
docstring_tokens    26909 non-null object
language            26909 non-null object
partition           26909 non-null object
dtypes: object(9)
memory usage: 1.8+ MB


Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.).  You can find more information regarding the definition of the above columns in the README of this repo. 

Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics.  **Warning:** This step takes ~ 20 minutes.

In [34]:
#all_df = jsonl_list_to_dataframe(all_files, columns_short_list)
#gets list of all comments, but some may be empty
#docstring_list_train = jvdf_train[["docstring"]].values.tolist()
docstring_list_test = jvdf_test[["docstring"]].values.tolist()
docstring_list_valid = jvdf_valid[["docstring"]].values.tolist()
docstring_list_valid[0]

['/*\nThis is overridden to improve performance. Rough benchmarking shows that this almost doubles\nthe speed when processing strings that do not require any escaping.']

In [35]:
#run the dump_data step only once
import pickle
# filename = "docstrings_list_train_java.pkl"
# dump_data = docstring_list
def pkldump(filename, dump_data):
    with open(filename, "wb") as f:
        pickle.dump(dump_data, f)
def pklload(filename):
    with open(filename, "rb") as f:
        out_file = pickle.load(f)
        return out_file

In [41]:
#run the dump_data step only once
filename = "docstrings_list_valid_java.pkl"
dump_data = docstring_list_valid
pkldump(filename, dump_data)
docstring_list = pklload(filename)

In [42]:
len(docstring_list)

15328

## Summary Statistics

### Row Counts

By Partition

In [13]:
all_df.partition.value_counts()

train    1880853
test      100529
valid      89154
Name: partition, dtype: int64

By Language

In [14]:
all_df.language.value_counts()

php           578118
java          496688
python        457461
go            346365
javascript    138625
ruby           53279
Name: language, dtype: int64

By Partition & Language

In [15]:
all_df.groupby(['partition', 'language'])['code_tokens'].count()

partition  language  
test       go             14291
           java           26909
           javascript      6483
           php            28391
           python         22176
           ruby            2279
train      go            317832
           java          454451
           javascript    123889
           php           523712
           python        412178
           ruby           48791
valid      go             14242
           java           15328
           javascript      8253
           php            26015
           python         23107
           ruby            2209
Name: code_tokens, dtype: int64

### Token Lengths By Language

In [16]:
all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))
all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))

#### Code Length Percentile By Language

For example, the 80th percentile length for python tokens is 72

In [17]:
code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(code_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,code_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,61.0
go,0.7,100.0
go,0.8,138.0
go,0.9,217.0
go,0.95,319.0
java,0.5,66.0
java,0.7,104.0
java,0.8,142.0
java,0.9,224.0
java,0.95,331.0


#### Query Length Percentile By Language

For example, the 80th percentile length for python tokens is 19

In [18]:
query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,query_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,12.0
go,0.7,19.0
go,0.8,28.0
go,0.9,49.0
go,0.95,92.0
java,0.5,11.0
java,0.7,18.0
java,0.8,25.0
java,0.9,39.0
java,0.95,61.0


#### Query Length All Languages

In [19]:
query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0,query_len
0.5,10.0
0.7,15.0
0.8,20.0
0.9,32.0
0.95,50.0
