# Data Exploration

This notebook explores the pre-processed data, and shows some basic statistics that may be useful.  

In [1]:
import json
import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint
import os
import pickle
import time

## Part 1: Preview The Dataset
    
Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data.  While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3.  

The s3 links follow this pattern:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip

For example, the link for the `python` is:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

Similarly, the link for `java`which is what we are using is:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip

First we download and decompress this dataset:

In [2]:
#doesnt work on windows os based anaconda system
#recommended linux, or doing the windows equivalent for each command which won't be fun
#file is very large, approximately 1011Mb
!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip

--2020-04-26 16:39:52--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.20.75
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.20.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1060569153 (1011M) [application/zip]
Saving to: ‘java.zip.1’

java.zip.1            0%[                    ] 534.64K   131KB/s    eta 2h 21m ^C


In [3]:
#The error below occurs because I've only partially downloaded the Java zip file
#after unzipping, relocate java folder(after unzip) to CodeSearchNet/resources/data
!unzip java.zip

Archive:  java.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
note:  java.zip may be a plain executable, not an archive
unzip:  cannot find zipfile directory in one of java.zip or
        java.zip.zip, and cannot find java.zip.ZIP, period.


Finally, we can inspect `java/final/jsonl/test/java_test_0.jsonl.gz` to see its contents:

In [5]:
!gzip -d ../resources/data/java/final/jsonl/test/java_test_0.jsonl.gz

gzip: ../resources/data/java/final/jsonl/test/java_test_0.jsonl already exists; do you wish to overwrite (y or n)? ^C


Read in the file and display the first row.  The data is stored in [JSON Lines](http://jsonlines.org/) format.

In [7]:
with open('../resources/data/java/final/jsonl/test/java_test_0.jsonl', 'r') as f:
    sample_file = f.readlines()
sample_file[0]

'{"repo": "ReactiveX/RxJava", "path": "src/main/java/io/reactivex/internal/observers/QueueDrainObserver.java", "func_name": "QueueDrainObserver.fastPathOrderedEmit", "original_string": "protected final void fastPathOrderedEmit(U value, boolean delayError, Disposable disposable) {\\n        final Observer<? super V> observer = downstream;\\n        final SimplePlainQueue<U> q = queue;\\n\\n        if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\\n            if (q.isEmpty()) {\\n                accept(observer, value);\\n                if (leave(-1) == 0) {\\n                    return;\\n                }\\n            } else {\\n                q.offer(value);\\n            }\\n        } else {\\n            q.offer(value);\\n            if (!enter()) {\\n                return;\\n            }\\n        }\\n        QueueDrainHelper.drainLoop(q, observer, delayError, disposable, this);\\n    }", "language": "java", "code": "protected final void fastPathOrderedEmit(U value, boolean d

We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:

In [8]:
pprint(json.loads(sample_file[0]))

{'code': 'protected final void fastPathOrderedEmit(U value, boolean '
         'delayError, Disposable disposable) {\n'
         '        final Observer<? super V> observer = downstream;\n'
         '        final SimplePlainQueue<U> q = queue;\n'
         '\n'
         '        if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\n'
         '            if (q.isEmpty()) {\n'
         '                accept(observer, value);\n'
         '                if (leave(-1) == 0) {\n'
         '                    return;\n'
         '                }\n'
         '            } else {\n'
         '                q.offer(value);\n'
         '            }\n'
         '        } else {\n'
         '            q.offer(value);\n'
         '            if (!enter()) {\n'
         '                return;\n'
         '            }\n'
         '        }\n'
         '        QueueDrainHelper.drainLoop(q, observer, delayError, '
         'disposable, this);\n'
         '    }',
 'code_tokens': ['pr

Definitions of each of the above fields are located in the  in the README.md file in the root of this repository.

## Part 2: Exploring The Full Dataset

You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding.  Ignore the above, since it applies to for all the datasets, we are only working on java so you don't need to run that.

The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.  The readme file is already there.

The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:

In [11]:
#This is only for Java
java_train_files = sorted(Path('../resources/data/java/final/jsonl/train').glob('**/*.gz'))
java_test_files =  sorted(Path('../resources/data/java/final/jsonl/test').glob('**/*.gz'))
java_valid_files = sorted(Path('../resources/data/java/final/jsonl/valid').glob('**/*.gz'))
all_files = java_train_files + java_test_files + java_valid_files

# To match all files, in all directories (from the base directory and deeper)

# **/*.nupkg
# Will match

# sample.nupkg
# sample-2.nupkg
# tmp/sample.nupkg
# tmp/other.nupkg
# other/new/sample.nupkg

# python_files = sorted(Path('../resources/data/python/').glob('**/*.gz'))
# java_files = sorted(Path('../resources/data/java/').glob('**/*.gz'))
# go_files = sorted(Path('../resources/data/go/').glob('**/*.gz'))
# php_files = sorted(Path('../resources/data/php/').glob('**/*.gz'))
# javascript_files = sorted(Path('../resources/data/javascript/').glob('**/*.gz'))
# ruby_files = sorted(Path('../resources/data/ruby/').glob('**/*.gz'))
# all_files = python_files + go_files + java_files + php_files + javascript_files + ruby_files

In [13]:
#Count should be:
# Total number of files: 18
# No of test files: 1
# No of valid files: 1
# No of train files: 16
print(f'Total number of files: {len(all_files):,}')
print("No of test files: " + str(len(java_test_files)))
print("No of valid files: " + str(len(java_valid_files)))
print("No of train files: " + str(len(java_train_files)))

Total number of files: 18
No of test files: 1
No of valid files: 1
No of train files: 16


To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: 

In [14]:
columns_long_list = ['repo', 'path', 'url', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = ['code_tokens', 'docstring_tokens', 
                      'language', 'partition']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f, 
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

This is what the python dataset looks like:

In [16]:
#jvdf is short for java dataframe (pandas dataframe)
#This will take about a minute
#expected output:
# (454451, 9)
# (26909, 9)
# (15328, 9)
#Note all entries are non-null
#training dataset : 454451 rows, and all are supposed to be full i.e. non-null
#valid dataset: 15328 rows, all columns have same number of non null entries
#testing dataset: 26,909 rows, same condition as above
#total is 496688, the same as solved below
jvdf_train = jsonl_list_to_dataframe(java_train_files)
jvdf_valid = jsonl_list_to_dataframe(java_valid_files)
jvdf_test = jsonl_list_to_dataframe(java_test_files)
print(jvdf_train.shape)
print(jvdf_test.shape)
print(jvdf_valid.shape)

(454451, 9)
(26909, 9)
(15328, 9)


In [21]:
#To make sure the frames have been loaded correctly
jvdf_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 454451 entries, 0 to 29999
Data columns (total 9 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   repo              454451 non-null  object
 1   path              454451 non-null  object
 2   url               454451 non-null  object
 3   code              454451 non-null  object
 4   code_tokens       454451 non-null  object
 5   docstring         454451 non-null  object
 6   docstring_tokens  454451 non-null  object
 7   language          454451 non-null  object
 8   partition         454451 non-null  object
dtypes: object(9)
memory usage: 34.7+ MB


In [22]:
jvdf_valid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15328 entries, 0 to 15327
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   repo              15328 non-null  object
 1   path              15328 non-null  object
 2   url               15328 non-null  object
 3   code              15328 non-null  object
 4   code_tokens       15328 non-null  object
 5   docstring         15328 non-null  object
 6   docstring_tokens  15328 non-null  object
 7   language          15328 non-null  object
 8   partition         15328 non-null  object
dtypes: object(9)
memory usage: 1.1+ MB


In [23]:
jvdf_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26909 entries, 0 to 26908
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   repo              26909 non-null  object
 1   path              26909 non-null  object
 2   url               26909 non-null  object
 3   code              26909 non-null  object
 4   code_tokens       26909 non-null  object
 5   docstring         26909 non-null  object
 6   docstring_tokens  26909 non-null  object
 7   language          26909 non-null  object
 8   partition         26909 non-null  object
dtypes: object(9)
memory usage: 1.8+ MB


In [25]:
#we save the dataframe into a csv so we don't need to create it again
#we save them in the resources/data/java
#This also takes a minute
sp = "../resources/data/java/"
jvdf_train.to_csv(sp + "jvdf_train.csv")
jvdf_test.to_csv(sp + "jvdf_test.csv")
jvdf_valid.to_csv(sp + "jvdf_valid.csv")

Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.).  You can find more information regarding the definition of the above columns in the README of this repo. 

Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics.  **Warning:** This step takes ~ 20 minutes.

In [26]:
#gets list of all comments, note that we target docstring and not docstring tokens. Note that there are no empty elements
docstring_list_train = jvdf_train[["docstring"]].values.tolist()
docstring_list_test = jvdf_test[["docstring"]].values.tolist()
docstring_list_valid = jvdf_valid[["docstring"]].values.tolist()
print(docstring_list_train[0])
print(docstring_list_valid[0])
print(docstring_list_test[0])

['Bind indexed elements to the supplied collection.\n@param name the name of the property to bind\n@param target the target bindable\n@param elementBinder the binder to use for elements\n@param aggregateType the aggregate type, may be a collection or an array\n@param elementType the element type\n@param result the destination for results']
['/*\nThis is overridden to improve performance. Rough benchmarking shows that this almost doubles\nthe speed when processing strings that do not require any escaping.']
['Makes sure the fast-path emits in order.\n@param value the value to emit or queue up\n@param delayError if true, errors are delayed until the source has terminated\n@param disposable the resource to dispose if the drain terminates']


In [27]:
#run the dump_data step only once
# filename = "docstrings_list_train_java.pkl"
# dump_data = docstring_list
def pkldump(filename, dump_data):
    with open(filename, "wb") as f:
        pickle.dump(dump_data, f)
def pklload(filename):
    with open(filename, "rb") as f:
        out_file = pickle.load(f)
        return out_file

In [30]:
#run the dump_data step only once
p = "/mnt/c/Users/Akhil Chandra/Desktop/BTP/CodeSearchNet/resources/data/java/docstring_lists/"
if not os.path.isdir(p):
        os.mkdir(p)
filename = p+"docstrings_list_train_java.pkl"
dump_data = docstring_list_train
pkldump(filename, dump_data)
filename = p+"docstrings_list_valid_java.pkl"
dump_data = docstring_list_valid
pkldump(filename, dump_data)
filename = p+"docstrings_list_test_java.pkl"
dump_data = docstring_list_test
pkldump(filename, dump_data)

In [31]:
#output should be 15328
docstring_list = pklload(p+"docstrings_list_valid_java.pkl")
len(docstring_list)

15328

In [33]:
def lol_to_list(lol):
    a = []
    for i in lol:
        for j in i:
            a.append(j)
    return a

In [34]:
## To get individual codefiles
#pulls data from the csv files and then pushes them into a java file, essentially creating our own java files
#This takes a lot of time, like hours.
stime = time.time()
for val in ["train", "test", "valid"]:
    path = "jvdf_" + val + ".csv"
    rp = "../resources/data/java/"  #relative path
    path = rp + path
    df = pd.read_csv(path)
    lol = df[["code"]].values.tolist()  #each element of lol is a list of code tokens, i.e. lol is a list of lists
    l = lol_to_list(lol)   #each element of l is a code snippet, i.e. l is a list of code snippets
    count = 0
    p = rp + val + "_code_codesearchnet/Proj1" #input to the preprocess file is location of a folder of project folders, where java files are in project folder.
    if not os.path.isdir(p):
        os.mkdir(p)
    for i in l: #we make a file for each code snippet, and name it based on its index in the list.  List indexes don't change, unlike dictionaries
        f = open(p+"code"+str(count)+".java", "w", encoding="utf8") #code is used to refer the code snippet, we will use doc to represent docs
        f.write(str(i))
        count = count + 1
        if (count%10000 == 0):
            print(count)
            print(time.time()-stime)
#of course there is no guarentee that this will run on code2vec, but preprocess tests seem to be working
#should they all be written into one java file?, no since it would be difficult

KeyboardInterrupt: 

## Summary Statistics
We can ignore this bhatia, since this has already been verified above

### Row Counts

By Partition

In [13]:
all_df.partition.value_counts()

train    1880853
test      100529
valid      89154
Name: partition, dtype: int64

By Language

In [14]:
all_df.language.value_counts()

php           578118
java          496688
python        457461
go            346365
javascript    138625
ruby           53279
Name: language, dtype: int64

By Partition & Language

In [15]:
all_df.groupby(['partition', 'language'])['code_tokens'].count()

partition  language  
test       go             14291
           java           26909
           javascript      6483
           php            28391
           python         22176
           ruby            2279
train      go            317832
           java          454451
           javascript    123889
           php           523712
           python        412178
           ruby           48791
valid      go             14242
           java           15328
           javascript      8253
           php            26015
           python         23107
           ruby            2209
Name: code_tokens, dtype: int64

### Token Lengths By Language

In [16]:
all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))
all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))

#### Code Length Percentile By Language

For example, the 80th percentile length for python tokens is 72

In [17]:
code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(code_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,code_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,61.0
go,0.7,100.0
go,0.8,138.0
go,0.9,217.0
go,0.95,319.0
java,0.5,66.0
java,0.7,104.0
java,0.8,142.0
java,0.9,224.0
java,0.95,331.0


#### Query Length Percentile By Language

For example, the 80th percentile length for python tokens is 19

In [18]:
query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,query_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,12.0
go,0.7,19.0
go,0.8,28.0
go,0.9,49.0
go,0.95,92.0
java,0.5,11.0
java,0.7,18.0
java,0.8,25.0
java,0.9,39.0
java,0.95,61.0


#### Query Length All Languages

In [19]:
query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0,query_len
0.5,10.0
0.7,15.0
0.8,20.0
0.9,32.0
0.95,50.0
