# Data Exploration

This notebook explores the pre-processed data, and shows some basic statistics that may be useful.  

In [116]:
import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
import json
from pprint import pprint

## Part 1: Preview The Dataset
    
Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data.  While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3.  

The s3 links follow this pattern:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/{python,java,csharp}/{train,valid,test,holdout}.zip

For example, the link for the `python` test partition is:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/python/test.zip

First we download and decompress this dataset:

In [101]:
! wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/python/test.zip

--2019-03-03 21:48:05--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/python/test.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.189
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.189|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56867215 (54M) [application/zip]
Saving to: ‘test.zip’


2019-03-03 21:48:07 (87.1 MB/s) - ‘test.zip’ saved [56867215/56867215]



In [103]:
! unzip test.zip

Archive:  test.zip
   creating: test/
  inflating: test/codedata_00004.jsonl.gz  
  inflating: test/codedata_00005.jsonl.gz  
  inflating: test/codedata_00030.jsonl.gz  
  inflating: test/codedata_00031.jsonl.gz  
  inflating: test/codedata_00013.jsonl.gz  
  inflating: test/codedata_00012.jsonl.gz  
  inflating: test/codedata_00027.jsonl.gz  
  inflating: test/codedata_00026.jsonl.gz  
  inflating: test/codedata_00019.jsonl.gz  
  inflating: test/codedata_00018.jsonl.gz  
  inflating: test/codedata_00014.jsonl.gz  
  inflating: test/codedata_00015.jsonl.gz  
  inflating: test/codedata_00020.jsonl.gz  
  inflating: test/codedata_00021.jsonl.gz  
  inflating: test/codedata_00003.jsonl.gz  
  inflating: test/codedata_00002.jsonl.gz  
  inflating: test/codedata_00009.jsonl.gz  
  inflating: test/codedata_00008.jsonl.gz  
  inflating: test/codedata_00024.jsonl.gz  
  inflating: test/codedata_00025.jsonl.gz  
  inflating: test/codedata_00010.jsonl.gz  
  inflating: test/codedata_00011.jsonl

Finally, we can inspect `test/codedata_00000.jsonl.gz` to see its contents:

In [108]:
# decompress this gzip file
! gzip -d test/codedata_00000.jsonl.gz

Read in the file and display the first row.  The data is stored in [JSON Lines](http://jsonlines.org/) format.

In [123]:
with open('test/codedata_00000.jsonl', 'r') as f:
    sample_file = f.readlines()
sample_file[0]

'{"repo": "samluescher/django-media-tree", "path": "media_tree/templatetags/media_tree_tags.py", "lineno": 21, "func_name": "file_links", "original_string": "def file_links(items, opts=None):\\n    \\"\\"\\"\\n\\tTurns a (optionally nested) list of FileNode objects into a list of \\n\\tstrings, linking to the associated files.\\n\\t\\"\\"\\"\\n    result = []\\n    kwargs = get_kwargs_for_file_link(opts)\\n    for item in items:\\n        if isinstance(item, FileNode):\\n            result.append(get_file_link(item, **kwargs))\\n        else:\\n            result.append(file_links(item, kwargs))\\n    return result\\n", "language": "python", "code": "def file_links(items, opts=None):\\n    \\"\\"\\"\\"\\"\\"\\n    result = []\\n    kwargs = get_kwargs_for_file_link(opts)\\n    for item in items:\\n        if isinstance(item, FileNode):\\n            result.append(get_file_link(item, **kwargs))\\n        else:\\n            result.append(file_links(item, kwargs))\\n    return result\\n"

We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:

In [124]:
pprint(json.loads(sample_file[0]))

{'code': 'def file_links(items, opts=None):\n'
         '    """"""\n'
         '    result = []\n'
         '    kwargs = get_kwargs_for_file_link(opts)\n'
         '    for item in items:\n'
         '        if isinstance(item, FileNode):\n'
         '            result.append(get_file_link(item, **kwargs))\n'
         '        else:\n'
         '            result.append(file_links(item, kwargs))\n'
         '    return result\n',
 'code_tokens': ['def',
                 'file_links',
                 'items',
                 'opts',
                 'None',
                 '""""""',
                 'result',
                 'kwargs',
                 'get_kwargs_for_file_link',
                 'opts',
                 'for',
                 'item',
                 'in',
                 'items',
                 'if',
                 'isinstance',
                 'item',
                 'FileNode',
                 'result',
                 'append',
                 'g

Definitions of each of the above fields are located in the  in the README.md file in the root of this repository.

## Part 2: Exploring The Full Dataset

You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding.

The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.

The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:

In [86]:
pythonfiles = sorted(Path('../resources/data/python/').glob('**/*.gz'))
csharpfiles = sorted(Path('../resources/data/csharp/').glob('**/*.gz'))
javafiles = sorted(Path('../resources/data/java/').glob('**/*.gz'))
allfiles = pythonfiles + csharpfiles + javafiles

In [87]:
print(f'Total number of files: {len(allfiles):,}')

Total number of files: 3,192


To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: 

In [80]:
columns_long_list = ['repo', 'path', 'lineno', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = [ 'code_tokens', 'docstring_tokens', 
                        'language', 'partition']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    "Load a list of jsonl.gz files into a pandas DataFrame."
    return pd.concat([pd.read_json(f, 
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

This is what the python dataset looks like:

In [82]:
pydf = jsonl_list_to_dataframe(pythonfiles)

In [84]:
pydf.head(3)

Unnamed: 0,repo,path,lineno,code,code_tokens,docstring,docstring_tokens,language,partition
0,samluescher/django-media-tree,media_tree/templatetags/media_tree_tags.py,21,"def file_links(items, opts=None):\n """"""""""""\n result = []\n kwargs = get_kwargs_for_file_link(opts)\n for item in items:\n if isinstance(item, FileNode):\n result.append(get_file_link(item, **kwargs))\n else:\n result.append(file_links(item, kwargs)...","[def, file_links, items, opts, None, """""""""""", result, kwargs, get_kwargs_for_file_link, opts, for, item, in, items, if, isinstance, item, FileNode, result, append, get_file_link, item, kwargs, else, result, append, file_links, item, kwargs, return, result]","Turns a (optionally nested) list of FileNode objects into a list of \nstrings, linking to the associated files.","[turns, a, (, optionally, nested, ), list, of, filenode, objects, into, a, list, of, strings, linking, to, the, associated, files, .]",python,test
1,MediaKraken/MediaKraken_Deployment,source/common/common_metadata.py,28,"def com_meta_calc_trailer_weight(trailer_file_list, title_name, title_year):\n """"""""""""\n old_weight = 0\n best_match = None\n weight = 0\n for file_name in trailer_file_list:\n weight = 0\n if file_name.lower().find('official trailer') != -1:\n weight += 3\...","[def, com_meta_calc_trailer_weight, trailer_file_list, title_name, title_year, """""""""""", old_weight, best_match, None, weight, for, file_name, in, trailer_file_list, weight, if, file_name, lower, find, 'official trailer', weight, if, file_name, lower, find, title_name, lower, weight, if, file_name...","Determine ""weight"" of file to download for trailer","[determine, weight, of, file, to, download, for, trailer]",python,test
2,jptomo/ansible-connection-nsenter,nsenter.py,44,"def _get_container_env(self):\n """"""""""""\n env_path = '/proc/{}/environ'.format(self._extract_var('Leader'))\n env_str = self._exec_command('cat ' + env_path)[2]\n proc_envs = env_str.split('\x00')\n proc_envs = dict([x.split('=') for x in proc_envs if x])\n return proc_envs\n","[def, _get_container_env, self, """""""""""", env_path, '/proc/{}/environ', format, self, _extract_var, 'Leader', env_str, self, _exec_command, 'cat ', env_path, proc_envs, env_str, split, '\x00', proc_envs, dict, x, split, '=', for, x, in, proc_envs, if, x, return, proc_envs]",return container env dict,"[return, container, env, dict]",python,test


Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.).  You can find more information regarding the definition of the above columns in the README of this repo. 

Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics.  **Warning:** This step takes ~ 20 minutes.

In [88]:
alldf = jsonl_list_to_dataframe(allfiles, columns_short_list)

## Summary Statistics

### Row Counts

By Partition

In [90]:
alldf.partition.value_counts()

train    2060644
valid     519851
test      516239
Name: partition, dtype: int64

By Language

In [96]:
alldf.language.value_counts()

java      1808209
python     690299
csharp     598226
Name: language, dtype: int64

By Partition & Language

In [94]:
alldf.groupby(['partition', 'language'])['code_tokens'].count()

partition  language
test       csharp       100181
           java         301584
           python       114474
train      csharp       396736
           java        1203977
           python       459931
valid      csharp       101309
           java         302648
           python       115894
Name: code_tokens, dtype: int64

### Token Lengths By Language

In [97]:
alldf['code_len'] = alldf.code_tokens.apply(lambda x: len(x))
alldf['query_len'] = alldf.docstring_tokens.apply(lambda x: len(x))

#### Code Length Percentile By Language

For example, the 80th percentile length for python tokens is 72

In [98]:
code_len_summary = alldf.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(code_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,code_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
csharp,0.5,77.0
csharp,0.7,120.0
csharp,0.8,164.0
csharp,0.9,273.0
csharp,0.95,479.0
java,0.5,76.0
java,0.7,118.0
java,0.8,158.0
java,0.9,240.0
java,0.95,347.0


#### Query Length Percentile By Language

For example, the 80th percentile length for python tokens is 19

In [99]:
query_len_summary = alldf.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,query_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
csharp,0.5,8.0
csharp,0.7,12.0
csharp,0.8,16.0
csharp,0.9,20.0
csharp,0.95,27.0
java,0.5,11.0
java,0.7,16.0
java,0.8,22.0
java,0.9,34.0
java,0.95,48.0


#### Query Length All Languages

In [100]:
query_len_summary = alldf['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0,query_len
0.5,10.0
0.7,15.0
0.8,20.0
0.9,30.0
0.95,43.0
