# Data Exploration

This notebook explores the pre-processed data, and shows some basic statistics that may be useful.  

In [6]:
import json

import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint

## Part 1: Preview The Dataset
    
Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data.  While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3.  

The s3 links follow this pattern:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip

For example, the link for the `python` is:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

First we download and decompress this dataset:

In [2]:
!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

--2019-06-14 01:05:08--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.184.77
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.184.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 218813834 (209M) [application/zip]
Saving to: ‘python.zip’


2019-06-14 01:05:11 (63.9 MB/s) - ‘python.zip’ saved [218813834/218813834]



In [3]:
!unzip python.zip

Archive:  python.zip
   creating: python/
   creating: python/final/
   creating: python/final/jsonl/
   creating: python/final/jsonl/valid/
  inflating: python/final/jsonl/valid/python_valid_0.jsonl.gz  
   creating: python/final/jsonl/test/
  inflating: python/final/jsonl/test/python_test_0.jsonl.gz  
   creating: python/final/jsonl/train/
  inflating: python/final/jsonl/train/python_train_7.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_6.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_12.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_13.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_0.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_1.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_4.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_5.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_9.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_8.jsonl.gz  
  inflating: p

Finally, we can inspect `python/final/jsonl/test/python_test_0.jsonl.gz` to see its contents:

In [3]:
# decompress this gzip file

!gzip -d java/final/jsonl/test/java_test_0.jsonl.gz

In [32]:
!gzip -r java/final/jsonl/test/java_test_0.jsonl

Read in the file and display the first row.  The data is stored in [JSON Lines](http://jsonlines.org/) format.

In [9]:
with open('java/final/jsonl/test/java_test_0.jsonl', 'r') as f:
    sample_file = f.readlines()
sample_file[0]
print(len(sample_file))

26909


We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:

In [6]:
pprint(json.loads(sample_file[0]))

{'code': 'protected final void fastPathOrderedEmit(U value, boolean '
         'delayError, Disposable disposable) {\n'
         '        final Observer<? super V> observer = downstream;\n'
         '        final SimplePlainQueue<U> q = queue;\n'
         '\n'
         '        if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\n'
         '            if (q.isEmpty()) {\n'
         '                accept(observer, value);\n'
         '                if (leave(-1) == 0) {\n'
         '                    return;\n'
         '                }\n'
         '            } else {\n'
         '                q.offer(value);\n'
         '            }\n'
         '        } else {\n'
         '            q.offer(value);\n'
         '            if (!enter()) {\n'
         '                return;\n'
         '            }\n'
         '        }\n'
         '        QueueDrainHelper.drainLoop(q, observer, delayError, '
         'disposable, this);\n'
         '    }',
 'code_tokens': ['pr

Definitions of each of the above fields are located in the  in the README.md file in the root of this repository.

## Part 2: Exploring The Full Dataset

You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding.

The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.

The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:

In [7]:

java_files = sorted(Path('java/').glob('**/*.gz'))

In [8]:
print(f'Total number of files: {len(java_files):,}')
for file in java_files:
    print(file.name)

Total number of files: 18
java_test_0.jsonl.gz
java_train_0.jsonl.gz
java_train_1.jsonl.gz
java_train_10.jsonl.gz
java_train_11.jsonl.gz
java_train_12.jsonl.gz
java_train_13.jsonl.gz
java_train_14.jsonl.gz
java_train_15.jsonl.gz
java_train_2.jsonl.gz
java_train_3.jsonl.gz
java_train_4.jsonl.gz
java_train_5.jsonl.gz
java_train_6.jsonl.gz
java_train_7.jsonl.gz
java_train_8.jsonl.gz
java_train_9.jsonl.gz
java_valid_0.jsonl.gz


To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: 

In [9]:
columns_long_list = ['repo', 'path', 'url', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = ['repo', 'path', 
                      'code', 'func_name','docstring']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f, 
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

This is what the python dataset looks like:

In [10]:
javadf = jsonl_list_to_dataframe(java_files)

In [20]:
javadf.head(3)

Unnamed: 0,repo,path,url,code,code_tokens,docstring,docstring_tokens,language,partition
0,ReactiveX/RxJava,src/main/java/io/reactivex/internal/observers/QueueDrainObserver.java,https://github.com/ReactiveX/RxJava/blob/ac84182aa2bd866b53e01c8e3fe99683b882c60e/src/main/java/io/reactivex/internal/observers/QueueDrainObserver.java#L88-L108,"protected final void fastPathOrderedEmit(U value, boolean delayError, Disposable disposable) {\n final Observer<? super V> observer = downstream;\n final SimplePlainQueue<U> q = queue;\n\n if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\n if (q.isEmpty()) {\n ...","[protected, final, void, fastPathOrderedEmit, (, U, value, ,, boolean, delayError, ,, Disposable, disposable, ), {, final, Observer, <, ?, super, V, >, observer, =, downstream, ;, final, SimplePlainQueue, <, U, >, q, =, queue, ;, if, (, wip, ., get, (, ), ==, 0, &&, wip, ., compareAndSet, (, 0, ...","Makes sure the fast-path emits in order.\n@param value the value to emit or queue up\n@param delayError if true, errors are delayed until the source has terminated\n@param disposable the resource to dispose if the drain terminates","[Makes, sure, the, fast, -, path, emits, in, order, .]",java,test
1,ReactiveX/RxJava,src/main/java/io/reactivex/Observable.java,https://github.com/ReactiveX/RxJava/blob/ac84182aa2bd866b53e01c8e3fe99683b882c60e/src/main/java/io/reactivex/Observable.java#L118-L124,"@CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> amb(Iterable<? extends ObservableSource<? extends T>> sources) {\n ObjectHelper.requireNonNull(sources, ""sources is null"");\n return RxJavaPlugins.onAssembly(new Obser...","[@, CheckReturnValue, @, NonNull, @, SchedulerSupport, (, SchedulerSupport, ., NONE, ), public, static, <, T, >, Observable, <, T, >, amb, (, Iterable, <, ?, extends, ObservableSource, <, ?, extends, T, >, >, sources, ), {, ObjectHelper, ., requireNonNull, (, sources, ,, ""sources is null"", ), ;,...","Mirrors the one ObservableSource in an Iterable of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</...","[Mirrors, the, one, ObservableSource, in, an, Iterable, of, several, ObservableSources, that, first, either, emits, an, item, or, sends, a, termination, notification, ., <p, >, <img, width, =, 640, height, =, 385, src, =, https, :, //, raw, ., github, ., com, /, wiki, /, ReactiveX, /, RxJava, /,...",java,test
2,ReactiveX/RxJava,src/main/java/io/reactivex/Observable.java,https://github.com/ReactiveX/RxJava/blob/ac84182aa2bd866b53e01c8e3fe99683b882c60e/src/main/java/io/reactivex/Observable.java#L144-L158,"@SuppressWarnings(""unchecked"")\n @CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> ambArray(ObservableSource<? extends T>... sources) {\n ObjectHelper.requireNonNull(sources, ""sources is null"");\n int len = sources...","[@, SuppressWarnings, (, ""unchecked"", ), @, CheckReturnValue, @, NonNull, @, SchedulerSupport, (, SchedulerSupport, ., NONE, ), public, static, <, T, >, Observable, <, T, >, ambArray, (, ObservableSource, <, ?, extends, T, >, ..., sources, ), {, ObjectHelper, ., requireNonNull, (, sources, ,, ""s...","Mirrors the one ObservableSource in an array of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</b><...","[Mirrors, the, one, ObservableSource, in, an, array, of, several, ObservableSources, that, first, either, emits, an, item, or, sends, a, termination, notification, ., <p, >, <img, width, =, 640, height, =, 385, src, =, https, :, //, raw, ., github, ., com, /, wiki, /, ReactiveX, /, RxJava, /, im...",java,test


Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.).  You can find more information regarding the definition of the above columns in the README of this repo. 

Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics.  **Warning:** This step takes ~ 20 minutes.

In [67]:
all_df = jsonl_list_to_dataframe(java_files, columns_short_list)
all_df.head(5)

Unnamed: 0,repo,path,code,func_name,docstring
0,ReactiveX/RxJava,src/main/java/io/reactivex/internal/observers/QueueDrainObserver.java,"protected final void fastPathOrderedEmit(U value, boolean delayError, Disposable disposable) {\n final Observer<? super V> observer = downstream;\n final SimplePlainQueue<U> q = queue;\n\n if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\n if (q.isEmpty()) {\n ...",QueueDrainObserver.fastPathOrderedEmit,"Makes sure the fast-path emits in order.\n@param value the value to emit or queue up\n@param delayError if true, errors are delayed until the source has terminated\n@param disposable the resource to dispose if the drain terminates"
1,ReactiveX/RxJava,src/main/java/io/reactivex/Observable.java,"@CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> amb(Iterable<? extends ObservableSource<? extends T>> sources) {\n ObjectHelper.requireNonNull(sources, ""sources is null"");\n return RxJavaPlugins.onAssembly(new Obser...",Observable.amb,"Mirrors the one ObservableSource in an Iterable of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</..."
2,ReactiveX/RxJava,src/main/java/io/reactivex/Observable.java,"@SuppressWarnings(""unchecked"")\n @CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> ambArray(ObservableSource<? extends T>... sources) {\n ObjectHelper.requireNonNull(sources, ""sources is null"");\n int len = sources...",Observable.ambArray,"Mirrors the one ObservableSource in an array of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</b><..."
3,ReactiveX/RxJava,src/main/java/io/reactivex/Observable.java,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })\n @CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> concat(Iterable<? extends ObservableSource<? extends T>> sources) {\n ObjectHelper.requireNonNull(sources, ""sources is nul...",Observable.concat,"Concatenates elements of each ObservableSource provided via an Iterable sequence into a single sequence\nof elements without interleaving them.\n<p>\n<img width=""640"" height=""380"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/concat.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</..."
4,ReactiveX/RxJava,src/main/java/io/reactivex/Observable.java,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })\n @CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> concat(ObservableSource<? extends ObservableSource<? extends T>> sources, int prefetch) {\n ObjectHelper.requireNonNull(so...",Observable.concat,"Returns an Observable that emits the items emitted by each of the ObservableSources emitted by the source\nObservableSource, one after the other, without interleaving them.\n<p>\n<img width=""640"" height=""380"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/concat.png"" alt=""..."


In [62]:
all_df.head(5)
# 假设你有一个名为df的DataFrame
row_count = len(all_df)

print("DataFrame的行数为:", row_count)

DataFrame的行数为: 496688


In [23]:
method_code = all_df.iloc[0,2]
print(type(method_code))
print(all_df.iloc[0,2])

<class 'str'>
protected final void fastPathOrderedEmit(U value, boolean delayError, Disposable disposable) {
        final Observer<? super V> observer = downstream;
        final SimplePlainQueue<U> q = queue;

        if (wip.get() == 0 && wip.compareAndSet(0, 1)) {
            if (q.isEmpty()) {
                accept(observer, value);
                if (leave(-1) == 0) {
                    return;
                }
            } else {
                q.offer(value);
            }
        } else {
            q.offer(value);
            if (!enter()) {
                return;
            }
        }
        QueueDrainHelper.drainLoop(q, observer, delayError, disposable, this);
    }


In [13]:
import os
import subprocess

def clone_repository(repository):
    """
    从GitHub上克隆仓库

    Parameters:
        repo_name (str): 仓库的名称，例如 "username/repository"

    Returns:
        str: 本地克隆仓库的路径
    """
    # 指定克隆仓库的目标文件夹
    target_folder = '/Users/zhanghanxiao/data_process/repo/'

    # 提取仓库名和拼接克隆命令
    repo_name = repository.split('/')[1]
    if os.path.exists(os.path.join(target_folder,repo_name)):
        return os.path.join(target_folder,repo_name)
    else:
        # clone_command = f'git clone https://github.com/{repository}.git {os.path.join(target_folder, repo_name)}'
        clone_command = f'git clone git@github.com:{repository}.git {os.path.join(target_folder, repo_name)}'
        # 执行克隆命令
        subprocess.run(clone_command, shell=True)
        local_path = os.path.join(target_folder,repo_name)
        print(local_path)
        return local_path
        
    

In [14]:
from tree_sitter import Parser, Language
from java_parser import JavaParser

Language.build_library(
  # Store the library in the `build` directory
  'my-languages.so',

  # Include one or more languages
  [
    'tree-sitter-java',
  ]
)

False

In [15]:
JAVAPARSER = Parser()
JAVAPARSER.set_language(Language("my-languages.so", "java"))
jp = JavaParser(parser=JAVAPARSER)

PARSER = JAVAPARSER
sp_parser = jp

In [46]:
import re
from difflib import SequenceMatcher

def preprocess_code(code):
    # 使用正则表达式替换不可见字符为空字符串
    return re.sub(r'\s+', '', code)

def find_java_file(project_dir, file_name):
    for root, _, files in os.walk(project_dir):
        for file in files:
            if file == file_name:
                return os.path.join(root, file)
    return None

# 修剪语法树，只保留某个类中方法的签名，去除方法体
# def prune_method_bodies(node):
   
    # # 去掉构造方法方法体
    # if node.type == "constructor_delcaration":
    #     children_copy = node.children[:]  # 创建原始 children 的副本
    #     for child in children_copy:
    #         if child.type == "constructor_body":
    #             method_positions.append((child.start_byte, child.end_byte))
    #             break
    
    # # 去掉普通方法方法体
    # if node.type == "method_declaration":
    #      children_copy = node.children[:]  # 创建原始 children 的副本
    #      for child in children_copy:
    #         if child.type == "block":
    #             # node.children.remove(child)  # 移除该节点
    #             # print(child.text.decode('utf-8'))
    #             # Remove the method body by replacing it with an empty string
    #             method_positions.append((child.start_byte, child.end_byte))
    #             break
    # for child in node.children:
    #     prune_method_bodies(child)

# 修剪语法树，去掉包声明语句
# def delete_package_declaration(node):
#     if node.type == 'package_declaration':
#         method_positions.append((node.start_byte, node.end_byte))
#     for child in node.children:
#         delete_package_declaration(child)

# def get_icontext(code,modified_code_parts,last_pos):
#     for start_pos, end_pos in method_positions:
#         modified_code_parts.append(code[last_pos:start_pos])
#         last_pos = end_pos
#     modified_code_parts.append(code[last_pos:])
#     modified_code = ''.join(modified_code_parts)
#     return modified_code

def extract_in_file_code(df):
    """
    提取函数信息和代码上下文的函数

    Parameters:
        df (DataFrame): 包含函数信息的DataFrame

    Returns:
        list of dict: 包含提取的函数信息和代码上下文的列表
    """

    # 查找方法节点
    def dfs_find_method_node(node):
        if node.type == "method_declaration" :
            # print(node)
            file_code = file_content[node.start_byte: node.end_byte]
            # print(SequenceMatcher(a=file_content[node.start_byte: node.end_byte],b=method_code).ratio())
            if preprocess_code(file_code) == preprocess_code(function_code):
                method_positions.append((node.start_byte,node.end_byte))
                # print(SequenceMatcher(a=preprocess_code(file_code),b=preprocess_code(method_code)).ratio())
                return node
         # 递归查找子节点
        for child in node.children:
            result = dfs_find_method_node(child)
            if result:
                return result
        return None
    
    # 修剪语法树，只保留某个类中方法的签名，去除方法体
    def prune_method_bodies(node):
        # 去掉构造方法方法体
        if node.type == "constructor_delcaration":
            children_copy = node.children[:]  # 创建原始 children 的副本
            for child in children_copy:
                if child.type == "constructor_body":
                    method_positions.append((child.start_byte, child.end_byte))
                    break
    
        # 去掉普通方法方法体
        if node.type == "method_declaration":
            children_copy = node.children[:]  # 创建原始 children 的副本
            for child in children_copy:
                if child.type == "block":
                    # node.children.remove(child)  # 移除该节点
                    # print(child.text.decode('utf-8'))
                    # Remove the method body by replacing it with an empty string
                    method_positions.append((child.start_byte, child.end_byte))
                    break
        for child in node.children:
            prune_method_bodies(child)

    # 修剪语法树，去掉包声明语句
    def delete_package_declaration(node):
        if node.type == 'package_declaration':
            method_positions.append((node.start_byte, node.end_byte))
        for child in node.children:
            delete_package_declaration(child)
    
    function_info_list = []
    subset_df = df.head(10)
    
    for index, row in subset_df.iterrows():
        
        method_positions = []

        repo = row['repo']  # 函数所在的仓库名
        dfile_path = row['path']  # 函数所在的项目文件的路径【检查了一下发现原数据集给的文件路径有问题】
        function_code = row['code']  # 函数源代码

        file_name = dfile_path.split("/")[-1]

        # 从GitHub上克隆仓库并获取本地克隆仓库的路径
        local_repo_path = clone_repository(repo)
    

        # 文件的绝对路径
        file_path = find_java_file(local_repo_path,file_name)

        # 使用文件路径读取对应的文件内容，获取函数所在文件的代码内容
        with open(file_path,'r') as file:
            file_content = file.read()
        # 解析Java代码
        tree = PARSER.parse(file_content.encode())
        
        # 获取上下文
        java_code = file_content

        # 获取函数所在文件的代码修改位置
        method_node = dfs_find_method_node(tree.root_node)
        prune_method_bodies(tree.root_node)
        delete_package_declaration(tree.root_node)
        
        modified_code_parts = []
        last_pos = 0
        modified_code = []
        print(method_positions)
        # 对列表进行排序，按照元组中的第一个元素进行升序排序
        method_positions = sorted(method_positions)
        
        for start_pos, end_pos in method_positions:
            print(start_pos,end_pos)
            modified_code_parts.append(java_code[last_pos:start_pos])
            last_pos = end_pos
        modified_code_parts.append(java_code[last_pos:])
        modified_code = ''.join(modified_code_parts)
        code_context = modified_code
        

        # 构建函数信息字典并添加到列表中
        function_info = {
            'function_code': function_code,
            'code_context': code_context
        }
        function_info_list.append(function_info)

    return function_info_list

# 获取DataFrame的大小
rows, cols = all_df.shape
print("all_df有{}行和{}列。".format(rows, cols))

# 获取文件内上下文

function_context = extract_in_file_code(all_df)
print(len(function_context))
print(type(function_context))

import jsonlines

# 指定要保存的JSON文件路径
json_file_path = "test1.jsonl"

# 将列表中每个元素保存为 jsonl 格式的文件
with jsonlines.open(json_file_path, mode='w') as writer:
    for data in function_context:
        writer.write(data)



all_df有496688行和5列。
[(2806, 3489), (1758, 1791), (1839, 1867), (1916, 1966), (2055, 2528), (2899, 3489), (3540, 3569), (3619, 3659), (3726, 3763), (605, 653)]
605 653
1758 1791
1839 1867
1916 1966
2055 2528
2806 3489
2899 3489
3540 3569
3619 3659
3726 3763
[(6171, 6315), (7967, 8293), (8753, 8798), (11273, 11343), (14041, 14452), (16885, 16960), (19592, 20076), (22176, 22491), (24830, 25214), (27819, 28272), (31118, 31640), (34714, 35305), (38662, 39322), (42910, 43639), (47467, 48265), (50712, 50797), (53537, 54019), (56588, 56668), (59439, 59849), (60854, 61032), (62162, 62215), (63610, 63870), (65051, 65224), (66593, 66835), (68355, 68666), (69655, 70040), (70935, 71311), (72557, 72634), (74388, 74522), (75864, 75951), (77801, 77934), (79066, 79189), (80355, 80424), (82110, 82406), (83668, 83740), (85612, 85749), (87072, 87144), (89066, 89178), (90526, 90608), (92566, 92702), (94124, 94206), (96212, 96340), (98705, 98843), (100536, 100679), (101622, 101712), (102886, 103029), (104156

In [47]:
with open('test1.jsonl', 'r') as f:
    sample_file = f.readlines()
pprint(json.loads(sample_file[0]))

{'code_context': '/*\n'
                 ' * Copyright (c) 2016-present, RxJava Contributors.\n'
                 ' *\n'
                 ' * Licensed under the Apache License, Version 2.0 (the '
                 '"License"); you may not use this file except in\n'
                 ' * compliance with the License. You may obtain a copy of the '
                 'License at\n'
                 ' *\n'
                 ' * http://www.apache.org/licenses/LICENSE-2.0\n'
                 ' *\n'
                 ' * Unless required by applicable law or agreed to in '
                 'writing, software distributed under the License is\n'
                 ' * distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR '
                 'CONDITIONS OF ANY KIND, either express or implied. See\n'
                 ' * the License for the specific language governing '
                 'permissions and limitations under the License.\n'
                 ' */\n'
                 '\n'
                 '\n'


In [11]:
local_repo_path = '/Users/zhanghanxiao/dataset/2/springboot_swagger_example/src/main/java/guru/springframework/domain/Product.java'
with open(local_repo_path, 'r') as file:
            file_content = file.read()

# 使用javalang解析Java代码
tree = javalang.parse.parse(file_content)

type(tree)

tree.children
tree.types[0].bod


['package', 'imports', 'types']

[PackageDeclaration(annotations=None, documentation=None, modifiers=None, name=guru.springframework.domain),
 [Import(path=io.swagger.annotations.ApiModelProperty, static=False, wildcard=False),
  Import(path=javax.persistence, static=False, wildcard=True),
  Import(path=java.math.BigDecimal, static=False, wildcard=False)],
 [ClassDeclaration(annotations=[Annotation(element=None, name=Entity)], body=[FieldDeclaration(annotations=[Annotation(element=None, name=Id), Annotation(element=[ElementValuePair(name=strategy, value=MemberReference(member=AUTO, postfix_operators=[], prefix_operators=[], qualifier=GenerationType, selectors=[]))], name=GeneratedValue), Annotation(element=[ElementValuePair(name=notes, value=Literal(postfix_operators=[], prefix_operators=[], qualifier=None, selectors=[], value="The database generated product ID"))], name=ApiModelProperty)], declarators=[VariableDeclarator(dimensions=[], initializer=None, name=id)], documentation=None, modifiers={'private'}, type=Refer

In [32]:
with open('QueueDrainObserver.java', 'r') as file:
            file_content = file.read()


# 解析Java代码
tree = PARSER.parse(file_content.encode())

tree.root_node.children



[<Node type=block_comment, start_point=(0, 0), end_point=(11, 3)>,
 <Node type=package_declaration, start_point=(13, 1), end_point=(13, 49)>,
 <Node type=import_declaration, start_point=(15, 1), end_point=(15, 50)>,
 <Node type=import_declaration, start_point=(17, 1), end_point=(17, 43)>,
 <Node type=import_declaration, start_point=(18, 1), end_point=(18, 52)>,
 <Node type=import_declaration, start_point=(19, 1), end_point=(19, 45)>,
 <Node type=import_declaration, start_point=(20, 1), end_point=(20, 56)>,
 <Node type=block_comment, start_point=(22, 1), end_point=(29, 4)>,
 <Node type=class_declaration, start_point=(30, 1), end_point=(119, 2)>,
 <Node type=line_comment, start_point=(121, 1), end_point=(121, 71)>,
 <Node type=line_comment, start_point=(122, 1), end_point=(122, 24)>,
 <Node type=line_comment, start_point=(123, 1), end_point=(123, 70)>,
 <Node type=block_comment, start_point=(125, 1), end_point=(125, 47)>,
 <Node type=class_declaration, start_point=(126, 1), end_point=(12

In [61]:
root_node = tree.root_node

# Find all import statements
import_nodes = []
import_stats = []
def traverse(node):
    if node.type == 'import_declaration':
        import_nodes.append(node)
    for child in node.children:
        traverse(child)

traverse(root_node)

for node in import_nodes:
    for child in node.children:
        if child.type == "scoped_identifier":
            import_stats.append(child.text.decode('utf-8'))
            break
print(import_stats)

for import_stat in import_stats:
    if import_stat.startwith('java'):
        continue
    file_name  = 




['java.util.concurrent.atomic.AtomicInteger', 'io.reactivex.rxjava3.core.Observer', 'io.reactivex.rxjava3.disposables.Disposable', 'io.reactivex.rxjava3.internal.util', 'io.reactivex.rxjava3.operators.SimplePlainQueue']


In [39]:
method_code

'protected final void fastPathOrderedEmit(U value, boolean delayError, Disposable disposable) {\n        final Observer<? super V> observer = downstream;\n        final SimplePlainQueue<U> q = queue;\n\n        if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\n            if (q.isEmpty()) {\n                accept(observer, value);\n                if (leave(-1) == 0) {\n                    return;\n                }\n            } else {\n                q.offer(value);\n            }\n        } else {\n            q.offer(value);\n            if (!enter()) {\n                return;\n            }\n        }\n        QueueDrainHelper.drainLoop(q, observer, delayError, disposable, this);\n    }'

In [43]:
java_code = file_content

method_positions = []

# 修剪语法树，只保留某个类中方法的签名，去除方法体
def prune_method_bodies(node):

    # 去掉构造方法方法体
    if node.type == "constructor_delcaration":
        children_copy = node.children[:]  # 创建原始 children 的副本
        for child in children_copy:
            if child.type == "constructor_body":
                method_positions.append((child.start_byte, child.end_byte))
                break
    
    # 去掉普通方法方法体
    if node.type == "method_declaration":
         children_copy = node.children[:]  # 创建原始 children 的副本
         for child in children_copy:
            if child.type == "block":
                # node.children.remove(child)  # 移除该节点
                # print(child.text.decode('utf-8'))
                # Remove the method body by replacing it with an empty string
                method_positions.append((child.start_byte, child.end_byte))
                break
    for child in node.children:
        prune_method_bodies(child)

def delete_package_declaration(node):
    if node.type == 'package_declaration':
        method_positions.append((node.start_byte, node.end_byte))
    for child in node.children:
        delete_package_declaration(child)

prune_method_bodies(tree.root_node)
print(method_positions)
delete_package_declaration(tree.root_node)
print(method_positions)

# 对列表进行排序，按照元组中的第一个元素进行升序排序
method_positions = sorted(method_positions)

# Generate modified source code without method bodies
modified_code_parts = []
last_pos = 0
for start_pos, end_pos in method_positions:
    modified_code_parts.append(java_code[last_pos:start_pos])
    last_pos = end_pos
modified_code_parts.append(java_code[last_pos:])
modified_code = ''.join(modified_code_parts)

print(modified_code)


        

[(1790, 1825), (1876, 1906), (1958, 2010), (2101, 2590), (2969, 3579), (3633, 3664), (3717, 3759), (3829, 3868)]
[(1790, 1825), (1876, 1906), (1958, 2010), (2101, 2590), (2969, 3579), (3633, 3664), (3717, 3759), (3829, 3868), (606, 654)]
/*
 * Copyright (c) 2016-present, RxJava Contributors.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in
 * compliance with the License. You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software distributed under the License is
 * distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See
 * the License for the specific language governing permissions and limitations under the License.
 */

 

 import java.util.concurrent.atomic.AtomicInteger;
 
 import io.reactivex.rxjava3.core.Observer;
 import io.reactivex.rxjava3.disposables.Disposable;
 imp

In [51]:
import sys

cursor = tree.walk()
method_contexts = []

# 查找方法节点
def dfs_find_method_node(node):
    if node.type == "method_declaration" :
        # print(node)
        file_code = file_content[node.start_byte: node.end_byte]
        # print(SequenceMatcher(a=file_content[node.start_byte: node.end_byte],b=method_code).ratio())
        if preprocess_code(file_code) == preprocess_code(function_code):
            method_positions.append((node.start_byte,node.end_byte))
            # print(SequenceMatcher(a=preprocess_code(file_code),b=preprocess_code(method_code)).ratio())
            return node
    # 递归查找子节点
    for child in node.children:
        result = dfs_find_method_node(child)
        if result:
            return result
    return None

# 遍历语法树并查找方法节点
method_node = dfs_find_method_node(tree.root_node)
if method_node is None:
    print("method_node is None")
else: print(method_node)

# 遍历语法树，查找导入语句并添加到上下文中
cursor.goto_first_child()
while True:
    if cursor.node.type == "import_declaration":
        import_code = file_content[cursor.node.start_byte: cursor.node.end_byte]
        method_contexts.append(import_code)
    if not cursor.goto_next_sibling():
        break

# # 遍历语法树，查找类声明语句并添加到上下文列表
# cursor.goto_parent()
# cursor.goto_first_child()
# while True:
#     if cursor.node.type == "class_declaration" and cursor.node.start_byte <= method_node.start_byte and cursor.node.end_byte >= method_node.end_byte:
#         class_declaration = file_content[cursor.node.start_byte: cursor.node.end_byte]
#         method_contexts.append(class_declaration)
#         break
#     if not cursor.goto_next_sibling():
#         break

# 将上下文信息按原文件中的顺序写入一个新的文件
with open("context.txt", "w") as output_file:
    for context in method_contexts:
        output_file.write(context + "\n")

<Node type=method_declaration, start_point=(44, 5), end_point=(47, 6)>
0.06266318537859007
<Node type=method_declaration, start_point=(49, 5), end_point=(52, 6)>
0.05555555555555555
<Node type=method_declaration, start_point=(54, 5), end_point=(57, 6)>
0.08472400513478819
<Node type=method_declaration, start_point=(59, 5), end_point=(75, 6)>
0.8701195219123506
<Node type=method_declaration, start_point=(83, 5), end_point=(103, 6)>
0.961038961038961
1.0
<Node type=method_declaration, start_point=(83, 5), end_point=(103, 6)>


In [None]:
import_stat = []
import_node = []
for child in tree.root_node.children:
    if child.type == "import_declaration":
        import_stat.append(child)
        import_stat.append(child)


In [None]:
def get_java_file_context(file_name, project_dir):
    
    # 使用文件路径读取对应的文件内容，获取函数所在文件的代码内容
    with open(file_name,'r') as file:
        file_content = file.read()
    # 解析Java代码
    tree = PARSER.parse(file_content.encode())

    # Find all import statements
    import_nodes = []

    def traverse(node):
        if node.type == 'import_declaration':
            import_nodes.append(node)

        for child in node.children:
            traverse(child)

    traverse(tree.root_node)

    # Get the project context for each Java file
    java_files_context = []

    for import_node in import_nodes:
        import_stat = import_node.child_by_field_name(' scoped_identifier')
        # package_name = import_statement.child_by_field_name('package_name').string

        # Build the Java file path using package name and file name
        java_file_path = os.path.join(project_dir, package_name.replace('.', os.sep), file_name + '.java')

        # Get the project context for the Java file
        context = get_project_context(java_file_path)

        java_files_context.append((java_file_path, context))

    return java_files_context

def get_project_context(java_file_path):
    # Add your code here to get the project context for the Java file
    # You can read the Java file, analyze its contents, and extract the necessary project context
    # For example, you can use the JavaParser library or other parsing techniques to analyze the Java file
    # Return the project context for the Java file
    return "ProjectContext"  # Replace this with the actual project context

# Test code
if __name__ == "__main__":
    java_code = """
    import java.util.ArrayList;
    import java.util.List;

    public class MyClass {
        // class code here
    }
    """

    project_dir = "/path/to/your/project"  # Replace with your actual project directory

    java_files_context = get_java_file_context(java_code, project_dir)
    print(java_files_context)






## Summary Statistics

### Row Counts

By repo

In [63]:
all_df.repo.value_counts()

repo
aws/aws-sdk-java                            22028
OpenLiberty/open-liberty                    19469
alkacon/opencms-core                        10507
Azure/azure-sdk-for-java                    10240
google/j2objc                                8046
                                            ...  
balysv/material-menu                            1
electrumpayments/service-interface-base         1
akquinet/android-archetypes                     1
relops/snowflake                                1
elastic/elasticsearch-mapper-attachments        1
Name: count, Length: 4769, dtype: int64

In [66]:
# 使用条件选择来筛选出指定的行
filtered_df = all_df[all_df['repo'] == 'deeplearning4j/deeplearning4j']
print(filtered_df.head(5))

                                repo  \
16221  deeplearning4j/deeplearning4j   
16222  deeplearning4j/deeplearning4j   
16223  deeplearning4j/deeplearning4j   
16224  deeplearning4j/deeplearning4j   
16225  deeplearning4j/deeplearning4j   

                                                                                                         path  \
16221  nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/autodiff/samediff/ops/SDRandom.java   
16222  nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/autodiff/samediff/ops/SDRandom.java   
16223  nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/autodiff/samediff/ops/SDRandom.java   
16224  nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/autodiff/samediff/ops/SDRandom.java   
16225  nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/autodiff/samediff/ops/SDRandom.java   

                                                                                

In [None]:
result_series = all_df.repo.value_counts()
type(result_series)
result_df = result_series.reset_index()
result_df.columns = ['repo','count']

result_df.to_csv('repo.csv', index=False)

In [None]:
# 获取'repo'列的唯一值
unique_repos = all_df['repo'].unique()

# 打印'repo'列的唯一值数组
len(unique_repos)

4769

By Language

In [None]:
all_df.language.value_counts()

AttributeError: 'DataFrame' object has no attribute 'language'

By Partition & Language

In [None]:
all_df.groupby(['partition', 'language'])['code_tokens'].count()

partition  language  
test       go             14291
           java           26909
           javascript      6483
           php            28391
           python         22176
           ruby            2279
train      go            317832
           java          454451
           javascript    123889
           php           523712
           python        412178
           ruby           48791
valid      go             14242
           java           15328
           javascript      8253
           php            26015
           python         23107
           ruby            2209
Name: code_tokens, dtype: int64

### Token Lengths By Language

In [None]:
all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))
all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))

#### Code Length Percentile By Language

For example, the 80th percentile length for python tokens is 72

In [None]:
code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(code_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,code_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,61.0
go,0.7,100.0
go,0.8,138.0
go,0.9,217.0
go,0.95,319.0
java,0.5,66.0
java,0.7,104.0
java,0.8,142.0
java,0.9,224.0
java,0.95,331.0


#### Query Length Percentile By Language

For example, the 80th percentile length for python tokens is 19

In [None]:
query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,query_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,12.0
go,0.7,19.0
go,0.8,28.0
go,0.9,49.0
go,0.95,92.0
java,0.5,11.0
java,0.7,18.0
java,0.8,25.0
java,0.9,39.0
java,0.95,61.0


#### Query Length All Languages

In [None]:
query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0,query_len
0.5,10.0
0.7,15.0
0.8,20.0
0.9,32.0
0.95,50.0
