# Java Method Name prediction using Code2Vec

I will be using the original author's implementation, which was trained with a dataset of size 32GB (containing 14M data). Training with this massive dataset is extremely time consuming. So, I will be using their pre-trained model for prediction. They have a trainable model for further training. But again, training such a huge model will be time consuming.

I have already downloaded and unzipped the required data and models (pre-trained and trainable) into my google drive. Check https://github.com/tech-srl/code2vec for detailed information on how to get their resources.

**Note:** There is a mistake in the instructions.

While extracting the trainable model, the command they provided with is 
```
tar -xvzf java14m_model_trainable.tar
```
Instead, it should be like this:
```
tar -xvzf java14m_model_trainable.tar.gz
```

## Setting up drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd 'drive/Shareddrives/With Myself/huge-datasets-for-colab/code2vec'

/content/drive/Shareddrives/With Myself/huge-datasets-for-colab/code2vec


## Checking Requirements

In [3]:
import tensorflow as tf

In [4]:
!python3 --version

Python 3.7.10


In [5]:
tf.__version__

'2.4.1'

In [6]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0


## Prediction using pre-trained model

### Prediction on the whole test dataset

In [7]:
!python3 code2vec.py --load models/java14_model/saved_model_iter8.release --test data/java14m/java14m.test.c2v

2021-04-28 17:12:58.927807: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-04-28 17:13:00.654471: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-28 17:13:00.655243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-28 17:13:00.687463: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-28 17:13:00.688073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-04-28 17:13:00.688110: I tensorflow/stream_executor/platform/default/dso_loade

In [8]:
!ls

build_extractor.sh		    keras_topk_word_predictions_layer.py
code2vec.py			    keras_word_prediction_layer.py
common.py			    keras_words_subtoken_metrics.py
config.py			    LICENSE
CSharpExtractor			    log.txt
data				    model_base.py
extractor.py			    models
images				    path_context_reader.py
__init__.py			    preprocess_csharp.sh
Input.java			    preprocess.py
interactive_predict.py		    preprocess.sh
java14m_data.tar.gz		    __pycache__
java14m_model.tar.gz		    README.md
java14m_model_trainable.tar.gz	    requirements.txt
JavaExtractor			    tensorflow_model.py
keras_attention_layer.py	    train.sh
keras_checkpoint_saver_callback.py  vocabularies.py
keras_model.py


Running the command above has created a file named 'log.txt', which is written with each test example name and the model's prediction. There are 0.3M data, so the log file is very big.

See the outputs [here](https://drive.google.com/file/d/1QosEoEgx-zaor0qltI1rSXVXo2iJQB59/view?usp=sharing).

There are many cases where the model does not have any prediction.

### Prediction on input data

In [10]:
!python3 code2vec.py --load models/java14_model/saved_model_iter8.release --predict

2021-04-28 17:16:11.269278: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-04-28 17:16:13.042920: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-28 17:16:13.045426: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-28 17:16:13.074513: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-28 17:16:13.075119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-04-28 17:16:13.075153: I tensorflow/stream_executor/platform/default/dso_loade

After a couple of trial and error run, it seems the model is predicting only a default result for input functions. My guess, this is due to the *Jupyter Notebook* interface, but I can be wrong. Need to run the model offline in future, since my current configuration is not good enough to run such models.

### Checking the embeddings created by the model

In [None]:
from gensim.models import KeyedVectors as word2vec
vectors_text_path = 'models/java14_model/targets.txt' # or: `models/java14_model/tokens.txt'
model = word2vec.load_word2vec_format(vectors_text_path, binary=False)

In [16]:
model.most_similar(positive=['equals', 'to|lower']) # or: 'tolower', if using the downloaded embeddings

[('equals|ignore|case', 0.42434296011924744),
 ('to|lower|case', 0.3609984219074249),
 ('to|upper', 0.35575786232948303),
 ('camel|case', 0.35217034816741943),
 ('equals|normalized', 0.3509151041507721),
 ('standard|equals', 0.33294883370399475),
 ('equal|match', 0.3233056664466858),
 ('is|upper|case', 0.31760913133621216),
 ('get|color|array', 0.31566670536994934),
 ('test|equals|equivalent', 0.3095047175884247)]

In [17]:
model.most_similar(positive=['download', 'send'], negative=['receive'])

[('send|message', 0.2544481158256531),
 ('wait|for|completion', 0.22676804661750793),
 ('checkout', 0.22608843445777893),
 ('display', 0.22561199963092804),
 ('create|surface', 0.221379816532135),
 ('request|url', 0.22023360431194305),
 ('play', 0.21058186888694763),
 ('subscribe', 0.2058645784854889),
 ('replace', 0.20528222620487213),
 ('href', 0.2050342857837677)]

In [18]:
vectors_text_path = 'models/java14_model/tokens.txt'
model = word2vec.load_word2vec_format(vectors_text_path, binary=False)

In [20]:
model.most_similar(positive=['equals', 'tolower']) # or: 'tolower', if using the downloaded embeddings

[('equalsignorecase', 0.528664231300354),
 ('startvertexendpos', 0.4887391924858093),
 ('myequals', 0.48323819041252136),
 ('invariantlocalecollator', 0.4712514579296112),
 ('currentlocalecollator', 0.46890103816986084),
 ('bequals', 0.4582321345806122),
 ('emptymapsshouldbeequal', 0.45157575607299805),
 ('instanceswithdifferentdefaultvaluesarenotequal', 0.4490710198879242),
 ('assertrejects', 0.44240081310272217),
 ('slicekindconfiguration', 0.4329659342765808)]

In [21]:
model.most_similar(positive=['download', 'send'], negative=['receive'])

[('filepicturemanager', 0.4243929088115692),
 ('resumeonretry', 0.40389949083328247),
 ('deviceofflinedownload', 0.40292447805404663),
 ('downloadableartifacts', 0.39464467763900757),
 ('minstallbuttonguard', 0.3908683657646179),
 ('repeatcontext', 0.38964638113975525),
 ('holotrack', 0.3889748752117157),
 ('modelclipchanged', 0.3857502341270447),
 ('budwhitelapdcom', 0.38103920221328735),
 ('fixeddownloadlimitforapiroutine', 0.37883850932121277)]