# Sentence-transformers

Sentence-transformers are a type of natural language processing (NLP) model that is trained to convert a given sentence or text into a vector representation that captures the semantic meaning of the text. These vector representations can then be used for a variety of tasks such as text classification, semantic similarity, and text clustering, among others.

`microsoft/codebert-base` is a pre-trained language model that was specifically trained on source code tasks. It is a variant of the original BERT model, which was pre-trained on a large corpus of text data. The CodeBERT-base model was pre-trained on a massive dataset of 6 million lines of code, which was sourced from various open-source software projects.

The reason for using Microsoft/CodeBERT-base as the base model for sentence-transformers is that it has several advantages over other pre-trained language models. 

Firstly, as it was trained on source code tasks, it has a better understanding of the syntactical and structural components of code, which makes it more suitable for tasks related to code analysis and code generation. This is particularly useful for developers who are looking to automate certain parts of their coding process.

Secondly, `microsoft/codebert-base` has been fine-tuned on a variety of programming-related tasks, such as code completion, code summarization, and code search. This means that it has a better understanding of the programming domain and can provide more accurate and relevant results for programming-related tasks.

Finally, `microsoft/codebert-base` has been shown to perform well on a variety of downstream NLP tasks, such as sentence classification and semantic similarity. This is due to its ability to capture the semantic meaning of the text, which is essential for many NLP tasks.

In [1]:
from sentence_transformers import SentenceTransformer

In [2]:
model = SentenceTransformer('microsoft/codebert-base')

No sentence-transformers model found with name /home/qnbhd/.cache/torch/sentence_transformers/microsoft_codebert-base. Creating a new one with MEAN pooling.


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [4]:
train = pd.read_csv('train-preprocessed.csv')
test = pd.read_csv('test-preprocessed.csv')

In [5]:
X_train_embeddings = model.encode(train['code'].values)

In [6]:
X_test_embeddings = model.encode(test['code'].values)

In [7]:
from lightgbm import LGBMClassifier

In [8]:
clf = LGBMClassifier(random_state=42).fit(X_train_embeddings, train['language'])

In [9]:
from sklearn.metrics import classification_report

In [10]:
print(classification_report(test['language'], clf.predict(X_test_embeddings)))

              precision    recall  f1-score   support

           c       0.86      0.93      0.89       229
         cpp       0.88      0.82      0.85       230
         css       0.96      0.98      0.97       220
     haskell       0.89      0.91      0.90       221
        html       0.94      0.95      0.95       240
        java       0.94      0.99      0.96       216
  javascript       0.95      0.95      0.95       219
         lua       0.90      0.89      0.89       218
        objc       0.97      0.94      0.95       243
        perl       0.96      0.95      0.95       225
         php       0.97      0.94      0.95       237
      python       0.96      0.94      0.95       225
           r       0.91      0.94      0.92       214
        ruby       0.91      0.92      0.92       208
       scala       0.89      0.90      0.89       198
      sqlite       0.95      0.93      0.94       209
       swift       0.97      0.91      0.94       210

    accuracy              