**This approach and the model prepared in this notebook was used in the final project UI which was deployed.**

Project Web App: https://code-search-dot-code-crafts-1477836554331.el.r.appspot.com/

<hr>

As the former approach in the [other notebook: 02-data-ops-classification-experiment.ipynb](./02-data-ops-classification-experiment.ipynb) did not yield satisifactory performance on the classification task, instead we use automatic labelling and trained an SVM classifier which worked well performing very well on the downstream task.

In this notebook, again we use the same pre-processed dataset prepared in the [previous notebook: 01-data-ops-preprocessing.ipynb](./01-data-ops-preprocessing.ipynb) and perform the following steps:

1. Each shell script or batch file contents is labelled on the basis of the source category from where the data was collected. (`*.sh` and `Dockerfile` files are labelled to be originating from Linux OS, while `*.bat` files are labelled to be originating from Windows OS)
2. We use a two-class classification problem to distinguish documents whether 'linux' or 'windows'.
3. Split the dataset into 70:30 ratio for training and testing.
4. Train the classifier using Bag of Words tokenization + TFIDF preprocessing in the pipeline followed by comparing results of SVM classifier and Naive Bayes classifier.
5. SVM classifier was found to be more performant than Naive Bayes, and we store the model to disk, so the web app can reuse the saved model for classification at inference time when user provides input on web UI.

As the classification dataset itself consists of imbalanced samples, we use imbalanced classification measures like F1-score, precision, recall to compare classifiers.

In [1]:
import pandas as pd
from sklearn.utils import shuffle
import metapy
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
import joblib

In [2]:
df = pd.read_json('./web/data/dataset.json')
df.head()

Unnamed: 0,id,size,content,sample_repo_name,sample_path,original_content,source_category
0,c9cf4a93718cce2c5bcfd5caf5ecbd8f0c1ae4a6,221,#!/bin/bash\nset -e\n\nenv | sed 's/^/export /...,SpisTresci/SpisTresci,compose/django/cron.sh,,.sh
1,c9aaa28acde02cba18658aa4f2eb335fb4cf20b7,265,"#!/bin/sh -f\nxv_path=""/home/huchao/vivado/Viv...",chaohu/Daily-Learning,Verilog/lab2/lab2_1/lab1_2_2/lab1_2_2.sim/sim_...,,.sh
2,2f1c35188379847bdb4d907196bb7f2dd7a515f7,376,#!/bin/bash\n#--------------------------------...,BeeeOn/android,tests/monkey/kill-test.sh,,.sh
3,04fe6cd6b11a81f12a91bd6effd318d2483a17d9,89,#!/bin/bash\nrabbitmq-plugins enable rabbitmq_...,oscm/shell,mq/rabbitmq/enable.rabbitmq_management.sh,,.sh
4,b46ba78936c1ec41fbf4aed15a876d858432db88,1905,{% if cluster.type == 'ec2' -%}\n#$ -q all.q@@...,Kitware/HPCCloud,server/taskflows/hpccloud/taskflow/pvw.sh,,.sh


In [3]:
df_req = df.copy().drop(columns=['id', 'sample_repo_name', 'sample_path', 'original_content', 'size'])
df_req.head()

Unnamed: 0,content,source_category
0,#!/bin/bash\nset -e\n\nenv | sed 's/^/export /...,.sh
1,"#!/bin/sh -f\nxv_path=""/home/huchao/vivado/Viv...",.sh
2,#!/bin/bash\n#--------------------------------...,.sh
3,#!/bin/bash\nrabbitmq-plugins enable rabbitmq_...,.sh
4,{% if cluster.type == 'ec2' -%}\n#$ -q all.q@@...,.sh


In [4]:
df_req = shuffle(df_req)
df_req.head()

Unnamed: 0,content,source_category
11477,#!/bin/bash\nif [[ $TRAVIS_PULL_REQUEST == fal...,.sh
10984,#!/bin/sh\n\ntest_description='git p4 locked f...,.sh
14862,"#!/bin/bash\n# Copyright 2015, Google Inc.\n# ...",.sh
8011,#!/bin/bash -e\n#\n# Capture ESXi traffic and ...,.sh
12195,#!/bin/sh\n\n# I use this to build static ucli...,.sh


In [5]:
def label(source_category):
    if source_category == '.sh' or source_category == 'Dockerfile':
        return 'linux'
    elif source_category == '.bat':
        return 'windows'
    return ''

In [6]:
df_req['label'] = df_req.source_category.apply(label)

In [7]:
df_req.head()

Unnamed: 0,content,source_category,label
11477,#!/bin/bash\nif [[ $TRAVIS_PULL_REQUEST == fal...,.sh,linux
10984,#!/bin/sh\n\ntest_description='git p4 locked f...,.sh,linux
14862,"#!/bin/bash\n# Copyright 2015, Google Inc.\n# ...",.sh,linux
8011,#!/bin/bash -e\n#\n# Capture ESXi traffic and ...,.sh,linux
12195,#!/bin/sh\n\n# I use this to build static ucli...,.sh,linux


In [8]:
X = np.array(df_req.content)
y = np.array(df_req.label)

# ignore empty contents
for i in range(len(X)):
    if X[i] is None:
        X[i] = ''

# train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=10)

SVM Classifier.

In [9]:
svm_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SVC()),
])

svm_clf.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', SVC())])

In [10]:
y_pred = svm_clf.predict(X_test)

print('SVM Classification Report:')
print(classification_report(y_test, y_pred))

SVM Classification Report:
              precision    recall  f1-score   support

       linux       0.97      1.00      0.98      5103
     windows       0.97      0.79      0.87       670

    accuracy                           0.97      5773
   macro avg       0.97      0.89      0.93      5773
weighted avg       0.97      0.97      0.97      5773



In [11]:
# sample user queries

queries = [
    'sudo apt-get install git', # ground truth: rhel/fedora
    'cls rem rmdir', # windows
    'sudo dnf install -y code', # ubuntu/debian
    'apk add golang' # unknown/others
]

In [12]:
clfe = svm_clf.predict(np.array(queries))
print([{q: i} for q, i in zip(queries, clfe)])

[{'sudo apt-get install git': 'linux'}, {'cls rem rmdir': 'windows'}, {'sudo dnf install -y code': 'linux'}, {'apk add golang': 'linux'}]


Naive Bayes classifier.

In [13]:
nb_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

nb_clf.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

In [14]:
y_pred = nb_clf.predict(X_test)

print('Naive Bayes Classification Report:')
print(classification_report(y_test, y_pred))

Naive Bayes Classification Report:
              precision    recall  f1-score   support

       linux       0.90      1.00      0.95      5103
     windows       1.00      0.15      0.27       670

    accuracy                           0.90      5773
   macro avg       0.95      0.58      0.61      5773
weighted avg       0.91      0.90      0.87      5773



In [15]:
# sample user query
clfe = nb_clf.predict(np.array(queries))
print([{q: i} for q, i in zip(queries, clfe)])

[{'sudo apt-get install git': 'linux'}, {'cls rem rmdir': 'windows'}, {'sudo dnf install -y code': 'linux'}, {'apk add golang': 'linux'}]


SVM classifier works better than Naive Bayes classifier, we persist the SVM model along with BOW and TFIDF preprocessing pipeline to disk, so model deployment in web app can reuse it later.

In [16]:
joblib.dump(svm_clf, './web/data/model.joblib')

['./web/data/model.joblib']