**Alert: \[Experimental Approach\]** This approach was not used in the final project UI which was deployed. If you're looking for the final approach, consider seeing [other notebook: 03-data-ops-classification-final.ipynb](./03-data-ops-classification-final.ipynb) instead.

<hr>

This approach did not yield good results on the classification performance of the actual downstream task. It was identified that the root cause for such mis-classification was a result of manually created labels which were not robust enough to classify the data properly.

In this notebook, we use the pre-processed dataset prepared in the [previous notebook: 01-data-ops-preprocessing.ipynb](./01-data-ops-preprocessing.ipynb) and try to use it to do the following:

1. Manually label each shell script / batch file contents on the basis of gathered keywords
2. Score them using a Bag of Words based unigram token giving a reward for each keyword matched across corresponding labels
3. Aggregate the counts of the relevant keywords and select the label of the document based upon highest rewards for each category.
4. Split the dataset into 70:30 ratio for training and testing
5. Train the classifier using Bag of Words tokenization + TFIDF preprocessing in the pipeline followed by comparing results of SVM classifier and Naive Bayes classifier.

**Challenges (with RCA)**: As a result of the manually creation of labels, the classification data wasn't robust in terms of labelling causing classification failures for almost all queries. Also, the 'unknown' class was dominant in the classification problem and as a result of curse of imbalance faced by classifiers, it almost always will classify unknown which wasn't relevant.

In [1]:
import pandas as pd
from sklearn.utils import shuffle
import metapy
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

In [2]:
df = pd.read_json('./web/data/dataset.json')
df.head()

Unnamed: 0,id,size,content,sample_repo_name,sample_path,original_content,source_category
0,c9cf4a93718cce2c5bcfd5caf5ecbd8f0c1ae4a6,221,#!/bin/bash\nset -e\n\nenv | sed 's/^/export /...,SpisTresci/SpisTresci,compose/django/cron.sh,,.sh
1,c9aaa28acde02cba18658aa4f2eb335fb4cf20b7,265,"#!/bin/sh -f\nxv_path=""/home/huchao/vivado/Viv...",chaohu/Daily-Learning,Verilog/lab2/lab2_1/lab1_2_2/lab1_2_2.sim/sim_...,,.sh
2,2f1c35188379847bdb4d907196bb7f2dd7a515f7,376,#!/bin/bash\n#--------------------------------...,BeeeOn/android,tests/monkey/kill-test.sh,,.sh
3,04fe6cd6b11a81f12a91bd6effd318d2483a17d9,89,#!/bin/bash\nrabbitmq-plugins enable rabbitmq_...,oscm/shell,mq/rabbitmq/enable.rabbitmq_management.sh,,.sh
4,b46ba78936c1ec41fbf4aed15a876d858432db88,1905,{% if cluster.type == 'ec2' -%}\n#$ -q all.q@@...,Kitware/HPCCloud,server/taskflows/hpccloud/taskflow/pvw.sh,,.sh


In [3]:
df_req = df.copy().drop(columns=['id', 'sample_repo_name', 'sample_path', 'original_content', 'size'])
df_req.head()

Unnamed: 0,content,source_category
0,#!/bin/bash\nset -e\n\nenv | sed 's/^/export /...,.sh
1,"#!/bin/sh -f\nxv_path=""/home/huchao/vivado/Viv...",.sh
2,#!/bin/bash\n#--------------------------------...,.sh
3,#!/bin/bash\nrabbitmq-plugins enable rabbitmq_...,.sh
4,{% if cluster.type == 'ec2' -%}\n#$ -q all.q@@...,.sh


In [4]:
df_req = shuffle(df_req)
df_req.head()

Unnamed: 0,content,source_category
13699,#!/bin/bash\n# -------------------------------...,.sh
9141,#!/bin/sh\n\n### BEGIN INIT INFO\n# Provides: ...,.sh
3863,"curl -i -X POST -H ""Content-Type: application/...",.sh
5406,"#!/bin/sh\ncd `dirname ""$0""`\nvalgrind --error...",.sh
5728,#!/bin/bash\n#\n# This script assumes a linux ...,.sh


In [5]:
keywords = {
    'rhel/fedora': [
        'systemctl',
        'useradd', 'passwd', 'userdel',
        'os', 'fedora', 'rhel', 'uname', 'cat',
        'podman', 'oc', 'yum', 'dnf',
        'microdnf', 'subscription', 'manager',
        'ubi', 'ip', 'journalctl', 'selinux'
    ],
    'ubuntu/debian': [
        'apt', 'get', 'dpkg',
        'debian', 'ubuntu', 'systemctl',
        'useradd', 'passwd', 'userdel',
        'lsb', 'uname', 'cat', 'lts', 'ifconfig',
        'ip'
    ],
    'windows': [
        'systeminfo', 'ver', 'choco',
        'sc', 'net', 'ipconfig', 'reg',
        'cmd', 'powershell', 'msiexec', 'chdir',
        'call', 'rmdir', 'move', 'cls', 'assoc', 'tasklist'
    ],
    'others/unknown': [
        'apk', 'pacman', 'alpine', 'arch',
    ]
}

In [6]:
token_set = set()

def preprocess(content):
    tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
    ngram = metapy.analyzers.NGramWordAnalyzer(1, tok)
    
    try:
        doc = metapy.index.Document()
        doc.content(content)
        unigrams = ngram.analyze(doc)
        
        tok.set_content(doc.content())
    except:
        return {}
    
    tokens_counts = {}
    for token, count in unigrams.items():
        tokens_counts[token] = count
        token_set.add(token)
    
    return tokens_counts

# manual labelling use gather keywords rewards
def label(contents, source_category):
    tc = preprocess(contents)
    ctr = {}
    for key in keywords:
        ctr[key] = 0
    
    for cat in keywords:
        for kw in keywords[cat]:
            if kw in tc:
                ctr[cat] += 1
    
    cat_names = np.array([key for key in ctr])
    cats = np.array([ctr[key] for key in cat_names])
    i0 = np.argmax(cats)
    
    cats1 = cats.copy()
    cats1[i0] = -1
    i1 = np.argmax(cats1)
    if ctr[cat_names[i0]] == ctr[cat_names[i1]]:
        if source_category == '.bat':
            return 'windows'
        else:
            return 'others/unknown'
    
    cat = cat_names[i0]
    return cat

In [7]:
labels = []
for _, row in df.iterrows():
    labels.append(label(row.content, row.source_category))

print('total tokens in corpus:', len(token_set))

ctr = {k: 0 for k in keywords}
for label in labels:
    ctr[label] += 1

print('dataset distribution:', ctr)

total tokens in corpus: 258421
dataset distribution: {'rhel/fedora': 478, 'ubuntu/debian': 2480, 'windows': 2674, 'others/unknown': 13611}


In [8]:
cats = {'rhel/fedora': 0, 'ubuntu/debian': 1, 'windows': 2, 'others/unknown': 3}
rev_cats = {0: 'rhel/fedora', 1: 'ubuntu/debian', 2: 'windows', 3: 'others/unknown'}
label_cats = [cats[label] for label in labels]

In [9]:
X = np.array(df_req.content)
y = np.array(label_cats)

# ignore empty contents
for i in range(len(X)):
    if X[i] is None:
        X[i] = ''

# train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=10)

SVM classifier

In [10]:
svm_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SVC()),
])

svm_clf.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', SVC())])

In [11]:
y_pred = svm_clf.predict(X_test)

print('SVM Classification Report:')
print(classification_report(y_test, y_pred))

SVM Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       145
           1       0.33      0.00      0.00       737
           2       0.00      0.00      0.00       811
           3       0.71      1.00      0.83      4080

    accuracy                           0.71      5773
   macro avg       0.26      0.25      0.21      5773
weighted avg       0.54      0.71      0.59      5773



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [12]:
# sample user queries

queries = [
    'sudo apt-get install git', # ground truth: rhel/fedora
    'cls rem rmdir', # windows
    'sudo dnf install -y code', # ubuntu/debian
    'apk add golang' # unknown/others
]

In [13]:
clfe = svm_clf.predict(np.array(queries))
print([{q: rev_cats[i]} for q, i in zip(queries, clfe)])

[{'sudo apt-get install git': 'others/unknown'}, {'cls rem rmdir': 'others/unknown'}, {'sudo dnf install -y code': 'others/unknown'}, {'apk add golang': 'others/unknown'}]


Naive Bayes classifier

In [14]:
nb_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

nb_clf.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

In [15]:
y_pred = nb_clf.predict(X_test)

print('Naive Bayes Classification Report:')
print(classification_report(y_test, y_pred))

Naive Bayes Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       145
           1       0.00      0.00      0.00       737
           2       0.00      0.00      0.00       811
           3       0.71      1.00      0.83      4080

    accuracy                           0.71      5773
   macro avg       0.18      0.25      0.21      5773
weighted avg       0.50      0.71      0.59      5773



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [16]:
# sample user query
clfe = nb_clf.predict(np.array(queries))
print([{q: rev_cats[i]} for q, i in zip(queries, clfe)])

[{'sudo apt-get install git': 'others/unknown'}, {'cls rem rmdir': 'others/unknown'}, {'sudo dnf install -y code': 'others/unknown'}, {'apk add golang': 'others/unknown'}]
