# Cluster errors to identify the type of errors that can appear in solver reports 

# Table of Contents

1. [Introduction](#Introduction)
2. [Import Packages](#Import_packages)
3. [Load the clean solver data saved by 'PreprocessSolverErrorData' notebook](#load_clean_data)
4. [Filter data using Solver / datetime](#filter)
5. [Word to Vector Conversion using Continuous Bag of Words model (CBOW)](#word2vec)
6. [Sentence (error message) to vector conversion](#sent2vec)
7. [Clustering using DBScan](#clustering)
8. [Get cluster statistics such as : "pattern", "mean_length", "mean_similarity"](#cluster_stats)
9. [Save clustered data to Ceph](#save_to_ceph)
10. [View data from each cluster](#view_data)
 1. [Cluster No. 0: FileNotFoundError](#c0)
 2. [Cluster No. 1: UnableToExecuteGccError](#c1)
 3. [Cluster No. 3: NoMatchingDistributionFoundError](#c3)
11. [Clusters with more than one error](#clusters_with_more_than_one_error)
 1. [Cluster No. 10: ImportError, HTTPError](#c10)
 2. [Cluster No. 106: CalledProcessError, FileNotFoundError, KeyError, RuntimeError](#c106)
 3. [Cluster No. 116:  ConnectionError, OSError, MaxRetryError, DistutilsError, ResponseError](#c16)
 4. [Cluster No. 7: CheckTheLogsError : Need further exploring](#c7) 

## Introduction  <a id='Introduction'></a>

The purpose of this notebook is to cluster solver errors so that we can derive context on why dependencies cannot be solved in order to better advise users on why something cannot be used.

#### Summary :
- Preprocessed data by [PreprocessSolverErrorData](./PreprocessSolverErrorData.ipynb) notebook is loaded.
- Each word in converted into a vector using [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) (Continuous Bag of Words model). 
- Each error message is then converted into a vector(Sentence2vec using word2vec model).
- Clustering is done using [DBScan](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html).
- Cluster statistics such as "pattern", "mean_length" and "mean_similarity" is calculated.
- Error Class is defined and added to the dataframe.
- Saved the classified error data to Ceph.

## Import packages <a id='Import_packages'></a>

In [1]:
import pandas as pd
import multiprocessing
import pickle
import numpy as np
import difflib
import regex as re
import boto3
import os

from math import sqrt
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from gensim.models import Word2Vec
from kneed import KneeLocator
from string import punctuation    

In [2]:
pd.set_option('max_colwidth', 2600)
pd.set_option('display.max_rows', 200)

In [3]:
cpu_number = multiprocessing.cpu_count()
w2v_window= 7

## Load the clean solver data saved by 'PreprocessSolverErrorData' notebook <a id='load_clean_data'></a>

In [4]:
preprocessed_filename = 'error-clean-data.csv'

In [6]:
entire_error_df = pd.read_csv(preprocessed_filename)

In [7]:
len(entire_error_df)

93532

## Filter data using Solver / datetime <a id='filter'></a>

In [8]:
def filter_data(entire_error_df, solver_name=None, start_date='2019-12-27',end_date='2020-01-14', mode='solver'):
    if mode == 'solver':
        error_df = entire_error_df.loc[entire_error_df['solver'] == solver_name]
    elif mode == 'datetime':
        mask = (entire_error_df['datetime'] >= start_date) & (entire_error_df['datetime'] <= end_date)
        error_df = entire_error_df.loc[mask]
    elif mode == 'all':
        error_df = entire_error_df
    return error_df

In [9]:
entire_error_df['solver'].unique()

array(['solver-fedora-31-py37', 'solver-fedora-31-py38',
       'solver-fedora-32-py37', 'solver-fedora-32-py38',
       'solver-rhel-8-py36'], dtype=object)

In [10]:
#error_df = filter_data(entire_error_df, solver_name = 'solver-fedora-31-py37', mode='solver')
#error_df = filter_data(entire_error_df, start_date='2019-12-24',end_date='2020-01-14', mode='datetime')
error_df = filter_data(entire_error_df, mode = 'all')

In [11]:
len(error_df)

93532

### Extract tokenized_clustering_data for clustering

In [12]:
clean_clustering_data = error_df['tokenized_clustering_data']

## Word to Vector Conversion using Continuous Bag of Words model (CBOW) <a id='word2vec'></a>

In [13]:
print('Number of rows in training data :', len(clean_clustering_data))

Number of rows in training data : 93532


In [14]:
def detect_embedding_size(tokens):
    flat_list = [item for row in tokens for item in row]
    vocab = set(flat_list)
    embedding_size = round(len(vocab) ** (2/3))
    if embedding_size >= 400:
        embedding_size = 400
    return embedding_size

w2v_size = detect_embedding_size(clean_clustering_data)

In [15]:
def tokens_vectorization(clustering_data, w2v_size, w2v_window, cpu_number, model_name):
    iterations = 100
    word2vec = Word2Vec(clustering_data,
                           size = w2v_size, 
                           window = w2v_window, 
                           min_count=1, 
                           workers = cpu_number,
                           iter=iterations)
    word2vec.save(model_name)
    return word2vec

In [16]:
word2vec = tokens_vectorization(clean_clustering_data, 
                                 w2v_size = w2v_size, 
                                 w2v_window= w2v_window, 
                                 cpu_number = cpu_number, 
                                 model_name='../models/word2vec.model')

## Sentence (error message) to vector conversion <a id='sent2vec'></a>

sum all content words in the documents and divide by the number of content words.

In [17]:
def sentence_vectorization(clustering_data, word2vec):
    sent2vec = []
    for sent in clustering_data:
        sent_vec = []
        numw = 0
        for w in sent:
            try:
                sent_vec = word2vec[w] if numw == 0 else np.add(sent_vec, word2vec[w])
                numw += 1
            except Exception:
                pass
        sent2vec.append(np.asarray(sent_vec) / numw)   
    return np.vstack(sent2vec)

In [18]:
sent2vec = sentence_vectorization(clean_clustering_data, word2vec)

  


## Clustering using DBScan  <a id='clustering'></a>

Based on a set of points DBSCAN groups together points that are close to each other based on a distance measurement(epsilon) and a minimum number of points. It also marks as outliers the points that are in low-density regions.

Find the avg_distances using NearestNeighbors between the data points.

In [19]:
def kneighbors(sent2vec):
    k = round(sqrt(len(sent2vec)))
    neigh = NearestNeighbors(n_neighbors=k)
    nbrs = neigh.fit(sent2vec)
    distances, indices = nbrs.kneighbors(sent2vec)
    distances = [np.mean(d) for d in np.sort(distances, axis=0)]
    return distances

avg_distances = kneighbors(sent2vec)

Calculate epsilon, which is the linkage distance threshold above which, clusters will not be merged.

In [20]:
def epsilon_search(distances):
    kneedle = KneeLocator(distances, list(range(len(distances))))
    epsilon = max(kneedle.all_elbows) if (len(kneedle.all_elbows) > 0) else 1
    return epsilon

In [21]:
epsilon = epsilon_search(avg_distances)

DBScan Clustering using epsilon and min_samples as 1

In [22]:
def dbscan(epsilon, min_samples, cpu_number, sent2vec):
    cluster_labels = DBSCAN(eps=epsilon,
                            min_samples= min_samples,
                            n_jobs=cpu_number).fit_predict(sent2vec)
    return cluster_labels

In [None]:
#cluster_labels = hierarchical(epsilon, sent2vec)
cluster_labels = dbscan(epsilon, 1, cpu_number, sent2vec)

In [None]:
len(cluster_labels)

In [None]:
error_df['cluster_no.'] = cluster_labels

## Get cluster statistics such as : "pattern", "mean_length", "mean_similarity" <a id='cluster_stats'></a>

In [None]:
def clustered_output(error_df, mode='INDEX'):
    groups, unique_rows = {}, {}
    for key, value in error_df.groupby(['cluster_no.']):
        unique_rows[str(key)] = set(value['clustering_data'])
        if mode == 'ALL':
            groups[str(key)] = value.to_dict(orient='records')
        elif mode == 'Tokenized':
            groups[str(key)] = value['tokenized_clustering_data'].values.tolist()
        elif mode == 'CLEANED':
            groups[str(key)] = value['clustering_data'].values.tolist()
    return groups, unique_rows

In [None]:
table = str.maketrans(punctuation, ' '*len(punctuation))

def find_matching_blocks(strings):
    curr = strings[0]
    curr = curr.replace('ERROR', '')
    curr = curr.replace('Command exited with non-zero status code (1):', '')
    if len(strings) == 1:
        #return curr
        return curr.translate(table).strip()
    else:
        cnt = 1
        for i in range(cnt, len(strings)):
            matches = difflib.SequenceMatcher(None, curr, strings[i])
            common = []
            for match in matches.get_matching_blocks():
                common.append(curr[match.a:match.a + match.size])
            curr = ''.join(common)
            cnt = cnt + 1
            if cnt == len(strings):
                break
        if curr == '':
            'NO COMMON PATTERNS HAVE BEEN FOUND'
        #return curr
        return curr.translate(table).strip()

def get_similarity(rows):
    s = []
    for i in range(0, len(rows)):
        s.append(difflib.SequenceMatcher(None, rows[0], rows[i]).ratio() * 100)
    return s

In [None]:
STATISTICS = ["cluster_name", "cluster_size", "pattern", 'CLASS', "mean_similarity"]

def statistics(error_df, output_mode='frame'):
    """
    Returns dictionary with statistic for all clusters
    "cluster_name" - name of a cluster
    "cluster_size" = number of log messages in cluster
    "pattern" - all common substrings in messages in the cluster
    "mean_length" - average length of log messages in cluster
    "mean_similarity" - average similarity of log messages in cluster
    (calculated as the levenshtein distances between the 1st and all other log messages)
    :param clustered_df:
    :param output_mode: frame | dict
    :return:
    """
    clusters = []
    clustered_df, unique_rows = clustered_output(error_df, mode='CLEANED')
    clustered_df_class, unique_rows = clustered_output(error_df, mode='Tokenized')
    for item in clustered_df:
        row = clustered_df[item]
        matcher = find_matching_blocks(row)
        class_matcher = find_matching_blocks(clustered_df_class[item])
        similarity = get_similarity(row)
        clusters.append([item,
                         len(row),
                         matcher,
                         class_matcher,
                         #unique_rows[item],
                         #np.mean(lengths),
                         np.mean(similarity)])
    df = pd.DataFrame(clusters, columns=STATISTICS).round(2).sort_values(by='cluster_size', ascending=False)
    if output_mode == 'frame':
        return df
    else:
        return df.to_dict(orient='records')

In [None]:
stat = statistics(error_df, output_mode='frame')
stat_df = pd.DataFrame.from_dict(stat)

In [None]:
print('Number of clusters : ', len(stat_df))

Generate CLASS label

In [None]:
def get_class_label(stat_df):
    class_labels = []
    number_of_errors = []
    MachineDefinedError = []
    for item in stat_df['CLASS']:
        if "Error" in item.split():
            item = item.replace('Error', '')
        row = item.split()
        #if len(row) > 1 and len(re.findall(r'Error', str(row))) < 2:
        if not re.search('(\w\w*Error)', item):
            MachineDefinedError.append('NO')
            item = ''
            for word in row:
                item += word[0].upper() + word[1:]
            item += "Error"
        else:
            if len(re.findall(r'Error', str(row))) > 1:
                item = ', '.join(row)
            else:
                item = ''.join(row)
            MachineDefinedError.append('YES')
        class_labels.append(item)
        number_of_errors.append(len(re.findall(r'Error', str(item))))
    return class_labels, number_of_errors, MachineDefinedError

In [None]:
class_labels, number_of_errors, MachineDefinedError = get_class_label(stat_df)

In [None]:
stat_df['number_of_errors'] = number_of_errors
stat_df['MachineDefinedError?'] = MachineDefinedError
stat_df['CLASS'] = class_labels

In [None]:
stat_df.sort_values(by='cluster_size', ascending=False)

In [None]:
error_df['CLASS'] = error_df['cluster_no.'].map(stat_df['CLASS'])
error_df['number_of_errors'] = error_df['cluster_no.'].map(stat_df['number_of_errors'])
error_df['MachineDefinedError?'] = error_df['cluster_no.'].map(stat_df['MachineDefinedError?'])

## Save clustered data to Ceph <a id='save_to_ceph'></a>

In [None]:
import os
os.environ['THOTH_S3_ENDPOINT_URL'] = 'https://s3.upshift.redhat.com/'
os.environ['AWS_ACCESS_KEY_ID'] = AWS_ACCESS_KEY_ID
os.environ['AWS_SECRET_ACCESS_KEY'] = AWS_SECRET_ACCESS_KEY

In [None]:
from io import StringIO

def store_csv_to_ceph(error_df):
    csv_buffer = StringIO()
    error_df = error_df.drop(columns =['index', 'message','split_message', 'Error_info', 'command_info', 
                                       'cwd', 'Complete_output','ERROR', 'Exception', 'specific_error'])
    error_df.to_csv(csv_buffer, header=False, sep ='`', index=False)
    bucket = 'DH-PLAYPEN'
    s3_resource = boto3.resource('s3',
                        endpoint_url= os.environ['THOTH_S3_ENDPOINT_URL'],
                        aws_access_key_id = os.environ["AWS_ACCESS_KEY_ID"],
                        aws_secret_access_key= os.environ['AWS_SECRET_ACCESS_KEY'])
    s3_resource.Object(bucket, 'thoth/data/solver-error-context/solver-error-context.csv').put(Body=csv_buffer.getvalue())

In [None]:
store_csv_to_ceph(error_df)

## View data from each cluster <a id='view_data'></a>

In [None]:
def get_data_from_cluster(df_processed, clusters, cluster_number):
    indices = [i for i, x in enumerate(clusters) if x == cluster_number]
    df_grouped = df_processed.iloc[indices]
    print(len(df_grouped))
    return df_grouped

def split_log(log_messages):
    log_messages = log_messages.split('\n')
    return log_messages

### Cluster No. 0: FileNotFoundError <a id='c0'></a>

In [None]:
get_data_from_cluster(error_df, cluster_labels, 0)[['package_name', 'package_version', 'solver','message', 
                                                    'specific_error', 'CLASS', 'MachineDefinedError?']]

### Cluster No. 1: UnableToExecuteGccError	<a id='c1'></a>

In [None]:
get_data_from_cluster(error_df, cluster_labels, 1)[['package_name', 'package_version', 'solver','message', 
                                                    'specific_error', 'CLASS', 'MachineDefinedError?']]

### Cluster No. 3: NoMatchingDistributionFoundError <a id='c3'></a>

In [None]:
get_data_from_cluster(error_df, cluster_labels, 3)[['package_name', 'package_version', 'solver','message', 
                                                    'ERROR', 'CLASS', 'MachineDefinedError?']]

## Clusters with more than one error  <a id='clusters_with_more_than_one_error'></a>

### Cluster No. 10: ImportError, HTTPError <a id='c10'></a>

In [None]:
get_data_from_cluster(error_df, cluster_labels, 10)[['package_name', 'package_version', 'solver','message', 
                                                    'specific_error', 'CLASS', 'MachineDefinedError?']]

### An example of log from cluster 10

In [None]:
split_log(error_df['message'][33])

### Cluster No. 106: CalledProcessError, FileNotFoundError, KeyError, RuntimeError <a id='c106'></a>

In [None]:
get_data_from_cluster(error_df, cluster_labels, 106)[['package_name', 'package_version', 'solver','message', 
                                                    'specific_error', 'CLASS', 'MachineDefinedError?']]

### An example of log from cluster 106

In [None]:
split_log(error_df['message'][31434])

### Cluster No. 116:  ConnectionError, OSError, MaxRetryError, DistutilsError, ResponseError <a id='c116'></a>

In [None]:
get_data_from_cluster(error_df, cluster_labels, 116)[['package_name', 'package_version', 'solver','message', 
                                                    'specific_error', 'CLASS', 'MachineDefinedError?']]

### An example of log from cluster 116

In [None]:
split_log(error_df['message'][71802])

### Cluster No. 7: CheckTheLogsError : Need further exploring <a id='c7'></a>

In [None]:
get_data_from_cluster(error_df, cluster_labels, 7)[['package_name', 'package_version', 'solver','message', 
                                                    'ERROR', 'CLASS', 'MachineDefinedError?']]

In [None]:
split_log(error_df['message'][18])

##### Missing gcc  (this is very important here to know)