# Pre-processing solver dataset and output clean dataset which is the input for clustering.

# Table of Contents

1. [Introduction](#Introduction)
2. [Import Packages](#Import_packages)
3. [Set environment variables to retrieve data from Ceph and map to dataframe](#retrieve_data_from_Ceph)
4. [Load Error data from csv file created in the above step](#Load_data)
5. [Split the error components from the log message](#split_error)
6. [Example of error logs](#example)
7. [Prepare the data for clustering](#prepare_data)
8. [Cleaning clustering data](#clean_data)
9. [Tokenization](#tokenization)
10. [Save the cleaned data for clustering](#save_data)

## Introduction  <a id='Introduction'></a>

The purpose of this notebook is to preprocess solver data, i.e, extract error data from solver data, prepare data for clustering, clean and tokenize the clustering data and save the clean dataset for [ClusterError](./ClusterErrors.ipynb) notebook. 


Currently grafana metrics from the SolverResultsStore show over 700k results for solvers. This notebook filters for documents with solver errors. Each solver has around 20k - 30k results with solver error.\
Within the results, there are currently five different solvers:
   - solver-rhel-8-py36
   - solver-fedora-31-py38
   - solver-fedora-31-py37
   - solver-fedora-32-py38
   - solver-fedora-32-py37

## Import packages <a id='Import_packages'></a>

In [1]:
import pandas as pd
import regex as re
import pickle
import nltk

from thoth.lab import solver
from nltk.tokenize import TreebankWordTokenizer
from string import punctuation  

In [2]:
pd.set_option('max_colwidth', 1000)

## Set environment variables to retrieve data from Ceph and map to dataframe  <a id='retrieve_data_from_Ceph'></a>
### DO NOT Run this everytime. 

### All the result in solver report are described below:

- **environment,** information about the environment on which the package has being solved;
- **environment_packages,** information about external packages installed on the environment;
- **errors,** if the installation of a package was not succesfull there will be information stored for each package error;
    - **details,**
        - **command**,
        - **message**,
        - **return_code**,
        - **stderr**,
        - **stdout**,
        - **timeout**,
    - **index_url,** from where the package was download;
    - **package_name;**
    - **package_version;**
    - **is_provided_package,** flag for storing package;
    - **is_provided_package_version,** flag for storing package;
    - **type,** error type;
- **tree**, all the packages installed in the dependency tree and information about them;
    - **dependencies**
    - **metadata** of the package as taken from importlib_metadata;
    - **index_url** from where the package was download;
    - **package_name;**
    - **package_version;**
    - **sha256;**
- **unparsed** if there are packages in the tree that could not be parsed;
- **unresolved,** if there are packages in the tree that could not be solved;

In [3]:
get_fresh_data = True

In [6]:
import os
os.environ['THOTH_S3_ENDPOINT_URL'] = 'https://s3.upshift.redhat.com/'
os.environ['THOTH_CEPH_BUCKET'] = 'thoth'
os.environ['THOTH_CEPH_HOST'] = 'https://s3.upshift.redhat.com/'
os.environ['THOTH_CEPH_BUCKET_PREFIX'] = 'data/thoth'
os.environ['THOTH_DEPLOYMENT_NAME'] = 'thoth-psi-stage'

os.environ['THOTH_CEPH_KEY_ID'] = THOTH_CEPH_KEY_ID
os.environ['THOTH_CEPH_SECRET_KEY'] = THOTH_CEPH_SECRET_KEY

In [7]:
if get_fresh_data:
    # Connect to thoth storage
    from thoth.storages import SolverResultsStore
    store = SolverResultsStore(region="us-east-1")
    store.connect()
    
    solver_reports_extracted_data = []
    solver_errors = []

    # This block filters for documents with solver errors(Retrieve solver reports from Ceph and 
    # save only if it has "errors" in 
    for document_id in store.get_document_listing():
        try:
            solver_document = store.retrieve_document(document_id=document_id)
            solver_report_extracted_data = solver.extract_data_from_solver_metadata(solver_document["metadata"])
            errors = solver.extract_errors_from_solver_result(solver_document["result"]["errors"])   
            for error in errors:
                error['document_id'] = solver_report_extracted_data['document_id']
                error['datetime'] = solver_report_extracted_data['datetime']
                error['analyzer_version'] = solver_report_extracted_data['analyzer_version']
                error['environment'] = solver_document["result"]["environment"]
                solver_errors.append(error)
        except Exception as e:
            print(document_id, e)
            
    solver_total = pd.DataFrame(solver_errors).reset_index()

    solver_total['solver'] = solver_total['document_id'].str.rsplit("-", n=1).str[0]
    solver_total['solver'] = solver_total['solver'].replace('solver-rhel-8.0-py36', 'solver-rhel-8-py36')
    
    solver_total.to_csv('error_data.csv', index=False)
    print(len(solver_total))

## Load Error data from csv file created in the above step <a id='Load_data'></a>

In [None]:
if get_fresh_data:
    solver_total_errors_df = solver_total
else:
    solver_total_errors_df = pd.read_csv('error_data.csv')

In [None]:
solver_total_errors_df['solver'] = solver_total_errors_df['solver'].replace('solver-rhel-8.0-py36', 'solver-rhel-8-py36')

In [None]:
solver_total_errors_df['solver'].unique()

In [None]:
solver_total_errors_df.head(5)

## Split the error components from the log message <a id='split_error'></a>

In [None]:
def split_log(log_messages):
    log_messages = log_messages.split('\n')
    return log_messages

In [None]:
error_df = pd.DataFrame()
error_df['index'] = solver_total_errors_df['index']
error_df['document_id'] = solver_total_errors_df['document_id']
error_df['command'] = solver_total_errors_df['command']
error_df['package_name'] = solver_total_errors_df['package_name']
error_df['package_version'] = solver_total_errors_df['package_version']
error_df['solver'] = solver_total_errors_df['solver']
error_df['datetime'] = solver_total_errors_df['datetime']
error_df['environment'] = solver_total_errors_df['environment']
error_df['analyzer_version'] = solver_total_errors_df['analyzer_version']

In [None]:
error_df['message'] = solver_total_errors_df['message']
error_df['split_message']= solver_total_errors_df.apply(lambda row: split_log(row.message),axis = 1)

### Example of error logs: <a id='example'></a>

In [None]:
solver_total_errors_df['message'][999]

In [None]:
error_df['split_message'][999]

In [None]:
solver_total_errors_df['message'][19]

In [None]:
error_df['split_message'][19]

In [None]:
solver_total_errors_df['message'][226]

In [None]:
error_df['split_message'][226]

In [None]:
def split_error_components(log_messages):
    ids_with_different_log_pattern = []
    Error_info, command_info, cwd, Complete_output, ERROR, specific_error, exception = {}, {}, {}, {}, {}, {}, {}
    for idx, msg in enumerate(log_messages):
        msg = msg.replace("Error:\n", "Error:")
        sentences = [x.strip() for x in msg.split('\n')]
        for id, sent in enumerate(sentences):
            if re.match(r".*WARNING.*|.*warnings.*|.*except .*|^copying.*|^checking.*", sent):
                pass
            elif re.match(r"^command.*", sent):
                if idx in command_info.keys() and sent not in command_info[idx]:
                    command_info[idx].append(sent)
                else:
                    command_info[idx] = [sent]
            elif re.match(r"^cwd.*", sent):
                if idx in cwd.keys() and sent not in cwd[idx]:
                    cwd[idx].append(sent)
                else:
                    cwd[idx] = [sent]
            elif re.match(r"^Complete output.*", sent):
                number_of_lines = re.findall(r'\d+', sent) 
                if idx in Complete_output.keys() and sent not in Complete_output[idx]:
                    pass
                else:
                    Complete_output[idx] = sentences[id:id+int(number_of_lines[0])+1]
                    for txt in Complete_output[idx]:
                        if re.match(r"^Exception.*", txt):
                            exception[idx] = [txt]
            elif re.match(r".*unable to execute.*", sent):
                specific_error[idx] = [sent]
            elif re.compile("(\w\w*Error)").findall(sent):
                if re.match('^from .* import .*', sent):
                    pass
                elif idx in specific_error.keys():
                    if sent in specific_error[idx]:
                        pass
                    if not re.match(r".*unable to execute.*", specific_error[idx][0]):
                        specific_error[idx].extend([sent])
                else:
                    specific_error[idx] = [sent]
            elif re.match(r"^Error.*", sent):
                if idx not in specific_error.keys():
                    specific_error[idx] = [sent]
            elif re.match(r"^ERROR: .*", sent):
                if idx in ERROR.keys() and sent not in ERROR[idx]:
                    if not re.match(r".*Failed.*", ERROR[idx][0]):
                        ERROR[idx].extend([sent])
                else:
                    ERROR[idx] = [sent]   
            elif re.match(r"^Command.*", sent):
                Error_info[idx] = sent
    return Error_info, command_info, cwd, Complete_output, ERROR, specific_error, exception

print(len(solver_total_errors_df['message']))

In [None]:
Error_info, command_info, cwd, Complete_output, ERROR, specific_error, exception = split_error_components(solver_total_errors_df['message'])
print(len(Error_info), len(command_info), len(cwd), len(Complete_output), len(ERROR), len(specific_error), len(exception))

In [None]:
error_df['Error_info']= error_df['index'].map(Error_info)
error_df['command_info']= error_df['index'].map(command_info)
error_df['cwd']= error_df['index'].map(cwd)
error_df['Complete_output']= error_df['index'].map(Complete_output)
error_df['ERROR']= error_df['index'].map(ERROR)
error_df['Exception']= error_df['index'].map(exception)
error_df['specific_error']= error_df['index'].map(specific_error)

In [None]:
error_df.head(10)

## Prepare the data for clustering <a id='prepare_data'></a>

Look for data in the 'specific_error' column of the dataframe. If not present, look at 'ERROR' column or 'Error_info' column. If all the extracted columns are empty, take the error data from 'message' column. 

In [None]:
clustering_data = {}
context_message = {}
for idx, error in enumerate(error_df['specific_error']):
    if type(error) != list:
        #if type(error_df.iloc[idx]['Exception']) == list:
            #clustering_data[idx] = error_df.iloc[idx]['Exception']
        if pd.isnull(error) and  type(error_df.iloc[idx]['ERROR']) == list:
            context_message[idx] = error_df.iloc[idx]['ERROR']
            r = re.compile(".*Failed.*|.*CUDA.*")
            match = list(filter(r.match, error_df.iloc[idx]['ERROR']))    
            if len(error_df.iloc[idx]['ERROR']) == 1 or not match:
                clustering_data[idx] = [error_df.iloc[idx]['ERROR'][0]]
            else:
                clustering_data[idx] = match
        elif type(error_df.iloc[idx]['ERROR']) != list:
            if pd.isnull(error_df.iloc[idx]['ERROR']):
                context_message[idx] = error_df.iloc[idx]['Error_info']
                clustering_data[idx] = [error_df.iloc[idx]['Error_info']]
                if type(error_df.iloc[idx]['Error_info']) != list:
                    if pd.isnull(error_df.iloc[idx]['Error_info']):
                        context_message[idx] = error_df.iloc[idx]['message']
                        clustering_data[idx] = [error_df.iloc[idx]['message']]        
    else:
        context_message[idx] = error_df.iloc[idx]['specific_error']     
        # Extracting just the words with 'Error' as suffix to it and handling some edge cases. 
        r = re.compile("(\w\w*Error)")     
        string = ' '.join(error)
        match = re.findall(r, string)
        if match:
            if len(match) == 1:
                clustering_data[idx] = match
            else:
                if 'SyntaxError' in match:
                    for err in match:
                        if re.match(rf'.*{err},|.*{err}\),', string):
                            clustering_data[idx] = ['SyntaxError']
                            break
                    if idx not in clustering_data.keys():
                        clustering_data[idx] = list(set(match))
                elif {'OSError', 'EnvironmentError'} == set(match):
                    clustering_data[idx] = ['OSError']
                elif {'Error_GetLastError', 'AttributeError'} == set(match):
                    clustering_data[idx] = ['AttributeError']
                elif {'GrabNetworkError', 'GrabError', 'ModuleNotFoundError'} == set(match):
                    clustering_data[idx] = ['ModuleNotFoundError']
                elif {'ReadTimeoutError', 'EnvironmentError'} == set(match):
                    clustering_data[idx] = ['ReadTimeoutError']
                elif {'NewConnectionError', 'EnvironmentError'} == set(match):
                    clustering_data[idx] = ['NewConnectionError']
                else:
                    clustering_data[idx] = list(set(match))
        else:
            clustering_data[idx] = error
    if clustering_data[idx] == ['Command exited with non-zero status code (-9):']:
        clustering_data[idx] = ['Error Info not Available']
    string = ' '.join(clustering_data[idx])
    if re.match(".*make: .*|.*CMake .*|.*zError.*|.*SWIG_Python_TypeError.*|.*CatchlibLZMAError.*|.*PyExc_MemoryError.*|.*ViZDoomError.*|.*PyExc_KeyError.*|.*icsneoGetError.*|.*LapackSrcNotFoundError.*|.*PyFrame_FastToLocalsWithError.*", string):
    #if re.match(".*zError.*|.*SWIG_Python_TypeError.*|.*CatchlibLZMAError.*|.*PyExc_KeyError.*|.*PyFrame_FastToLocalsWithError.*", clustering_data[idx][0]):
        clustering_data[idx] = error_df.iloc[idx]['ERROR']

In [None]:
len(clustering_data), len(context_message)

## Cleaning clustering data <a id='clean_data'></a>

In [None]:
_line_number = r'(at line[:]*\s*\d+)'
_url = r'(http[s]|root|srm|file)*:(//|/)(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
_filepath = "(/[a-zA-Z\./]*[\s]?)"
path_regex = re.compile(r'(\b\w+://)\S+(?=\s)')
file_regex = re.compile(r'(\b[f|F]ile( exists)?:?\s?)/\S+(?=\s)')
py_regex = re.compile(r'/?\b[-./_a-zA-Z0-9]+\.py\b')
long_regex = re.compile(r'[-/_a-zA-Z0-9]{25,}')

In [None]:
def remove_whitespaces(sentence):
    return " ".join(sentence.split())

def substitute_path(string):
    string = path_regex.sub(r'\1', string)
    string = py_regex.sub(r' ', string)
    string = file_regex.sub(r'\1', string)
    string = long_regex.sub(r'', string)
    return string

def cleaner(log_messages):
    clean_log = {}
    for key in log_messages:
        for item in log_messages[key]:
            item = re.sub(_line_number, "at line *", item)
            item = re.sub(_url, " ", item)
            item = re.sub(_filepath, " ", item)
            item = substitute_path(item)
            if key in clean_log.keys() and item not in clean_log[key]:
                clean_log[key] = clean_log[key] + ' ' + remove_whitespaces(item)
            else:
                clean_log[key] = remove_whitespaces(item)
    return clean_log

In [None]:
clustering_data = cleaner(clustering_data)

In [None]:
len(clustering_data)

## Tokenization <a id='tokenization'></a>

In [None]:
stemmer = nltk.PorterStemmer()
stop = punctuation + "``" + "''" + '""' + "/"
table = str.maketrans(stop, ' '*len(stop))

In [None]:
def tokenization(log_messages):
    tokenized = []
    for key, item in log_messages.items():
        item = item.replace(' Error','Error').strip()
        item = item.replace('Errno','').strip()
        item = item.replace('Error :','Error:').strip()
        item = item.replace('Exception:','').strip()
        item = item.replace('Exception','').strip()
        item = item.replace('\'','').strip()
        item = item.replace('=',' ').strip()
        if error_df['package_name'][key] in item.split():
            item = item.replace(error_df['package_name'][key],'').strip()
        if "interpreter:" in item.split():
            item = item.split("interpreter",1)[1]
        if " Please" in item:
            item = item.split("Please",1)[0]
        if "for"  in item.split():
            item = item.split("for",1)[0] 
        if "in" in item.split():
            if "in an" in item:
                item = item.split("in an",1)[1]
            else:
                item = item.split("in",1)[0]
        if "on" in item.split():
            item = re.split(" on ",item)[0] 
        if re.match(r"^ERROR:.* |^Error:.* |^error:.*", item):
            item = re.split("^ERROR: |^error: |^Error: |^error:Error:",item)[1].split('.')[0]
        if item.strip() == '':
            item = error_df.iloc[idx]['ERROR'][0].translate(table).strip()
        if "Command" in item:
            if "ERROR:" in item:
                item = item.split("ERROR:",1)[1] 
            if re.match(r".*Command errored out with exit status.*", item):
                item = "Check the logs"
        if "not found" in item:
            words = [stemmer.stem(word) for word in item.split()] 
            item = ''
            for word in words:
                item += word[0].upper() + word[1:]
            item += "Error"
        if ":" in item and item[0] != ":":
            item = item.split(":",1)[0] 
        if "," in item:
            item = item.split(",",1)[0] 
        if "JAVA_HOME" in item:
            item = 'JAVA HOME not set to a path containing the JDK'
        if re.match("(\w\w* error)", item):
            item = item.split()[0] + "Error"
        if "XML" in item:
            item = "cannot get pre-processor and compiler flags"
        item = item.translate(table).strip()
        tokenized.append(TreebankWordTokenizer().tokenize(item))
    cleaned_tokens = []
    for id, row in enumerate(tokenized):
        cleaned_tokens.append(list(filter(None, [i
                                                 for i in row 
                                                 if i != error_df['package_name'][id]
                                                 and not i.lower().isnumeric()])))
    return cleaned_tokens

In [None]:
tokenized_clustering_data = tokenization(clustering_data)

In [None]:
tokenized_clustering_data

## Save the cleaned data for clustering <a id='save_data'></a>

In [None]:
error_df['clustering_data'] = error_df['index'].map(clustering_data)

In [None]:
error_df['tokenized_clustering_data'] = tokenized_clustering_data

In [None]:
error_df['context_message'] = error_df['index'].map(context_message)

In [None]:
error_df.head(20)

In [None]:
error_df.to_csv('error-clean-data.csv', index=False)

In [None]:
len(error_df)