Building AI model to classify logs into P1, P2, P3 and P4.

### Data Loading and Exploration 

#### Efficient Data Loading
- ***Chunk Loading:*** Due to dataset size, data is loaded in chunks to prevent memory overflow.
- ***PySpark Integration:*** For scalability, Pyspark is utilized for parallel data processing.

Library imports

In [43]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Configurations

In [44]:
%matplotlib inline
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rshekar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/rshekar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Loading dataset in chunks

In [45]:
chunk_size = 500000
chunks = pd.read_csv('../datasets/processed_logs.csv', chunksize=chunk_size)
logs_df = pd.concat(chunks, ignore_index=True)

  logs_df = pd.concat(chunks, ignore_index=True)


In [46]:
print(logs_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8505002 entries, 0 to 8505001
Data columns (total 4 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   timestamp  object
 1   level      object
 2   message    object
 3   source     object
dtypes: object(4)
memory usage: 259.6+ MB
None


#### Initial Data Inspection
- ***Schema:*** The dataset consists of timestamp, level, message and source columns.
- ***Null Values:*** Checked for missing or malformed entries.

In [47]:
print(logs_df.head())

print(logs_df.isnull().sum())

                  timestamp   level  \
0  Thu Jun 09 06:07:04 2005  notice   
1  Thu Jun 09 06:07:04 2005  notice   
2  Thu Jun 09 06:07:04 2005  notice   
3  Thu Jun 09 06:07:05 2005  notice   
4  Thu Jun 09 06:07:05 2005  notice   

                                             message  source  
0                 LDAP: Built with OpenLDAP LDAP SDK  Apache  
1                      LDAP: SSL support unavailable  Apache  
2  suEXEC mechanism enabled (wrapper: /usr/sbin/s...  Apache  
3  Digest: generating secret for digest authentic...  Apache  
4                                       Digest: done  Apache  
timestamp    2500000
level              0
message          101
source             0
dtype: int64


#### Exploratory Data Analysis (EDA) and Cleaning

In [48]:
print(logs_df['level'].value_counts())

level
INFO                                5954567
WARN                                 833297
ERROR                                625355
CRITICAL                             624647
FATAL                                321471
                                     ...   
GoogleSoftwareUpdateAgent[35089]          1
netbiosd[35901]                           1
netbiosd[31279]                           1
helpd[36107]                              1
GoogleSoftwareUpdateAgent[33940]          1
Name: count, Length: 1040, dtype: int64


The level column has 1,040 unique values, but standard log levels like INFO, WARN, ERROR, CRITICAL, and FATAL dominate the dataset. The remaining values appear to be process names or non-standard log levels, such as GoogleSoftwareUpdateAgent[35089], netbiosd[35901], etc.

Separating standard log levels and process names.

In [49]:
standard_levels = ['INFO', 'WARN', 'ERROR', 'CRITICAL', 'FATAL', 'NOTICE', 'DEBUG']

logs_df["cleaned_level"] = logs_df["level"].apply(lambda x: x.upper() if x.upper() in standard_levels else 'OTHER')

print(logs_df["cleaned_level"].value_counts())

cleaned_level
INFO        5954567
WARN         833465
ERROR        663436
CRITICAL     624647
FATAL        321471
OTHER         93661
NOTICE        13755
Name: count, dtype: int64


Extracting log source (process names) if 'level' is not standard

In [None]:
logs_df["log_source"] = logs_df.apply(lambda row: row["level"] if row["cleaned_level"] == "OTHER" else "SYSTEM", axis=1)

print(logs_df["log_source"].value_counts())

#### Text Preprocessing (NLP)
- ***Lowercasing:*** Coverting all messages to lowercase.
- ***Punctuation Removal:*** Removed unnecessary punctuation.
- ***Stopword Removal:*** Eliminated common stopwords using NLTK.
- ***Lemmatization:*** Reduced words to their base forms for uniformity.

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Function to clean log messages
def preprocess_log_message(message):
    # Checking if the message is str object if not replacing NaN or non-string values with an empty string
    if isinstance(message, str):
        message = ''
    message = message.lower()
    message = re.sub(r'[^a-zA-Z0-9\s]', '', message)
    message = ' '.join([lemmatizer.lemmatize(word) for word in message.split() if word not in stop_words])
    return message

# logs_df['message'] = logs_df['message'].fillna('No message provided')
# logs_df['cleaned_message'] = logs_df["message"].apply(preprocess_log_message)
# print(logs_df["message"].head(10))
print(logs_df['message'].isnull().sum())


0
