<a href="https://colab.research.google.com/github/slrico/Log-clickstream-Analysis/blob/main/Apache_Spark_Log_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

### Dataset Overview:

| Column Name    | Description                                                                 | Example                  |
|----------------|-----------------------------------------------------------------------------|--------------------------|
| **LineId**     | A unique sequential identifier for each log entry.                         | 1, 2, 3, ...           |
| **Time**       | The timestamp indicating when the log entry was recorded. It follows a format: `Day Mon DD HH:MM:SS YYYY`. | Sun Dec 04 04:47:44 2005  |
| **Level**      | The severity or importance level of the log message (e.g., `notice`, `error`). | notice, error          |
| **Content**    | The main log message or content describing the event or status. Includes some dynamic variables indicated by placeholders `<*>`. | `workerEnv.init() ok /etc/httpd/conf/workers2.p...`  |
| **EventId**    | An identifier associated with the event, possibly unique per event.        | E2, E3, E1             |
| **EventTemplate** | A generalized template of the log message, where specific variable content is replaced with placeholders, capturing the message structure. | `workerEnv.init() ok <*>` |

---

### Explanation:

- **LineId** and **Time** provide a unique identifier and timestamp for each log entry, facilitating chronological analysis.
- **Level** classifies logs by severity—`notice` for normal events, `error` for issues.
- **Content** contains the detailed message, which often includes variable data such as paths, IDs, or specific values.
- **EventId** is a label that might group the same event types.
- **EventTemplate** abstracts the `Content` by replacing variable elements with placeholders (`<*>`), enabling pattern recognition and grouping similar logs based on structure.

---

### Example insights:

- Entries with similar **EventTemplate** values indicate similar event structures, but variable parts (paths, IDs) are anonymized or generalized.
- Analyzing **EventTemplate** helps identify common log message patterns, which can be useful for understanding system behavior, troubleshooting, or log aggregation.

---


In [1]:
import pandas as pd

df = pd.read_csv('./Apache_2k.log_structured.csv')
df.head()


Unnamed: 0,LineId,Time,Level,Content,EventId,EventTemplate
0,1,Sun Dec 04 04:47:44 2005,notice,workerEnv.init() ok /etc/httpd/conf/workers2.p...,E2,workerEnv.init() ok <*>
1,2,Sun Dec 04 04:47:44 2005,error,mod_jk child workerEnv in error state 6,E3,mod_jk child workerEnv in error state <*>
2,3,Sun Dec 04 04:51:08 2005,notice,jk2_init() Found child 6725 in scoreboard slot 10,E1,jk2_init() Found child <*> in scoreboard slot <*>
3,4,Sun Dec 04 04:51:09 2005,notice,jk2_init() Found child 6726 in scoreboard slot 8,E1,jk2_init() Found child <*> in scoreboard slot <*>
4,5,Sun Dec 04 04:51:09 2005,notice,jk2_init() Found child 6728 in scoreboard slot 6,E1,jk2_init() Found child <*> in scoreboard slot <*>


---

### Log Template Extraction Workflow

This code processes each log entry in the `'Content'` column of the DataFrame:

- It defines a function `extract_template()` that standardizes logs by replacing variable components—such as file paths, IP addresses, numeric IDs, timestamps, and placeholders—with fixed tokens (`<PATH>`, `<IP>`, `<ID>`, `<DATETIME>`, `<PLACEHOLDER>`). It also normalizes function names and whitespace, producing a consistent template.

- The function is applied to all log entries, creating a new `'Template'` column. This enables grouping similar logs based on their core structure, abstracted from variable details.

- The code then performs frequency analysis with `value_counts()` to identify the most common templates, or groups logs by `'Template'` for further analysis.

---


In [2]:
# Assuming your DataFrame is df, and 'Content' contains your log lines

# Define your extract_template() as above
import re

def extract_template(content):
    # Replace file paths
    content = re.sub(r'/[^ ]+', '<PATH>', content)
    # Replace IP addresses
    content = re.sub(r'\b\d{1,3}(?:\.\d{1,3}){3}\b', '<IP>', content)
    # Replace all numbers
    content = re.sub(r'\b\d+\b', '<ID>', content)
    # Replace dates/times (example pattern)
    content = re.sub(r'(Sun|Mon|Tue|Wed|Thu|Fri|Sat)\s+[A-Za-z]{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+\d{4}', '<DATETIME>', content)
    # Replace placeholders
    content = re.sub(r'<\*>', '<PLACEHOLDER>', content)
    # Keep function names, remove parentheses
    content = re.sub(r'(\b[A-Za-z_][A-Za-z0-9_]*)(\(\))', r'\1', content)
    # Normalize spaces
    content = re.sub(r'\s+', ' ', content).strip()
    return content

# Apply to create templates
df['Template'] = df['Content'].apply(extract_template)

# Now you can group by templates
template_counts = df['Template'].value_counts()

# Or get all logs belonging to a specific template
grouped = df.groupby('Template')

- This script processes a DataFrame containing textual content, applying a cleaning function to standardize the text. It preserves the original content before cleaning and then re-applies the cleaning process.

- To help visualize the changes, it uses difflib to generate a line-by-line diff between the original and cleaned versions. This diff is stored in a new column, allowing easy inspection of the modifications. Finally, it outputs a sample of these differences for review, providing insight into how the cleaning impacts each piece of content.

In [3]:
import difflib

# Define your cleaning function
def clean_content(content):
    # Example cleaning (replace with your actual cleaning steps)
    # For example: strip whitespace, lower case, remove special characters, etc.
    cleaned = content.strip()
    # Add any other cleaning steps here
    return cleaned

# Store original content before cleaning
df['Original_Content'] = df['Content']

# Reapply your cleaning function
df['Cleaned_Content'] = df['Content'].apply(clean_content)

# Function to generate a detailed diff
def generate_diff(orig, cleaned):
    diff = difflib.ndiff(orig.splitlines(), cleaned.splitlines())
    return '\n'.join(diff)

# Create a column with the diff
df['Difference'] = df.apply(
    lambda row: generate_diff(row['Original_Content'], row['Cleaned_Content']), axis=1
)

# Show a sample of differences
print(df[['LineId', 'Difference']].head(10))

   LineId                                         Difference
0       1    workerEnv.init() ok /etc/httpd/conf/workers2...
1       2            mod_jk child workerEnv in error state 6
2       3    jk2_init() Found child 6725 in scoreboard sl...
3       4    jk2_init() Found child 6726 in scoreboard sl...
4       5    jk2_init() Found child 6728 in scoreboard sl...
5       6    workerEnv.init() ok /etc/httpd/conf/workers2...
6       7    workerEnv.init() ok /etc/httpd/conf/workers2...
7       8    workerEnv.init() ok /etc/httpd/conf/workers2...
8       9            mod_jk child workerEnv in error state 6
9      10            mod_jk child workerEnv in error state 6


- This code searches for file paths, IP addresses, numeric IDs, and timestamps based on specific patterns and stores them in a dictionary.

- The function returns these extracted elements, which can be applied to each row of a DataFrame to create a new column containing the parsed variables.

In [4]:
import re

def extract_variables(content):
    variables = {}

    # Extract file paths
    paths = re.findall(r'/[^ ]+', content)
    variables['paths'] = paths if paths else []

    # Extract IP addresses
    ips = re.findall(r'\b\d{1,3}(?:\.\d{1,3}){3}\b', content)
    variables['ips'] = ips if ips else []

    # Extract numeric IDs
    ids = re.findall(r'\b\d+\b', content)
    variables['ids'] = ids if ids else []

    # Extract timestamps (example pattern)
    timestamps = re.findall(r'(Sun|Mon|Tue|Wed|Thu|Fri|Sat)\s+[A-Za-z]{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+\d{4}', content)
    variables['timestamps'] = timestamps if timestamps else []

    # You can add more regexes for other variable parts as needed

    return variables

# Apply this to your DataFrame:
df['Extracted_Variables'] = df['Content'].apply(extract_variables)

In [5]:
from collections import Counter

# Count most common file paths
all_paths = sum(df['Extracted_Variables'].apply(lambda x: x['paths']), [])
common_paths = Counter(all_paths).most_common(10)
print("Most common paths:", common_paths)

Most common paths: [('/etc/httpd/conf/workers2.properties', 569), ('/var/www/html/', 32)]


- This code searches for complete timestamp patterns within the 'Time' column of a DataFrame, extracting them with a regular expression. Each match is stored in a new column, and any rows where extraction fails are removed.
- Using `Counter`, it then tallies the frequency of each timestamp, identifying the most common ones. Finally, it outputs the top ten timestamps along with their occurrence counts, providing insights into prevalent time entries within the dataset.

In [6]:
import re
from collections import Counter

# Define a function to extract full timestamp strings from the 'Time' column
def extract_full_time_stamp(time_str):
    pattern = r'[A-Za-z]{3}\s+[A-Za-z]{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+\d{4}'
    match = re.findall(pattern, time_str)
    return match[0] if match else None

# Apply the extraction function to the 'Time' column
df['Parsed_Time'] = df['Time'].apply(extract_full_time_stamp)

# Drop any rows where extraction failed (None)
times = df['Parsed_Time'].dropna()

# Count the frequency of each timestamp
counter = Counter(times)

# Get and display the 10 most common timestamps
most_common_times = counter.most_common(10)

print("Most common timestamps:")
for time_str, count in most_common_times:
    print(f"{time_str}: {count} times")

Most common timestamps:
Mon Dec 05 07:57:02 2005: 18 times
Sun Dec 04 05:04:04 2005: 14 times
Sun Dec 04 07:18:00 2005: 14 times
Sun Dec 04 17:43:12 2005: 14 times
Mon Dec 05 04:14:00 2005: 14 times
Mon Dec 05 10:59:29 2005: 14 times
Sun Dec 04 20:47:17 2005: 12 times
Mon Dec 05 11:06:52 2005: 12 times
Sun Dec 04 16:41:22 2005: 10 times
Sun Dec 04 16:48:01 2005: 10 times


In [7]:
import re
import pandas as pd
from collections import Counter

# 1. Define your extract_template() function
def extract_template(content):
    # Replace file paths
    content = re.sub(r'/[^ ]+', '<PATH>', content)
    # Replace IP addresses
    content = re.sub(r'\b\d{1,3}(?:\.\d{1,3}){3}\b', '<IP>', content)
    # Replace all numbers
    content = re.sub(r'\b\d+\b', '<ID>', content)
    # Replace dates/times
    content = re.sub(r'(Sun|Mon|Tue|Wed|Thu|Fri|Sat)\s+[A-Za-z]{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+\d{4}', '<DATETIME>', content)
    # Replace placeholders
    content = re.sub(r'<\*>', '<PLACEHOLDER>', content)
    # Keep function names
    content = re.sub(r'(\b[A-Za-z_][A-Za-z0-9_]*)(\(\))', r'\1', content)
    # Normalize spaces
    content = re.sub(r'\s+', ' ', content).strip()
    return content

# 2. Apply extract_template() to create 'Template' column
df['Template'] = df['Content'].apply(extract_template)

# 3. Now, extract variables for each content
def extract_variables(content):
    variables = {}
    variables['paths'] = re.findall(r'/[^ ]+', content)
    variables['ips'] = re.findall(r'\b\d{1,3}(?:\.\d{1,3}){3}\b', content)
    variables['ids'] = re.findall(r'\b\d+\b', content)
    return variables

# 4. Apply variable extraction on original content
df['Variables'] = df['Content'].apply(extract_variables)

# 5. Group by 'Template'
grouped = df.groupby('Template')

# 6. Analyze each group
for template, group in grouped:
    print(f"\nTemplate: {template}")
    print(f"Number of messages: {len(group)}")

    # Collect all variables
    all_paths = sum(df['Variables'].apply(lambda x: x['paths']).tolist(), [])
    all_ips = sum(df['Variables'].apply(lambda x: x['ips']).tolist(), [])
    all_ids = sum(df['Variables'].apply(lambda x: x['ids']).tolist(), [])

    print('Overall most common paths:', Counter(all_paths).most_common(10))
    print('Overall most common IPs:', Counter(all_ips).most_common(10))
    print('Overall most common IDs:', Counter(all_ids).most_common(10))



Template: [client <IP>] Directory index forbidden by rule: <PATH>
Number of messages: 32
Overall most common paths: [('/etc/httpd/conf/workers2.properties', 569), ('/var/www/html/', 32)]
Overall most common IPs: [('222.166.160.184', 1), ('63.13.186.196', 1), ('147.31.138.75', 1), ('207.203.80.15', 1), ('218.76.139.20', 1), ('24.147.151.74', 1), ('211.141.93.88', 1), ('216.127.124.16', 1), ('208.51.151.210', 1), ('65.68.235.27', 1)]
Overall most common IDs: [('6', 558), ('7', 296), ('8', 238), ('9', 181), ('10', 80), ('11', 16), ('1', 12), ('2', 12), ('12', 6), ('218', 6)]

Template: jk2_init Can't find child <ID> in scoreboard
Number of messages: 12
Overall most common paths: [('/etc/httpd/conf/workers2.properties', 569), ('/var/www/html/', 32)]
Overall most common IPs: [('222.166.160.184', 1), ('63.13.186.196', 1), ('147.31.138.75', 1), ('207.203.80.15', 1), ('218.76.139.20', 1), ('24.147.151.74', 1), ('211.141.93.88', 1), ('216.127.124.16', 1), ('208.51.151.210', 1), ('65.68.235.27'

In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
import transformers

# Example data loading
# df = pd.read_csv('your_data.csv')

# 1. Extract timestamp features
def extract_time_features(time_str):
    # Parse the date string
    dt = datetime.strptime(time_str, '%a %b %d %H:%M:%S %Y')
    return {
        'year': dt.year,
        'month': dt.month,
        'day': dt.day,
        'hour': dt.hour,
        'minute': dt.minute,
        'second': dt.second
    }

time_features = df['Time'].apply(extract_time_features)
time_df = pd.DataFrame(list(time_features))
df = pd.concat([df, time_df], axis=1)

# 2. Encode categorical columns
label_encoders = {}
for col in ['Level', 'EventId', 'EventTemplate']:
    le = LabelEncoder()
    df[col + '_Encoded'] = le.fit_transform(df[col])
    label_encoders[col] = le

# 3. Encode `Content` using pre-trained embeddings
# Using SentenceTransformers for better semantic capture
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')  # choose your preferred model
df['Content_Embedding'] = list(model.encode(df['Content'], show_progress_bar=True))

# 4. Prepare feature matrix
# Combine numerical timestamp features, categorical encodings, and content embeddings
X = np.hstack([
    df[['year', 'month', 'day', 'hour', 'minute', 'second']].values,
    df[[col + '_Encoded' for col in ['Level', 'EventId', 'EventTemplate']]].values,
    np.vstack(df['Content_Embedding'])
])

# 5. Prepare target variable
# Convert label to numerical if needed
y = df['Level'].values  # or your desired target


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

In [11]:
# Display the first few rows of the DataFrame with all added features
print(df.head())

   LineId                      Time   Level  \
0       1  Sun Dec 04 04:47:44 2005  notice   
1       2  Sun Dec 04 04:47:44 2005   error   
2       3  Sun Dec 04 04:51:08 2005  notice   
3       4  Sun Dec 04 04:51:09 2005  notice   
4       5  Sun Dec 04 04:51:09 2005  notice   

                                             Content EventId  \
0  workerEnv.init() ok /etc/httpd/conf/workers2.p...      E2   
1            mod_jk child workerEnv in error state 6      E3   
2  jk2_init() Found child 6725 in scoreboard slot 10      E1   
3   jk2_init() Found child 6726 in scoreboard slot 8      E1   
4   jk2_init() Found child 6728 in scoreboard slot 6      E1   

                                       EventTemplate  \
0                            workerEnv.init() ok <*>   
1          mod_jk child workerEnv in error state <*>   
2  jk2_init() Found child <*> in scoreboard slot <*>   
3  jk2_init() Found child <*> in scoreboard slot <*>   
4  jk2_init() Found child <*> in scoreboard slot <*>

In [13]:
print(df.columns)

Index(['LineId', 'Time', 'Level', 'Content', 'EventId', 'EventTemplate',
       'Template', 'Original_Content', 'Cleaned_Content', 'Difference',
       'Extracted_Variables', 'Parsed_Time', 'Variables',
       'Cleaned_Content_Encoded', 'year', 'month', 'day', 'hour', 'minute',
       'second', 'Level_Encoded', 'EventId_Encoded', 'EventTemplate_Encoded',
       'Content_Embedding'],
      dtype='object')


In [12]:
# Check the shape
print("Shape of feature matrix X:", X.shape)

# Inspect a sample row
print("Sample feature vector:", X[0])

Shape of feature matrix X: (2000, 393)
Sample feature vector: [ 2.00500000e+03  1.20000000e+01  4.00000000e+00  4.00000000e+00
  4.70000000e+01  4.40000000e+01  1.00000000e+00  1.00000000e+00
  5.00000000e+00 -1.05698615e-01  2.79850625e-02 -5.73457330e-02
 -4.93196724e-03  2.52824295e-02 -7.84836337e-02 -1.17702503e-02
 -4.64029126e-02 -1.04454339e-01 -5.84492506e-03  1.39556825e-02
  3.04687619e-02 -2.18127370e-02  2.00901181e-02 -1.86533947e-02
  5.12557030e-02  6.94829896e-02 -6.42733201e-02  2.93499362e-02
 -1.41004019e-03  1.94235099e-03  1.86477341e-02  7.05204718e-03
  2.39634868e-02  1.74772437e-03 -1.85269918e-02 -3.43218632e-02
  5.77332377e-02 -1.69798546e-02  1.21299094e-02  4.01175208e-02
 -8.17891303e-03 -9.14918929e-02  1.06372833e-02  4.91540320e-02
  1.15131818e-01  9.20082405e-02  6.43925834e-03 -8.54848400e-02
  4.63875905e-02  1.25864828e-02 -3.52856852e-02  2.81110464e-04
 -7.55568519e-02 -4.12710905e-02  7.09811822e-02 -2.25118501e-03
 -8.20461214e-02  2.80345138

In [8]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['Cleaned_Content_Encoded'] = label_encoder.fit_transform(df['Cleaned_Content'])
#print(df[['LineId', 'Cleaned_Content', 'Cleaned_Content_Encoded']])


In [9]:
print(df[['LineId', 'Cleaned_Content', 'Cleaned_Content_Encoded']])

      LineId                                    Cleaned_Content  \
0          1  workerEnv.init() ok /etc/httpd/conf/workers2.p...   
1          2            mod_jk child workerEnv in error state 6   
2          3  jk2_init() Found child 6725 in scoreboard slot 10   
3          4   jk2_init() Found child 6726 in scoreboard slot 8   
4          5   jk2_init() Found child 6728 in scoreboard slot 6   
...      ...                                                ...   
1995    1996            mod_jk child workerEnv in error state 6   
1996    1997   jk2_init() Found child 6791 in scoreboard slot 8   
1997    1998   jk2_init() Found child 6790 in scoreboard slot 7   
1998    1999  workerEnv.init() ok /etc/httpd/conf/workers2.p...   
1999    2000            mod_jk child workerEnv in error state 6   

      Cleaned_Content_Encoded  
0                         885  
1                         881  
2                         823  
3                         825  
4                         826  
...

In [None]:
def clean_event_template(template):
    # Remove placeholders like <*>
    cleaned = re.sub(r'<\*\>', '', template)

    # Replace numeric identifiers with <ID>
    cleaned = re.sub(r'\b\d+\b', '<ID>', cleaned)

    # Remove specific unwanted keywords while preserving function/command patterns
    cleaned = re.sub(r'\b(state)\b', '', cleaned, flags=re.IGNORECASE)
    cleaned = re.sub(r'\s+', ' ', cleaned).strip()  # Normalize whitespace

    return cleaned

# Apply the cleaning function to the EventTemplate column
df['Cleaned_EventTemplate'] = df['EventTemplate'].apply(clean_event_template)

# Show the resulting DataFrame
print(df[['LineId', 'EventTemplate', 'Cleaned_EventTemplate']])

      LineId                                      EventTemplate  \
0          1                            workerEnv.init() ok <*>   
1          2          mod_jk child workerEnv in error state <*>   
2          3  jk2_init() Found child <*> in scoreboard slot <*>   
3          4  jk2_init() Found child <*> in scoreboard slot <*>   
4          5  jk2_init() Found child <*> in scoreboard slot <*>   
...      ...                                                ...   
1995    1996          mod_jk child workerEnv in error state <*>   
1996    1997  jk2_init() Found child <*> in scoreboard slot <*>   
1997    1998  jk2_init() Found child <*> in scoreboard slot <*>   
1998    1999                            workerEnv.init() ok <*>   
1999    2000          mod_jk child workerEnv in error state <*>   

                          Cleaned_EventTemplate  
0                           workerEnv.init() ok  
1               mod_jk child workerEnv in error  
2     jk2_init() Found child in scoreboard sl

In [None]:
# Apply the cleaning function to the EventTemplate column
df['Cleaned_EventTemplate'] = df['EventTemplate'].apply(clean_event_template)

# Create LabelEncoder instance
label_encoder = LabelEncoder()

# Label encode the Cleaned_EventTemplate
df['Cleaned_EventTemplate_Encoded'] = label_encoder.fit_transform(df['Cleaned_EventTemplate'])

# Show the resulting DataFrame
print(df[['LineId', 'EventTemplate', 'Cleaned_EventTemplate', 'Cleaned_EventTemplate_Encoded']])

      LineId                                      EventTemplate  \
0          1                            workerEnv.init() ok <*>   
1          2          mod_jk child workerEnv in error state <*>   
2          3  jk2_init() Found child <*> in scoreboard slot <*>   
3          4  jk2_init() Found child <*> in scoreboard slot <*>   
4          5  jk2_init() Found child <*> in scoreboard slot <*>   
...      ...                                                ...   
1995    1996          mod_jk child workerEnv in error state <*>   
1996    1997  jk2_init() Found child <*> in scoreboard slot <*>   
1997    1998  jk2_init() Found child <*> in scoreboard slot <*>   
1998    1999                            workerEnv.init() ok <*>   
1999    2000          mod_jk child workerEnv in error state <*>   

                          Cleaned_EventTemplate  Cleaned_EventTemplate_Encoded  
0                           workerEnv.init() ok                              5  
1               mod_jk child work

In [None]:
# have a timeseries plot
# the target is the level
# find large models for classification task

In [None]:
!pip install transformers==4.31.0

from transformers import XLNetForSequenceClassification, XLNetTokenizer

model_name = "xlnet-base-cased"  # Choose an XLNet variant
tokenizer = XLNetTokenizer.from_pretrained(model_name)
model = XLNetForSequenceClassification.from_pretrained(model_name, num_labels=your_num_labels)


In [None]:
!pip install transformers==4.31.0

from transformers import ElectraForSequenceClassification, ElectraTokenizer

model_name = "google/electra-base-discriminator"  # Choose an ELECTRA variant
tokenizer = ElectraTokenizer.from_pretrained(model_name)
model = ElectraForSequenceClassification.from_pretrained(model_name, num_labels=your_num_labels)

In [None]:
!pip install transformers==4.31.0

from transformers import DebertaForSequenceClassification, DebertaTokenizer

model_name = "microsoft/deberta-base"  # Choose a DeBERTa variant
tokenizer = DebertaTokenizer.from_pretrained(model_name)
model = DebertaForSequenceClassification.from_pretrained(model_name, num_labels=your_num_labels)

In [None]:
!pip install transformers==4.31.0

from transformers import LongformerForSequenceClassification, LongformerTokenizer

model_name = "allenai/longformer-base-4096"  # Choose a Longformer variant
tokenizer = LongformerTokenizer.from_pretrained(model_name)
model = LongformerForSequenceClassification.from_pretrained(model_name, num_labels=your_num_labels)


In [None]:
# Apache Spark Models
from pyspark.ml.classification import GBTClassifier

# Assuming 'df' is your Spark DataFrame with features and label column
gbt = GBTClassifier(featuresCol='features', labelCol='label')
gbt_model = gbt.fit(df)

# Make predictions
predictions = gbt_model.transform(df)


# Better Visualizations and Story telling