# **GUIDEWIRE DEVTrails University Hackathon**
## **Phase 1: AI/ML Model for Predicting Kubernetes Issues**

# Google 2019 Cluster Sample Dataset

This dataset, provided by [Derrick Mwiti](https://www.kaggle.com/derrickmwiti), is a sample of the Google 2019 cluster data. It can be used for various data analysis and machine learning projects, including clustering, anomaly detection, and performance analysis.

## Dataset Overview

- **Source:** [Kaggle - Google 2019 Cluster Sample](https://www.kaggle.com/datasets/derrickmwiti/google-2019-cluster-sample)
- **Description:** A sample dataset that includes details from Google’s cluster data for 2019. (Customize this description based on your understanding of the dataset.)
- **Potential Uses:** Data exploration, clustering algorithms, performance analytics, etc.

## Data Attributes

*(If available, list key attributes/columns of the dataset along with brief descriptions. For example:)*

- **Cluster_ID:** Unique identifier for each cluster.
- **Instance_Type:** The type of instance or machine.
- **CPU_Usage:** CPU usage metrics.
- **Memory_Usage:** Memory consumption statistics.
- ... *(add more as applicable)*

## Preprocessing Notes

Below are some suggested preprocessing steps for this dataset:

1. **Data Cleaning:** Handle missing values, correct data types, and remove outliers.
2. **Feature Engineering:** Create additional features if needed based on the dataset’s structure.
3. **Scaling & Normalization:** Apply scaling methods to numerical features if you plan on using machine learning models.
4. **Exploratory Data Analysis (EDA):** Generate visualizations to understand the distribution and relationships of the features.

# **Importing The Dataset**

In [None]:
import pandas as pd
import numpy as np
import ast
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load the dataset
file_path = "/content/google-cluster-dataset.csv"
df = pd.read_csv(file_path)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,time,instance_events_type,collection_id,scheduling_class,collection_type,priority,alloc_collection_id,instance_index,machine_id,...,assigned_memory,page_cache_memory,cycles_per_instruction,memory_accesses_per_instruction,sample_rate,cpu_usage_distribution,tail_cpu_usage_distribution,cluster,event,failed
0,0,0,2,94591244395,3,1,200,0,144,168846390496,...,0.014435,0.000415,,,1.0,[0.00314331 0.00381088 0.00401306 0.00415039 0...,[0.00535583 0.00541687 0.00548553 0.00554657 0...,7,FAIL,1
1,1,2517305308183,2,260697606809,2,0,360,221495397286,335,85515092,...,0.0,0.0,,,1.0,[1.23977661e-05 1.23977661e-05 1.23977661e-05 ...,[1.23977661e-05 1.23977661e-05 1.23977661e-05 ...,7,FAIL,1
2,2,195684022913,6,276227177776,2,0,103,0,376,169321752432,...,0.010422,0.000235,0.939919,0.001318,1.0,[0.01344299 0.01809692 0.0201416 0.02246094 0...,[0.02902222 0.02929688 0.0295105 0.0296936 0...,7,SCHEDULE,0
3,3,0,2,10507389885,3,0,200,0,1977,178294817221,...,0.041626,0.000225,1.359102,0.007643,1.0,[0.03704834 0.04125977 0.04290771 0.04425049 0...,[0.05535889 0.05584717 0.05633545 0.05718994 0...,8,FAIL,1
4,4,1810627494172,3,25911621841,2,0,0,0,3907,231364893292,...,0.000272,1e-05,,,1.0,[0. 0. 0. 0. 0...,[0.00041485 0.00041485 0.00041485 0.00041485 0...,2,FINISH,0


# **Data Cleaning**

In [None]:
# Drop unnecessary column (if 'Unnamed: 0' exists due to index)
if 'Unnamed: 0' in df.columns:
    df.drop(columns=['Unnamed: 0'], inplace=True)

# Convert 'time', 'start_time', and 'end_time' to datetime format
df['time'] = pd.to_datetime(df['time'], unit='ns')  # Assuming timestamps are in nanoseconds
df['start_time'] = pd.to_datetime(df['start_time'], unit='ns')
df['end_time'] = pd.to_datetime(df['end_time'], unit='ns')

# Calculate duration (how long the instance was running)
df['duration'] = (df['end_time'] - df['start_time']).dt.total_seconds()

# Drop invalid durations (e.g., negative values)
df = df[df['duration'] >= 0]

In [None]:
df.head()

Unnamed: 0,time,instance_events_type,collection_id,scheduling_class,collection_type,priority,alloc_collection_id,instance_index,machine_id,resource_request,...,page_cache_memory,cycles_per_instruction,memory_accesses_per_instruction,sample_rate,cpu_usage_distribution,tail_cpu_usage_distribution,cluster,event,failed,duration
0,1970-01-01 00:00:00.000000000,2,94591244395,3,1,200,0,144,168846390496,"{'cpus': 0.020660400390625, 'memory': 0.014434...",...,0.000415,,,1.0,[0.00314331 0.00381088 0.00401306 0.00415039 0...,[0.00535583 0.00541687 0.00548553 0.00554657 0...,7,FAIL,1,0.3
1,1970-01-01 00:41:57.305308183,2,260697606809,2,0,360,221495397286,335,85515092,"{'cpus': 0.00724029541015625, 'memory': 0.0013...",...,0.0,,,1.0,[1.23977661e-05 1.23977661e-05 1.23977661e-05 ...,[1.23977661e-05 1.23977661e-05 1.23977661e-05 ...,7,FAIL,1,0.001
2,1970-01-01 00:03:15.684022913,6,276227177776,2,0,103,0,376,169321752432,"{'cpus': 0.048583984375, 'memory': 0.004165649...",...,0.000235,0.939919,0.001318,1.0,[0.01344299 0.01809692 0.0201416 0.02246094 0...,[0.02902222 0.02929688 0.0295105 0.0296936 0...,7,SCHEDULE,0,0.3
3,1970-01-01 00:00:00.000000000,2,10507389885,3,0,200,0,1977,178294817221,"{'cpus': 0.0704345703125, 'memory': 0.04162597...",...,0.000225,1.359102,0.007643,1.0,[0.03704834 0.04125977 0.04290771 0.04425049 0...,[0.05535889 0.05584717 0.05633545 0.05718994 0...,8,FAIL,1,0.3
4,1970-01-01 00:30:10.627494172,3,25911621841,2,0,0,0,3907,231364893292,"{'cpus': 0.00244903564453125, 'memory': 0.0002...",...,1e-05,,,1.0,[0. 0. 0. 0. 0...,[0.00041485 0.00041485 0.00041485 0.00041485 0...,2,FINISH,0,0.002


# **Handling Missing Values**

In [None]:
# Check percentage of missing values
missing_values = df.isnull().sum() / len(df) * 100
print("\nMissing Values (%):\n", missing_values[missing_values > 0])

# Fill missing numerical values with median (robust to outliers)
num_cols = ['vertical_scaling', 'scheduler', 'cycles_per_instruction', 'memory_accesses_per_instruction']
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Fill missing categorical values with mode
cat_cols = ['resource_request']
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])


Missing Values (%):
 resource_request                    0.190690
vertical_scaling                    0.236269
scheduler                           0.236269
cycles_per_instruction             30.719350
memory_accesses_per_instruction    30.719350
dtype: float64


In [None]:
df.head()

Unnamed: 0,time,instance_events_type,collection_id,scheduling_class,collection_type,priority,alloc_collection_id,instance_index,machine_id,resource_request,...,page_cache_memory,cycles_per_instruction,memory_accesses_per_instruction,sample_rate,cpu_usage_distribution,tail_cpu_usage_distribution,cluster,event,failed,duration
0,1970-01-01 00:00:00.000000000,2,94591244395,3,1,200,0,144,168846390496,"{'cpus': 0.020660400390625, 'memory': 0.014434...",...,0.000415,1.918681,0.009506,1.0,[0.00314331 0.00381088 0.00401306 0.00415039 0...,[0.00535583 0.00541687 0.00548553 0.00554657 0...,7,FAIL,1,0.3
1,1970-01-01 00:41:57.305308183,2,260697606809,2,0,360,221495397286,335,85515092,"{'cpus': 0.00724029541015625, 'memory': 0.0013...",...,0.0,1.918681,0.009506,1.0,[1.23977661e-05 1.23977661e-05 1.23977661e-05 ...,[1.23977661e-05 1.23977661e-05 1.23977661e-05 ...,7,FAIL,1,0.001
2,1970-01-01 00:03:15.684022913,6,276227177776,2,0,103,0,376,169321752432,"{'cpus': 0.048583984375, 'memory': 0.004165649...",...,0.000235,0.939919,0.001318,1.0,[0.01344299 0.01809692 0.0201416 0.02246094 0...,[0.02902222 0.02929688 0.0295105 0.0296936 0...,7,SCHEDULE,0,0.3
3,1970-01-01 00:00:00.000000000,2,10507389885,3,0,200,0,1977,178294817221,"{'cpus': 0.0704345703125, 'memory': 0.04162597...",...,0.000225,1.359102,0.007643,1.0,[0.03704834 0.04125977 0.04290771 0.04425049 0...,[0.05535889 0.05584717 0.05633545 0.05718994 0...,8,FAIL,1,0.3
4,1970-01-01 00:30:10.627494172,3,25911621841,2,0,0,0,3907,231364893292,"{'cpus': 0.00244903564453125, 'memory': 0.0002...",...,1e-05,1.918681,0.009506,1.0,[0. 0. 0. 0. 0...,[0.00041485 0.00041485 0.00041485 0.00041485 0...,2,FINISH,0,0.002


# **Feature Engineering**

In [None]:
def parse_list_column_alt(data):
    """
    Convert a string representation of a list to an actual list and return the mean.
    Assumes the numbers are space-separated.
    """
    try:
        # Remove any leading/trailing whitespace and the surrounding brackets
        data = data.strip().lstrip('[').rstrip(']')
        # Split the string by whitespace. This will work even if commas are missing.
        # If commas exist, we can remove them first.
        data = data.replace(',', ' ')
        # Split by whitespace and filter out any empty strings
        parts = [x for x in data.split() if x]
        # Convert to floats
        values = [float(x) for x in parts]
        return np.mean(values) if values else np.nan
    except Exception as e:
        print("Error parsing data:", data, "Error:", e)
        return np.nan

# Apply the alternative parser to both columns
df['avg_cpu_usage_distribution'] = df['cpu_usage_distribution'].apply(parse_list_column_alt)
df['avg_tail_cpu_usage_distribution'] = df['tail_cpu_usage_distribution'].apply(parse_list_column_alt)

# Print the first few values to verify
print(df[['avg_cpu_usage_distribution', 'avg_tail_cpu_usage_distribution']].head())


   avg_cpu_usage_distribution  avg_tail_cpu_usage_distribution
0                    0.005054                         0.006783
1                    0.000012                         0.000012
2                    0.023960                         0.029945
3                    0.048234                         0.059367
4                    0.000207                         0.000415


In [None]:
# Suppose your DataFrame is df
# We want to parse columns: average_usage, maximum_usage, random_sample_usage

def parse_usage_col(value):
    """
    value: string that looks like {'cpus': 0.00466, 'memory': 0.0059}
    returns a tuple (cpu_val, mem_val)
    """
    try:
        parsed = ast.literal_eval(value)  # convert string to dict
        cpu_val = parsed.get('cpus', np.nan)
        mem_val = parsed.get('memory', np.nan)
        return cpu_val, mem_val
    except:
        # if the parse fails or is None
        return np.nan, np.nan

# For each JSON-like column, create new columns for CPU and memory usage:
df[['avg_usage_cpu', 'avg_usage_mem']] = df['average_usage'].apply(lambda x: pd.Series(parse_usage_col(x)))
df[['max_usage_cpu', 'max_usage_mem']] = df['maximum_usage'].apply(lambda x: pd.Series(parse_usage_col(x)))
df[['rand_usage_cpu','rand_usage_mem']] = df['random_sample_usage'].apply(lambda x: pd.Series(parse_usage_col(x)))

# Now 'assigned_memory' is presumably already numeric,
# but if it's not, you can convert:
df['assigned_memory'] = pd.to_numeric(df['assigned_memory'], errors='coerce')

In [None]:
df.head()

Unnamed: 0,time,instance_events_type,collection_id,scheduling_class,collection_type,priority,alloc_collection_id,instance_index,machine_id,resource_request,...,tail_cpu_usage_distribution,cluster,event,failed,duration,cpu_request,memory_request,avg_cpu_usage_distribution,avg_tail_cpu_usage_distribution,event_duration
0,1970-01-01 00:00:00.000000000,2,94591244395,3,1,200,0,144,168846390496,"{'cpus': 0.020660400390625, 'memory': 0.014434...",...,[0.00535583 0.00541687 0.00548553 0.00554657 0...,7,FAIL,1,0.3,0.02066,0.014435,0.005054,0.006783,0.0
1,1970-01-01 00:41:57.305308183,2,260697606809,2,0,360,221495397286,335,85515092,"{'cpus': 0.00724029541015625, 'memory': 0.0013...",...,[1.23977661e-05 1.23977661e-05 1.23977661e-05 ...,7,FAIL,1,0.001,0.00724,0.001303,1.2e-05,1.2e-05,2517.305308
2,1970-01-01 00:03:15.684022913,6,276227177776,2,0,103,0,376,169321752432,"{'cpus': 0.048583984375, 'memory': 0.004165649...",...,[0.02902222 0.02929688 0.0295105 0.0296936 0...,7,SCHEDULE,0,0.3,0.048584,0.004166,0.02396,0.029945,-2321.621285
3,1970-01-01 00:00:00.000000000,2,10507389885,3,0,200,0,1977,178294817221,"{'cpus': 0.0704345703125, 'memory': 0.04162597...",...,[0.05535889 0.05584717 0.05633545 0.05718994 0...,8,FAIL,1,0.3,0.070435,0.041626,0.048234,0.059367,-195.684023
4,1970-01-01 00:30:10.627494172,3,25911621841,2,0,0,0,3907,231364893292,"{'cpus': 0.00244903564453125, 'memory': 0.0002...",...,[0.00041485 0.00041485 0.00041485 0.00041485 0...,2,FINISH,0,0.002,0.002449,0.000232,0.000207,0.000415,1810.627494


# **Categorical Encoding**

In [None]:
# Encode 'event' and 'scheduling_class' as numerical values
label_encoders = {}
for col in ['event', 'scheduling_class']:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le  # Save encoder for later use

# Map 'failed' column (target variable) as binary (0 or 1)
df['failed'] = df['failed'].astype(int)

In [None]:
df.head()

Unnamed: 0,time,instance_events_type,collection_id,scheduling_class,collection_type,priority,alloc_collection_id,instance_index,machine_id,resource_request,...,tail_cpu_usage_distribution,cluster,event,failed,duration,cpu_request,memory_request,avg_cpu_usage_distribution,avg_tail_cpu_usage_distribution,event_duration
0,1970-01-01 00:00:00.000000000,2,94591244395,3,1,200,0,144,168846390496,"{'cpus': 0.020660400390625, 'memory': 0.014434...",...,[0.00535583 0.00541687 0.00548553 0.00554657 0...,7,2,1,0.3,0.02066,0.014435,0.005054,0.006783,0.0
1,1970-01-01 00:41:57.305308183,2,260697606809,2,0,360,221495397286,335,85515092,"{'cpus': 0.00724029541015625, 'memory': 0.0013...",...,[1.23977661e-05 1.23977661e-05 1.23977661e-05 ...,7,2,1,0.001,0.00724,0.001303,1.2e-05,1.2e-05,2517.305308
2,1970-01-01 00:03:15.684022913,6,276227177776,2,0,103,0,376,169321752432,"{'cpus': 0.048583984375, 'memory': 0.004165649...",...,[0.02902222 0.02929688 0.0295105 0.0296936 0...,7,7,0,0.3,0.048584,0.004166,0.02396,0.029945,-2321.621285
3,1970-01-01 00:00:00.000000000,2,10507389885,3,0,200,0,1977,178294817221,"{'cpus': 0.0704345703125, 'memory': 0.04162597...",...,[0.05535889 0.05584717 0.05633545 0.05718994 0...,8,2,1,0.3,0.070435,0.041626,0.048234,0.059367,-195.684023
4,1970-01-01 00:30:10.627494172,3,25911621841,2,0,0,0,3907,231364893292,"{'cpus': 0.00244903564453125, 'memory': 0.0002...",...,[0.00041485 0.00041485 0.00041485 0.00041485 0...,2,3,0,0.002,0.002449,0.000232,0.000207,0.000415,1810.627494


# **Data Normalization**

In [None]:
df.columns


Index(['time', 'instance_events_type', 'collection_id', 'scheduling_class',
       'collection_type', 'priority', 'alloc_collection_id', 'instance_index',
       'machine_id', 'resource_request', 'constraint',
       'collections_events_type', 'user', 'collection_name',
       'collection_logical_name', 'start_after_collection_ids',
       'vertical_scaling', 'scheduler', 'start_time', 'end_time',
       'average_usage', 'maximum_usage', 'random_sample_usage',
       'assigned_memory', 'page_cache_memory', 'cycles_per_instruction',
       'memory_accesses_per_instruction', 'sample_rate',
       'cpu_usage_distribution', 'tail_cpu_usage_distribution', 'cluster',
       'event', 'failed', 'duration', 'avg_usage_cpu', 'avg_usage_mem',
       'max_usage_cpu', 'max_usage_mem', 'rand_usage_cpu', 'rand_usage_mem'],
      dtype='object')

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Select numeric columns to normalize
cols_to_normalize = [
    'avg_usage_cpu', 'avg_usage_mem', 'max_usage_cpu', 'max_usage_mem',
    'rand_usage_cpu', 'rand_usage_mem', 'assigned_memory',
    'page_cache_memory', 'cycles_per_instruction',
    'memory_accesses_per_instruction', 'duration', 'sample_rate'
]

# Choose one of the scalers:
scaler = MinMaxScaler()  # Normalize between 0 and 1
# scaler = StandardScaler()  # Normalize to mean=0, std=1

# Apply normalization
df[cols_to_normalize] = scaler.fit_transform(df[cols_to_normalize])

print(df.head())  # Check the results


                           time  instance_events_type  collection_id  \
0 1970-01-01 00:00:00.000000000                     2    94591244395   
1 1970-01-01 00:41:57.305308183                     2   260697606809   
2 1970-01-01 00:03:15.684022913                     6   276227177776   
3 1970-01-01 00:00:00.000000000                     2    10507389885   
4 1970-01-01 00:30:10.627494172                     3    25911621841   

   scheduling_class  collection_type  priority  alloc_collection_id  \
0                 3                1       200                    0   
1                 2                0       360         221495397286   
2                 2                0       103                    0   
3                 3                0       200                    0   
4                 2                0         0                    0   

   instance_index    machine_id  \
0             144  168846390496   
1             335      85515092   
2             376  169321752432   


  return xp.asarray(numpy.nanmin(X, axis=axis))
  return xp.asarray(numpy.nanmax(X, axis=axis))


In [None]:
df.drop(columns=['rand_usage_mem'], inplace=True)

# **Save Final Data**

In [None]:
# Save cleaned dataset
df.to_csv("cleaned_k8s_data.csv", index=False)

print("\n✅ Data preprocessing complete! Cleaned dataset saved as 'cleaned_k8s_data.csv'.")


✅ Data preprocessing complete! Cleaned dataset saved as 'cleaned_k8s_data.csv'.




---

