Objectives:
This notebook addresses feedback related to data preparation. 

Key Steps Covered:

1. Filtering Small Commands:
    
    Commands with memory usage less than 1 MB are filtered out.

2. Including Outliers:
    
    The top 0.1% of high-memory usage outliers (previously excluded) are now retained to observe their influence during training and evaluation.

3. Log Transformation:
    
    The MAX_MEM_USAGE_MB column is log-transformed (using np.log10) to reduce skew and compress the range.

4. Binning for Stratified Downsampling:
    
    Two temporary columns are created: log (log-transformed memory usage) and bin (using pd.cut()).
    These bins allow us to stratify the data during downsampling, ensuring we retain distribution.
    After sampling, temporary columns are dropped.

5. Visualisation:
    
    For visualisation, plots use a log-scaled x-axis when showing the memory usage distribution.
    The downsampled dataset appears more balanced, with visible high-memory outliers at the tail retained for analysis.

6. Saving Intermediate Results:
    
    Two dataframe one cleaned and another downsampled are stored in a temporary variable for use in later training and testing notebooks.




In [None]:
df = pd.read_json("/Users/dn10/Downloads/Bsub_dataset/filtered_under_5GB.jsonl", lines=True)
len(df)

In [None]:
# Filter jobs with low memory
print(f"length of df before any filtering: {len(df)}")
df_low = df[df["MAX_MEM_USAGE_MB"] < 1.0].copy()
df_filtered = df[df["MAX_MEM_USAGE_MB"] >= 1.0].copy()
print(f"length of df after initial filtering: {len(df)}")

All the filtered rows are with memory usage of 0, therefore safe to remove

In [None]:
df1 = df_filtered.copy()
df1['log_max_usage'] = np.log10(df_filtered["MAX_MEM_USAGE_MB"])

In [None]:
df1['log_max_usage'].hist()
plt.show()


The log transformed data is not as skewed and provide a good distribution for us to sample from for model training

In [None]:
# Bin the data
df2 = df1.copy()
df2['bin'] = pd.cut(df1["log_max_usage"], bins=100)

In [None]:
# Sample from each bin
df3 = (df2
        .groupby('bin')
        .apply(lambda x: x.sample(min(len(x),1000), random_state=42))
        .drop(columns=['bin', 'log_max_usage'])
        .reset_index(drop=True)
        )

In [None]:
df3["MAX_MEM_USAGE_MB"].hist(bins=50)
plt.xscale('log')
plt.show()

Saving the df2(dataframe without downsampling) and df3( dataframe with downsampling)

In [None]:
import json
df2_index = df2.index.to_list()
with open('/Users/dn10/Downloads/Bsub_dataset/df_without_downsampling.json','w')as f:
    json.dump(df2_index, f)

In [None]:
with open('/Users/dn10/Downloads/Bsub_dataset/df_without_downsampling.json', 'r') as f:
    df4 = pd.Index(json.load(f))
index_from_json = df.loc[df4]

In [None]:
df3.to_json('/Users/dn10/Downloads/Bsub_dataset/df_with_downsampling.json', orient='records', lines=True)