# Demo Notebook for ICER_Data_Package 
This notebook is for testing methods in DataAnalyzer class on ICER data files: GPFS and SLURM

In [6]:
# Import class
from ICER_Data_Package.DataAnalyzer import DataAnalyzer

## Step 1: Prepare Your Data Files
1. GPFS Data File: This file should be a text file with space-separated values, containing columns as described in read_gpfs method. The expected columns are "Inode (file unique ID)", "KB Allocated", "File Size", etc.
2. Slurm Data File: This file should be a text file with values separated by a pipe (|), according to read_slurm method's specification.

## Step 2: Initialize the DataAnalyzer Class

You'll need to provide paths to your GPFS and Slurm data files when initializing the DataAnalyzer class. If you only have one type of data file or the other, you can provide an empty string for the path you don't have.

In [16]:
analyzer = DataAnalyzer(gpfspath = "/mnt/research/CMSE495-SS24-ICER/file_system_usage/gpfs-stats/inode-size-age-jan-23", slurmpath="/mnt/research/CMSE495-SS24-ICER/slurm_usage/DID_FINAL_SLURM_OCT_2023.csv")


gpfs path= /mnt/research/CMSE495-SS24-ICER/file_system_usage/gpfs-stats/inode-size-age-jan-23
slurm path= /mnt/research/CMSE495-SS24-ICER/slurm_usage/DID_FINAL_SLURM_OCT_2023.csv


## Step 3: Use the DataAnalyzer Methods

With your data loaded into the DataAnalyzer instance, you can now use its methods to analyze the data.


1.  To aggregate Slurm data:
  
  This method aggregates Slurm data based on the jobs, excluding certain jobs as per the logic you've defined.
  The aggregated Slurm data (aggregated_slurm_data) will give you a simplified view of jobs, combining batch and extern jobs into single records where applicable.

In [8]:
aggregated_slurm_data = analyzer.AggSLURMDat(analyzer.slurm_data)

In [14]:
# access through function
aggregated_slurm_data.head()
# you can access aggregated Slurm data through slurm_agg, after running .AggSLURMDat method
analyzer.slurm_agg.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,JobID,User,Group,Submit,Start,End,Elapsed,State,...,CPUTimeRAW,ReqCPUS,AllocCPUS,ReqMem,MaxRSS,ReqNodes,NNodes,NodeList,ReqTRES,AllocTRES
790,790,790,90361456,user_606,group_000,2023-09-25T14:22:02,2023-10-01T12:17:05,2023-10-01T23:22:50,11:05:45,COMPLETED,...,798900,20,20,160G,126751612K,1,1,lac-230,"billing=24903,cpu=20,mem=160G,node=1","billing=24903,cpu=20,mem=160G,node=1"
682,682,682,90361456,user_606,group_000,2023-09-25T14:22:02,2023-09-30T19:11:29,2023-10-02T09:48:17,1-14:36:48,COMPLETED,...,2780160,20,20,160G,128816740K,1,1,skl-143,"billing=24903,cpu=20,mem=160G,node=1","billing=24903,cpu=20,mem=160G,node=1"
736,736,736,90361456,user_606,group_000,2023-09-25T14:22:02,2023-10-01T02:43:47,2023-10-01T13:13:42,10:29:55,COMPLETED,...,755900,20,20,160G,129609204K,1,1,acm-064,"billing=24903,cpu=20,mem=160G,node=1","billing=24903,cpu=20,mem=160G,node=1"
691,691,691,90361456,user_606,group_000,2023-09-25T14:22:02,2023-09-30T22:05:41,2023-10-01T09:25:17,11:19:36,COMPLETED,...,815520,20,20,160G,132740564K,1,1,amr-070,"billing=24903,cpu=20,mem=160G,node=1","billing=24903,cpu=20,mem=160G,node=1"
802,802,802,90361456,user_606,group_000,2023-09-25T14:22:02,2023-10-01T13:15:18,2023-10-01T23:57:16,10:41:58,COMPLETED,...,770360,20,20,160G,132093732K,1,1,skl-145,"billing=24903,cpu=20,mem=160G,node=1","billing=24903,cpu=20,mem=160G,node=1"


2.  To identify users with many files in the GPFS data:

  Replace 100 with whatever file limit you're interested in. This method will return a DataFrame of users who have more files than the specified limit.
  The DataFrame returned by UsersWithManyFiles (users_with_many_files) will list users exceeding the specified file count, which can be useful for identifying heavy users or potential optimizations.

In [11]:
users_with_many_files = analyzer.UsersWithManyFiles(file_limit=100)

In [12]:
users_with_many_files.head()

Unnamed: 0,UID numeric ID for the owner of the file,Number of files
5,762231,20534
10,881083,14576
6,785573,10379
0,500120,9029
11,1000000092,8658


3. To group users based on their utilization patterns of the HPCC