## Analysing Dataset-

# 📊 CPU Usage Analysis

## 📌 Overview
This project analyzes CPU resource allocation and utilization based on the **DatasetFinal.csv** dataset. The key insights include over-provisioning trends, underutilization, priority-based CPU usage, scheduling class impact, hourly trends, and wasted CPU resources.

---
## 📂 Dataset Summary
- **Total Entries:** 100,000
- **Missing Values:** `Timestamp` column had missing values, which were forward-filled.
- **Key Columns:**
  - `resource_request` → CPU resources requested
  - `average_usage` → Actual CPU usage
  - `maximum_usage` → Peak CPU usage
  - `priority` → Task priority level
  - `scheduling_class` → Scheduling classification

---
## 📊 Key Insights

### **General Statistics**
| Metric | Resource Request | Average Usage | Maximum Usage |
|--------|-----------------|---------------|---------------|
| Count  | 99805           | 100000        | 100000        |
| Mean   | 15.32           | 7.41          | 25.21        |
| Std Dev| 28.53           | 18.34         | 52.18        |
| Min    | 0.00            | 0.00          | 0.00         |
| 25%    | 4.05            | 0.20          | 0.79         |
| 50%    | 8.10            | 1.03          | 5.00         |
| 75%    | 15.93           | 7.28          | 29.66        |
| Max    | 583.00          | 538.08        | 1271.48      |

### **Utilization Analysis**
- 🔍 **Over-Provisioned Entries (<50% Utilized):** **63,734** (63.7%)
- 💤 **Underutilized Entries (Zero Usage):** **11,862** (11.8%)

### **CPU Usage by Priority**
| Priority | Avg CPU Usage |
|----------|--------------|
| 0        | 2.40         |
| 25       | 0.83         |
| 100      | 5.79         |
| 101      | 19.85        |
| 118      | 34.44        |
| 119      | 13.98        |
| 199      | 35.00        |
| 210      | **0.00** ⚠️ (Needs Investigation) |
| 450      | 10.44        |

### **CPU Usage by Scheduling Class**
| Scheduling Class | Avg CPU Usage |
|------------------|--------------|
| 0               | 5.09         |
| 1               | 4.93         |
| 2               | 8.44         |
| 3               | 15.66        |

### **Hourly CPU Usage Trends**
- Peak hours: **9 AM, 3 AM**
- Lowest usage: **11 AM, 10 PM**
- No extreme fluctuations but identifiable patterns

### **Wasted CPU Resources**
| Metric | Value |
|--------|-------|
| Count  | 99,805 |
| Mean   | 7.92  |
| Std Dev| 20.31 |
| Min    | -112.6 ⚠️ (Usage exceeded request) |
| 25%    | 0.80  |
| 50%    | 4.33  |
| 75%    | 9.20  |
| Max    | 578.93 |

---
## 📌 Next Steps
🔹 **Investigate Priority 210** (Zero CPU usage issue)  
🔹 **Analyze negative wasted CPU values** (Possible auto-scaling/misconfigurations)  
🔹 **Visualize hourly trends using a heatmap or line chart**  

---
## 📂 Files
- **DatasetFinal.csv** → Raw dataset
- **analysis.py** → Python script for analysis
- **README.md** → Documentation

---
## 🔧 Setup & Execution
```sh
pip install pandas numpy
python analysis.py
```

---
## 📬 Contact
For queries or collaboration, feel free to reach out! 🚀


In [14]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("DatasetFinal.csv")

# Convert Timestamp to datetime format
df["Timestamp"] = pd.to_datetime(df["Timestamp"], format="%d-%m-%Y %H:%M", errors="coerce")

# Fill missing timestamps (forward fill method)
df["Timestamp"].fillna(method="ffill", inplace=True)

# General Statistics
summary_stats = df[["resource_request", "average_usage", "maximum_usage"]].describe()

# CPU Utilization Analysis
df["utilization_ratio"] = df["average_usage"] / df["resource_request"]
over_provisioned = df[df["utilization_ratio"] < 0.5]  # Less than 50% utilization
underutilized = df[df["average_usage"] == 0]  # Completely unused allocations

# Priority & Scheduling Class Impact
priority_usage = df.groupby("priority")["average_usage"].mean()
scheduling_class_usage = df.groupby("scheduling_class")["average_usage"].mean()

# Peak & Off-Peak Usage (Hourly Trends)
df["Hour"] = df["Timestamp"].dt.hour  # Extract hour after fixing Timestamp
hourly_usage = df.groupby("Hour")["average_usage"].mean()

# Efficiency Analysis
df["wasted_cpu"] = df["resource_request"] - df["average_usage"]
wasted_cpu_stats = df["wasted_cpu"].describe()

# Display Insights
print("📊 General Statistics:\n", summary_stats)
print("\n🔍 Over-Provisioned Entries (<50% Utilized):", len(over_provisioned))
print("\n💤 Underutilized Entries (Zero CPU Usage):", len(underutilized))
print("\n📌 Avg CPU Usage by Priority:\n", priority_usage)
print("\n📌 Avg CPU Usage by Scheduling Class:\n", scheduling_class_usage)
print("\n⏳ Hourly CPU Usage Trend:\n", hourly_usage)
print("\n⚠️ Wasted CPU Resources:\n", wasted_cpu_stats)


📊 General Statistics:
        resource_request  average_usage  maximum_usage
count      99805.000000  100000.000000  100000.000000
mean          15.323945       7.412663      25.212272
std           28.533821      18.342905      52.187262
min            0.000000       0.000000       0.000000
25%            4.051208       0.200272       0.794411
50%            8.102417       1.035690       5.004883
75%           15.930176       7.286072      29.663086
max          583.007812     538.085938    1271.484375

🔍 Over-Provisioned Entries (<50% Utilized): 63734

💤 Underutilized Entries (Zero CPU Usage): 11862

📌 Avg CPU Usage by Priority:
 priority
0.0       2.401346
25.0      0.835525
100.0     5.790075
101.0    19.851960
103.0     5.482175
105.0     2.577361
107.0     4.161291
114.0     7.363143
115.0     1.627170
116.0    10.626646
117.0     4.583187
118.0    34.440866
119.0    13.985375
199.0    35.009695
200.0    10.384887
201.0     0.160917
205.0     0.301726
210.0     0.000000
214.0    

  df["Timestamp"].fillna(method="ffill", inplace=True)
