# Analytics of ML Features Usage in IDEs



## 1. Import Required Libraries and Load Data

**Name:** Wassim Mezgahnni   

**Goal:** Analyze user activity (March–May 2025) for interactions with LLM models, features and license types in JetBrains IDEs. Produce EDA, charts, statistical tests, aggregated metrics, and actionable recommendations.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from datetime import datetime

In [5]:
file_path = "/Users/wassim/Analytics-of-ML-Features-Usage-in-IDEs/da_internship_task_dataset.csv"
df = pd.read_csv(file_path)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")

df.head(10)

Dataset loaded successfully!
Shape: (122746, 7)


Unnamed: 0,uuid,day_id,license,model,feature,requests_cnt,spent_amount
0,user_920,2025-05-01,Premium,Model_A,Feature_1,44.0,16.38
1,user_717,2025-03-04,Premium,Model_B,Feature_2,72.0,27.92
2,user_610,2025-05-08,Premium,Model_A,Feature_2,27.0,9.87
3,user_94,2025-03-11,Basic,Model_D,Feature_1,76.0,14.67
4,user_920,2025-05-14,Premium,Model_E,Feature_3,47.0,9.88
5,user_97,2025-05-14,Basic,Model_A,Feature_1,10.0,3.01
6,user_338,2025-04-19,Basic,Model_A,Feature_3,13.0,5.29
7,user_433,2025-05-28,Enterprise,Model_E,Feature_1,81.0,14.19
8,user_81,2025-03-19,Standard,Model_E,Feature_3,59.0,12.92
9,user_311,2025-03-07,Standard,Model_D,Feature_1,56.0,12.12


## 2. Initial Data Exploration and Cleaning
- the structure
- data types
- Quality of the dataset

In [13]:
# Dataset information
print("DATASET INFORMATION")

print(f"\nDataset Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nColumn Names and Data Types:")
display(df.dtypes)

# Check for missing values
print(f"\nMissing Values in Each Column:")
display(df.isna().sum())

#  data quality checks
display(df[['requests_cnt','spent_amount']].describe())

print('Unique licenses:', df['license'].nunique(), 'examples ->', df['license'].unique()[:10])
print('Unique models:', df['model'].nunique(), 'examples ->', df['model'].unique()[:10])
print('Unique features:', df['feature'].nunique(), 'examples ->', df['feature'].unique()[:10])

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")


DATASET INFORMATION

Dataset Shape: 122746 rows × 7 columns

Column Names and Data Types:


uuid             object
day_id           object
license          object
model            object
feature          object
requests_cnt    float64
spent_amount    float64
dtype: object


Missing Values in Each Column:


uuid            0
day_id          0
license         0
model           0
feature         0
requests_cnt    0
spent_amount    0
dtype: int64

Unnamed: 0,requests_cnt,spent_amount
count,122746.0,122746.0
mean,51.260742,12.227798
std,167.97738,37.588494
min,1.0,0.27
25%,18.0,5.14
50%,32.0,8.02
75%,54.0,13.06
max,12900.0,2599.0


Unique licenses: 4 examples -> ['Premium' 'Basic' 'Enterprise' 'Standard']
Unique models: 5 examples -> ['Model_A' 'Model_B' 'Model_D' 'Model_E' 'Model_C']
Unique features: 5 examples -> ['Feature_1' 'Feature_2' 'Feature_3' 'Feature_5' 'Feature_4']

Duplicate rows: 0
