# Week 7 — Applied SaaS Notebook

This notebook is part of the 'Applied ML Foundations for SaaS Analytics' course. Conversational, mentor-style guidance is provided throughout.

In [None]:
from IPython.display import HTML
HTML('''
<style>
details {
  margin: 10px 0;
  padding: 8px 12px;
  border: 1px solid #d9e2ec;
  border-radius: 8px;
  background: #f9fbfd;
}
details summary {
  font-weight: 600;
  color: #0056b3;
  cursor: pointer;
}
details[open] {
  background: #f1f7ff;
  border-color: #c3d4f0;
}
details pre {
  background: #f8f9fa;
  padding: 8px;
  border-radius: 6px;
}
</style>
''')

## Scenario — Predict which trial users convert to paid

We will construct features from usage and events, train a classifier, and discuss evaluation metrics meaningful to SaaS (precision at top-N, lift).


## Hands-on

Try different feature sets and evaluate ROC AUC and precision@k.


<details>
<summary>💡 Hint</summary>

Try breaking the problem into smaller steps. For example, if you need to aggregate per-user metrics, first compute a grouped table, then convert to NumPy arrays for vectorized ops. Think about edge cases: missing users, zero counts, or extreme values.

</details>

<details>
<summary>✅ Solution (example)</summary>

```python
# Example solution snippet — adapt to your dataset & question.
import pandas as pd
import numpy as np

# Load data (adjust path as needed)
df = pd.read_csv('../data/feature_usage.csv', parse_dates=['date'], low_memory=False)

# Example: compute total usage per user and return top users
user_usage = df.groupby('user_id')['usage_count'].sum().reset_index(name='total_usage')
top_users = user_usage.sort_values('total_usage', ascending=False).head(10)
top_users
```

**Why this works:** We use `groupby` to aggregate events by `user_id`, then sort to find the heaviest users. Converting to NumPy arrays can speed up numeric-only operations.

</details>

In [None]:

import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# load small samples to build prototype features
subs = pd.read_csv('../data/subscriptions.csv')
fu = pd.read_csv('../data/feature_usage.csv')
# simple feature: total usage_count per user (sample)
user_usage = fu.groupby('user_id')['usage_count'].sum().reset_index()
df = subs.merge(user_usage, on='user_id', how='left').fillna(0)
df['is_paid'] = (df['mrr']>0).astype(int)
X = df[['usage_count','tenure_days']].values[:2000]
y = df['is_paid'].values[:2000]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=50, random_state=42)
clf.fit(X_train, y_train)
print('test score', clf.score(X_test,y_test))


## Reflection

Why might accuracy be misleading when few users convert?
