# Data Understanding & Validation (SaaS Trial-to-Paid Conversion)

## Objective
Validate that the dataset is analytics-ready and understand how each field maps to
trial user behavior and conversion outcomes.

## Dataset
- Source: Kaggle (sample dataset, ~1,000 rows)
- Use case: exploratory product analytics (trial â†’ paid conversion)

## What this notebook covers
1. Load dataset (reproducible path)
2. Schema overview + business mapping
3. Data quality validation (missing values, duplicates, ranges)
4. Target distribution (conversion rate baseline)


In [14]:
# Importing Libraries and Display Settings
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns",100)
pd.set_option("display.width",100)

In [24]:
import os
print(os.getcwd())


C:\Users\Sandeep


In [34]:
# Loading the Dataset
DATA_PATH = Path("data/user_behavior_dataset.csv")

if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Dataset not found at {DATA_PATH}. "
        "Place the CSV in a /data folder and update DATA_PATH if needed."
    )

df = pd.read_csv(DATA_PATH)
df.head()

Unnamed: 0,age,income,score,height,weight,visits,clicks,time_spent,target
0,39,59796,0.435,154.8,45.0,8,4,191,1
1,33,48031,0.709,164.3,63.9,5,4,366,0
2,41,45971,0.316,168.3,68.5,9,3,209,1
3,50,48756,0.508,189.0,94.7,9,5,314,0
4,32,35634,0.371,177.0,60.6,11,5,384,1


## Column Mapping (Business Meaning)

Below is a quick mapping of dataset fields to SaaS product analytics concepts.
(Adjust descriptions to match the actual column names in this dataset.)


In [37]:
# Auto schema summary table
schema = pd.DataFrame({
    "column": df.columns,
    "dtype": df.dtypes.astype(str),
    "non_null_count": df.notna().sum().values,
    "null_count": df.isna().sum().values,
    "unique_count": df.nunique().values
}).sort_values("column")

schema


Unnamed: 0,column,dtype,non_null_count,null_count,unique_count
age,age,int64,1000,0,46
clicks,clicks,int64,1000,0,17
height,height,float64,1000,0,349
income,income,int64,1000,0,988
score,score,float64,1000,0,466
target,target,int64,1000,0,2
time_spent,time_spent,int64,1000,0,461
visits,visits,int64,1000,0,22
weight,weight,float64,1000,0,434


In [39]:
# Getting the Dataset Shape and Column Names
print(f"Rows: {df.shape[0]:,}")
print(f"Columns: {df.shape[1]:,}")
df.columns.tolist()

Rows: 1,000
Columns: 9


['age',
 'income',
 'score',
 'height',
 'weight',
 'visits',
 'clicks',
 'time_spent',
 'target']

## Data Types & Completeness


In [19]:
# Getting Data Types & Non-Null Counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         1000 non-null   int64  
 1   income      1000 non-null   int64  
 2   score       1000 non-null   float64
 3   height      1000 non-null   float64
 4   weight      1000 non-null   float64
 5   visits      1000 non-null   int64  
 6   clicks      1000 non-null   int64  
 7   time_spent  1000 non-null   int64  
 8   target      1000 non-null   int64  
dtypes: float64(3), int64(6)
memory usage: 70.4 KB


In [42]:
# Check for Missing Values and Duplicates
# Data quality checks
missing = df.isna().sum().sort_values(ascending=False)
dup_rows = df.duplicated().sum()

print(f"Duplicate rows: {dup_rows}")
print("\nTop missing-value columns (should be all zeros for this dataset):")
missing.head(10)


Duplicate rows: 0

Top missing-value columns (should be all zeros for this dataset):


age           0
income        0
score         0
height        0
weight        0
visits        0
clicks        0
time_spent    0
target        0
dtype: int64

## Sanity Checks (Ranges & Categories)

Quick validation that numeric columns have reasonable ranges and categorical columns
have expected levels.


In [45]:
# Numeric sanity checks
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
df[num_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1000.0,34.812,9.462991,18.0,28.0,35.0,41.0,65.0
income,1000.0,42433.235,16343.081256,8000.0,31268.25,40328.5,51475.75,112113.0
score,1000.0,0.472451,0.148494,0.0,0.362,0.4715,0.57625,0.862
height,1000.0,171.846,9.200383,150.0,165.375,172.0,178.0,200.0
weight,1000.0,71.5213,14.64087,45.0,61.5,71.4,81.7,128.7
visits,1000.0,9.256,3.595747,1.0,7.0,9.0,11.0,22.0
clicks,1000.0,5.561,2.755335,0.0,4.0,5.0,7.0,16.0
time_spent,1000.0,371.404,145.022123,30.0,271.75,354.0,463.25,954.0
target,1000.0,0.29,0.453989,0.0,0.0,0.0,1.0,1.0


In [51]:
# Categorical level checks
cat_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()

for col in cat_cols:
    print(f"\n{col} (unique={df[col].nunique()}):")
    display(df[col].value_counts(dropna=False).head(10))

In [53]:
# Target distribution (conversion baseline)
target_col = "target"  # keep as-is if the dataset uses 'target'

counts = df[target_col].value_counts(dropna=False)
rates = df[target_col].value_counts(normalize=True).mul(100).round(2)

target_summary = pd.DataFrame({
    "count": counts,
    "percent": rates.astype(str) + "%"
})

target_summary


Unnamed: 0_level_0,count,percent
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,710,71.0%
1,290,29.0%


**Interpretation:**  
This is the baseline conversion rate. In later notebooks, engagement and segment analysis
should be interpreted relative to this baseline.


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1000.0,34.812,9.462991,18.0,28.0,35.0,41.0,65.0
income,1000.0,42433.235,16343.081256,8000.0,31268.25,40328.5,51475.75,112113.0
score,1000.0,0.472451,0.148494,0.0,0.362,0.4715,0.57625,0.862
height,1000.0,171.846,9.200383,150.0,165.375,172.0,178.0,200.0
weight,1000.0,71.5213,14.64087,45.0,61.5,71.4,81.7,128.7
visits,1000.0,9.256,3.595747,1.0,7.0,9.0,11.0,22.0
clicks,1000.0,5.561,2.755335,0.0,4.0,5.0,7.0,16.0
time_spent,1000.0,371.404,145.022123,30.0,271.75,354.0,463.25,954.0
target,1000.0,0.29,0.453989,0.0,0.0,0.0,1.0,1.0


In [64]:
# Examining the data types
df.dtypes

age             int64
income          int64
score         float64
height        float64
weight        float64
visits          int64
clicks          int64
time_spent      int64
target          int64
dtype: object

In [80]:
# Analyzing the Target Variable
df["target"].value_counts(dropna=False), df["target"].value_counts(normalize=True)

(target
 0    710
 1    290
 Name: count, dtype: int64,
 target
 0    0.71
 1    0.29
 Name: proportion, dtype: float64)