In [9]:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols

In [2]:
data = pd.read_csv("user_data.csv")

In [3]:
data.shape

(500, 9)

In [4]:
data.head()

Unnamed: 0,user_id,baseline_watch_time,age,region,device_type,subscription_status,session_count_last_week,homepage_version,post_watch_time
0,0,138.730361,38,US,Desktop,Premium,9,0,76.332293
1,1,71.647308,49,US,Tablet,Free,8,1,38.604333
2,2,74.154847,44,EU,Mobile,Free,6,0,16.122792
3,3,57.810941,25,EU,Tablet,Premium,10,0,37.294173
4,4,115.962229,61,US,Tablet,Free,6,0,30.219337


#### Assigning the independent variable of choice (grouping variable) `homepage_version` - through a binomial distribution
We basically want to assing whether a particular user is in the control group or the other group hence we're sampling values out of a binomial distribution

In [5]:
# data['homepage_version'] = np.random.binomial(n=1, p=0.5, size=data.shape[0])

In [7]:
data['homepage_version'].value_counts()

homepage_version
1    265
0    235
Name: count, dtype: int64

#### Writing the linear expression for ANCOVA

$$
y = \beta_{0} + \beta_{1} \cdot groupingVariable + \beta_{2} \cdot x_{1} + \beta_{3} \cdot x_{2} + ... + \beta_{n} \cdot x_{n} + \epsilon
$$


In [10]:
formula = "post_watch_time ~ homepage_version + baseline_watch_time + age + session_count_last_week + C(subscription_status) + C(device_type) + C(region)"

In [14]:
model = ols(formula=formula, data=data).fit()

In [15]:
model.summary()

0,1,2,3
Dep. Variable:,post_watch_time,R-squared:,0.715
Model:,OLS,Adj. R-squared:,0.71
Method:,Least Squares,F-statistic:,136.8
Date:,"Sat, 10 May 2025",Prob (F-statistic):,1.25e-127
Time:,16:37:11,Log-Likelihood:,-1854.9
No. Observations:,500,AIC:,3730.0
Df Residuals:,490,BIC:,3772.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-3.6494,2.621,-1.392,0.164,-8.800,1.501
C(subscription_status)[T.Premium],14.5143,1.032,14.067,0.000,12.487,16.542
C(device_type)[T.Mobile],-4.6814,1.000,-4.683,0.000,-6.646,-2.717
C(device_type)[T.Tablet],-0.5912,1.446,-0.409,0.683,-3.433,2.251
C(region)[T.EU],-0.2572,1.171,-0.220,0.826,-2.559,2.044
C(region)[T.US],2.5332,1.082,2.341,0.020,0.407,4.660
homepage_version,10.4485,0.903,11.565,0.000,8.673,12.224
baseline_watch_time,0.3840,0.015,25.269,0.000,0.354,0.414
age,-0.0899,0.034,-2.671,0.008,-0.156,-0.024

0,1,2,3
Omnibus:,0.397,Durbin-Watson:,2.046
Prob(Omnibus):,0.82,Jarque-Bera (JB):,0.238
Skew:,-0.013,Prob(JB):,0.888
Kurtosis:,3.104,Cond. No.,632.0


#### Interpretation:

***For `homepage_version`***

Coef	Std Err	P-value	95% CI
10.45	0.90	< 0.001	[8.67, 12.22]

- Estimated treatment effect: New homepage increases watch time by ~10.45 minutes, controlling for other variables.
- P-value is very significant (< 0.001), so the effect is unlikely due to chance.
- 95% CI is tight and includes the true value you simulated (10) — strong model fit.

***Other Variables***:

- Baseline watch time: Strong positive effect (~0.38 per minute)
- Age: Small negative impact (~-0.09 per year)
- Premium users: Watch ~14.5 minutes more
- Mobile users: Watch ~4.7 minutes less
- Session count: Strongly predictive (1.47 per session)
- Tablet and EU are not significant — p > 0.05

***Model Quality***:

- R-squared = 0.715: The model explains 71.5% of the variance — excellent for behavioral data.
- Durbin-Watson ~2: No evidence of autocorrelation in residuals.