# <center><font color='#F1B03D'>**Revenue Intelligence Enhancement for BrokerChooser - Inferential Regression Analysis**</font></center>
### <center><font color='#F1B03D'>Central European University, 2024-2025</font></center>
### <center><font color='#F1B03D'>CEU Capstone Project</font></center>

### <left><font color='#F1B03D'>Author: Péter Bence Török (torokpe@gmail.com)</font></left>
### <left><font color='#F1B03D'>BrokerChooser Contact Person: Zoltán Molnár (zoltan.molnar@brokerchooser.com)</font></left>

---
<p style="font-size:22px;"> This notebook uses a pre-processed dataset to perform an inferential regression analysis using a logistic (logit) model. The analysis estimates the relationship between session-level variables and the likelihood of revenue generation. The output includes a summary table displaying the coefficients for each variable, along with their standard errors, confidence intervals, and indicators of statistical significance. This provides a clear overview of which features are most strongly associated with the target outcome.

In [None]:
# Import necessary libraries
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load the sampled dataset from a CSV file
df = pd.read_csv('path/to/file_location/data_sample.csv')

In [None]:
# Creating dummies for categorical variables
df = pd.get_dummies(df, columns=['country', 'device', 'traffic_name', 'traffic_medium','browser','op_system', 'day_of_month', 'start_event_hour', 'day_of_week'], drop_first=False)

In [None]:
# Dropping selected reference categories
cols_to_drop = ['country_other', 'device_other', 'traffic_name_other',
                'traffic_medium_other', 'browser_other', 'op_system_other', 'day_of_month_1','start_event_hour_0', 'day_of_week_0']

df.drop(columns=[col for col in cols_to_drop if col in df.columns], inplace=True)

In [None]:
# Create the formula automatically
all_vars = df.columns.difference(['generated_revenue'])  # exclude target
formula = 'generated_revenue ~ ' + ' + '.join(all_vars)

# Fit the model
model = smf.logit(formula=formula, data=df)
result = model.fit()

# View summary
print(result.summary())

In [None]:
# McFadden's pseudo-R²
llf = result.llf           # Log-likelihood of the fitted model
llnull = result.llnull     # Log-likelihood of the null model

pseudo_r2 = 1 - (llf / llnull)
print(f"McFadden's R²: {pseudo_r2:.4f}")

In [None]:
# Getting coefficient table
summary_table = result.summary2().tables[1]

# Adding significance stars
def significance_stars(p):
    if p < 0.001: return '***'
    elif p < 0.01: return '**'
    elif p < 0.05: return '*'
    elif p < 0.1: return '.'
    else: return ''

summary_table['Significance'] = summary_table['P>|z|'].apply(significance_stars)

# Renaming and selecting columns
export_df = summary_table.reset_index()[['index', 'Coef.', 'Std.Err.', '[0.025', '0.975]', 'Significance']]
export_df.columns = ['Variable', 'Coefficient', 'Std. Error', 'CI Lower', 'CI Upper', 'Significance']

# Export to Excel
export_df.to_excel("path/to/file_location/logit_summary_export.xlsx", index=False)