
# Which user groups perform what action
To find which user segments (e.g., male vs female, age groups, hometowns) are more likely to perform certain actions (like posting content, buying, etc.), you can apply a mix of exploratory data analysis, statistical testing, and predictive modeling.

A comprehensive Python example walking through these steps using a synthetic dataset. It includes:
* Exploratory analysis (group-by & action rates)
* Statistical significance testing (Chi-square for gender & action)
* Predictive modeling (logistic regression + feature importance)
* Interpretation notes and comments at every step

## Imports & Sample Data Creation

In [33]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency, ttest_ind, fisher_exact
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import statsmodels.api as sm

# Create a synthetic user dataset with demographics and binary action flag
np.random.seed(42)

n = 1000
df = pd.DataFrame({
    'user_id': range(1, n+1),
    'gender': np.random.choice(['Male', 'Female'], n, p=[0.5, 0.5]),
    'age': np.random.randint(18, 60, n),
    'hometown': np.random.choice(['TownA', 'TownB', 'TownC'], n, p=[0.4, 0.35, 0.25]),
})

# Create a binary outcome 'action_flag' influenced by gender and age
# (e.g., females and younger users slightly more likely to perform the action)
df['action_flag'] = np.random.binomial(
    1,
    p=0.1 + 0.1*(df['gender'] == 'Female') + 0.002*(60 - df['age']),
    size=n
)

# Show head of dataset
df.head()

Unnamed: 0,user_id,gender,age,hometown,action_flag
0,1,Male,29,TownB,0
1,2,Female,33,TownB,1
2,3,Female,41,TownA,0
3,4,Female,36,TownA,0
4,5,Male,25,TownC,0


## Exploratory Analysis: Action Rates by Segment

In [34]:
# Calculate action rate by gender
action_rate_gender = df.groupby('gender')['action_flag'].mean().reset_index()
print("Action Rate by Gender:\n", action_rate_gender)

# Calculate action rate by age groups (create bins)
df['age_group'] = pd.cut(df['age'], bins=[17, 25, 35, 45, 60], labels=['18-25','26-35','36-45','46-60'])
action_rate_age = df.groupby('age_group', observed=False)['action_flag'].mean().reset_index()
print("\nAction Rate by Age Group:\n", action_rate_age)

# Calculate action rate by hometown
action_rate_home = df.groupby('hometown')['action_flag'].mean().reset_index()
print("\nAction Rate by Hometown:\n", action_rate_home)

Action Rate by Gender:
    gender  action_flag
0  Female     0.265594
1    Male     0.139165

Action Rate by Age Group:
   age_group  action_flag
0     18-25     0.239583
1     26-35     0.219124
2     36-45     0.197248
3     46-60     0.171091

Action Rate by Hometown:
   hometown  action_flag
0    TownA     0.190000
1    TownB     0.213675
2    TownC     0.204819


In [35]:
df.head()

Unnamed: 0,user_id,gender,age,hometown,action_flag,age_group
0,1,Male,29,TownB,0,26-35
1,2,Female,33,TownB,1,26-35
2,3,Female,41,TownA,0,36-45
3,4,Female,36,TownA,0,36-45
4,5,Male,25,TownC,0,18-25


### Interpretation
This gives a first idea where action rates differ. For example, if females show a 20% action rate vs males 10%, that’s a strong signal.

##  Statistical Significance Testing: Gender vs Action

Since both variables are categorical (gender and binary action), **Chi-square test is the standard** approach to test independence.

In [36]:
# Create contingency table
contingency_table = pd.crosstab(df['gender'], df['action_flag'])
print("\nContingency Table (Gender vs Action):\n", contingency_table)

# Chi-square test of independence
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"\nChi-square test results:\nChi2 = {chi2:.3f}, p-value = {p_value:.5f}")

if p_value < 0.05:
    print("=> Reject null hypothesis: Gender and action are statistically dependent.")
else:
    print("=> Fail to reject null hypothesis: No significant dependency found.")


Contingency Table (Gender vs Action):
 action_flag    0    1
gender               
Female       365  132
Male         433   70

Chi-square test results:
Chi2 = 24.011, p-value = 0.00000
=> Reject null hypothesis: Gender and action are statistically dependent.


### When to use Chi-square, T-test, or Fisher’s Exact Test

| Situation                                     | Recommended Test      | Why?                                 |
|-----------------------------------------------|----------------------|-------------------------------------|
| Two categorical variables (large sample)      | Chi-square test      | Tests independence between categories |
| Binary categorical vs continuous (or interval) | T-test / ANOVA       | Compares means between two or more groups |
| Two categorical variables (small sample or low counts) | Fisher’s Exact Test | Exact test valid for small samples   |

#### Side Note: Contingency table
A contingency table summarizes the relationship between categorical variables by counting occurrences. It is fundamental in categorical data analysis.
* It summarizes data by **counting how many times different combinations of categories occur**.
* Typically, one categorical variable is displayed along the rows and another along the columns.
* Each cell in the table represents the count (or sometimes proportion) of observations that fall into that combination of categories.


In [37]:
import pandas as pd

# Example data
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Fruit': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana']
}

df_cont = pd.DataFrame(data)

# Create a contingency table
contingency = pd.crosstab(df_cont['Gender'], df_cont['Fruit'])
print(contingency)

Fruit   Apple  Banana  Orange
Gender                       
Female      0       1       1
Male        2       1       0


## Predictive Modeling: Logistic Regression
We model the probability of the action based on gender, age, and hometown. This gives effect sizes and feature importance.

In [38]:
# Prepare features - encode categorical variables
df_model = df.copy()
df_model = pd.get_dummies(df_model, columns=['gender', 'hometown', 'age_group'], drop_first=True)

# Define X and y
X = df_model.drop(columns=['user_id', 'action_flag'])
y = df_model['action_flag']

# Add intercept for statsmodels logistic regression
X_sm = sm.add_constant(X)

# Convert all boolean columns to integers
bool_cols = X_sm.select_dtypes(include=['bool']).columns
X_sm[bool_cols] = X_sm[bool_cols].astype(int)

# Fit logistic regression with statsmodels for interpretability
logit_model = sm.Logit(y, X_sm)
result = logit_model.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.487362
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:            action_flag   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      992
Method:                           MLE   Df Model:                            7
Date:                Sun, 10 Aug 2025   Pseudo R-squ.:                 0.03140
Time:                        13:42:11   Log-Likelihood:                -487.36
converged:                       True   LL-Null:                       -503.16
Covariance Type:            nonrobust   LLR p-value:                 4.816e-05
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              -0.1148      0.603     -0.190      0.849      -1.297       1.068
age           

### Interpretation
* **Coefficients show how each feature increases or decreases the log odds of performing the action.**
* For example, a positive coefficient for gender_Female means females are more likely to perform the action.
* P-values show statistical significance of each predictor.

## Model Evaluation & Feature Importance (scikit-learn)

In [39]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Logistic regression with sklearn
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predict and classification report
y_pred = clf.predict(X_test)
print("\nClassification Report:\n", classification_report(y_test, y_pred, zero_division=0)) # zero_division: sklearn throws the warning and sets precision to 0.0 by default.

# Feature importance (coefficients)
feature_importance = pd.Series(clf.coef_[0], index=X.columns).sort_values(ascending=False)
print("\nFeature Importance (logistic regression coefficients):\n", feature_importance)


Classification Report:
               precision    recall  f1-score   support

           0       0.80      1.00      0.89       200
           1       0.00      0.00      0.00        50

    accuracy                           0.80       250
   macro avg       0.40      0.50      0.44       250
weighted avg       0.64      0.80      0.71       250


Feature Importance (logistic regression coefficients):
 hometown_TownB     0.250804
age_group_46-60    0.179527
hometown_TownC     0.167558
age_group_36-45    0.155321
age_group_26-35    0.129470
age               -0.019596
gender_Male       -0.623933
dtype: float64


# Advanced Deep Dive

## Interaction Terms (Gender × Age)

Adding interaction terms lets you see if the effect of one feature depends on another (e.g., does gender effect vary by age?).

How to do it:
* In logistic regression, add a new feature that’s the product of gender_Female × age (or age_group).
* Check if the interaction coefficient is significant.