# Gamma Regression Model 

In [1]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import bambi as bmb



In [3]:
data = bmb.load_data("carclaims")

In [4]:
# Filter for claims where claimcst0 > 0
claims_data = data[data["claimcst0"] > 0].copy()

In [5]:
claims_data['gender'] = claims_data['gender'].astype('category')
claims_data['area'] = claims_data['area'].astype('category')
claims_data['agecat'] = claims_data['agecat'].astype('category')

In [6]:
formula = "claimcst0 ~ veh_value + veh_age + C(gender) + C(area) + C(agecat)"

In [7]:
model = smf.glm(formula=formula, data=claims_data, family=sm.families.Gamma(link=sm.families.links.log()))

# Fit the model
results = model.fit()



In [8]:
print(results.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:              claimcst0   No. Observations:                 4624
Model:                            GLM   Df Residuals:                     4610
Model Family:                   Gamma   Df Model:                           13
Link Function:                    log   Scale:                          2.9597
Method:                          IRLS   Log-Likelihood:                -40459.
Date:                Wed, 23 Jul 2025   Deviance:                       7223.5
Time:                        02:20:37   Pearson chi2:                 1.36e+04
No. Iterations:                    20   Pseudo R-squ. (CS):            0.01136
Covariance Type:            nonrobust                                         
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          7.6001      0.138     55.

The gamma regression model was fitted to the Australian insurance claims dataset, specifically focusing on `claimcst0` (claim amount) for claims greater than zero. The model used a log link function, which is standard for gamma regression, implying that the effect of predictors is multiplicative on the expected claim amount.

**Intercept (7.6001):** This is the expected log of the claim amount when all other predictors are at their reference level (e.g., female gender, reference area, age category 1, and zero vehicle value and age). To get the actual expected claim amount, you would exponentiate this value (e.g., `exp(7.6001)`).

*   **C(gender)[T.M] (0.1620):** This coefficient is positive and statistically significant (p-value = 0.002). This suggests that, holding all other variables constant, male policyholders (compared to the reference gender, likely female) have an expected claim amount that is `exp(0.1620)` times higher. This indicates that male policyholders are associated with higher claim costs.

*   **C(area):** The coefficients for different areas (B, C, D, E, F) represent the difference in the expected log claim amount compared to the reference area (likely Area A). Only `C(area)[T.F]` (0.3701) is statistically significant (p-value = 0.002), indicating that policyholders in Area F have significantly higher expected claim amounts compared to the reference area, all else being equal.

*   **C(agecat):** The coefficients for age categories 2 through 6 are all negative and statistically significant (p-values < 0.05). This indicates that policyholders in age categories 2, 3, 4, 5, and 6 have significantly lower expected claim amounts compared to the reference age category (likely age category 1), holding other variables constant. The negative coefficients suggest that older age categories (or at least age categories 2-6 compared to 1) are associated with lower claim costs.

*   **veh_value (-0.0018):** This coefficient is very close to zero and not statistically significant (p-value = 0.947). This suggests that vehicle value does not have a significant linear relationship with the expected claim amount in this model.

*   **veh_age (0.0512):** This coefficient is positive but only marginally statistically significant (p-value = 0.080). This suggests a weak positive relationship, where older vehicles might be associated with slightly higher expected claim amounts, though this finding is not as robust as the effects of gender and age category.


## Conclusion

The gamma regression model identifies gender, certain geographical areas, and age categories as statistically significant predictors of insurance claim amounts. Male policyholders and those in Area F are associated with higher claim costs, while policyholders in age categories 2-6 (compared to age category 1) are associated with lower claim costs. Vehicle value does not appear to be a significant predictor, and vehicle age has a weak positive association. The low pseudo R-squared value and the overdispersion suggest that while the model identifies some significant relationships, a substantial portion of the variability in claim amounts remains unexplained, indicating the complexity of factors influencing insurance claims. Further analysis with additional variables or more complex model structures might be beneficial to improve the predictive power.