# Fitting a linear model predicting rings using only the variables: Sex (as a factor) and Whole Weight 

## Preparing the Data


In [3]:
import pandas as pd

data = pd.read_csv('../data/processed/abalone.csv')



In [4]:
#Converting Sex to Categorical Variable
data['Sex'] = data['Sex'].map({0 : 'M', 1 : 'F', 2 : 'I'})


data['Sex'] = data['Sex'].astype('category')
print(data['Sex'].unique())

['M', 'F', 'I']
Categories (3, object): ['F', 'I', 'M']


## Fitting the Linear Model

In [5]:
# Fitting the Linear Model

import statsmodels.formula.api as smf
model_interact = smf.ols(formula='Rings ~ C(Sex) * Q("Whole weight")', data=data).fit()

print(model_interact.summary())

                            OLS Regression Results                            
Dep. Variable:                  Rings   R-squared:                       0.352
Model:                            OLS   Adj. R-squared:                  0.351
Method:                 Least Squares   F-statistic:                     452.8
Date:                Tue, 01 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:08:12   Log-Likelihood:                -9895.5
No. Observations:                4172   AIC:                         1.980e+04
Df Residuals:                    4166   BIC:                         1.984e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

## Comments

### Key Statistics
- R-Squared is 0.352, meaning about 35.2% of the variation in the Rings is explained by the model
- F-Static value is 452.8, with p < 0.001, indicating the significance of the whole model, statistically
- Kurtosis being 6.92, and Skew being 1.62, shows that the residuals may not be perfectly normal, but with this large sample size, it isn't too critical

### Coefficients
Baseline (`Females`)
- `Intercept` = `9.1216`
- `Slope` = `1.921`

`Infants` (I)
- `Intercept` = Baseline + Baseline Shift (C(Sex)[T.I]) = 9.1216 - 3.8558 = `5.2658`
- `Slope` = Baseline Slope + (C(Sex)[T.I]:Whole weight = 4.1720) = 1.9201 + 4.1720 = `6.0921`
    Infants gain about 6.09 rings for each 1-unit increase in whole weight, significantly higher than females

`Males` (I)
- `Intercept` = Baseline + Baseline Shift (C(Sex)[T.M]) = 9.1216 - 0.7999 = `8.3217`
- `Slope` = Baseline Slope + (C(Sex)[T.M]:Whole weight = 0.4866) = 1.9201 + 0.4866 = `2.4067`
    Infants gain about 6.09 rings for each 1-unit increase in whole weight, significantly higher than females


### Significance
- p-values of all the terms (main effects and interactions) is less than 0.05 (p < 0.05), meaning that they are stastically significant

### Interpretation
- As we can see, sex matters. This is because both Intercept Rings and Slopes differ significantly.
- Whole Weight is a strong predictor. This is because Higher whole weight generally correlates with more rings, but the efect   size varies accross sex categories.
- There is rapid growth or increase in the rings for Infants. This is because there is much steeper slope for Infants, indicating a rapid increase in rings as weight increases


### Conclusion
Overall, this regression model explains about 35% of the variation in Rings, indicating that both Sex and Whole weight, along with their interaction, play a significant role in predicting abalone age. Infants begin with a lower baseline number of Rings but show a steep increase in Rings as Whole weight grows, while Males have a moderate slope and Females have the lowest slope but the highest baseline. All coefficients are statistically significant, showing the importance of including Sex and its interaction with Whole weight when modeling Rings.

# Fitting a linear model with all the original variables, (without transforming predictors and without interactions)

In [6]:
import statsmodels.formula.api as smf


#Our Predictors are Sex, Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight and Shell weight
# Fit the full model using all original predictors
model_full = smf.ols(
    formula='Rings ~ C(Sex) + Length + Diameter + Height + Q("Whole weight") + Q("Shucked weight") + Q("Viscera weight") + Q("Shell weight")',
    data=data
).fit()

# Display the model summary
print(model_full.summary())

                            OLS Regression Results                            
Dep. Variable:                  Rings   R-squared:                       0.543
Model:                            OLS   Adj. R-squared:                  0.542
Method:                 Least Squares   F-statistic:                     549.8
Date:                Tue, 01 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:08:12   Log-Likelihood:                -9166.7
No. Observations:                4172   AIC:                         1.835e+04
Df Residuals:                    4162   BIC:                         1.842e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               3.5967    

### Key Statistics
- `R-Squared` is 0.543, meaning about 54.3% of the variation in the Rings is explained by the model, and `Ajusted R-squared` value is 0.542, slightly lower due to multiple predictors.
- `F-Static value` is 549.8, with p value being 0.00, indicating the significance of the whole model, statistically
- `Kurtosis` being 6.03, and `Skew` being 1.192, shows that the residuals are not be perfectly normal, however with this large sample size isn't too critical.

### Coefficient Observations
- `Length` also has negative coefficient, likely due to its multicollinearity with `Diameter` (found earlier in th analysis).
- C(Sex)[T.M] coefficient is very low, meaning that `Males` have only 0.0554 rings more than `Females`.
- We can see that P value for `Length` and `C(Sex)[T.M]` is really high (p > 0.05), and `t` is very low (close to 0)
- There are negative coefficients for `Shucked Weight` and `Viscera Weight` but positive coefficient for `Whole Weight`, this is likely due to multicollinearity (found earlier on in the analysis)

### Conclusion

The multiple regression model fits the data reasonably well, explaining around 54% of the variation in abalone age. Key predictors such as `Height`, `Diameter`, and `Whole Weight` showed strong  relationships with age, aligning with the findings from earlier exploration. However, not all results were accurate. Some predictors like `Length`, `Shucked Weight`, and `Viscera Weight` had **negative or unexpected coefficients**. This is likely due to **multicollinearity**, where predictors are highly correlated with each other, which makes it **difficult for the model to accurately isolate their individual effects**. We also found that males only had 0.554 rings more than the `Females`, which is not statistically significant. This is a much different observation compared to the earlier model (*Fitting Model using `Sex` and `Whole Weight`*), where we found significant difference in slopes and intercepts. This means that when other physical measurements are included (e.g Length, Diameter, etc), **sex of the abalone adds little additional predictive value**, and `Males` and `Females` **would have similar predicted rings**

