# Multinomial Regression

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

## Loading Data

In this activity, we will work with the dataset as the “California housing dataset”. This dataset can be fetched from internet using `scikit-learn` module.

Once we loaded up the dataset, we call the `housing` object to inspect the contents. `housing` is a Python dictionary. We will then turn the data into our independent variables (X) and dependent variables (y).

In [2]:

data = pd.read_csv("abalone.csv")
data.head

<bound method NDFrame.head of      Sex  Length  Diameter  Height  Whole_weight  Shucked_weight  \
0      M   0.455     0.365   0.095        0.5140          0.2245   
1      M   0.350     0.265   0.090        0.2255          0.0995   
2      F   0.530     0.420   0.135        0.6770          0.2565   
3      M   0.440     0.365   0.125        0.5160          0.2155   
4      I   0.330     0.255   0.080        0.2050          0.0895   
...   ..     ...       ...     ...           ...             ...   
4172   F   0.565     0.450   0.165        0.8870          0.3700   
4173   M   0.590     0.440   0.135        0.9660          0.4390   
4174   M   0.600     0.475   0.205        1.1760          0.5255   
4175   F   0.625     0.485   0.150        1.0945          0.5310   
4176   M   0.710     0.555   0.195        1.9485          0.9455   

      Viscera_weight  Shell_weight  Rings  
0             0.1010        0.1500     15  
1             0.0485        0.0700      7  
2             0.1415 

## Fitting a Multinomial Regression Model

In [3]:
X = data[data.columns[~data.columns.isin(['Sex'])]]
y = data['Sex']

In [4]:
mn = sm.MNLogit(y,sm.add_constant(X))

We print the shape of X, and inspect the top 5 rows.

In [6]:
print(X.shape)
X.head(5)

(4177, 8)


Unnamed: 0,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [7]:
model = mn.fit()
print_model = model.summary()
print(print_model)

Optimization terminated successfully.
         Current function value: 0.854590
         Iterations 8
                          MNLogit Regression Results                          
Dep. Variable:                    Sex   No. Observations:                 4177
Model:                        MNLogit   Df Residuals:                     4159
Method:                           MLE   Df Model:                           16
Date:                Sun, 07 Apr 2024   Pseudo R-squ.:                  0.2204
Time:                        02:05:21   Log-Likelihood:                -3569.6
converged:                       True   LL-Null:                       -4578.9
Covariance Type:            nonrobust   LLR p-value:                     0.000
         Sex=I       coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const              2.8410      0.505      5.627      0.000       1.851       3.831
Length           

The statsmodels module's output only has K-1 equations (in this case two equations), which show coefficients against a reference group. In the abalone example, the reference group was chosen to be female. The coefficients represent the log of ratios between two probabilities: the probability of belonging to a group of interest vs. the probability of belonging to the reference group.  In the abalone example, the reference group was female, therefore the equation below represents the first set of coefficients marked as SEX=Infant.  Note that there are two sets of coefficients, one marked as Infant and the second marked as Male.

## Accessing Model Parameters

In statsmodels, the `fit()` method returns a `Result` object. The model coefficients, standard errors, p-values, etc., are all available from this Result object.

Conveniently these are stored as Pandas dataframes with the parameter name as the dataframe index.

In [8]:
model.params

Unnamed: 0,0,1
const,2.840995,2.521223
Length,17.681684,-1.015661
Diameter,-13.048868,-4.962977
Height,-8.107566,-3.176845
Whole_weight,-6.399546,-0.131395
Shucked_weight,5.246396,3.048409
Viscera_weight,-13.217943,-2.166477
Shell_weight,5.595137,0.474669
Rings,-0.196681,0.005943


Here are some of the relevant values for a Logistic Regression.


|Attr/func|Description|
| ------------- |-------------|
|params|Estimated model parameters. Appears as coef when calling summary() on a fitted model|
|bse|Standard error|
|tvalues|A coefficient's t-statistic|
|pvalues|The model's p-value|
|conf_int(alpha)|Method that calculates the confidence interval for the estimated parameters. To call: model.conf_int(0.05)|



## Evaluating Multinomial Regression Model

Two Ways:

1. Examine the model output
2. Use .pred_table() methods

In [9]:
model.pred_table()

array([[ 449.,  215.,  643.],
       [  69., 1108.,  165.],
       [ 385.,  351.,  792.]])