## Logistic Regression Example: The Donner Party

### Background 
- **Chapter 9, Section 5:**
- In 1846, the Donner family left Springfield, Illinois, for California.
- The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October.
- By the time the last survivor was rescued, 40 of the 87 members had died from famine and exposure to extreme cold.
- How can we predict probability of survival using the `age` and `gender` variables using a logistic regression model?

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

# so that we can see all the columns
pd.set_option('display.max_columns', None) 

import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

df_url = 'https://raw.githubusercontent.com/vaksakalli/datasets/master/donner_party.csv'
url_content = requests.get(df_url).content
df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

In [2]:
print(f"df shape: {df.shape}")
df.sample(10, random_state=999)

df shape: (45, 3)


Unnamed: 0,age,gender,status
20,20,male,died
10,25,male,survived
35,32,male,died
1,40,male,died
3,30,female,survived
24,25,female,died
12,28,female,survived
15,23,male,died
42,23,female,died
38,25,female,survived


In [3]:
df.to_csv('donner_party.csv', index=False)
df.isna().sum()

age       0
gender    0
status    0
dtype: int64

In [4]:
df.dtypes

age        int64
gender    object
status    object
dtype: object

In [5]:
categoricalColumns = df.columns[df.dtypes==object].tolist()

for col in categoricalColumns:
    print('Unique values and counts for ' + col)
    print(df[col].unique())
    print(df[col].value_counts())
    print('')

Unique values and counts for gender
['female' 'male']
female    30
male      15
Name: gender, dtype: int64

Unique values and counts for status
['survived' 'died']
survived    25
died        20
Name: status, dtype: int64



### Logistic Regression with One Variable

Let's fit an *logistic regression model* to the data using only the `age` variable.

In [6]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

model_full = smf.glm(formula='status ~ age', 
                     data=df, 
                     family=sm.families.Binomial())

model_full_fitted = model_full.fit()
print(model_full_fitted.summary())

                          Generalized Linear Model Regression Results                           
Dep. Variable:     ['status[died]', 'status[survived]']   No. Observations:                   45
Model:                                              GLM   Df Residuals:                       43
Model Family:                                  Binomial   Df Model:                            1
Link Function:                                    logit   Scale:                          1.0000
Method:                                            IRLS   Log-Likelihood:                -28.145
Date:                                  Wed, 25 Aug 2021   Deviance:                       56.291
Time:                                          17:00:39   Pearson chi2:                     43.2
No. Iterations:                                       4                                         
Covariance Type:                              nonrobust                                         
                 coef    std e

### Logistic Regression with Two Variables

Let's fit an *logistic regression model* to the data using both the `age` and `gender` variables.

In [7]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

model_full = smf.glm(formula='status ~ age + gender', 
                     data=df, 
                     family=sm.families.Binomial())

model_full_fitted = model_full.fit()
print(model_full_fitted.summary())

                          Generalized Linear Model Regression Results                           
Dep. Variable:     ['status[died]', 'status[survived]']   No. Observations:                   45
Model:                                              GLM   Df Residuals:                       42
Model Family:                                  Binomial   Df Model:                            2
Link Function:                                    logit   Scale:                          1.0000
Method:                                            IRLS   Log-Likelihood:                -25.628
Date:                                  Wed, 25 Aug 2021   Deviance:                       51.256
Time:                                          17:00:39   Pearson chi2:                     44.4
No. Iterations:                                       5                                         
Covariance Type:                              nonrobust                                         
                     coef    s

### Analysis Results

Detailed analysis results for the two models above can be found in Chapter 9, Section 5 in our textbook, OpenIntro Statistics.