# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [1]:
import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv')

# Print the first five rows of the DataFrame
print(df.head())


   Unnamed: 0  PassengerId  Survived Pclass  \
0           0            1         0      3   
1           1            2         1      1   
2           2            3         1      3   
3           3            4         1      1   
4           4            5         0      3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123 

## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [2]:
# Total number of people who survived/didn't survive
import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv')

# Define the target variable
y = df['Survived']

# Define the independent variables (features)
X = df.drop('Survived', axis=1)

# Calculate the total number of people who didn't survive
num_did_not_survive = df[df['Survived'] == 0].shape[0]

# Calculate the total number of people who survived
num_survived = df[df['Survived'] == 1].shape[0]

# Print the results
print(f"Total number of people who didn't survive: {num_did_not_survive}")
print(f"Total number of people who survived: {num_survived}")


Total number of people who didn't survive: 549
Total number of people who survived: 342


Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [4]:
import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv')

# Filter the relevant columns
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
filtered_df = df[relevant_columns]

# Handle missing values (for simplicity, we can drop rows with missing values)
filtered_df = filtered_df.dropna()

# Create dummy variables for categorical columns
dummy_dataframe = pd.get_dummies(filtered_df, columns=['Sex', 'Embarked'], drop_first=True)

# Print the shape of the resulting DataFrame
print(dummy_dataframe.shape)


(712, 8)


Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [5]:
import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv')

# Filter the relevant columns
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
filtered_df = df[relevant_columns]

# Drop rows with missing values
filtered_df = filtered_df.dropna()

# Create dummy variables for categorical columns
dummy_dataframe = pd.get_dummies(filtered_df, columns=['Sex', 'Embarked'], drop_first=True)

# Print the shape of the resulting DataFrame
print(dummy_dataframe.shape)


(712, 8)


Finally, assign the independent variables to `X` and the target variable to `y`: 

In [6]:
import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv')

# Filter the relevant columns
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
filtered_df = df[relevant_columns]

# Drop rows with missing values
filtered_df = filtered_df.dropna()

# Create dummy variables for categorical columns
dummy_dataframe = pd.get_dummies(filtered_df, columns=['Sex', 'Embarked'], drop_first=True)

# Assign the target variable
y = dummy_dataframe['Survived']

# Assign the independent variables
X = dummy_dataframe.drop('Survived', axis=1)

# Print the shapes of X and y to verify
print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")


Shape of X: (712, 7)
Shape of y: (712,)


## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [11]:
import pandas as pd
import statsmodels.api as sm
import numpy as np  # Import numpy

# Import the data
df = pd.read_csv('titanic.csv')

# Filter the relevant columns
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
filtered_df = df[relevant_columns]

# Drop rows with missing values
filtered_df = filtered_df.dropna()

# Create dummy variables for categorical columns
dummy_dataframe = pd.get_dummies(filtered_df, columns=['Sex', 'Embarked'], drop_first=True)

# Ensure all data types are numeric
dummy_dataframe = dummy_dataframe.apply(pd.to_numeric)

# Assign the target variable
y = dummy_dataframe['Survived']

# Assign the independent variables
X = dummy_dataframe.drop('Survived', axis=1)

# Add an intercept term
X = sm.add_constant(X)

# Fit the logistic regression model
try:
    logit_model = sm.Logit(y, X).fit()
    print(logit_model.summary())
except sm.tools.sm_exceptions.PerfectSeparationError:
    print("Perfect separation detected, results not available")
except np.linalg.LinAlgError as e:
    print(f"LinAlgError: {e}")
    print("Trying to remove some features and fit the model again...")
    
    # Remove some features to handle multicollinearity
    X_reduced = X.drop(['const', 'SibSp'], axis=1)  # Example of removing features
    X_reduced = sm.add_constant(X_reduced)
    
    try:
        logit_model = sm.Logit(y, X_reduced).fit()
        print(logit_model.summary())
    except Exception as e:
        print(f"Error after removing features: {e}")


ValueError: Unable to parse string "?" at position 22

## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [12]:
# Summary table
# Print the summary of the model
print(logit_model.summary())


NameError: name 'logit_model' is not defined

In [None]:
# Your comments here


## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [None]:
# Your code here


In [None]:
# Your comments here

## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!