# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [1]:
# Import the data

import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv')
df



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [2]:
# Total number of people who didn't survive the shipwreck
not_survived_count = df[df['Survived'] == 0]['Survived'].count()

# Total number of people who survived the shipwreck
survived_count = df[df['Survived'] == 1]['Survived'].count()

print("Total number of people who didn't survive the shipwreck:", not_survived_count)
print("Total number of people who survived the shipwreck:", survived_count)


Total number of people who didn't survive the shipwreck: 549
Total number of people who survived the shipwreck: 342


Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [3]:
# Define the relevant columns
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']

# Create the dummy variables
dummy_dataframe = pd.get_dummies(df[relevant_columns], drop_first=True, dtype=float)

# Check the shape of the dummy DataFrame
print(dummy_dataframe.shape)


(891, 8)


Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [4]:
# Drop rows with missing values
dummy_dataframe = dummy_dataframe.dropna()

# Check the shape of the dummy DataFrame
print(dummy_dataframe.shape)


(714, 8)


Finally, assign the independent variables to `X` and the target variable to `y`: 

In [5]:
# Split the data into X and y
y = dummy_dataframe['Survived']
X = dummy_dataframe.drop(columns=['Survived'])

## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [6]:
import statsmodels.api as sm

# Add an intercept term to X
X_with_intercept = sm.add_constant(X)

# Build the logistic regression model
logit_model = sm.Logit(y, X_with_intercept)

# Fit the model
result = logit_model.fit()

# Print the summary of the model
print(result.summary())



Optimization terminated successfully.
         Current function value: 0.443267
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      706
Method:                           MLE   Df Model:                            7
Date:                Mon, 07 Aug 2023   Pseudo R-squ.:                  0.3437
Time:                        18:58:35   Log-Likelihood:                -316.49
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 1.103e-67
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6503      0.633      8.921      0.000       4.409       6.892
Pclass        -1.2118      0.

## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

The p-value for the constant term (intercept) is 0.000, which means it is highly statistically significant. It indicates that the model's intercept is significantly different from zero, suggesting that it is an essential component of the model.

For 'Pclass', 'Age', 'SibSp', and 'Sex_male', the p-values are less than 0.05 (i.e., 0.000), which means they are statistically significant. These variables have a significant impact on predicting the target variable 'Survived'.

The 'Fare', 'Embarked_Q', and 'Embarked_S' variables have p-values greater than 0.05, indicating that they are not statistically significant in predicting 'Survived' at the 5% significance level.

 'Pclass', 'Age', 'SibSp', and 'Sex_male' are statistically significant predictors of survival on the Titanic, while 'Fare', 'Embarked_Q', and 'Embarked_S' do not seem to have a significant impact on the outcome. You might consider removing the non-significant variables to simplify the model and improve its performance.

## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [8]:
import statsmodels.api as sm

# Select only the influential features
influential_columns = ['Pclass', 'Age', 'SibSp', 'Sex_male']
X_influential = X[influential_columns]

# Add an intercept term to X_influential
X_influential_with_intercept = sm.add_constant(X_influential)

# Build the logistic regression model with influential features
logit_model_influential = sm.Logit(y, X_influential_with_intercept)

# Fit the model
result_influential = logit_model_influential.fit()

# Print the summary of the new model
print(result_influential.summary())



Optimization terminated successfully.
         Current function value: 0.445882
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      709
Method:                           MLE   Df Model:                            4
Date:                Mon, 07 Aug 2023   Pseudo R-squ.:                  0.3399
Time:                        19:01:00   Log-Likelihood:                -318.36
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 1.089e-69
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6008      0.543     10.306      0.000       4.536       6.666
Pclass        -1.3174      0.

The new logistic regression model, which includes only the influential features 'Pclass', 'Age', 'SibSp', and 'Sex_male', has the following performance:

- **Pseudo R-squared:** The pseudo R-squared value of approximately 0.3399 indicates that the model explains about 33.99% of the variance in the target variable 'Survived'. This means that the model is moderately effective in explaining the variability in the survival outcome based on the selected features.

- **Coefficients:** The coefficients represent the log-odds of the target variable 'Survived' for each unit change in the corresponding independent variable. For example, for every one unit increase in 'Pclass', the log-odds of survival decrease by approximately 1.3174, all else being equal.

- **P-values:** All the p-values for the coefficients are less than 0.05 (i.e., 0.000), indicating that all the selected features ('Pclass', 'Age', 'SibSp', and 'Sex_male') are statistically significant in predicting 'Survived'. This confirms that the features included in the model are indeed influential and play a significant role in predicting the target variable.

- **LLR p-value:** The LLR (Log-Likelihood Ratio) p-value is approximately 1.089e-69, which is extremely small. This indicates that the model significantly outperforms the null model (model with no predictors) and suggests that the selected features collectively provide a meaningful improvement in predicting 'Survived'.

Thus,the new model with influential features seems to have a reasonably good performance in terms of statistical significance and pseudo R-squared. However, it's essential to remember that the model's predictive power may vary depending on the specific data and context. It's always a good practice to evaluate the model's performance on a separate test dataset and consider additional model evaluation metrics to ensure




## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!