# Logistic Regression in scikit-learn - Lab

## Introduction 

In this lab, you are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the column labeled `'target'`. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives

In this lab you will: 

- Fit a logistic regression model using scikit-learn 


## Let's get started!

Run the following cells that import the necessary functions and import the dataset: 

In [1]:
# Import necessary functions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [2]:
# Import data
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Define appropriate `X` and `y` 

Recall the dataset contains information about whether or not a patient has heart disease and is indicated in the column labeled `'target'`. With that, define appropriate `X` (predictors) and `y` (target) in order to model whether or not a patient has heart disease.

In [3]:
# Split the data into target and predictors
y = df['target']
X = df.drop('target', axis=1)


## Train- test split 

- Split the data into training and test sets 
- Assign 25% to the test set 
- Set the `random_state` to 0 

N.B. To avoid possible data leakage, it is best to split the data first, and then normalize.

In [4]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Print the shapes of the splits to verify
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (227, 13)
Shape of X_test: (76, 13)
Shape of y_train: (227,)
Shape of y_test: (76,)


## Normalize the data 

Normalize the data (`X`) prior to fitting the model. 

In [6]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and test data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert the normalized data back to DataFrame for easier handling
X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)

# Print the first few rows of the normalized training data to verify
print(X_train.head())

        age       sex        cp  trestbps      chol       fbs   restecg  \
0  0.352565  0.702439  0.987029  0.020206 -0.435970 -0.426956 -0.982565   
1 -0.310686  0.702439 -0.919827 -1.140980 -0.325394 -0.426956  0.891740   
2 -0.089602  0.702439  0.987029  1.065272 -0.288536 -0.426956 -0.982565   
3  0.463107  0.702439  1.940457  2.690932  0.411776 -0.426956 -0.982565   
4  1.347442  0.702439 -0.919827 -0.676506 -0.343823 -0.426956 -0.982565   

    thalach     exang   oldpeak     slope        ca      thal  
0  1.011893 -0.723526  1.724840  0.962226  1.227233  1.121359  
1  0.453640 -0.723526 -0.923487  0.962226  0.259935 -0.459688  
2  0.668352 -0.723526  0.400676  0.962226 -0.707364  1.121359  
3 -0.190498 -0.723526  2.552442 -2.273704 -0.707364  1.121359  
4 -0.877579  1.382120  1.228278 -0.655739  1.227233  1.121359  


## Fit a model

- Instantiate `LogisticRegression`
  - Make sure you don't include the intercept  
  - set `C` to a very large number such as `1e12` 
  - Use the `'liblinear'` solver 
- Fit the model to the training data 

In [7]:
# Instantiate the model
logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')

# Fit the model to the training data
logreg.fit(X_train, y_train)


## Predict
Generate predictions for the training and test sets. 

In [8]:
# Generate predictions for the training set
y_hat_train = logreg.predict(X_train)

# Generate predictions for the test set
y_hat_test = logreg.predict(X_test)

# Print the first few predictions to verify
print("Training set predictions:", y_hat_train[:10])
print("Test set predictions:", y_hat_test[:10])


Training set predictions: [0 1 1 0 0 0 0 1 0 0]
Test set predictions: [0 1 1 0 0 0 0 0 0 0]


## How many times was the classifier correct on the training set?

In [9]:
# Your code here
# Calculate the number of correct predictions
correct_predictions = np.sum(y_hat_train == y_train)

# Print the number of correct predictions
print("Number of correct predictions on the training set:", correct_predictions)


Number of correct predictions on the training set: 195


## How many times was the classifier correct on the test set?

In [10]:
# Your code here
# Calculate the number of correct predictions on the test set
correct_predictions_test = np.sum(y_hat_test == y_test)

# Print the number of correct predictions on the test set
print("Number of correct predictions on the test set:", correct_predictions_test)


Number of correct predictions on the test set: 63


## Analysis
Describe how well you think this initial model is performing based on the training and test performance. Within your description, make note of how you evaluated performance as compared to your previous work with regression.

## Summary

In this lab, you practiced a standard data science pipeline: importing data, split it into training and test sets, and fit a logistic regression model. In the upcoming labs and lessons, you'll continue to investigate how to analyze and tune these models for various scenarios.