# Logistic Regression in scikit-learn - Lab

## Introduction 

In this lab, you are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the column labeled `'target'`. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives

In this lab you will: 

- Fit a logistic regression model using scikit-learn 


## Let's get started!

Run the following cells that import the necessary functions and import the dataset: 

In [4]:
# Import necessary functions
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [5]:
# Import data
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Define appropriate `X` and `y` 

Recall the dataset contains information about whether or not a patient has heart disease and is indicated in the column labeled `'target'`. With that, define appropriate `X` (predictors) and `y` (target) in order to model whether or not a patient has heart disease.

In [6]:
# Split the data into target and predictors
y = df['target']
X = df.drop('target', axis=1)

## Normalize the data 

Normalize the data (`X`) prior to fitting the model. 

In [7]:
# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the data (X) using the scaler
X = scaler.fit_transform(X)

# Convert the normalized X back to a DataFrame
X = pd.DataFrame(X, columns=df.drop('target', axis=1).columns)

X.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.952197,0.681005,1.973123,0.763956,-0.256334,2.394438,-1.005832,0.015443,-0.696631,1.087338,-2.274579,-0.714429,-2.148873
1,-1.915313,0.681005,1.002577,-0.092738,0.072199,-0.417635,0.898962,1.633471,-0.696631,2.122573,-2.274579,-0.714429,-0.512922
2,-1.474158,-1.468418,0.032031,-0.092738,-0.816773,-0.417635,-1.005832,0.977514,-0.696631,0.310912,0.976352,-0.714429,-0.512922
3,0.180175,0.681005,0.032031,-0.663867,-0.198357,-0.417635,0.898962,1.239897,-0.696631,-0.206705,0.976352,-0.714429,-0.512922
4,0.290464,-1.468418,-0.938515,-0.663867,2.08205,-0.417635,0.898962,0.583939,1.435481,-0.379244,0.976352,-0.714429,-0.512922


## Train- test split 

- Split the data into training and test sets 
- Assign 25% to the test set 
- Set the `random_state` to 0 

In [8]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Fit a model

- Instantiate `LogisticRegression`
  - Make sure you don't include the intercept  
  - set `C` to a very large number such as `1e12` 
  - Use the `'liblinear'` solver 
- Fit the model to the training data 

In [9]:
# Instantiate the model without the intercept and with a very large C
logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')

# Fit the model to the training data
logreg.fit(X_train, y_train)

## Predict
Generate predictions for the training and test sets. 

In [10]:
# Generate predictions for the training set
y_hat_train = logreg.predict(X_train)

# Generate predictions for the test set
y_hat_test = logreg.predict(X_test)


## How many times was the classifier correct on the training set?

In [11]:
# Calculate the number of correct predictions on the training set
correct_predictions_train = sum(y_hat_train == y_train)

print("Number of correct predictions on the training set:", correct_predictions_train)

Number of correct predictions on the training set: 192


## How many times was the classifier correct on the test set?

In [12]:
# Calculate the number of correct predictions on the test set
correct_predictions_test = sum(y_hat_test == y_test)

print("Number of correct predictions on the test set:", correct_predictions_test)


Number of correct predictions on the test set: 63


## Analysis
Describe how well you think this initial model is performing based on the training and test performance. Within your description, make note of how you evaluated performance as compared to your previous work with regression.

Based on the training and test performance of the initial logistic regression model, we can assess its effectiveness in predicting heart disease. The performance can be evaluated based on the number of correct predictions on both the training and test sets.

### Training Performance:
The number of correct predictions on the training set indicates how well the model fits the training data. A higher number of correct predictions suggest that the model is able to capture the underlying patterns in the training data and generalize well to it. However, it is important to note that a high number of correct predictions on the training set alone does not guarantee good generalization to unseen data.

### Test Performance:
The number of correct predictions on the test set is a more critical measure of the model's performance. It indicates how well the model can generalize to new, unseen data. A high number of correct predictions on the test set demonstrates the model's ability to make accurate predictions on new instances, which is essential for a good predictive model.

### Comparison to Previous Work with Regression:
In this logistic regression scenario, we are dealing with a classification problem (predicting whether a patient has heart disease or not) rather than regression (predicting continuous values). Unlike regression, where we evaluate performance using metrics such as Mean Squared Error (MSE) or R-squared, in classification, we assess performance using metrics like accuracy, precision, recall, F1-score, etc.

To make a comprehensive evaluation of the model's performance, we would need to calculate other classification metrics such as accuracy, precision, recall, and F1-score for both the training and test sets. These metrics provide a better understanding of the model's overall performance and its ability to correctly classify positive and negative cases.

In conclusion, the initial logistic regression model's performance can be assessed based on the number of correct predictions on both the training and test sets. However, to have a more comprehensive evaluation, additional classification metrics should be calculated. The model's effectiveness will depend on how well it generalizes to new, unseen data, as indicated by its test performance.


## Summary

In this lab, you practiced a standard data science pipeline: importing data, split it into training and test sets, and fit a logistic regression model. In the upcoming labs and lessons, you'll continue to investigate how to analyze and tune these models for various scenarios.