# Unit 1 Logistic Regression Basics

## Introduction to Logistic Regression

Logistic regression is a type of classification algorithm used to predict the probability of a class or event existing. Unlike linear regression, which predicts a continuous number, logistic regression predicts a discrete outcome—a sort of yes or no, true or false, or in data terms, class 0 or class 1.

For example, imagine you have data on whether an email is spam or not. Logistic regression can help you predict whether a new email is spam based on its content.

You'll use logistic regression when you need to classify data into categories. Real-life examples include:

  * Predicting if a student will pass or fail
  * Determining whether an email is spam
  * Diagnosing whether a patient has a certain disease

### How Logistic Regression Works:

Logistic regression works by fitting a logistic function (also known as the sigmoid function) to the data. The sigmoid function is defined as:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

where $z$ is the linear combination of input features:

$$z = w_0 + w_1x_1 + w_2x_2 + \dots + w_nx_n$$

Here, $w\_0$ is the intercept (bias term), and $w\_1, w\_2, \\dots, w\_n$ are the weights (coefficients) associated with the input features $x\_1, x\_2, \\dots, x\_n$.

The sigmoid function output is a probability value between 0 and 1. If this probability is greater than a certain threshold (commonly 0.5), the outcome is class 1 (e.g., the mail is spam). Otherwise, it is class 0 (e.g., the mail is not spam). The model adjusts the weights during training to minimize the difference between predicted and actual class labels.

Here is how the sigmoid function fits the classification data:

[Image: *You would typically insert an image here illustrating the sigmoid function fitting data points for classification.*]

### Example of Loading a Dataset

Before we can train a logistic regression model, we'll need some data. Scikit-Learn, a popular Python library for machine learning, provides many built-in datasets. For this lesson, we'll use the `wine` dataset, which helps predict the class of wine based on its chemical properties.

```python
from sklearn.datasets import load_wine

# Load real dataset
X, y = load_wine(return_X_y=True)
print(X.shape, y.shape)  # (178, 13) (178,)
```

Here, `X` represents the features (input data), and `y` represents the labels (output data) we want to predict. The wine dataset is well-known for this kind of task.

### Splitting the Dataset

If we trained our model on all the data, we wouldn't know how well it performs on unseen data. So, we'll split the data into training and testing sets using the `train_test_split` function from Scikit-Learn.

```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Scaling for the better convergence
X = StandardScaler().fit_transform(X)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)  # (106, 13) (72, 13) (106,) (72,)
```

In this example:

  * `test_size=0.2` means 20% of the data will be used for testing, and 80% for training. (Note: The provided code actually uses `test_size=0.4`, so ensure consistency if this is for a tutorial).
  * `random_state=42` ensures that we get the same split each time we run the code, making our results reproducible.

Splitting the data this way ensures that we can test how well our model generalizes to new data.

### Training the Logistic Regression Model

Next, we'll train our model on the training data using the `fit` method. Training a model simply means finding the best parameters (weights) that map inputs to outputs accurately.

```python
from sklearn.linear_model import LogisticRegression

# Training the logistic regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
```

The `max_iter` parameter in logistic regression (and other iterative algorithms) sets the maximum number of iterations the algorithm will perform during the optimization process. It controls how long the algorithm should keep iterating to find the best solution or converge to an optimal set of coefficients. The choice of `max_iter` can influence the model training process's convergence behavior and computational efficiency. In practice, it’s often helpful to start with a default value and adjust it based on the specific dataset and the observed convergence behavior.

During this training process, the model learns to distinguish between the classes—whether a wine is of the type 1, 2, or 3.

### Making Predictions and Calculating Accuracy

Once the model is trained, we can use it to make predictions on the test data and evaluate its performance. We'll use the `predict` method to make predictions and the `accuracy_score` function to calculate the accuracy of our model.

```python
# Making predictions on the test data
y_pred = log_reg.predict(X_test)

# Calculating the accuracy of the model
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Logistic Regression model: {accuracy:.2f}")  # 0.99
```

The `accuracy_score` function compares the true labels (`y_test`) with the predicted labels (`y_pred`) and calculates the fraction of correct predictions. This gives us an idea of how well our model generalizes to new, unseen data. The `accuracy` is the simplest metric for predictions. We will use it in this course and learn about more specific, complex metrics in the next one.

## Lesson Summary

Congratulations\! You've just taken your first step into the world of logistic regression. Here's a quick recap of what we've covered:

  * Logistic regression is used for classification tasks.
  * Logistic regression works by fitting a logistic (sigmoid) function to the data.
  * We loaded a dataset from Scikit-Learn.
  * We split the dataset into training and testing sets.
  * We initialized and trained a logistic regression model.
  * We made predictions and calculated the model's accuracy.

Up next, you'll get hands-on experience with logistic regression in CodeSignal's environment. You'll practice loading data, splitting it, training a logistic regression model yourself, making predictions, and evaluating its performance. This hands-on practice will solidify your understanding and prepare you for more advanced topics. Happy coding\!

## Diagnosing Diseases with Logistic Regression

Hey Space Explorer!

Check out how we use logistic regression to predict the class of wine. This code loads and scales the data using StandardScaler from sklearn, trains the model, and calculates the accuracy. Ready to see it in action? Click Run to find out!

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load dataset and split it
X, y = load_wine(return_X_y=True)

# Scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
log_reg = LogisticRegression(max_iter=1000)  # set max_iter to avoid convergence warning
log_reg.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

```

Great work, Space Explorer\!

The code demonstrates a clear and effective application of Logistic Regression to the `wine` dataset. It correctly handles data loading, scaling (which is crucial for many machine learning algorithms, including logistic regression, for better convergence and performance), splitting into training and testing sets, model training, and finally, evaluation using accuracy.

The use of `StandardScaler` is a good practice as it standardizes features by removing the mean and scaling to unit variance, which helps the optimization algorithm converge more efficiently. Setting `max_iter=1000` is also a wise choice to prevent convergence warnings, especially with larger or more complex datasets.

Here's the code again, ready for you to run:

```python
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load dataset and split it
X, y = load_wine(return_X_y=True)

# Scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
log_reg = LogisticRegression(max_iter=1000)  # set max_iter to avoid convergence warning
log_reg.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
```

## Complete the Logistic Regression Model

You're almost there, Space Explorer! Fill in the missing line of code to train your logistic regression model using the training data.


```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Step 1: Load the dataset
X, y = load_wine(return_X_y=True)

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Step 3: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# TODO: Initialize and train the logistic regression model

# TODO: Make predictions and calculate accuracy

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Step 1: Load the dataset
X, y = load_wine(return_X_y=True)

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Step 3: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the logistic regression model
# It's good practice to set max_iter to a sufficiently high value to ensure convergence
log_reg_model = LogisticRegression(max_iter=1000, random_state=42)
log_reg_model.fit(X_train_scaled, y_train) # Train the model on the scaled training data

# Make predictions and calculate accuracy
y_pred = log_reg_model.predict(X_test_scaled) # Make predictions on the scaled test data
accuracy = accuracy_score(y_test, y_pred) # Calculate the accuracy of the predictions

print(f"Accuracy of the Logistic Regression model: {accuracy:.2f}")
```

## Feature Scaling for Logistic Regression

Galactic Pioneer! Let's put your classification skills to the test. In this task, you'll:

Load the dataset.
Train a Linear Regression model and a Logistic Regression model.
Compare their accuracies.
Complete the missing pieces in the code and see if you can improve your classifier by switching from linear regression to logistic regression! You got this!

Happy coding!

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Initialize the Linear Regression model
linear_model = LinearRegression()

# Train the Linear Regression model
linear_model.fit(X_train, y_train)

# Make predictions with Linear Regression
lin_predictions = (linear_model.predict(X_test) > 0.5).astype(int)

# TODO: Initialize the Logistic Regression model

# TODO: Train the Logistic Regression model

# TODO: Make predictions with Logistic Regression

# Calculate accuracy for both models
lin_accuracy = accuracy_score(y_test, lin_predictions)
# TODO: calculate the logistic regression accuracy

print(f'Linear Regression Accuracy: {lin_accuracy:.2f}')
# TODO: print the logistic regression accuracy

```

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Initialize the Linear Regression model
linear_model = LinearRegression()

# Train the Linear Regression model
linear_model.fit(X_train, y_train)

# Make predictions with Linear Regression
lin_predictions = (linear_model.predict(X_test) > 0.5).astype(int)

# TODO: Initialize the Logistic Regression model
logistic_model = LogisticRegression(max_iter=1000, solver='liblinear') # Added max_iter and solver for convergence

# TODO: Train the Logistic Regression model
logistic_model.fit(X_train, y_train)

# TODO: Make predictions with Logistic Regression
log_predictions = logistic_model.predict(X_test)

# Calculate accuracy for both models
lin_accuracy = accuracy_score(y_test, lin_predictions)
# TODO: calculate the logistic regression accuracy
log_accuracy = accuracy_score(y_test, log_predictions)

print(f'Linear Regression Accuracy: {lin_accuracy:.2f}')
# TODO: print the logistic regression accuracy
print(f'Logistic Regression Accuracy: {log_accuracy:.2f}')
```

## Logistic Regression Wine Classification

Space Voyager, your task is to write a logistic regression model to classify wine types. Load and split the dataset, train the model, make predictions, and calculate the accuracy. Good luck!

It is vital to write the full code yourself to gain confidence!

```python
# TODO: make necessary imports

# TODO: Load the wine dataset and split it into features and target

# TODO: Split the dataset into training and testing sets

# TODO: Initialize and train the Logistic Regression model

# TODO: Make predictions and calculate accuracy

# TODO: Print the accuracy of the model
```

I've completed the Python code for the Logistic Regression Wine Classification.
Logistic Regression Wine Classification

```python
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the wine dataset and split it into features (X) and target (y)
X, y = load_wine(return_X_y=True)

# Split the dataset into training and testing sets
# Using test_size=0.3 to allocate 30% of the data for testing
# random_state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Logistic Regression model
# max_iter is set to 1000 to ensure convergence for some datasets, as default might be too low.
# solver='liblinear' is a good choice for smaller datasets and handles L1/L2 regularization.
logistic_model = LogisticRegression(max_iter=1000, solver='liblinear')
logistic_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logistic_model.predict(X_test)

# Calculate accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy of the model
print(f'Accuracy of the Logistic Regression model: {accuracy:.2f}')

```

This code loads the wine dataset, splits it into training and testing sets, trains a Logistic Regression model, makes predictions, and then calculates and prints the accuracy. The max_iter and solver parameters were specified for better model convergence.