# Logistic regression tutorial

Classification Task: Predicting Soil Behavior (Binary Classification)

The task is a binary classification problem, where the goal is to predict whether a soil sample will exhibit contracting or dilative behavior based on the `state` parameter. If `state >= -0.05`, soil tends to be contractive, and if `state < -0.05`, soil tends to be dilative. We first train the model using the training examples (i.e., feature-label). Then we test the model on unseen data to assess whether the model correctly classify the soil behavior (contractive or dilative)

## Data loading


Load the dataset and display its first few rows and information.



In [None]:
import pandas as pd


## Data exploration

Explore the dataset by visualizing relationships between features and the target variable 'state' using pair plots, correlation matrix, heatmap, and distributions of features.



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Determine the shape of the DataFrame


In [None]:
# Generate a pair plot


In [None]:
# Calculate and display the correlation matrix as a heatmap


In [None]:
# Describe the distributions of each feature


In [None]:
# Analyze the correlation between features and the target variable 'state'


## Data preparation


Remove the "gray zone" data points where 'state' is 0 and create a new binary target variable 'binary_state' based on the provided conditions. Then, drop the original 'state' column and verify the changes.



In [None]:
# Remove rows where 'state' is 0


# Create the 'binary_state' column using .loc for correct assignment

# Drop the original 'state' column

# Verify the changes


## Data splitting

Split the prepared data into training, testing, and validation sets.


In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)

# First split: combined training/validation and testing sets

# Second split: training and validation sets


## Model training

Train a logistic regression model using scikit-learn.


In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model

# Train the model


## Model optimization

Optimize the logistic regression model using GridSearchCV with k-fold cross-validation.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the parameter grid

# Create a logistic regression model

# Create the GridSearchCV object

# Fit the GridSearchCV object to the training data

# Print the best hyperparameters

# Get the best estimator


## Model evaluation

Evaluate the best model on the test set, generate a classification report, confusion matrix, ROC curve, and AUC.



In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns

# Predict on the test set

# Generate the classification report


In [None]:
# Compute and plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
# Calculate the ROC curve and AUC
y_prob = best_logreg_model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Print the AUC value
print(f"AUC: {roc_auc}")

In [None]:
# Plot the ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()