ML assignment - theory

1. What is Logistic Regression, and how does it differ from Linear Regression?

Logistic Regression is a classification algorithm used for binary or multi-class classification problems. Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts probabilities using the sigmoid function and applies a threshold to classify data into categories.


2. What is the mathematical equation of Logistic Regression?

The equation for Logistic Regression is:
P(Y=1∣X)=11+e−(β0+β1X1+β2X2+...+βnXn)P(Y=1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n)}}
where β0\beta_0 is the intercept and β1,β2,...βn\beta_1, \beta_2, ... \beta_n are coefficients.

3. Why do we use the sigmoid function in Logistic Regression?

The sigmoid function maps any real number to a value between 0 and 1, which makes it ideal for probability estimation. It helps in converting linear predictions into a probability score for classification.

4. What is the cost function of Logistic Regression?

Logistic Regression uses the Log Loss (Binary Cross-Entropy) cost function:
J(θ)=−1m∑i=1m[yilog⁡hθ(xi)+(1−yi)log⁡(1−hθ(xi))]J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log h_\theta(x_i) + (1 - y_i) \log (1 - h_\theta(x_i)) \right]
where hθ(xi)h_\theta(x_i) is the predicted probability.

5. What is Regularization in Logistic Regression? Why is it needed?

Regularization is used to prevent overfitting by adding a penalty term to the cost function. It discourages overly complex models by shrinking coefficient values.

6. Explain the difference between Lasso, Ridge, and Elastic Net regression.

Lasso (L1 Regularization): Shrinks some coefficients to zero, performing feature selection.

Ridge (L2 Regularization): Shrinks all coefficients but does not set them to zero.

Elastic Net: A combination of L1 and L2 regularization, balancing both penalties.

7. When should we use Elastic Net instead of Lasso or Ridge?

Elastic Net is preferred when there are many correlated features, as it selects groups of correlated variables rather than eliminating some entirely like Lasso.

8. What is the impact of the regularization parameter (λ) in Logistic Regression?

A high λ leads to high regularization, reducing overfitting but increasing bias.
A low λ allows more flexibility but may cause overfitting.

9. What are the key assumptions of Logistic Regression?

No multicollinearity among independent variables.
Independent variables should be meaningful and relevant.
Linearity between independent variables and the log-odds.
No strong outliers affecting the model.


10. What are some alternatives to Logistic Regression for classification tasks?

Decision Trees
Random Forest
Support Vector Machines (SVM)
Naïve Bayes
Neural Networks


11. What are Classification Evaluation Metrics?

Accuracy
Precision
Recall
F1 Score
ROC-AUC Curve


12. How does class imbalance affect Logistic Regression?

In cases of class imbalance, Logistic Regression may predict the majority class more often, leading to misleading accuracy. Solutions include resampling, using weighted loss functions, or alternative evaluation metrics like precision-recall.

13. What is Hyperparameter Tuning in Logistic Regression?

Hyperparameter tuning involves optimizing model parameters such as the regularization strength (λ) and solver type to improve performance using techniques like Grid Search and Random Search.

14. What are different solvers in Logistic Regression? Which one should be used?

lbfgs: Suitable for small to medium-sized datasets.
saga: Works well with L1 and L2 regularization.
liblinear: Good for small datasets with L1 or L2 regularization.
newton-cg: Works well with L2 regularization but not L1.
For large datasets, saga is a good choice.

15. How is Logistic Regression extended for multiclass classification?

Logistic Regression can be extended using:
One-vs-Rest (OvR): Trains a separate classifier for each class against the rest.
Softmax Regression (Multinomial Logistic Regression): Computes probabilities for all classes at once.

16. What are the advantages and disadvantages of Logistic Regression?

Advantages:
Simple and interpretable.
Works well with small datasets.
Provides probabilistic interpretation.
Disadvantages:
Assumes linear relationship in log-odds.
Sensitive to multicollinearity.
Can struggle with complex, nonlinear relationships.

17. What are some use cases of Logistic Regression?

Fraud detection
Customer churn prediction
Disease diagnosis (e.g., cancer detection)
Spam email classification

18. What is the difference between Softmax Regression and Logistic Regression?

Logistic Regression is used for binary classification.
Softmax Regression generalizes Logistic Regression for multiclass classification by computing probabilities for multiple classes.

19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?

OvR: Suitable for sparse datasets or when training time is a concern.
Softmax: Preferred when all classes are mutually exclusive, providing a more holistic probability distribution.

20. How do we interpret coefficients in Logistic Regression?

Each coefficient represents the change in log-odds for a one-unit increase in the corresponding feature while keeping other features constant. The odds ratio can be calculated as eβe^\beta, indicating how much more (or less) likely an event is with a given feature.

ML Assignment - Practical

1.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset (Example: Iris dataset)
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print model accuracy
print("Model Accuracy:", accuracy_score(y_test, y_pred))


2.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset (Example: Iris dataset)
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y


3.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression with L2 Regularization (Ridge)
model_l2 = LogisticRegression(penalty='l2', solver='lbfgs')  # 'lbfgs' supports L2 penalty
model_l2.fit(X_train, y_train)

# Make predictions
y_pred_l2 = model_l2.predict(X_test)

# Print model accuracy and coefficients
print("L2 Regularization Model Accuracy:", accuracy_score(y_test, y_pred_l2))
print("Model Coefficients:", model_l2.coef_)



4.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression with Elastic Net Regularization
model_en = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5)  # 'saga' supports elastic net
model_en.fit(X_train, y_train)

# Make predictions
y_pred_en = model_en.predict(X_test)

# Print model accuracy
print("Elastic Net Regularization Model Accuracy:", accuracy_score(y_test, y_pred_en))



5.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression for multiclass classification (One-vs-Rest)
model_ovr = LogisticRegression(multi_class='ovr', solver='lbfgs')  
model_ovr.fit(X_train, y_train)

# Make predictions
y_pred_ovr = model_ovr.predict(X_test)

# Print model accuracy
print("OvR (One-vs-Rest) Multiclass Model Accuracy:", accuracy_score(y_test, y_pred_ovr))



6.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'lbfgs']
}

# Apply GridSearchCV
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Print best parameters and accuracy
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)



7.

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5)
model = LogisticRegression()

cv_scores = cross_val_score(model, X, y, cv=skf)

# Print average cross-validation accuracy
print("Average CV Accuracy:", np.mean(cv_scores))


8.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from CSV file
df = pd.read_csv("data.csv")  # Replace with your actual CSV file

# Assume the last column is the target variable
X = df.iloc[:, :-1]  # Features
y = df.iloc[:, -1]   # Target variable

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print model accuracy
print("Model Accuracy:", accuracy_score(y_test, y_pred))


9.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_dist = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'solver': ['liblinear', 'saga'],
}

# Apply RandomizedSearchCV
random_search = RandomizedSearchCV(LogisticRegression(), param_dist, cv=5, n_iter=10, random_state=42)
random_search.fit(X_train, y_train)

# Print best parameters and accuracy
print("Best Parameters:", random_search.best_params_)
print("Best Cross-Validation Accuracy:", random_search.best_score_)


10.

from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply One-vs-One (OvO) Logistic Regression
model_ovo = OneVsOneClassifier(LogisticRegression())
model_ovo.fit(X_train, y_train)

# Make predictions
y_pred_ovo = model_ovo.predict(X_test)

# Print model accuracy
print("OvO Multiclass Model Accuracy:", accuracy_score(y_test, y_pred_ovo))


11.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


12.

from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate Precision, Recall, and F1-Score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")


13.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=10, weights=[0.9, 0.1], random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with class weights
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))



14.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Titanic dataset
df = pd.read_csv("titanic.csv")  # Replace with your actual dataset

# Select features and target variable
X = df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]
y = df['Survived']

# Handle missing values
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

# Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print accuracy
print("Model Accuracy:", accuracy_score(y_test, y_pred))


15.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train without scaling
model_no_scaling = LogisticRegression()
model_no_scaling.fit(X_train, y_train)
accuracy_no_scaling = accuracy_score(y_test, model_no_scaling.predict(X_test))

# Apply standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train with scaling
model_scaled = LogisticRegression()
model_scaled.fit(X_train_scaled, y_train)
accuracy_scaled = accuracy_score(y_test, model_scaled.predict(X_test_scaled))

print(f"Accuracy without Scaling: {accuracy_no_scaling:.2f}")
print(f"Accuracy with Scaling: {accuracy_scaled:.2f}")


16.

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)

print("ROC-AUC Score:", roc_auc)


17.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with custom learning rate (C=0.5)
model = LogisticRegression(C=0.5)
model.fit(X_train, y_train)

# Predict and evaluate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy with C=0.5:", accuracy)



18.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate synthetic dataset with feature names
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
feature_names = [f'Feature {i}' for i in range(X.shape[1])]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Get feature importance (coefficients)
coefficients = model.coef_[0]

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
importance_df = importance_df.sort_values(by='Coefficient', ascending=False)

print("Feature Importance in Logistic Regression:")
print(importance_df)


19.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import cohen_kappa_score

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Compute Cohen's Kappa Score
kappa_score = cohen_kappa_score(y_test, y_pred)

print("Cohen's Kappa Score:", kappa_score)


20.

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Compute Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)

# Plot Precision-Recall curve
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()


21.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# List of solvers to compare
solvers = ['liblinear', 'saga', 'lbfgs']

for solver in solvers:
    model = LogisticRegression(solver=solver, max_iter=500)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Solver: {solver}, Accuracy: {accuracy:.4f}")


22.

from sklearn.metrics import matthews_corrcoef
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Compute MCC
mcc = matthews_corrcoef(y_test, y_pred)

print("Matthews Correlation Coefficient (MCC):", mcc)


23.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression on raw data
model_raw = LogisticRegression()
model_raw.fit(X_train, y_train)
y_pred_raw = model_raw.predict(X_test)
accuracy_raw = accuracy_score(y_test, y_pred_raw)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression on standardized data
model_scaled = LogisticRegression()
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy on Raw Data: {accuracy_raw:.4f}")
print(f"Accuracy on Standardized Data: {accuracy_scaled:.4f}")


24.

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Define range of C values to test
C_values = [0.001, 0.01, 0.1, 1, 10, 100]

# Perform cross-validation for each C value
best_C = None
best_score = 0

for C in C_values:
    model = LogisticRegression(C=C)
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    avg_score = np.mean(scores)
    print(f"C={C}, Average Accuracy={avg_score:.4f}")

    if avg_score > best_score:
        best_score = avg_score
        best_C = C

print(f"Best C: {best_C} with Accuracy: {best_score:.4f}")


25.

import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Save the model
joblib.dump(model, 'logistic_regression_model.pkl')

# Load the model
loaded_model = joblib.load('logistic_regression_model.pkl')

# Make predictions
y_pred = loaded_model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Loaded Model Accuracy:", accuracy)
