# Logistic Regression with Scikit-learn

## Task
Use Scikit-learn to apply logistic regression to classify whether a bank customer will subscribe to a term deposit.

## Dataset
Bank Marketing Dataset (UCI Repository)

## Steps
1. Load the dataset
2. Preprocess the data
3. Apply logistic regression
4. Evaluate the model

In [1]:
## 1. Load the dataset
import pandas as pd

# Load the dataset
file_path = 'bank-full.csv'
data = pd.read_csv(file_path, delimiter=';')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [2]:
## 2. Preprocess the data

# Check for missing values
missing_values = data.isnull().sum()
missing_values

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [3]:
### Encode categorical variables

# One-hot encode categorical variables
categorical_vars = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
data_encoded = pd.get_dummies(data, columns=categorical_vars, drop_first=True)

# Encode the target variable 'y'
data_encoded['y'] = data_encoded['y'].map({'no': 0, 'yes': 1})

# Display the first few rows of the encoded dataset
data_encoded.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y,job_blue-collar,job_entrepreneur,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,58,2143,5,261,1,-1,0,0,False,False,...,False,False,False,True,False,False,False,False,False,True
1,44,29,5,151,1,-1,0,0,False,False,...,False,False,False,True,False,False,False,False,False,True
2,33,2,5,76,1,-1,0,0,False,True,...,False,False,False,True,False,False,False,False,False,True
3,47,1506,5,92,1,-1,0,0,True,False,...,False,False,False,True,False,False,False,False,False,True
4,33,1,5,198,1,-1,0,0,False,False,...,False,False,False,True,False,False,False,False,False,True


In [4]:
## 3. Apply logistic regression
### Split the dataset into training and testing sets

from sklearn.model_selection import train_test_split

# Split the dataset into features (X) and target (y)
X = data_encoded.drop('y', axis=1)
y = data_encoded['y']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

((36168, 42), (9043, 42), (36168,), (9043,))

In [5]:
### Train the logistic regression model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

(accuracy, precision, recall, f1)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(0.8960521950680084,
 0.6295025728987993,
 0.3363886342804766,
 0.43847072879330945)

In [6]:
### Scale the data and retrain the model
from sklearn.preprocessing import StandardScaler

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Retrain the logistic regression model with increased iterations
model = LogisticRegression(max_iter=2000)
model.fit(X_train_scaled, y_train)

# Predict on the testing set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

(accuracy, precision, recall, f1)

(0.8988167643481145,
 0.6543859649122807,
 0.34188817598533455,
 0.44912703190848885)

## 4. Evaluation Results
- **Accuracy**: 0.8988
- **Precision**: 0.6544
- **Recall**: 0.3419
- **F1 Score**: 0.4491