# Build a Simple Machine Learning Classification Model Using Scikit-Learn

# Introduction

In this lab, we will walk through the process of building a simple machine learning classification model using the popular Python library, **scikit-learn**. The dataset, **'loan_small.csv'**, will be used to demonstrate the complete workflow, starting from data preprocessing to model evaluation.

This lab is designed to provide a step-by-step guide to understanding the core concepts and techniques involved in building a machine learning model for classification tasks. We will cover the following steps:

1. **Loading and exploring the dataset:** Understand the structure and contents of the data.
2. **Preprocessing the data:** Handle missing values and encode categorical variables for model compatibility.
3. **Splitting the data:** Divide the dataset into training and testing sets for evaluation.
4. **Scaling features:** Normalize feature values to improve model performance.
5. **Training a Logistic Regression model:** Build and train a logistic regression model to classify loan approvals.
6. **Saving the model:** Use `joblib` to save the trained model for future reuse.
7. **Making predictions and evaluating performance:** Assess the model using metrics such as accuracy, precision, recall, and F1-score.

By the end of this lab, you will have a clear understanding of how to use **scikit-learn** to build and evaluate a machine learning classification model. Additionally, you will gain insights into the importance of data preprocessing, feature scaling, and model evaluation in the machine learning pipeline.



# Step-by-Step Guide

## Step 1: Load the Data

The first step in any machine learning project is loading the dataset. This involves reading the data file into a suitable format, such as a DataFrame. Inspecting the structure of the dataset is critical to understand the types of data, the presence of missing values, and overall data quality.

- **Objective:** Understand the dataset's structure and identify initial issues (e.g., missing values or categorical data).
- **Key Considerations:** Ensure that the data is loaded without errors and explore its columns, data types, and summary statistics.


In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('loan_small.csv')

# Drop the Loan_ID column (not useful for prediction)
data = data.drop(columns=['Loan_ID'])

# Display the first few rows
print(data.head())


   Gender  ApplicantIncome  CoapplicantIncome  LoanAmount   Area Loan_Status
0     NaN           5849.0                0.0         NaN  urban           Y
1    Male           4583.0                NaN       128.0   semi           N
2    Male           3000.0                0.0        66.0    NaN           Y
3  Female           2583.0             2358.0       120.0   semi         NaN
4    Male              NaN                0.0       141.0  urban           Y


## Step 2: Data Preprocessing

Preprocessing prepares the data for model training by addressing inconsistencies or inadequacies in raw data.

1. **Handle Missing Values:**  
   Missing data can negatively impact model performance. Imputation techniques, such as replacing missing values with the mean, median, or mode, can fill these gaps. Tools like `SimpleImputer` in Python simplify this process.

2. **Encode Categorical Variables:**  
   Machine learning models require numerical input. Text-based categorical columns, such as "Gender" or "Area," must be converted into numerical form using techniques like label encoding or one-hot encoding.

3. **Feature Engineering (if required):**  
   Derive new features or transform existing ones to improve model performance.

- **Objective:** Ensure the data is clean, consistent, and fully numeric for modeling.
- **Key Considerations:** Choose imputation strategies and encoding methods appropriate for the dataset.


In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Impute missing values with the most frequent value
imputer = SimpleImputer(strategy="most_frequent")
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Encode categorical variables
label_encoders = {}
for col in ['Gender', 'Area', 'Loan_Status']:
    label_encoders[col] = LabelEncoder()
    data_imputed[col] = label_encoders[col].fit_transform(data_imputed[col])

# Convert numerical columns back to float
data_imputed[["ApplicantIncome", "CoapplicantIncome", "LoanAmount"]] = data_imputed[
    ["ApplicantIncome", "CoapplicantIncome", "LoanAmount"]
].astype(float)

# Display the cleaned dataset
print(data_imputed.head())

   Gender  ApplicantIncome  CoapplicantIncome  LoanAmount  Area  Loan_Status
0       1           5849.0                0.0        17.0     2            1
1       1           4583.0                0.0       128.0     1            0
2       1           3000.0                0.0        66.0     1            1
3       0           2583.0             2358.0       120.0     1            1
4       1           1299.0                0.0       141.0     2            1


## Step 3: Train-Test Split

Dividing the dataset into training and testing sets is essential for evaluating a model's generalization capability. A common split ratio is 70:30 or 80:20.

- **Training Set:** Used to train the model.
- **Test Set:** Used to evaluate how well the model performs on unseen data.

- **Objective:** Prevent overfitting and ensure the model's performance is robust on new data.
- **Key Considerations:** Maintain an appropriate split ratio and shuffle the data before splitting to avoid bias.


In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = data_imputed.drop(columns=["Loan_Status"])
y = data_imputed["Loan_Status"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Check the shapes of the splits
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(12, 5) (4, 5) (12,) (4,)


## Step 4: Scale the Data

Feature scaling ensures all input features contribute equally to the model. Algorithms like logistic regression and support vector machines are sensitive to feature magnitudes.

- **Standard Scaling:** Centers the data to a mean of 0 with a standard deviation of 1.
- **Min-Max Scaling:** Scales data to a fixed range, usually [0, 1].

- **Objective:** Improve model convergence and performance by normalizing feature values.
- **Key Considerations:** Apply scaling only to numeric features, not categorical data.


In [None]:
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 5: Train Logistic Regression Model

After preparing the data, the next step is to train a logistic regression model. Logistic regression is a simple yet effective algorithm for binary classification tasks. It calculates probabilities for each class and assigns the input data to the class with the highest probability.

- **Objective:** Build a model that learns patterns from the training data and generalizes well to unseen data.
- **Key Considerations:** Ensure that the training process converges by using appropriately scaled data and a sufficient number of iterations.


In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression(random_state=42)

# Train the model
model.fit(X_train_scaled, y_train)

## Step 6: Save the Model

Once the model is trained, it can be saved for future use without needing to retrain it. Tools like `joblib` or `pickle` in Python are commonly used to serialize and save models.

- **Objective:** Save the trained model to reuse it for predictions or deployment without retraining.
- **Key Considerations:** Store the model in a location where it can be easily loaded for future predictions.


In [None]:
import joblib

# Save the model to a file
joblib.dump(model, 'logistic_model.pkl')
print("Model saved successfully.")

Model saved successfully.


## Step 7: Make Predictions

Using the saved model, predictions can be made on the test data. This involves loading the model, applying it to new inputs, and evaluating its predictions.

- **Objective:** Use the trained model to predict outcomes on unseen data and assess its performance.
- **Key Considerations:** Ensure that the test data undergoes the same preprocessing steps as the training data.


In [None]:
from sklearn.metrics import accuracy_score

# Load the saved model
loaded_model = joblib.load('logistic_model.pkl')

# Predict on the test set
y_pred = loaded_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)

# Display predictions and accuracy
predictions = pd.DataFrame({"Actual": y_test.values, "Predicted": y_pred})
print(predictions)
print(f"Accuracy: {accuracy:.2f}")


   Actual  Predicted
0       0          0
1       1          0
2       1          0
3       1          0
Accuracy: 0.25




## Step 8: Evaluation Report

Evaluating the model involves calculating metrics like accuracy, precision, recall, and F1-score to understand how well the model is performing. Tools like `classification_report` from `sklearn.metrics` provide detailed performance metrics for each class.

- **Objective:** Analyze the model's strengths and weaknesses to identify areas for improvement.
- **Key Considerations:** Use multiple metrics to get a comprehensive understanding of model performance, especially in imbalanced datasets.


In [None]:
from sklearn.metrics import classification_report

# Example usage of classification_report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.25      1.00      0.40         1
           1       0.00      0.00      0.00         3

    accuracy                           0.25         4
   macro avg       0.12      0.50      0.20         4
weighted avg       0.06      0.25      0.10         4

