# Logistic Regression

Logistic Regression is used as a simple and interpretable baseline model to estimate the probability that a vehicle service can be justified, based on a combination of features including the service price, vehicle make, model, year, and vehicle category.

In this project, Logistic Regression models the probability of a justified service by fitting a logistic function to a linear combination of the input features:


## Logistic Regression Significance
- Interpretability: 
    - Coefficients indicate how each feature (price, make, model, year, category) influences the likelihood of service justification.
- Confidence Scores: 
    - Outputs a probability between 0 and 1, offering a calibrated confidence score for decision-making.
- Baseline Benchmark: 
    - Serves as a simple benchmark to compare against more complex models.
- Efficiency: 
    - Fast and scalable even with multiple categorical and numerical features.

## Important Considerations for the Design of the Model
- Handling Categorical Features: 
    - Vehicle make, model, and category must be encoded (e.g., via one-hot encoding or target encoding) to be usable by the model.
- Feature Scaling: 
    - Although Logistic Regression does not strictly require feature scaling, scaling numeric features like price and year can improve model stability.
- Linear Assumption: 
    - Assumes a linear relationship between the features and the log-odds of the outcome, which may be an oversimplification.
- Sensitivity to Outliers: 
    - Outliers, particularly in service price or vehicle year, could affect model performance.

---





In [2]:
# Preprocessing of Data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [22]:
data = pd.read_csv(r'C:\Users\jackl\Downloads\t316_completed_fleet_tix_20250424.csv')


# Calculate the median price for each combination of TaskName, VMakeModel, and VYear
median_prices = data.groupby(['TaskName', 'VMakeModel', 'VYear'])['PriceIncGSTRaw'].median().reset_index()
median_prices = median_prices.rename(columns={'PriceIncGSTRaw': 'MedianPrice'})

# Merge the median prices back with the original data
data = pd.merge(data, median_prices, on=['TaskName', 'VMakeModel', 'VYear'], how='left')
print (data.head())

# Justified if the price is within 10% of the median price for that task, make/model, and year
lower_bound = data['MedianPrice'] * 0.90
upper_bound = data['MedianPrice'] * 1.1

# Create the "Justified" column
data['Justified'] = ((data['PriceIncGSTRaw'] >= lower_bound) & (data['PriceIncGSTRaw'] <= upper_bound)).astype(int)

print(data.head())




   FCID  BookingID BCreatedDateAEST  BTicketID BTicketType  \
0     1     463259       17/06/2021     708763      Capped   
1     2    1360052       11/01/2024    2122072      Capped   
2     1    1058706       19/10/2022    1633633      Repair   
3     2    1078043       11/11/2022    1664447     Logbook   
4     2    1868175       30/07/2024    3101426      Capped   

               TaskName  IsCustomService  IsCustomRepair  PriceIncGSTRaw  \
0    Capped Price - 30K                0               0          180.00   
1    Capped Price - 50K                0               0          315.90   
2  Replace Wiper Blades                0               0          120.00   
3   Logbook - 60K / 48m                0               0          462.10   
4    Capped Price - 30K                0               0          359.21   

                  VYMM      VMakeModel       VMake  VYear  BShopID  \
0  2019 TOYOTA COROLLA  TOYOTA COROLLA      TOYOTA   2019    17885   
1      2021 MAZDA CX-5      MA

In [23]:
# Select features
features = ['PriceIncGSTRaw', 'VMakeModel', 'VYear']
target = 'Justified'

X = data[features]
y = data[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define transformers
categorical_features = ['VMakeModel']
numerical_features = ['PriceIncGSTRaw', 'VYear']

categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numerical_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='liblinear'))
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Evaluate
print(f"Training Accuracy: {pipeline.score(X_train, y_train):.2f}")
print(f"Test Accuracy: {pipeline.score(X_test, y_test):.2f}")

# Predict confidence scores
confidence_scores = pipeline.predict_proba(X_test)[:, 1]
print("\nExample Confidence Scores:")
print(confidence_scores[:10])



Training Accuracy: 0.59
Test Accuracy: 0.59

Example Confidence Scores:
[0.54000526 0.45405482 0.59096275 0.57025494 0.6735494  0.55469473
 0.44722101 0.49611969 0.59764256 0.47318586]
