### Business Intelligence and analysis

**The aim:**  predicting whether a vehicle will fail or not, and you have the following features: model, year, price, transmission, mileage, fuel type, tax, MPG, engine size, and manufacturer. Here are some considerations for each feature:

- **model**: This might introduce high cardinality (too many unique values) as we  have many different models. But is already One hot encoded.

- **year**: This is likely very relevant as it can be correlated with the likelihood of failure (newer cars might be less likely to fail).

- **price**: While not a direct technical feature, price could indirectly indicate the car's quality or maintenance history but is relevant,this was shown in the top_correlations_bar_plot.png

- **transmission**: This is relevant as the type of transmission could affect the likelihood of failure.

- **mileage**: Highly relevant, as higher mileage generally increases the likelihood of failure.

- **fuel type**: Relevant, as different fuel types (e.g., diesel, petrol, electric) can have different failure rates.

- **tax**: Might be less directly relevant but can correlate with vehicle condition or category. Evaluate its importance using feature importance methods.

- **MPG (Miles Per Gallon)**: Relevant, as it might indicate the condition of the engine and overall vehicle health.

- **engine size**: Relevant, as larger or smaller engines might have different failure characteristics.

- **manufacturer**: This can be relevant as some manufacturers have better reliability records than others.




#### Baseline model selection reasons

Based on the considerations outlined, a suitable **baseline model** for predicting car failure would be Logistic Regression. Here's why:

**Nature of the Data:** The dataset comprises structured features, making Logistic Regression an appropriate choice as it can handle tabular data effectively.

**Interpretability Requirements:** Logistic Regression offers good interpretability, providing insights into how each feature contributes to the prediction of car failure.

**Performance Metrics:** For binary classification tasks like predicting car failure, metrics such as precision, recall, and F1-score are crucial. Logistic Regression can be evaluated using these metrics to assess its performance.

**Business Context:** Logistic Regression is computationally efficient and can be deployed for real-time predictions or as part of a decision-support system, depending on the business requirements.

**Model Evaluation and Validation:** After training the Logistic Regression model, it can be evaluated using cross-validation techniques to ensure robust performance across different datasets.

**Final Model Selection:** As a baseline model, Logistic Regression provides a solid starting point for predicting car failure. Subsequent model iterations can build upon this baseline by incorporating more complex algorithms or ensemble methods if necessary.

In summary, **Logistic Regression** serves as a suitable baseline model for predicting car failure due to its interpretability, computational efficiency, and effectiveness in handling structured data. It provides a solid foundation for further model development and refinement based on specific business requirements and performance evaluation.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


In [2]:
# Step 1: Data Preparation
# Load the data
df = pd.read_csv('/Users/kanayojustice/Documents/Data_scientist_projects/AutoPredict/research/data/EDA_cars_data.csv')

In [3]:
df.columns

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fueltype', 'tax',
       'mpg', 'enginesize', 'manufacturer', 'fail'],
      dtype='object')

#### Independent and dependent feature selection

In [4]:
# Encode the target variable
label_encoder = LabelEncoder()
df['fail'] = label_encoder.fit_transform(df['fail'])

In [5]:
# Separate features (X) and target variable (y)
X = df[['model', 'year', 'price', 'transmission', 'mileage', 'fueltype', 'tax',
       'mpg', 'enginesize', 'manufacturer']]
y = df['fail']

In [6]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


In [7]:
print(X.shape, y.shape)

(97712, 10) (97712,)


In [8]:
# Step 2: Preprocessing
# Define numerical and categorical features
numerical_features = ['year', 'price', 'mileage', 'tax', 'mpg', 'enginesize']
categorical_features = ['model', 'transmission', 'fueltype', 'manufacturer']


In [9]:
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fueltype,tax,mpg,enginesize,manufacturer,fail
0,I10,2017,7495,Manual,11630,Petrol,145,60.1,1.0,hyundi,0
1,Polo,2017,10989,Manual,9200,Petrol,145,58.9,1.0,volkswagen,0
2,2 Series,2019,27990,Semi-Auto,1614,Diesel,145,49.6,2.0,BMW,0
3,Yeti Outdoor,2017,12495,Manual,30960,Diesel,150,62.8,2.0,skoda,0
4,Fiesta,2017,7999,Manual,19353,Petrol,125,54.3,1.2,ford,0


In [10]:
# Create preprocessing pipelines for numerical and categorical data
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

In [11]:
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [12]:
# ColumnTransformer to apply different preprocessing to numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)

In [13]:
# Create a pipeline that includes preprocessing and the classifier
pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(solver='liblinear'))  # Using 'liblinear' for small datasets
])

In [14]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10.0],  # Regularization strength
    'classifier__penalty': ['l1', 'l2'],  # Penalty type
    'classifier__max_iter': [100, 200]  # Number of iterations
}


In [15]:
# Create the GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

In [16]:
print("Best parameters found: ", best_params)
print("Best cross-validation accuracy: ", best_score)


Best parameters found:  {'classifier__C': 10.0, 'classifier__max_iter': 100, 'classifier__penalty': 'l1', 'preprocessor__num__imputer__strategy': 'mean'}
Best cross-validation accuracy:  0.9793012391957898


In [17]:
# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=1)
recall = recall_score(y_test, y_pred, pos_label=1)
f1 = f1_score(y_test, y_pred, pos_label=1)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC AUC: {roc_auc:.4f}")

Accuracy: 0.9793
Precision: 0.7273
Recall: 0.9848
F1 Score: 0.8367
ROC AUC: 0.9981


### Insights
##### Confusion Matrix:

- True Negatives (TN): 22,571
- False Positives (FP): 458
- False Negatives (FN): 21
- True Positives (TP): 1,378


This means that the model correctly identified 22,571 instances of the negative class (not failed) and 1,378 instances of the positive class (failed). It made 458 errors in identifying instances as failed when they were not, and 21 errors in identifying instances as not failed when they were.

- ##### Accuracy: 0.9804

This is the ratio of correctly predicted instances (both true positives and true negatives) to the total instances. An accuracy of 98.04% indicates that your model is highly accurate overall.

- ##### Precision: 0.7505

Precision measures the accuracy of the positive predictions, i.e., out of all instances predicted as failed, 75.05% were actually failed. High precision means few false positives.

- ##### Recall: 0.9850

Recall measures the ability of the model to capture all actual positive instances, i.e., out of all actual failed instances, 98.50% were correctly identified. High recall means few false negatives.

- ##### F1-Score: 0.8519

The F1-Score is the harmonic mean of precision and recall. A high F1-score (85.19%) indicates a good balance between precision and recall.

- ##### ROC AUC: 0.9981

The ROC AUC score is a measure of the model's ability to discriminate between the positive and negative classes. A score of 0.9981 indicates near-perfect discrimination.

- ##### Conclusion

These metrics suggest that your Logistic Regression model is performing very well:

-  High Accuracy: Indicates that the model is correct most of the time.
-  High Precision: Indicates that when the model predicts a failure, it is likely to be correct.
-  High Recall: Indicates that the model is good at identifying actual failures.
- High F1-Score: Indicates a good balance between precision and recall.
-  High ROC AUC: Indicates excellent overall performance in distinguishing between the classes.
The hyperparameters used for this model are:

-  Classifier C: 22.0
-  Max Iterations: 200
-  Penalty: L1
-  Imputer Strategy: Mean