### **AUTOMOTIVE PREDICTOR**

#### **Life cycle of Machine learning Project**  

- Understanding the Problem Statement

- Data Collection 

- Data Checks to perform

- Exploratory data analysis

- Data Pre-Processing

- Model Training

- Choose best mode

## **Business Intelligence and Business Analysis** 
The automotive industry continually strives to enhance vehicle reliability and minimize unexpected breakdowns. Traditional maintenance schedules often fail to account for the unique usage patterns and conditions each vehicle experiences, leading to inefficiencies and potential safety risks.

In the context of automotive components, there is a need for a predictive maintenance system that can anticipate failures and maintenance requirements based on historical data. This project aims to leverage a comprehensive dataset of over 90,000 used cars, spanning from 1970 to 2024, to develop a predictive model that identifies vehicles at high risk of failure. By predicting these maintenance needs, AutoPredict seeks to reduce downtime, optimize maintenance schedules, and ultimately improve vehicle safety and performance.

#### **Problem Statement** 
The challenge is to develop a predictive maintenance model that accurately forecasts automotive component failures using historical car data. This model aims to help in proactively addressing maintenance needs, thereby enhancing vehicle reliability, safety, and operational efficiency.



### **Goals** 
**Predict Failures:** Create a supervised learning model to predict the likelihood of automotive component failures based on features such as age, mileage, engine size, and fuel efficiency.

**Optimize Maintenance:** Use predictive insights to optimize maintenance schedules, reducing unnecessary maintenance actions and preventing unexpected breakdowns.

**Enhance Safety and Performance:** Improve overall vehicle safety and performance by ensuring timely maintenance interventions.



### **Creating a Synthetic Failure Label** 
The dataset does not include a direct feature for predicting failures. To address this, I created a synthetic failure label based on business knowledge and specific parameters. This label was generated using the following criteria:

**Age:** Vehicles older than 10 years are more likely to experience failures.

**Mileage:** Vehicles with mileage greater than 100,000 miles are more prone to maintenance needs.

**Combined Criteria:** Vehicles meeting both the age and mileage thresholds were labeled as likely to fail.



### **Features Included**

**Model:** The model of the car.

**Year:** The manufacturing year of the car.

**Price:** The price of the car.

**Transmission:** The type of transmission used in the car.

**Mileage:** The mileage of the car.

**FuelType:** The type of fuel used by the car.

**Tax:** The tax rate applicable to the car.

**MPG:** The miles per gallon efficiency of the car.

**EngineSize:** The size of the car's engine.

**Manufacturer:** The manufacturer of the car.

### Data collection
**Dataset source:** https://www.kaggle.com/datasets/meruvulikith/90000-cars-data-from-1970-to-2024

The data consists of: 10 columns but we would add 1 more and 97712 rows

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline

In [2]:
df = pd.read_csv("/Users/kanayojustice/Documents/Data_scientist_projects/AutoPredict/research/data/CarsData.csv")

### Exploratory data analysis (EDA)

In [3]:
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,Manufacturer
0,I10,2017,7495,Manual,11630,Petrol,145,60.1,1.0,hyundi
1,Polo,2017,10989,Manual,9200,Petrol,145,58.9,1.0,volkswagen
2,2 Series,2019,27990,Semi-Auto,1614,Diesel,145,49.6,2.0,BMW
3,Yeti Outdoor,2017,12495,Manual,30960,Diesel,150,62.8,2.0,skoda
4,Fiesta,2017,7999,Manual,19353,Petrol,125,54.3,1.2,ford


In [4]:
df.shape

(97712, 10)

In [5]:
df.describe()

Unnamed: 0,year,price,mileage,tax,mpg,engineSize
count,97712.0,97712.0,97712.0,97712.0,97712.0,97712.0
mean,2017.066502,16773.487555,23219.475499,120.142408,55.205623,1.664913
std,2.118661,9868.552222,21060.882301,63.35725,16.181659,0.558574
min,1970.0,450.0,1.0,0.0,0.3,0.0
25%,2016.0,9999.0,7673.0,125.0,47.1,1.2
50%,2017.0,14470.0,17682.5,145.0,54.3,1.6
75%,2019.0,20750.0,32500.0,145.0,62.8,2.0
max,2024.0,159999.0,323000.0,580.0,470.8,6.6


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97712 entries, 0 to 97711
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         97712 non-null  object 
 1   year          97712 non-null  int64  
 2   price         97712 non-null  int64  
 3   transmission  97712 non-null  object 
 4   mileage       97712 non-null  int64  
 5   fuelType      97712 non-null  object 
 6   tax           97712 non-null  int64  
 7   mpg           97712 non-null  float64
 8   engineSize    97712 non-null  float64
 9   Manufacturer  97712 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 7.5+ MB


In [7]:
df.isnull().sum()

model           0
year            0
price           0
transmission    0
mileage         0
fuelType        0
tax             0
mpg             0
engineSize      0
Manufacturer    0
dtype: int64

#### **Insights** 

- The dataset has 4 coloumns as object type and we have to OHE (One Hot Encode)
- The dataset has no missing values
- The dataset contains (97712 - rows, 10 coloumns)

### Feature Engineering

##### - Adding new target column named **"fail"** 

In [8]:
# Current year
current_year = datetime.now().year

##### **Note:** Creating the "Fail" Column: The apply function is used to iterate over each row. If a car is older than 10 years or has mileage greater than 100,000 miles, it's labeled as "Yes"; otherwise, it's labeled as "No".

In [9]:
# Add 'Fail' column based on age and mileage criteria
df['Fail'] = df.apply(lambda row: 'Yes' if (current_year - row['year'] > 10 or row['mileage'] > 100000) else 'No', axis=1)

In [10]:
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,Manufacturer,Fail
0,I10,2017,7495,Manual,11630,Petrol,145,60.1,1.0,hyundi,No
1,Polo,2017,10989,Manual,9200,Petrol,145,58.9,1.0,volkswagen,No
2,2 Series,2019,27990,Semi-Auto,1614,Diesel,145,49.6,2.0,BMW,No
3,Yeti Outdoor,2017,12495,Manual,30960,Diesel,150,62.8,2.0,skoda,No
4,Fiesta,2017,7999,Manual,19353,Petrol,125,54.3,1.2,ford,No


In [11]:
# Count the occurrences of 'Yes' and 'No' in the 'Fail' column
fail_counts = df['Fail'].value_counts()
print(fail_counts)

Fail
No     92458
Yes     5254
Name: count, dtype: int64


#### **Insights** 
- Now we can clearly see the dataset has variable "No" = 2458, and variable "Yes" = 5254 in our **fail** column.

#### **Note:** 
- Remember we have not yet done One Hot Encoding

In [12]:
df.columns

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'tax',
       'mpg', 'engineSize', 'Manufacturer', 'Fail'],
      dtype='object')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97712 entries, 0 to 97711
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         97712 non-null  object 
 1   year          97712 non-null  int64  
 2   price         97712 non-null  int64  
 3   transmission  97712 non-null  object 
 4   mileage       97712 non-null  int64  
 5   fuelType      97712 non-null  object 
 6   tax           97712 non-null  int64  
 7   mpg           97712 non-null  float64
 8   engineSize    97712 non-null  float64
 9   Manufacturer  97712 non-null  object 
 10  Fail          97712 non-null  object 
dtypes: float64(2), int64(4), object(5)
memory usage: 8.2+ MB


##### One Hot Encode all Object column

In [14]:
# Encode categorical variables using pd.get_dummies with dtype to ensure 0 and 1 encoding
categorical_columns = ['model', 'transmission', 'fuelType', 'Manufacturer']
df = pd.get_dummies(df, columns=categorical_columns, dtype=int)
df.head()

Unnamed: 0,year,price,mileage,tax,mpg,engineSize,Fail,model_ 1 Series,model_ 2 Series,model_ 3 Series,...,fuelType_Petrol,Manufacturer_Audi,Manufacturer_BMW,Manufacturer_ford,Manufacturer_hyundi,Manufacturer_merc,Manufacturer_skoda,Manufacturer_toyota,Manufacturer_vauxhall,Manufacturer_volkswagen
0,2017,7495,11630,145,60.1,1.0,No,0,0,0,...,1,0,0,0,1,0,0,0,0,0
1,2017,10989,9200,145,58.9,1.0,No,0,0,0,...,1,0,0,0,0,0,0,0,0,1
2,2019,27990,1614,145,49.6,2.0,No,0,1,0,...,0,0,1,0,0,0,0,0,0,0
3,2017,12495,30960,150,62.8,2.0,No,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,2017,7999,19353,125,54.3,1.2,No,0,0,0,...,1,0,0,1,0,0,0,0,0,0


In [15]:
# Convert 'Fail' column to integers: 'Yes' -> 1, 'No' -> 0
df['Fail'] = df['Fail'].map({'Yes': 1, 'No': 0})

In [None]:
# Move 'Fail' column to the end
columns = [col for col in df if col != 'Fail'] + ['Fail']
df = df[columns]

In [16]:
df.head()

Unnamed: 0,year,price,mileage,tax,mpg,engineSize,Fail,model_ 1 Series,model_ 2 Series,model_ 3 Series,...,fuelType_Petrol,Manufacturer_Audi,Manufacturer_BMW,Manufacturer_ford,Manufacturer_hyundi,Manufacturer_merc,Manufacturer_skoda,Manufacturer_toyota,Manufacturer_vauxhall,Manufacturer_volkswagen
0,2017,7495,11630,145,60.1,1.0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
1,2017,10989,9200,145,58.9,1.0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
2,2019,27990,1614,145,49.6,2.0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
3,2017,12495,30960,150,62.8,2.0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,2017,7999,19353,125,54.3,1.2,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0


In [18]:
df.columns

Index(['year', 'price', 'mileage', 'tax', 'mpg', 'engineSize', 'Fail',
       'model_ 1 Series', 'model_ 2 Series', 'model_ 3 Series',
       ...
       'fuelType_Petrol', 'Manufacturer_Audi', 'Manufacturer_BMW',
       'Manufacturer_ford', 'Manufacturer_hyundi', 'Manufacturer_merc',
       'Manufacturer_skoda', 'Manufacturer_toyota', 'Manufacturer_vauxhall',
       'Manufacturer_volkswagen'],
      dtype='object', length=221)