<h2>Classifying Warranty Claims with Symptom Class Names Using Machine Learning</h2><br><br>
by Daniel J. Kim

At Honda Market Quality (MQ), we are responsible for identifying vehicle quality and safety problems.  The primary source of market or field information is warranty claims data.  This data represents the voice of our customers.  The data contains several attributes such as part number, part cost, days to failure, miles to failure, customer's complaint, etc.  Over the years, Honda has accumulated several millions of warranty claims.  In order to efficiently identify market problems, methods have to be employed to "classify" or group like or similar claims together so that analysis can be made to efficiently find trends, track problems, and ensure problems are fixed or counter-measured.

Today, warranty claims data is classified using several, hard-coded algorithms, requiring extensive maintenance.  The jobs that our IT runs to complete the classification take several hours overnight.  Due to recent advancements and accessibility of [machine learning](https://en.wikipedia.org/wiki/Machine_learning) (ML) methodologies, I believe MQ and our Honda IT professionals should investigate how ML can be used to improve the warranty claims classification process and extend its usage to other applicable areas of business.  Furthermore, I strongly believe MQ need to develop in-house capability and knowledge in machine learning.  Unfortunately at MQ, we do not have associates that are knowledgeable in ML or have limited knowledge, this includes me.  But we can change that and hopefully we can discover benefits of applying machine learning to enhance MQ's business.

The following example is an attempt at a proof-of-concept of how machine learning can be used to classify warranty claims without hard-coded algorithms and is not meant to be representative of a "production" application.  The programming language used to employ the machine learning algorithm is Python using the [scikit-learn](http://scikit-learn.org/stable/) machine learning library.  This document is a [Jupyter](http://jupyter.org/) web notebook which allows me to document my process so that perhaps others can duplicate or understand my process as well.

### Library Imports

In [2]:
from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np

### Data Ingestion

Due to confidentiality, raw data will not be available.  Instead source of sample data was from an Excel file which I then "copy/pasted" from my computer's "clipboard".

In [3]:
df = pd.read_clipboard()

### Data Preparation: Data Cleansing and Transformation

Source data had dollar sign and comma in the part cost amounts.  So we need to remove them and ensure the part cost is a numeric (float) value.

In [4]:
df['PART_COST_USD'] = df['PART_COST_USD'].str.replace('$','').str.replace(',','')

In [5]:
df['PART_COST_USD'] = df['PART_COST_USD'].astype(float)

Confirm data type of the source data:

In [6]:
df.dtypes

FAIL_SHORT_PARTNO        object
PART_COST_USD           float64
DAYS_TO_FAIL_MINZERO      int64
MILES_TO_FAIL             int64
TEXT_CLUSTER_FAMILY      object
SYMP_CLASS_NM            object
dtype: object

Below is the first 5 rows of the training data set we will use.  The features columns that we will use are the first 5 columns and the target or label data will be the last column ("SYMPTOM_CLASS_NM").

Basically, we want to label or classify future claims based on part #, part cost, DTF, MTF, and symptom text cluster family to their appropriate symptom class name.

**Let's view our sample data:**

In [7]:
df.head()

Unnamed: 0,FAIL_SHORT_PARTNO,PART_COST_USD,DAYS_TO_FAIL_MINZERO,MILES_TO_FAIL,TEXT_CLUSTER_FAMILY,SYMP_CLASS_NM
0,30,0.0,0,2,NOISE/VIBRATION,BRAKE JUDDER
1,1469,458.91,0,11,FUNCTION ISSUE,BRAKE PEDAL SOFT
2,1469,455.32,0,10,FUNCTION ISSUE,MASTER CYLINDER/BOOSTER/POWER ASSY/FUNCTION ISSUE
3,1611,0.0,0,5,COSMETIC ISSUE,SIDE PANEL / FENDER/FENDER (FRONT)/COSMETIC ISSUE
4,4110,2.62,0,20887,FUNCTION ISSUE,BULBS (INTERIOR)/04110/FUNCTION ISSUE


When using machine learning libraries, most require that the input data do not contain text/string data.  Since part #s can contain string values, I had to resort to creating a Python dictionary that maps a part5 (first 5 digits of a part number) to a unique number.  I had to do the same for symptom text cluster family and symptom class name since they are also text/string data.

In [8]:
# Create Python list containing unique part #s and their corresponding numeric value
part5_unique = sorted(df.FAIL_SHORT_PARTNO.unique().tolist())
part5_index = [n for n in range(len(part5_unique))]

# Now create Python dictionaries that map part5 to integer value and vice versa
part5_to_int_mapper = dict(zip(part5_unique, part5_index))
int_to_part5_mapper = dict(zip(part5_index, part5_unique))

In [9]:
# Create Python list containing unique text cluster values and their corresponding numeric value
symptoms_unique = sorted(df.TEXT_CLUSTER_FAMILY.unique().tolist())
symptoms_index = [n for n in range(len(symptoms_unique))]

# Now create Python dictionaries that map symptom text to integer value and vice versa
symptom_to_int_mapper = dict(zip(symptoms_unique, symptoms_index))
int_to_symptom_mapper = dict(zip(symptoms_index, symptoms_unique))

In [10]:
# Create Python list containing unique symptom class values and their corresponding numeric value
symp_class_unique = sorted(df.SYMP_CLASS_NM.unique().tolist())
symp_class_index = [n for n in range(len(symp_class_unique))]

# Now create Python dictionaries that map symptom class to integer value and vice versa
symp_class_to_int_mapper = dict(zip(symp_class_unique, symp_class_index))
int_to_symp_class_mapper = dict(zip(symp_class_index, symp_class_unique))

As an example, here is what my **symptom_to_int_mapper** looks like:

In [17]:
symptom_to_int_mapper

{'COSMETIC ISSUE': 0,
 'FUNCTION ISSUE': 1,
 'LEAK': 2,
 'NOISE/VIBRATION': 3,
 'NOT APPL': 4,
 'ODOR': 5,

The above maps "COSMETIC ISSUE" to the numeric value of 0, "FUNCTION ISSUE" to 1, etc.

Here it is in action.  You just provide a symptom text cluster, it returns the numeric value for it.  For example:

In [19]:
symptom_to_int_mapper['FUNCTION ISSUE']

1

We can go the other direction with my int_to_symptom_mapper:

In [20]:
int_to_symptom_mapper[1]

'FUNCTION ISSUE'

### We need to be able to re-use these data structures for un-classified data later on

** Data Structure Persistence using Python's pickle library: **

In [11]:
import pickle

# Save Python data structures to disk/file
pickle.dump(part5_to_int_mapper, open(r'D:\jupyter\machine_learning\part5_to_int_mapper.sk','wb'))
pickle.dump(symptom_to_int_mapper, open(r'D:\jupyter\machine_learning\symptom_to_int_mapper.sk','wb'))
pickle.dump(symp_class_to_int_mapper, open(r'D:\jupyter\machine_learning\symp_class_to_int_mapper.sk','wb'))
pickle.dump(int_to_symp_class_mapper, open(r'D:\jupyter\machine_learning\int_to_symp_class_mapper.sk','wb'))

**NOTE**-For the sake of simplicity, I resorted to saving the mappings using Python's pickle object serialization library.  In a production environment, it would be more suitable to use a relational database to store the mappings in a table instead.

### Now use our Python dictionaries to map text value to their respective integer value

Below creates a new column called "PART5" which is the numeric representation of the "FAIL_SHORT_PARTNO" column:

In [12]:
df['PART5'] = df.FAIL_SHORT_PARTNO.map(part5_to_int_mapper).astype(np.int64)

In [13]:
df.dtypes

FAIL_SHORT_PARTNO        object
PART_COST_USD           float64
DAYS_TO_FAIL_MINZERO      int64
MILES_TO_FAIL             int64
TEXT_CLUSTER_FAMILY      object
SYMP_CLASS_NM            object
PART5                     int64
dtype: object

### Now do the same for text cluster and symptom class name

In [14]:
df['TEXT_CLUSTER'] = df.TEXT_CLUSTER_FAMILY.map(symptom_to_int_mapper)
df['SYMP_CLASS'] = df.SYMP_CLASS_NM.map(symp_class_to_int_mapper)

Below is list of final data columns and their respective data types:

In [15]:
df.dtypes

FAIL_SHORT_PARTNO        object
PART_COST_USD           float64
DAYS_TO_FAIL_MINZERO      int64
MILES_TO_FAIL             int64
TEXT_CLUSTER_FAMILY      object
SYMP_CLASS_NM            object
PART5                     int64
TEXT_CLUSTER              int64
SYMP_CLASS                int64
dtype: object

### Now we are ready to create our features input data

Our features data will consist of: part5, part cost, DTF, MTF, and symptom text cluster (all represented with numeric values thanks to my mappings made earlier):

In [17]:
features = df[['PART5','PART_COST_USD','DAYS_TO_FAIL_MINZERO','MILES_TO_FAIL','TEXT_CLUSTER']].values.tolist()

Now our features data does not contain text/string data.  Let's look at the first 10 rows of data:

In [20]:
features[:10]

[[0.0, 0.0, 0.0, 2.0, 3.0],
 [6.0, 458.91, 0.0, 11.0, 1.0],
 [6.0, 455.32, 0.0, 10.0, 1.0],
 [7.0, 0.0, 0.0, 5.0, 0.0],
 [11.0, 2.62, 0.0, 20887.0, 1.0],
 [11.0, 4.37, 0.0, 11849.0, 1.0],
 [12.0, 3.04, 0.0, 3.0, 6.0],
 [13.0, 0.0, 0.0, 5.0, 4.0],
 [19.0, 0.0, 0.0, 11.0, 0.0],
 [19.0, 0.0, 0.0, 14.0, 0.0]]

### Now create our target/label data

In [21]:
labels = df.SYMP_CLASS.tolist()

Let's look at the first 10 label data:

In [22]:
labels[:10]

[161, 162, 1708, 2280, 248, 248, 1247, 2343, 2293, 2293]

### Now use scikit-learn's NaiveBayes Gaussian classification algorithm

There are several ML classification algorithms to choose from.  Since I do not know the internal implementation of each and every algorithm, I had to resort to trial-and-error in finding the algorithm which gave me the best results, and quite frankly, I am not knowledgeable on how to perform proper model validation.  The model that gave me best results using large data set was the Naive Bayes classification algorithm.  With smaller data sets, decision tree or random forest algorithm gave me good results.

Fit the training data and target/label data.  Again, the training data is the part #, part cost, DTF, MTF, and symptom text cluster.  The label data is the symptom class name.

In [23]:
clf = GaussianNB()
clf.fit(features, labels)

GaussianNB()

**Predict just one record.**  Here I will feed the prediction model a transmission part # (06200), an arbitrary part cost ($2000), arbitrary DTF (0), MTF (5), and symptom text cluster ("WARNING LIGHT ON").

In [72]:
# criteria = [part5, part cost, dtf, mtf, symptom]
criteria = [part5_to_int_mapper['06200'], 2000, 0, 5, symptom_to_int_mapper['WARNING LIGHT ON']]
int_to_symp_class_mapper[clf.predict([criteria])[0]]



Comparing this to the source data, this is correct and makes sense!

**Predicting 10 records:**

In [25]:
test_data = features[:10]
test_data

[[0.0, 0.0, 0.0, 2.0, 3.0],
 [6.0, 458.91, 0.0, 11.0, 1.0],
 [6.0, 455.32, 0.0, 10.0, 1.0],
 [7.0, 0.0, 0.0, 5.0, 0.0],
 [11.0, 2.62, 0.0, 20887.0, 1.0],
 [11.0, 4.37, 0.0, 11849.0, 1.0],
 [12.0, 3.04, 0.0, 3.0, 6.0],
 [13.0, 0.0, 0.0, 5.0, 4.0],
 [19.0, 0.0, 0.0, 11.0, 0.0],
 [19.0, 0.0, 0.0, 14.0, 0.0]]

In [26]:
for item in test_data:
    print(int_to_symp_class_mapper[clf.predict([item])[0]], item[3])

VSA LIGHT ON (NO PARTS REPLACED) 2.0
MASTER CYLINDER/BOOSTER/POWER ASSY/FUNCTION ISSUE 11.0
MASTER CYLINDER/BOOSTER/POWER ASSY/FUNCTION ISSUE 10.0
RELAY/FUSE/FUSE (15A)/COSMETIC ISSUE 5.0
BULBS (INTERIOR)/04110/FUNCTION ISSUE 20887.0
BULBS (INTERIOR)/04110/FUNCTION ISSUE 11849.0
SRS/CONNECTOR/FUNCTION ISSUE 5.0
SIDE PANEL / FENDER/SIDE PANEL (RIGHT SIDE)/COSMETIC ISSUE 11.0
SIDE PANEL / FENDER/SIDE PANEL (RIGHT SIDE)/COSMETIC ISSUE 14.0


Comparing the above output to the source data, all but 1 was not classified as the original source.  This is pretty good considering generally symptom class name assignments are not completely accurate anyways AND were not expected to be.  Warranty data is inherently dirty data and is reflected in the not-so-perfect symptom class names.  So the performance of my ML classification attempt looks very good.

# Re-Using the Machine Learning Model to Classify Future Claims

### Persist the model so that we can re-use it without having to retrain

In [48]:
import pickle

pickle.dump(clf1, open(r'D:\jupyter\machine_learning\nbayes.sk','wb'))

### Re-use the Model and Load Helper Data Structures

In [54]:
# Load the model
clf2 = pickle.load(open(r'D:\jupyter\machine_learning\nbayes.sk','rb'))

# Load helper data structures that we made earlier
part5_to_int_mapper = pickle.load(open(r'D:\jupyter\machine_learning\part5_to_int_mapper.sk', 'rb'))
symptom_to_int_mapper = pickle.load(open(r'D:\jupyter\machine_learning\symptom_to_int_mapper.sk', 'rb'))
int_to_symp_class_mapper = pickle.load(open(r'D:\jupyter\machine_learning\int_to_symp_class_mapper.sk', 'rb'))

Again, in a production environment, it is probably best to load the mappings from a relational database instead of using Python's pickle.

### Test a single observation using the model

In [55]:
# criteria = [part5, part cost, dtf, mtf, symptom]
criteria = [part5_to_int_mapper['04823'], 0, 0, 207, symptom_to_int_mapper['COSMETIC ISSUE']]
int_to_symp_class_mapper[clf2.predict([criteria])[0]]

'SEAT BELTS/REAR/COSMETIC ISSUE'

**That's it!**

Instead of symptom class names, we can classify warranty claims with other different types of classification labels so this classification example can be extended for any other classification we can come up with.

I have not tested this model extensively with other larger test data, but so far I have been impressed with the model so far.

# Conclusion

This small-scale example shows that a machine learning classification algorithm was able to classify warranty claims without hard-coded algorithms.  It was "trained" solely from the training data consisting of just some of the attributes of the warranty claims data.