<h2>Classifying Warranty Claims with Symptom Class Names Using Machine Learning</h2><br><br>
by Daniel J. Kim

At Honda Market Quality (MQ), we are responsible for identifying vehicle quality and safety problems.  The primary source of market or field information is warranty claims data.  This data represents the voice of our customers.  The data contains several attributes such as part number, part cost, days to failure, miles to failure, customer's complaint, etc.  Over the years, Honda has accumulated several millions of warranty claims.  In order to efficiently identify market problems, methods have to be employed to "classify" or group like or similar claims together so that analysis can be made to efficiently find trends, track problems, and ensure problems are fixed or counter-measured.

Today, warranty claims data is classified using several, hard-coded algorithms, requiring extensive maintenance.  The jobs that our IT runs to complete the classification take several hours overnight.  Due to recent advancements and accessibility of [machine learning](https://en.wikipedia.org/wiki/Machine_learning) (ML) methodologies, I believe MQ and our Honda IT professionals should investigate how ML can be used to improve the warranty claims classification process and extend its usage to other applicable areas of business.  Furthermore, I strongly believe MQ need to develop in-house capability and knowledge in machine learning.  Unfortunately at MQ, we do not have associates that are knowledgeable in ML or have limited knowledge, this includes me.  But we can change that and hopefully we can discover benefits of applying machine learning to enhance MQ's business.

The following example is an attempt at a proof-of-concept of how machine learning can be used to classify warranty claims without hard-coded algorithms and is not meant to be representative of a "production" application.  The programming language used to employ the machine learning algorithm is Python using the [scikit-learn](http://scikit-learn.org/stable/) machine learning library.  This document is a [Jupyter](http://jupyter.org/) web notebook which allows me to document my process so that perhaps others can duplicate or understand my process as well.

### Library Imports

In [1]:
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing
import pandas as pd
import numpy as np

### Data Ingestion

Due to confidentiality, raw data will not be available.  Instead source of sample data was from an Excel file which I then "copy/pasted" from my computer's "clipboard".

In [2]:
df = pd.read_csv('/home/pybokeh/Downloads/sample_data.csv')

### Data Preparation: Data Cleansing and Transformation

Source data had dollar sign and comma in the part cost amounts.  So we need to remove them and ensure the part cost is a numeric (float) value.

In [3]:
df['PART_COST_USD'] = df['PART_COST_USD'].str.replace('$','').str.replace(',','')

In [4]:
df['PART_COST_USD'] = df['PART_COST_USD'].astype(float)

Confirm data type of the source data:

In [5]:
df.dtypes

FAIL_SHORT_PARTNO        object
PART_COST_USD           float64
DAYS_TO_FAIL_MINZERO      int64
MILES_TO_FAIL             int64
TEXT_CLUSTER_FAMILY      object
SYMP_CLASS_NM            object
dtype: object

Below is the first 5 rows of the training data set we will use.  The features columns that we will use are the first 5 columns and the target or label data will be the last column ("SYMPTOM_CLASS_NM").

Basically, we want to label or classify future claims based on part #, part cost, DTF, MTF, and symptom text cluster family to their appropriate symptom class name.

**Let's view our sample data:**

In [6]:
df.head()

Unnamed: 0,FAIL_SHORT_PARTNO,PART_COST_USD,DAYS_TO_FAIL_MINZERO,MILES_TO_FAIL,TEXT_CLUSTER_FAMILY,SYMP_CLASS_NM
0,30,0.0,0,2,NOISE/VIBRATION,BRAKE JUDDER
1,1469,458.91,0,11,FUNCTION ISSUE,BRAKE PEDAL SOFT
2,1469,455.32,0,10,FUNCTION ISSUE,MASTER CYLINDER/BOOSTER/POWER ASSY/FUNCTION ISSUE
3,1611,0.0,0,5,COSMETIC ISSUE,SIDE PANEL / FENDER/FENDER (FRONT)/COSMETIC ISSUE
4,4110,2.62,0,20887,FUNCTION ISSUE,BULBS (INTERIOR)/04110/FUNCTION ISSUE


When using machine learning algorithms, most require that the input data do not contain text/string data.  We can use scikit-learn's LabelEncoder() class to convert text/string data to integers.

In [8]:
from sklearn import preprocessing

part5_encoder = preprocessing.LabelEncoder()
text_cluster_encoder = preprocessing.LabelEncoder()
symp_class_encoder = preprocessing.LabelEncoder()

part5_encoder.fit(df.FAIL_SHORT_PARTNO)
text_cluster_encoder.fit(df.TEXT_CLUSTER_FAMILY)
symp_class_encoder.fit(df.SYMP_CLASS_NM)

LabelEncoder()

### Let's create new columns containing the integer version of the columns that contain text/string data

In [9]:
df['PART5'] = part5_encoder.transform(df.FAIL_SHORT_PARTNO)

In [10]:
df['TEXT_CLUSTER'] = text_cluster_encoder.transform(df.TEXT_CLUSTER_FAMILY)

In [11]:
df['SYMP_CLASS'] = symp_class_encoder.transform(df.SYMP_CLASS_NM)

In [12]:
df.head()

Unnamed: 0,FAIL_SHORT_PARTNO,PART_COST_USD,DAYS_TO_FAIL_MINZERO,MILES_TO_FAIL,TEXT_CLUSTER_FAMILY,SYMP_CLASS_NM,PART5,TEXT_CLUSTER,SYMP_CLASS
0,30,0.0,0,2,NOISE/VIBRATION,BRAKE JUDDER,0,3,161
1,1469,458.91,0,11,FUNCTION ISSUE,BRAKE PEDAL SOFT,6,1,162
2,1469,455.32,0,10,FUNCTION ISSUE,MASTER CYLINDER/BOOSTER/POWER ASSY/FUNCTION ISSUE,6,1,1708
3,1611,0.0,0,5,COSMETIC ISSUE,SIDE PANEL / FENDER/FENDER (FRONT)/COSMETIC ISSUE,7,0,2280
4,4110,2.62,0,20887,FUNCTION ISSUE,BULBS (INTERIOR)/04110/FUNCTION ISSUE,11,1,248


### We need to save the encoders for use later on un-classified data

** Data Structure Persistence using Python's pickle library: **

In [13]:
import pickle

# Encoders to disk
pickle.dump(part5_encoder, open(r'C:\Users\pybokeh\Dropbox\python\jupyter_notebooks\machine_learning\part5_encoder.sk','wb'))
pickle.dump(text_cluster_encoder, open(r'C:\Users\pybokeh\Dropbox\python\jupyter_notebooks\machine_learning\text_cluster_encoder.sk','wb'))
pickle.dump(symp_class_encoder, open(r'C:\Users\pybokeh\Dropbox\python\jupyter_notebooks\machine_learning\symp_class_encoder.sk','wb'))

**NOTE**-For the sake of simplicity, I resorted to saving the mappings using Python's pickle object serialization library.  In a production environment, it would be more suitable to use a relational database to store the mappings in a table instead.

### Now we are ready to create our features input data

Our features data will consist of: part5, part cost, DTF, MTF, and symptom text cluster (all represented with numeric values thanks to my mappings made earlier):

In [13]:
features = df[['PART5','PART_COST_USD','DAYS_TO_FAIL_MINZERO','MILES_TO_FAIL','TEXT_CLUSTER']].values.tolist()

Now our features data does not contain text/string data.  Let's look at the first 10 rows of data:

In [14]:
features[:10]

[[0.0, 0.0, 0.0, 2.0, 3.0],
 [6.0, 458.91, 0.0, 11.0, 1.0],
 [6.0, 455.32, 0.0, 10.0, 1.0],
 [7.0, 0.0, 0.0, 5.0, 0.0],
 [11.0, 2.62, 0.0, 20887.0, 1.0],
 [11.0, 4.37, 0.0, 11849.0, 1.0],
 [12.0, 3.04, 0.0, 3.0, 6.0],
 [13.0, 0.0, 0.0, 5.0, 4.0],
 [19.0, 0.0, 0.0, 11.0, 0.0],
 [19.0, 0.0, 0.0, 14.0, 0.0]]

Number of rows in our features data set:

In [15]:
len(features)

81403

### Now create our target/label data

In [16]:
labels = df.SYMP_CLASS.tolist()

Let's look at the first 10 label data:

In [17]:
labels[:10]

[161, 162, 1708, 2280, 248, 248, 1247, 2343, 2293, 2293]

Number of rows in our label data:

In [18]:
len(labels)

81403

## Partitioning the Data Sets

In [19]:
from sklearn.cross_validation import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=0)

Our features training data should contain 80% of our original complete data:

In [21]:
len(features_train)

65122

In [22]:
features_train[0]

[415.0, 81.22, 891.0, 1887.0, 6.0]

## Features Scaling

In [23]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
features_train_std = stdsc.fit_transform(features_train)
features_test_std = stdsc.transform(features_test)

In [24]:
features_train_std[:20]

array([[ -0.86131084,  -0.09973194,   1.05144666,  -0.93794294,
          1.81290566],
       [  0.9911888 ,  -0.1773592 ,   0.04145509,   1.25145102,
         -1.14746417],
       [  1.38262607,   0.09687569,  -0.22257824,  -0.31504457,  -0.6540692 ],
       [  1.86964685,   0.18940672,   0.7002567 ,   0.28702583,
         -1.14746417],
       [ -0.38111744,   1.42289478,   1.57438645,   0.45636622,
          0.82611572],
       [ -1.7261141 ,   0.20117431,   1.86661752,   2.95268154,
         -1.14746417],
       [ -1.1389582 ,  15.87116858,   0.118358  ,   0.5200144 ,
         -0.16067423],
       [  0.71809303,  -0.22817634,   0.63617094,   1.45384448,
         -1.14746417],
       [  1.77633913,  -0.11768862,  -0.49942872,  -0.03554885,
         -1.14746417],
       [ -1.26640289,  -0.2644505 ,   1.69486768,   0.12311882,
          1.81290566],
       [ -0.86131084,  -0.13039984,   1.45903209,   0.71212324,  -0.6540692 ],
       [ -0.86131084,  -0.09973194,  -0.16618277,  -0.65650

### Now use scikit-learn's NaiveBayes Gaussian classification algorithm

There are several ML classification algorithms to choose from.  Since I do not know the internal implementation of each and every algorithm, I had to resort to trial-and-error in finding the algorithm which gave me the best results, and quite frankly, I am not knowledgeable on how to perform proper model validation.  The model that gave me best results using large data set was the Naive Bayes classification algorithm.  With smaller data sets, decision tree or random forest algorithm gave me good results.

Fit the training data and target/label data.  Again, the training data is the part #, part cost, DTF, MTF, and symptom text cluster.  The label data is the symptom class name.

In [27]:
clf = GaussianNB()
clf.fit(features_train_std, labels_train)

GaussianNB()

In [57]:
from sklearn.linear_model import SGDClassifier

classes = np.unique(labels_train)

sgdc = SGDClassifier(loss="log", penalty="l2")
# sgdc.partial_fit(features_train_std, labels_train, classes=classes)
sgdc.fit(features_train_std, labels_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

In [25]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_jobs=1)
rfc.fit(features_train_std, labels_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

**Predict just one record.**  Here I will feed the prediction model a transmission part # (06200), an arbitrary part cost ($2000), arbitrary DTF (0), MTF (5), and symptom text cluster ("WARNING LIGHT ON").

In [29]:
# criteria = [part5, part cost, dtf, mtf, symptom]
criteria = [part5_encoder.transform(['06200']), 2000, 0, 5, text_cluster_encoder.transform(['WARNING LIGHT ON'])]
inv_criteria = stdsc.inverse_transform(criteria)
symp_class_encoder.inverse_transform(rfc.predict([inv_criteria]))[0]



Comparing this to the source data, this is correct and makes sense!

**Predicting 10 records:**

In [26]:
test_data = features_train_std[:10]
test_data

array([[ -0.86131084,  -0.09973194,   1.05144666,  -0.93794294,
          1.81290566],
       [  0.9911888 ,  -0.1773592 ,   0.04145509,   1.25145102,
         -1.14746417],
       [  1.38262607,   0.09687569,  -0.22257824,  -0.31504457,  -0.6540692 ],
       [  1.86964685,   0.18940672,   0.7002567 ,   0.28702583,
         -1.14746417],
       [ -0.38111744,   1.42289478,   1.57438645,   0.45636622,
          0.82611572],
       [ -1.7261141 ,   0.20117431,   1.86661752,   2.95268154,
         -1.14746417],
       [ -1.1389582 ,  15.87116858,   0.118358  ,   0.5200144 ,
         -0.16067423],
       [  0.71809303,  -0.22817634,   0.63617094,   1.45384448,
         -1.14746417],
       [  1.77633913,  -0.11768862,  -0.49942872,  -0.03554885,
         -1.14746417],
       [ -1.26640289,  -0.2644505 ,   1.69486768,   0.12311882,
          1.81290566]])

In [49]:
for name in symp_class_encoder.inverse_transform(labels_train[:10]):
    print(name)

DEAD BATTERY (BATTERY ONLY REPL)
TAILGATE / TRUNK/GARNISH/COSMETIC ISSUE
HVAC/BLOWER MOTOR/FUNCTION ISSUE
DOORS (FRONT)/LINER/COSMETIC ISSUE
HVAC/COMPRESSOR/FUNCTION ISSUE
OTHER/04746/COSMETIC ISSUE
OTHER/1A001 MOTOR   TRANSMISSION/LEAK
DOORS (REAR)/SEAL/COSMETIC ISSUE
SUNVISOR/VISOR/COSMETIC ISSUE
EVAP


In [43]:
for item in test_data:
    print(symp_class_encoder.inverse_transform(clf.predict([item]))[0])

TAILGATE / TRUNK/GARNISH/COSMETIC ISSUE
HVAC/BLOWER MOTOR/FUNCTION ISSUE
FRONT WINDSHIELD/COWL/COSMETIC ISSUE
HVAC/COMPRESSOR/FUNCTION ISSUE
OTHER/04746/COSMETIC ISSUE
OTHER/1A001 MOTOR   TRANSMISSION/LEAK
FRONT WINDSHIELD/COWL/COSMETIC ISSUE
FRONT WINDSHIELD/COWL/COSMETIC ISSUE


In [27]:
for item in test_data:
    print(symp_class_encoder.inverse_transform(rfc.predict([item]))[0])

DEAD BATTERY (BATTERY ONLY REPL)
TAILGATE / TRUNK/GARNISH/COSMETIC ISSUE
HVAC/BLOWER MOTOR/FUNCTION ISSUE
DOORS (FRONT)/LINER/COSMETIC ISSUE
HVAC/COMPRESSOR/FUNCTION ISSUE
OTHER/04746/COSMETIC ISSUE
OTHER/1A001 MOTOR   TRANSMISSION/LEAK
DOORS (REAR)/SEAL/COSMETIC ISSUE
SUNVISOR/VISOR/COSMETIC ISSUE
EVAP


In [58]:
for item in test_data:
    print(symp_class_encoder.inverse_transform(sgdc.predict([item]))[0])

TPMS LIGHT ON
AC LOW FILL / TEST
HVAC/BLOWER MOTOR/FUNCTION ISSUE
HVAC/BLOWER MOTOR/FUNCTION ISSUE
DEAD BATTERY (BATTERY ONLY REPL)
DEAD BATTERY (BATTERY ONLY REPL)
AUDIO SYSTEM/HEAD UNIT/FUNCTION ISSUE
AC LOW FILL / TEST
TAILGATE / TRUNK/GARNISH/COSMETIC ISSUE
TPMS LIGHT ON


Comparing the above output to the source data, all but 1 was not classified as the original source.  This is pretty good considering generally symptom class name assignments are not completely accurate anyways AND were not expected to be.  Warranty data is inherently dirty data and is reflected in the not-so-perfect symptom class names.  So the performance of my ML classification attempt looks very good.

# Re-Using the Machine Learning Model to Classify Future Claims

### Persist the model so that we can re-use it without having to retrain

In [48]:
import pickle

pickle.dump(clf1, open(r'D:\jupyter\machine_learning\nbayes.sk','wb'))

### Re-use the Model and Load Helper Data Structures

In [54]:
# Load the model
clf2 = pickle.load(open(r'D:\jupyter\machine_learning\nbayes.sk','rb'))

# Load helper data structures that we made earlier
part5_to_int_mapper = pickle.load(open(r'D:\jupyter\machine_learning\part5_to_int_mapper.sk', 'rb'))
symptom_to_int_mapper = pickle.load(open(r'D:\jupyter\machine_learning\symptom_to_int_mapper.sk', 'rb'))
int_to_symp_class_mapper = pickle.load(open(r'D:\jupyter\machine_learning\int_to_symp_class_mapper.sk', 'rb'))

Again, in a production environment, it is probably best to load the mappings from a relational database instead of using Python's pickle.

### Test a single observation using the model

In [55]:
# criteria = [part5, part cost, dtf, mtf, symptom]
criteria = [part5_to_int_mapper['04823'], 0, 0, 207, symptom_to_int_mapper['COSMETIC ISSUE']]
int_to_symp_class_mapper[clf2.predict([criteria])[0]]

'SEAT BELTS/REAR/COSMETIC ISSUE'

**That's it!**

Instead of symptom class names, we can classify warranty claims with other different types of classification labels so this classification example can be extended for any other classification we can come up with.

I have not tested this model extensively with other larger test data, but so far I have been impressed with the model so far.

# Conclusion

This small-scale example shows that a machine learning classification algorithm was able to classify warranty claims without hard-coded algorithms.  It was "trained" solely from the training data consisting of just some of the attributes of the warranty claims data.