<h2>Classifying Warranty Claims with Symptom Class Names Using Machine Learning</h2><br><br>
by Daniel J. Kim

At Honda Market Quality (MQ), we are responsible for identifying vehicle quality and safety problems.  The primary source of market or field information is warranty claims data.  This data represents the voice of our customers.  The data contains several attributes such as part number, part cost, days to failure, miles to failure, customer's complaint, etc.  Over the years, Honda has accumulated several millions of warranty claims.  In order to efficiently identify market problems, methods have to be employed to "classify" or group like or similar claims together so that analysis can be made to efficiently find trends, track problems, and ensure problems are fixed or counter-measured.

Today, warranty claims data is classified using several, hard-coded algorithms, requiring extensive maintenance.  The jobs that our IT runs to complete the classification take several hours overnight.  Due to recent advancements and accessibility of [machine learning](https://en.wikipedia.org/wiki/Machine_learning) (ML) methodologies, I believe MQ and our Honda IT professionals should investigate how ML can be used to improve the warranty claims classification process and extend its usage to other applicable areas of business.  Furthermore, I strongly believe MQ need to develop in-house capability and knowledge in machine learning.  Unfortunately at MQ, we do not have associates that are knowledgeable in ML or have limited knowledge, this includes me.  But we can change that and hopefully we can discover benefits of applying machine learning to enhance MQ's business.

The following example is an attempt at a proof-of-concept of how machine learning can be used to classify warranty claims without hard-coded algorithms and is not meant to be representative of a "production" application.  The programming language used to employ the machine learning algorithm is Python using the [scikit-learn](http://scikit-learn.org/stable/) machine learning library.  This document is a [Jupyter](http://jupyter.org/) web notebook which allows me to document my process so that perhaps others can duplicate or understand my process as well.

### Library Imports

In [1]:
import pandas as pd
import numpy as np

### Data Ingestion

Due to confidentiality, raw data will not be available.  Instead source of sample data was from an Excel file which I then "copy/pasted" from my computer's "clipboard".

In [2]:
df = pd.read_csv('/home/pybokeh/Downloads/fit_data.csv')

### Data Preparation: Data Cleansing and Transformation

Source data had dollar sign and comma in the part cost amounts.  So we need to remove them and ensure the part cost is a numeric (float) value.

In [4]:
df['PART_COST_USD'] = df['PART_COST_USD'].str.replace('$','').str.replace(',','')

In [5]:
df['PART_COST_USD'] = df['PART_COST_USD'].astype(float)

Confirm data type of the source data:

In [6]:
df.dtypes

FAIL_SHORT_PARTNO        object
PART_COST_USD           float64
DAYS_TO_FAIL_MINZERO      int64
MILES_TO_FAIL             int64
TEXT_CLUSTER_FAMILY      object
PRI_LAB_OPRTN_CD         object
MTC_MODEL                object
MTC_TYPE                 object
SYMP_CLASS_NM            object
dtype: object

Below is the first 5 rows of the training data set we will use.  The features columns that we will use are the first 5 columns and the target or label data will be the last column ("SYMPTOM_CLASS_NM").

Basically, we want to label or classify future claims based on part #, part cost, DTF, MTF, and symptom text cluster family to their appropriate symptom class name.

**Let's view our sample data:**

In [7]:
df.head()

Unnamed: 0,FAIL_SHORT_PARTNO,PART_COST_USD,DAYS_TO_FAIL_MINZERO,MILES_TO_FAIL,TEXT_CLUSTER_FAMILY,PRI_LAB_OPRTN_CD,MTC_MODEL,MTC_TYPE,SYMP_CLASS_NM
0,31500,81.22,689,4381,FUNCTION ISSUE,710100,FT5R,AC8,DEAD BATTERY (BATTERY ONLY REPL)
1,14310,103.74,692,31434,NOISE/VIBRATION,110153,FT5R,AB9,TIMING COMPONENTS/VTC ACTUATOR/NOISE/VIBRATION
2,14310,181.38,734,55236,NOISE/VIBRATION,110153,FT5R,AC8,TIMING COMPONENTS/VTC ACTUATOR/NOISE/VIBRATION
3,4770,0.0,1693,46707,FUNCTION ISSUE,752097,CTK6,AB5,OTHER/04770/FUNCTION ISSUE
4,16010,393.45,704,35763,WARNING LIGHT ON,121110,FT5R,AB9,FUEL SYSTEM/FUEL INJECTOR/WARNING LIGHT ON


When using machine learning algorithms, most require that the input data do not contain text/string data.  We can use scikit-learn's LabelEncoder() class to convert text/string data to integers.

In [13]:
from sklearn import preprocessing

part5_encoder = preprocessing.LabelEncoder()
text_cluster_encoder = preprocessing.LabelEncoder()
laborop_encoder = preprocessing.LabelEncoder()
mtc_model_encoder = preprocessing.LabelEncoder()
mtc_type_encoder = preprocessing.LabelEncoder()
symp_class_encoder = preprocessing.LabelEncoder()

part5_encoder.fit(df.FAIL_SHORT_PARTNO)
text_cluster_encoder.fit(df.TEXT_CLUSTER_FAMILY)
laborop_encoder.fit(df.PRI_LAB_OPRTN_CD)
mtc_model_encoder.fit(df.MTC_MODEL)
mtc_type_encoder.fit(df.MTC_TYPE)
symp_class_encoder.fit(df.SYMP_CLASS_NM)

LabelEncoder()

### Let's create new columns containing the integer version of the columns that contain text/string data

In [14]:
df['PART5'] = part5_encoder.transform(df.FAIL_SHORT_PARTNO)
df['TEXT_CLUSTER'] = text_cluster_encoder.transform(df.TEXT_CLUSTER_FAMILY)
df['LABOROP'] = laborop_encoder.transform(df.PRI_LAB_OPRTN_CD)
df['MTCMODEL'] = mtc_model_encoder.transform(df.MTC_MODEL)
df['MTCTYPE'] = mtc_type_encoder.transform(df.MTC_TYPE)
df['SYMP_CLASS'] = symp_class_encoder.transform(df.SYMP_CLASS_NM)

In [15]:
df.head()

Unnamed: 0,FAIL_SHORT_PARTNO,PART_COST_USD,DAYS_TO_FAIL_MINZERO,MILES_TO_FAIL,TEXT_CLUSTER_FAMILY,PRI_LAB_OPRTN_CD,MTC_MODEL,MTC_TYPE,SYMP_CLASS_NM,PART5,TEXT_CLUSTER,MTCMODEL,MTCTYPE,SYMP_CLASS,LABOROP
0,31500,81.22,689,4381,FUNCTION ISSUE,710100,FT5R,AC8,DEAD BATTERY (BATTERY ONLY REPL),388,1,5,9,498,795
1,14310,103.74,692,31434,NOISE/VIBRATION,110153,FT5R,AB9,TIMING COMPONENTS/VTC ACTUATOR/NOISE/VIBRATION,138,3,5,4,2546,126
2,14310,181.38,734,55236,NOISE/VIBRATION,110153,FT5R,AC8,TIMING COMPONENTS/VTC ACTUATOR/NOISE/VIBRATION,138,3,5,9,2546,126
3,4770,0.0,1693,46707,FUNCTION ISSUE,752097,CTK6,AB5,OTHER/04770/FUNCTION ISSUE,39,1,1,2,1807,1045
4,16010,393.45,704,35763,WARNING LIGHT ON,121110,FT5R,AB9,FUEL SYSTEM/FUEL INJECTOR/WARNING LIGHT ON,165,6,5,4,1073,250


### We need to save the encoders for use later on un-classified data

** Data Structure Persistence using Python's pickle library: **

In [16]:
import pickle

# Encoders to disk
pickle.dump(part5_encoder, open('/home/pybokeh/Dropbox/python/jupyter_notebooks/machine_learning/part5_encoder.sk','wb'))
pickle.dump(text_cluster_encoder, open('/home/pybokeh/Dropbox/python/jupyter_notebooks/machine_learning/text_cluster_encoder.sk','wb'))
pickle.dump(laborop_encoder, open('/home/pybokeh/Dropbox/python/jupyter_notebooks/machine_learning/laborop_encoder.sk','wb'))
pickle.dump(mtc_model_encoder, open('/home/pybokeh/Dropbox/python/jupyter_notebooks/machine_learning/mtc_model_encoder.sk','wb'))
pickle.dump(mtc_type_encoder, open('/home/pybokeh/Dropbox/python/jupyter_notebooks/machine_learning/mtc_type_encoder.sk','wb'))
pickle.dump(symp_class_encoder, open('/home/pybokeh/Dropbox/python/jupyter_notebooks/machine_learning/symp_class_encoder.sk','wb'))

**NOTE**-For the sake of simplicity, I resorted to saving the mappings using Python's pickle object serialization library.  In a production environment, it would be more suitable to use a relational database to store the mappings in a table instead.

### Now we are ready to create our features input data

Our features data will consist of: part5, part cost, DTF, MTF, and symptom text cluster (all represented with numeric values thanks to my mappings made earlier):

In [17]:
features = df[['PART5',
               'PART_COST_USD',
               'DAYS_TO_FAIL_MINZERO',
               'MILES_TO_FAIL',
               'LABOROP',
               'TEXT_CLUSTER',
               'MTCMODEL',
               'MTCTYPE'
              ]].values.tolist()

Now our features data does not contain text/string data.  Let's look at the first 10 rows of data:

In [18]:
features[:10]

[[388.0, 81.22, 689.0, 4381.0, 795.0, 1.0, 5.0, 9.0],
 [138.0, 103.74, 692.0, 31434.0, 126.0, 3.0, 5.0, 4.0],
 [138.0, 181.38, 734.0, 55236.0, 126.0, 3.0, 5.0, 9.0],
 [39.0, 0.0, 1693.0, 46707.0, 1045.0, 1.0, 1.0, 2.0],
 [165.0, 393.45, 704.0, 35763.0, 250.0, 6.0, 5.0, 4.0],
 [531.0, 0.0, 196.0, 1873.0, 291.0, 6.0, 8.0, 8.0],
 [1205.0, 7.63, 27.0, 2061.0, 983.0, 1.0, 8.0, 8.0],
 [1205.0, 3.98, 328.0, 13322.0, 983.0, 4.0, 7.0, 13.0],
 [1061.0, 514.61, 221.0, 7984.0, 1318.0, 0.0, 8.0, 3.0],
 [556.0, 185.0, 0.0, 3.0, 97.0, 1.0, 6.0, 8.0]]

Number of rows in our features data set:

In [19]:
len(features)

63157

### Now create our target/label data

In [20]:
labels = df.SYMP_CLASS.tolist()

Let's look at the first 10 label data:

In [21]:
labels[:10]

[498, 2546, 2546, 1807, 1073, 769, 2799, 2797, 1020, 1585]

Number of rows in our label data:

In [22]:
len(labels)

63157

## Partitioning the Data Sets

In [23]:
from sklearn.cross_validation import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=0)

Our features training data should contain 80% of our original complete data:

In [25]:
len(features_train)

50525

## Features Scaling

In [26]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
features_train_std = stdsc.fit_transform(features_train)
features_test_std = stdsc.transform(features_test)

In [27]:
features_train_std[:20]

array([[ -4.79126454e-01,  -3.12521367e-01,   6.74557502e-02,
         -6.71951272e-02,  -1.23044825e+00,   1.84463187e+00,
          9.37355527e-01,   3.10333886e-02],
       [ -8.26223612e-01,  -1.05993171e-01,  -2.18889670e-01,
         -7.46796923e-01,   5.67593357e-02,   1.84463187e+00,
          9.37355527e-01,   8.73118221e-01],
       [  1.11072416e+00,  -3.12521367e-01,   2.84384099e-01,
          8.48955131e-01,   1.09573802e+00,   3.58977908e-01,
         -4.67051681e-01,   3.11728333e-01],
       [ -8.26223612e-01,  -1.06400023e-01,   1.91279290e+00,
          4.64626620e-01,   5.67593357e-02,  -6.31458068e-01,
         -1.40332315e+00,  -1.09174639e+00],
       [ -8.26223612e-01,  -1.34091380e-01,  -1.93155893e-02,
          3.39504745e-01,   5.67593357e-02,  -6.31458068e-01,
         -1.40332315e+00,   3.11728333e-01],
       [  1.05247010e+00,  -1.77548249e-01,  -5.22589358e-01,
         -5.05946276e-01,   1.82251130e+00,  -1.12667606e+00,
          9.37355527e-01,  -5.3

## Fit the model with training data

We'll use Random Forest classification algorithm

In [28]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_jobs=1)
rfc.fit(features_train_std, labels_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

**Predicting 10 records:**

In [29]:
test_data = features_test_std[:10]
test_data

array([[-1.15390275,  0.79607378,  2.88173953,  1.86936167, -1.51962212,
        -0.13624008, -1.40332315,  0.31172833],
       [ 0.52332897, -0.18520215, -1.21097532, -0.94649   ,  0.93451719,
        -1.12667606,  0.93735553,  0.87311822],
       [-0.59563459, -0.29367904,  0.00382343,  0.51037319,  0.25636608,
        -0.63145807, -0.46705168, -1.09174639],
       [ 0.12283225, -0.31252137, -1.10974209, -0.97868468, -0.58556493,
        -1.12667606, -1.40332315,  0.31172833],
       [ 0.09855973, -0.31252137,  0.09059477,  0.53848079, -0.54717902,
         0.35897791,  0.93735553,  1.995898  ],
       [-0.82622361, -0.13409138, -0.5168046 , -0.72478408,  0.05675934,
        -0.63145807, -0.46705168,  0.31172833],
       [ 0.64469162, -0.31252137,  0.32487739, -0.88790859,  1.30302195,
         0.35897791,  0.93735553,  0.87311822],
       [ 1.16412373, -0.2945436 , -0.74241008, -0.6251885 ,  0.50715404,
        -0.63145807,  0.93735553,  1.995898  ],
       [-0.82622361, -0.10342492

In [31]:
for name in symp_class_encoder.inverse_transform(labels_test[:10]):
    print(name)

HVAC/RADIATOR/LEAK
BUMPERS/FRONT/COSMETIC ISSUE
SWITCHES (DOOR)/SWITCH, DOOR/FUNCTION ISSUE
DAMPERS/DAMPER, FR/COSMETIC ISSUE
SUSPENSION/ARM, FR/NOISE/VIBRATION
DEAD BATTERY (BATTERY ONLY REPL)
DOOR WINDOW SYSTEM (FRONT)/RUN CHANNEL/NOISE/VIBRATION
WIPERS (FRONT)/BLADE/FUNCTION ISSUE
DEAD BATTERY (BATTERY ONLY REPL)
HVAC/HEATER SUB-ASSY/ODOR


In [32]:
for item in test_data:
    print(symp_class_encoder.inverse_transform(rfc.predict([item]))[0])

HVAC/RADIATOR/LEAK
BUMPERS/FRONT/COSMETIC ISSUE
SWITCHES (DOOR)/SWITCH, DOOR/FUNCTION ISSUE
STEERING SYSTEM/END, TIE ROD/COSMETIC ISSUE
SUSPENSION/LINK, FR/NOISE/VIBRATION
DEAD BATTERY (BATTERY ONLY REPL)
DOOR WINDOW SYSTEM (FRONT)/RUN CHANNEL/NOISE/VIBRATION
WIPERS (FRONT)/BLADE/FUNCTION ISSUE
DEAD BATTERY (BATTERY ONLY REPL)
ABS/VSA SYSTEM/MODULATOR ASSY/FUNCTION ISSUE


Comparing the above output to the source data, all but 2 was not classified correctly (80%).  But this was on a sample of 20 data observations.  We can use sklearn's accuracy score for larger data size.

In [33]:
from sklearn.metrics import accuracy_score

accuracy_score(labels_test[:1000], rfc.predict(features_test_std[:1000]))

0.77100000000000002

# Re-Using the Machine Learning Model to Classify Future Claims

### Persist the model so that we can re-use it without having to retrain

In [35]:
import pickle

pickle.dump(rfc, open('/home/pybokeh/Dropbox/python/jupyter_notebooks/machine_learning/randomforest.sk','wb'))

MemoryError: 

### Re-use the Model and Load Helper Data Structures

In [54]:
# Load the model
rfc2 = pickle.load(open('/home/pybokeh/Dropbox/python/jupyter_notebooks/machine_learning/randomforest.sk','rb'))

# Load helper data structures that we made earlier
part5_to_int_mapper = pickle.load(open(r'D:\jupyter\machine_learning\part5_to_int_mapper.sk', 'rb'))
symptom_to_int_mapper = pickle.load(open(r'D:\jupyter\machine_learning\symptom_to_int_mapper.sk', 'rb'))
int_to_symp_class_mapper = pickle.load(open(r'D:\jupyter\machine_learning\int_to_symp_class_mapper.sk', 'rb'))

Again, in a production environment, it is probably best to load the mappings from a relational database instead of using Python's pickle.

### Test a single observation using the model

In [55]:
# criteria = [part5, part cost, dtf, mtf, symptom]
criteria = [part5_to_int_mapper['04823'], 0, 0, 207, symptom_to_int_mapper['COSMETIC ISSUE']]
int_to_symp_class_mapper[clf2.predict([criteria])[0]]

'SEAT BELTS/REAR/COSMETIC ISSUE'

**That's it!**

Instead of symptom class names, we can classify warranty claims with other different types of classification labels so this classification example can be extended for any other classification we can come up with.

I have not tested this model extensively with other larger test data, but so far I have been impressed with the model so far.

# Conclusion

This small-scale example shows that a machine learning classification algorithm was able to classify warranty claims without hard-coded algorithms.  It was "trained" solely from the training data consisting of just some of the attributes of the warranty claims data.