## Classifying Brake Warranty Claims as Either Hard Brake Pedal Problem or Not

\#imbalanced_data, \#tfidf, \#label_encode, \#one_hot_encode

**BACKGROUND:** Brake analysts are manually classifying vehicle brake warranty claims by reviewing the part # and the customer's complaint.  Based on these 2 features, the brake analyst will then label or classify the warranty claim as a particular brake problem (1) or not (0).

**GOAL:** Use machine learning classification instead of the manual process above.

In [1]:
import pandas as pd
import numpy as np
import pickle
from scipy import sparse
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

### Raw Data Ingestion

In [2]:
df = pd.read_csv('book1.csv')
df.head(n=5)

Unnamed: 0,Part5,Labor_Cost_USD,Part_Cost_USD,Days_To_Fail_MinZero,Miles_To_Fail,Customer_Complaint,PROBLEM,Target
0,81690,97.33,252.88,483,13949,CK FOR THE DRIVERS SIDE 2ND ROW SEAT WILL NOT ...,Unconfirmed,0
1,57455,436.82,192.64,1265,46722,CONTINUATION OF FIRST LINE. ADDED FOR TECHNICI...,Unconfirmed,0
2,57306,535.5,1293.16,274,17819,C/S BRAKE SYSTEM WARNING LIGHT & OTHER WARNING...,Unconfirmed,0
3,57111,1261.5,610.51,5,205,CLIENTS STATES THERE IS A WARNING LIGHT ON AND...,Unconfirmed,0
4,57111,1009.87,463.58,467,30655,VERIFIER PEDALE DE FREIN . DESCEND AU FOND ET ...,Unconfirmed,0


In [3]:
df.dtypes

Part5                     int64
Labor_Cost_USD          float64
Part_Cost_USD           float64
Days_To_Fail_MinZero      int64
Miles_To_Fail             int64
Customer_Complaint       object
PROBLEM                  object
Target                    int64
dtype: object

In [6]:
df['Part5'] = df['Part5'].astype('category')
df.dtypes

Part5                   category
Labor_Cost_USD           float64
Part_Cost_USD            float64
Days_To_Fail_MinZero       int64
Miles_To_Fail              int64
Customer_Complaint        object
PROBLEM                   object
Target                     int64
dtype: object

In [8]:
data = df[['Part5','Customer_Complaint','Target']]

In [9]:
data.head()

Unnamed: 0,Part5,Customer_Complaint,Target
0,81690,CK FOR THE DRIVERS SIDE 2ND ROW SEAT WILL NOT ...,0
1,57455,CONTINUATION OF FIRST LINE. ADDED FOR TECHNICI...,0
2,57306,C/S BRAKE SYSTEM WARNING LIGHT & OTHER WARNING...,0
3,57111,CLIENTS STATES THERE IS A WARNING LIGHT ON AND...,0
4,57111,VERIFIER PEDALE DE FREIN . DESCEND AU FOND ET ...,0


#### Data is imbalanced:

In [6]:
data['Target'].value_counts()

0    1514
1     158
Name: Target, dtype: int64

#### Upsample the minority class data

In [7]:
from sklearn.utils import resample

df_Upsmpl = resample(data.query("Target == 1"), 
                                    replace=True, 
                                    n_samples = data.query("Target == 0").shape[0], 
                                    random_state = 321)

df_Upsmpl.head()

Unnamed: 0,Fail_Short_PartNo,Customer_Contention_Text,Target
45,46402,REPAIR WHEN CAR IS STARTED COLD BRAKE PEDAL WA...,1
680,1469,CUSTOMER STATES: CHECK BRAKE SYSTEM LIGHT IS O...,1
50,46402,HAVE TO PRESS HARD ON BRAKE PEDAL,1
60,46402,CUST STATES BRAKE PEDAL HARD ON FIRST APPLICAT...,1
701,1469,DURING PDI- BRAKES WILL LOCK ON WHEN DRIVING,1


#### Now data is balanced

In [8]:
df_balanced = pd.concat([df.query("Target == 0"), df_Upsmpl])

df_balanced["Target"].value_counts()

1    1514
0    1514
Name: Target, dtype: int64

## Encode Features Data

Import encoders:

In [9]:
enc_label = LabelEncoder()
enc_onehot = OneHotEncoder()
enc_labelbinarizer = LabelBinarizer()
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()

### Encode the part # column

Label encode it first:

In [10]:
X_partno_labelencoded = enc_label.fit_transform(df_balanced.Fail_Short_PartNo.values)

In [11]:
X_partno_labelencoded

array([31, 29, 28, ...,  2, 18, 18])

Then onehot encode it:

In [12]:
X_partno_onehot = enc_onehot.fit_transform(X_partno_labelencoded.reshape(-1,1))

In [13]:
X_partno_onehot.shape

(3028, 32)

Alternatively, you can use LabelBinarizer to label encode and one hot encode all in one step.  By default, it returns a "dense" matrix, which is in contrast to onehote encoder.  To return sparse matrix instead, just provide ```sparse_output=True``` to the ```LabelBinarizer``` constructor:

In [14]:
X_partno_onehot_lb_dense = enc_labelbinarizer.fit_transform(df_balanced.Fail_Short_PartNo.values)

In [15]:
X_partno_onehot_lb_dense

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [16]:
enc_labelbinarizer_sparse = LabelBinarizer(sparse_output=True)
X_partno_onehot_lb_sparse = enc_labelbinarizer_sparse.fit_transform(df_balanced.Fail_Short_PartNo.values)

In [17]:
X_partno_onehot_lb_sparse

<3028x32 sparse matrix of type '<class 'numpy.int64'>'
	with 3028 stored elements in Compressed Sparse Row format>

In [18]:
X_partno_onehot_lb_sparse.data

array([1, 1, 1, ..., 1, 1, 1])

So which output should I obtain?  dense matrix or sparse?  For large data set, of course, you should use sparse!

### Encode the customer contention text column

First, CountVectorize() it:

In [19]:
X_complaint_counts = count_vect.fit_transform(df_balanced.Customer_Contention_Text.values)
X_complaint_counts

<3028x3054 sparse matrix of type '<class 'numpy.int64'>'
	with 47948 stored elements in Compressed Sparse Row format>

Then, tfidf tranform it:

In [20]:
X_complaint_tfidf = tfidf_transformer.fit_transform(X_complaint_counts)
X_complaint_tfidf.shape

(3028, 3054)

### Combine the encoded part # and encoded customer contention text together to make final matrix

In [21]:
X_final = sparse.hstack((X_partno_onehot, X_complaint_tfidf), format='csr')

In [22]:
X_final.shape

(3028, 3086)

Do the dimensions look right?  We know X_final should have 3028 rows, but what about the number of columns?  Let's check:

In [23]:
X_partno_onehot.shape

(3028, 32)

In [24]:
X_complaint_tfidf.shape

(3028, 3054)

3054 + 32 = 3086, so we know our X_final has the right number of columns.

Now, let's create our y_final variable containing our label data.  Since it is already numeric (0 or 1), there is no need for additional processing or encoding of our label data.

In [25]:
y_final = df_balanced.Target.values

In [26]:
y_final.shape

(3028,)

Now, we have what we need to partition our data into training and test sets:

### Partition the data into training and test data sets

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.8, random_state = 12)


As a sanity check, let's double-check we have the correct number of rows, columns in our data

In [28]:
3028 * 0.8

2422.4

In [29]:
X_train.shape

(605, 3086)

In [30]:
X_test.shape

(2423, 3086)

### Fit the training data to the model

In [31]:
clf = MultinomialNB().fit(X_train, y_train)

In [32]:
clf.score(X_test, y_test)

0.85967808501857201

### Test on unseen sample data

This should return 1:

In [33]:
part_test = np.array(['57111'])
complaint_test = np.array(['BRAKE PEDAL IS HARD'])

X_new_part_labelencoded = enc_label.transform(part_test)
X_new_part_onehot = enc_onehot.transform(X_new_part_labelencoded.reshape(-1,1))

X_new_complaint_counts = count_vect.transform(complaint_test)
X_new_complaint_tfidf = tfidf_transformer.transform(X_new_complaint_counts)

# Horizontally stack together the 2 sparse matrices
X_new_combined_tfidf = sparse.hstack((X_new_part_onehot, X_new_complaint_tfidf), format='csr')

try:
    predicted = clf.predict(X_new_combined_tfidf)
    print(predicted)
except:
    print(0)

[1]


This should return 0:

In [34]:
part_test = np.array(['57111'])
complaint_test = np.array(['BRAKE PEDAL IS SOFT'])

X_new_part_labelencoded = enc_label.transform(part_test)
X_new_part_onehot = enc_onehot.transform(X_new_part_labelencoded.reshape(-1,1))

X_new_complaint_counts = count_vect.transform(complaint_test)
X_new_complaint_tfidf = tfidf_transformer.transform(X_new_complaint_counts)

# Horizontally stack together the 2 sparse matrices
X_new_combined_tfidf = sparse.hstack((X_new_part_onehot, X_new_complaint_tfidf), format='csr')

try:
    predicted = clf.predict(X_new_combined_tfidf)
    print(predicted)
except:
    print(0)

[0]


Of course this should return 1:

In [35]:
part_test = np.array(['57111']) 
complaint_test = np.array(['OMG! MY BRAKE PEDAL DOES NOT WORK!'])

X_new_part_labelencoded = enc_label.transform(part_test)
X_new_part_onehot = enc_onehot.transform(X_new_part_labelencoded.reshape(-1,1))

X_new_complaint_counts = count_vect.transform(complaint_test)
X_new_complaint_tfidf = tfidf_transformer.transform(X_new_complaint_counts)

# Horizontally stack together the 2 sparse matrices
X_new_combined_tfidf = sparse.hstack((X_new_part_onehot, X_new_complaint_tfidf), format='csr')

try:
    predicted = clf.predict(X_new_combined_tfidf)
    print(predicted)
except:
    print(0)

[1]
