## Classifying Brake Warranty Claims as Either Hard Brake Pedal Problem or Not

\#imbalanced_data, \#tfidf, \#label_encode, \#one_hot_encode

**BACKGROUND:** Brake analysts are manually classifying vehicle brake warranty claims by reviewing the part # and the customer's complaint.  Based on these 2 features, the brake analyst will then label or classify the warranty claim as a particular brake problem (1) or not (0).

**GOAL:** Use machine learning classification instead of the manual process above.

In [1]:
import pandas as pd
import numpy as np
import pickle
from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, LabelBinarizer, CategoricalEncoder
from sklearn.model_selection import train_test_split, GridSearchCV

### Raw Data Ingestion

In [2]:
df = pd.read_csv('book1.csv')
df.head()

Unnamed: 0,Part5,Labor_Cost_USD,Part_Cost_USD,Days_To_Fail_MinZero,Miles_To_Fail,Customer_Complaint,PROBLEM,Target
0,81690,97.33,252.88,483,13949,CK FOR THE DRIVERS SIDE 2ND ROW SEAT WILL NOT ...,Unconfirmed,0
1,57455,436.82,192.64,1265,46722,CONTINUATION OF FIRST LINE. ADDED FOR TECHNICI...,Unconfirmed,0
2,57306,535.5,1293.16,274,17819,C/S BRAKE SYSTEM WARNING LIGHT & OTHER WARNING...,Unconfirmed,0
3,57111,1261.5,610.51,5,205,CLIENTS STATES THERE IS A WARNING LIGHT ON AND...,Unconfirmed,0
4,57111,1009.87,463.58,467,30655,VERIFIER PEDALE DE FREIN . DESCEND AU FOND ET ...,Unconfirmed,0


In [3]:
df.describe()

Unnamed: 0,Part5,Labor_Cost_USD,Part_Cost_USD,Days_To_Fail_MinZero,Miles_To_Fail,Target
count,1672.0,1672.0,1672.0,1672.0,1672.0,1672.0
mean,18764.279306,170.994222,158.293086,345.457536,12164.698565,0.094498
std,21892.799342,151.72258,306.786571,353.99464,13344.865461,0.292607
min,1469.0,0.0,0.0,0.0,1.0,0.0
25%,1469.0,90.0,18.07,62.0,2000.75,0.0
50%,1469.0,154.95,65.79,224.0,7424.5,0.0
75%,46101.0,213.0375,182.085,543.5,18675.75,0.0
max,81690.0,2056.2,2461.5,1724.0,92423.0,1.0


In [4]:
df.dtypes

Part5                     int64
Labor_Cost_USD          float64
Part_Cost_USD           float64
Days_To_Fail_MinZero      int64
Miles_To_Fail             int64
Customer_Complaint       object
PROBLEM                  object
Target                    int64
dtype: object

#### Based on prior business knowledge, part5 needs to be of type ```category``` since the part # can contain letters:

In [5]:
df['Part5'] = df['Part5'].astype('category')
df.dtypes

Part5                   category
Labor_Cost_USD           float64
Part_Cost_USD            float64
Days_To_Fail_MinZero       int64
Miles_To_Fail              int64
Customer_Complaint        object
PROBLEM                   object
Target                     int64
dtype: object

In [6]:
data = df[['Part5','Customer_Complaint','Target']]

In [7]:
data.head()

Unnamed: 0,Part5,Customer_Complaint,Target
0,81690,CK FOR THE DRIVERS SIDE 2ND ROW SEAT WILL NOT ...,0
1,57455,CONTINUATION OF FIRST LINE. ADDED FOR TECHNICI...,0
2,57306,C/S BRAKE SYSTEM WARNING LIGHT & OTHER WARNING...,0
3,57111,CLIENTS STATES THERE IS A WARNING LIGHT ON AND...,0
4,57111,VERIFIER PEDALE DE FREIN . DESCEND AU FOND ET ...,0


#### Data is imbalanced:

In [8]:
data['Target'].value_counts()

0    1514
1     158
Name: Target, dtype: int64

In [9]:
data['Target'].value_counts(normalize=True)

0    0.905502
1    0.094498
Name: Target, dtype: float64

In [10]:
data.shape

(1672, 3)

Our data is "imbalanced".  On first initial thought, one would think we can just upsample our data at this point.  BUT, I've read that you should ONLY upsample the training data.  So to proceed, I will encode/fit part5 and customer complaint text features, split the data into training and test sets.  Then upsample the training set.  Then perform the transformations on the training data. Then fit the classification model to the training data.

## Encode Features Data

Import encoders:

In [11]:
enc_label = LabelEncoder()
enc_onehot = OneHotEncoder(categories='auto')
enc_labelbinarizer = LabelBinarizer()
#enc_categorical = CategoricalEncoder() # new in version 0.20dev, but deprecated in final version of 0.20
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()

### Encode the part # column (prior to version 0.20)

Since part 5 is a categorical, nominal data type, we can't just stop at performing label encoding (text to numeric), we must also do one-hot encoding since we do not want the machine learning classifier to think the order of the values matter.

Label encode it first:

In [12]:
X_partno_labelencoded = enc_label.fit_transform(data['Part5'])

In [13]:
data['Part5']

0       81690
1       57455
2       57306
3       57111
4       57111
5       57111
6       57110
7       57110
8       57110
9       57110
10      57110
11      57100
12      53601
13      52611
14      51360
15      46600
16      46600
17      46600
18      46468
19      46402
20      46402
21      46402
22      46402
23      46402
24      46402
25      46402
26      46402
27      46402
28      46402
29      46402
        ...  
1642     1469
1643     1469
1644     1469
1645     1469
1646     1469
1647     1469
1648     1469
1649     1469
1650     1469
1651     1469
1652     1469
1653     1469
1654     1469
1655     1469
1656     1469
1657     1469
1658     1469
1659     1469
1660     1469
1661     1469
1662     1469
1663     1469
1664     1469
1665     1469
1666     1469
1667     1469
1668     1469
1669     1469
1670     1469
1671     1469
Name: Part5, Length: 1672, dtype: category
Categories (32, int64): [1469, 4816, 6462, 10002, ..., 57111, 57306, 57455, 81690]

In [14]:
X_partno_labelencoded

array([31, 30, 29, ...,  0,  0,  0])

In [15]:
X_partno_labelencoded.ndim

1

In [16]:
len(X_partno_labelencoded)

1672

Then one-hot encode it since we want the data to look nominal, not ordinal:

scikit-learn api requires that our data is a 2-D array, so need to also perform a .reshape(-1, 1)

In [17]:
X_partno_onehot = enc_onehot.fit_transform(X_partno_labelencoded.reshape(-1, 1))

In [18]:
X_partno_onehot

<1672x32 sparse matrix of type '<class 'numpy.float64'>'
	with 1672 stored elements in Compressed Sparse Row format>

In [19]:
X_partno_onehot.toarray()

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

### Encode the part # column with OneHotEncoder (version 0.20 and later)

With version 0.20, we can just use the one-hot encoder which now also does label encoding.  So essentially you can label encode and one-hot encode in just one step now.

But, scikit-learn has the dreaded 2-D "Gotcha".  For most encoders, they expect a 2-D array.  A pandas Series is 1-D array.  We can make or force a 2-D array or create an actual 1 column DataFrame by putting square brackets around the column name. 

In [None]:
part5 = data[['Part5']]  # make a 2-D data

In [None]:
X_partno_onehot = enc_onehot.fit_transform(part5)

In [None]:
X_partno_onehot

In [None]:
X_partno_onehot.toarray()

As we can see from above, it gives the same results as the 2-step process of LabelBinarizer and OneHotEncoder.

In [None]:
X_partno_onehot.shape

Alternatively, you can use LabelBinarizer to label encode and one hot encode all in one step.

In [None]:
X_partno_onehot_categorical = enc_labelbinarizer.fit_transform(part5)

In [None]:
X_partno_onehot_categorical

In [None]:
X_partno_onehot_categorical.shape

### Encode the customer contention text column

First, CountVectorize() it:

In [20]:
X_complaint_counts = count_vect.fit_transform(data['Customer_Complaint'])
X_complaint_counts

<1672x3054 sparse matrix of type '<class 'numpy.int64'>'
	with 25298 stored elements in Compressed Sparse Row format>

Then, tfidf tranform it:

In [21]:
X_complaint_tfidf = tfidf_transformer.fit_transform(X_complaint_counts)
X_complaint_tfidf.shape

(1672, 3054)

### Split the original data into training and testing data sets without separating features from label data:

In [22]:
df_train, df_test = train_test_split(data[['Part5','Customer_Complaint','Target']],
                                                    test_size = 0.5, random_state = 12)

In [23]:
df_train.head()

Unnamed: 0,Part5,Customer_Complaint,Target
892,1469,COMPLAINT;CUSTOMER REQUESTS BRAKE LININGS BE I...,0
1528,1469,CUST STATES BRAKES MAKEING A POPPING NOISE WHE...,0
324,46402,CUSTOMER STATES THAT WHEN VEHICLE SITS OVER NI...,0
783,1469,CUSTOMER STATES AFTER SITTING OVERNIGHT THE BR...,0
440,46101,CUST. STATES:BRAKE LIGHT FLASHES WHILE DRIVING,0


In [24]:
df_train.Target.value_counts()

0    756
1     80
Name: Target, dtype: int64

Our training data is imbalanced, so we need to balance the data somehow.  One strategy is to upsample the minority class.

#### Upsample the minority class ONLY from the training data:

In [25]:
from sklearn.utils import resample

df_train_upsampled = resample(df_train.query("Target == 1"), # filter to minority class
                                    replace=True, 
                                    n_samples = df_train.query("Target == 0").shape[0], 
                                    random_state = 321)

df_train_upsampled.Target.value_counts()

1    756
Name: Target, dtype: int64

In [26]:
df_train_upsampled.head()

Unnamed: 0,Part5,Customer_Complaint,Target
69,46402,C/S BRAKES FAIL IN COLD WEATHER,1
74,46402,0 CUSTOMER STATES BRAKES ARE HARD TO PRESS AND...,1
62,46402,LA P DALE DE FREIN FORCE VERRE LE HAUT QUAND L...,1
682,1469,CUSTOMER STATES BRAKE PEDAL IS REALLY HARD AND...,1
22,46402,CUSTOMER STATED STRANGE NOISE HEARD FROM FRONT...,1


In [27]:
df_train_balanced = pd.concat([df_train.query("Target == 0"), df_train_upsampled])

In [28]:
df_train_balanced.Target.value_counts()

1    756
0    756
Name: Target, dtype: int64

Now our training data has equal quantity of target/label data (756 each)

In [29]:
df_train_balanced.shape

(1512, 3)

### Encode training data

#### Encode the part5 training data

In [30]:
df_train_part5_label_encoded = enc_label.transform(df_train_balanced.Part5)
df_train_part5_onehot_encoded = enc_onehot.transform(df_train_part5_label_encoded.reshape(-1, 1))

In [31]:
df_train_part5_onehot_encoded.shape

(1512, 32)

In [32]:
df_train_part5_onehot_encoded

<1512x32 sparse matrix of type '<class 'numpy.float64'>'
	with 1512 stored elements in Compressed Sparse Row format>

#### Encode the contention text training data

In [33]:
df_train_contention_count_vectorized = count_vect.transform(df_train_balanced.Customer_Complaint)
df_train_contention_tfidf = tfidf_transformer.transform(df_train_contention_count_vectorized)

In [34]:
df_train_contention_tfidf.shape

(1512, 3054)

In [35]:
df_train_contention_tfidf

<1512x3054 sparse matrix of type '<class 'numpy.float64'>'
	with 23929 stored elements in Compressed Sparse Row format>

#### Combine the encoded part # and contention text training data together to create our final X / Features matrix:

In [36]:
X_train = sparse.hstack((df_train_part5_onehot_encoded, df_train_contention_tfidf), format='csr')

In [37]:
X_train.shape

(1512, 3086)

### Define our y / target 1-D array:

In [38]:
y_train = df_train_balanced.Target.values

In [39]:
y_train.shape

(1512,)

### Fit the training data to our model

In [40]:
clf = MultinomialNB().fit(X_train, y_train)

### Encode test data

#### Encode the part5 test data

In [41]:
df_test_part5_label_encoded = enc_label.transform(df_test.Part5)
df_test_part5_onehot_encoded = enc_onehot.transform(df_test_part5_label_encoded.reshape(-1, 1))

In [42]:
df_test_part5_onehot_encoded.shape

(836, 32)

#### Encode the contention text test data

In [43]:
df_test_contention_count_vectorized = count_vect.transform(df_test.Customer_Complaint)
df_test_contention_tfidf = tfidf_transformer.transform(df_test_contention_count_vectorized)

In [44]:
df_test_contention_tfidf.shape

(836, 3054)

#### Combine the encoded part # and contention text test data together

In [45]:
X_test = sparse.hstack((df_test_part5_onehot_encoded, df_test_contention_tfidf), format='csr')

In [46]:
X_test.shape

(836, 3086)

In [47]:
y_test = df_test.Target.values

In [48]:
y_test.shape

(836,)

### Now let's see how well our classifier performs with our test data:

In [50]:
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.817


Our model performed good, but not great.

### Test on unseen sample data

This should return 1:

In [59]:
part_test = np.array(['57111'])
complaint_test = np.array(['BRAKE PEDAL FEELS HARD'])

X_new_part_labelencoded = enc_label.transform(part_test)
X_new_part_onehot = enc_onehot.transform(X_new_part_labelencoded.reshape(-1,1))

X_new_complaint_counts = count_vect.transform(complaint_test)
X_new_complaint_tfidf = tfidf_transformer.transform(X_new_complaint_counts)

# Horizontally stack together the 2 sparse matrices
X_new_combined_tfidf = sparse.hstack((X_new_part_onehot, X_new_complaint_tfidf), format='csr')

predicted = clf.predict(X_new_combined_tfidf)
print(predicted)

[1]


This should return 0:

In [54]:
part_test = np.array(['57111'])
complaint_test = np.array(['BRAKE PEDAL IS SOFT'])

X_new_part_labelencoded = enc_label.transform(part_test)
X_new_part_onehot = enc_onehot.transform(X_new_part_labelencoded.reshape(-1,1))

X_new_complaint_counts = count_vect.transform(complaint_test)
X_new_complaint_tfidf = tfidf_transformer.transform(X_new_complaint_counts)

# Horizontally stack together the 2 sparse matrices
X_new_combined_tfidf = sparse.hstack((X_new_part_onehot, X_new_complaint_tfidf), format='csr')

predicted = clf.predict(X_new_combined_tfidf)
print(predicted)

[0]


### Follow-up activity:
- upsample with SMOTE and them compare accuracy
- use pipeline() to streamline the data tranformation workflow