## Classifying Brake Warranty Claims as Either Hard Brake Pedal Problem or Not

\#imbalanced_data, \#tfidf, \#label_encode, \#one_hot_encode

In [2]:
import pandas as pd
import numpy as np
import pickle
from scipy import sparse
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

### Raw Data Ingestion

In [3]:
df = pd.read_csv('book1.csv')
df.head()

Unnamed: 0,Part5,Labor_Cost_USD,Part_Cost_USD,Days_To_Fail_MinZero,Miles_To_Fail,Customer_Complaint,PROBLEM,Target
0,81690,97.33,252.88,483,13949,CK FOR THE DRIVERS SIDE 2ND ROW SEAT WILL NOT ...,Unconfirmed,0
1,57455,436.82,192.64,1265,46722,CONTINUATION OF FIRST LINE. ADDED FOR TECHNICI...,Unconfirmed,0
2,57306,535.5,1293.16,274,17819,C/S BRAKE SYSTEM WARNING LIGHT & OTHER WARNING...,Unconfirmed,0
3,57111,1261.5,610.51,5,205,CLIENTS STATES THERE IS A WARNING LIGHT ON AND...,Unconfirmed,0
4,57111,1009.87,463.58,467,30655,VERIFIER PEDALE DE FREIN . DESCEND AU FOND ET ...,Unconfirmed,0


In [4]:
df['Part5'] = df['Part5'].astype(str)
df.dtypes

Part5                    object
Labor_Cost_USD          float64
Part_Cost_USD           float64
Days_To_Fail_MinZero      int64
Miles_To_Fail             int64
Customer_Complaint       object
PROBLEM                  object
Target                    int64
dtype: object

In [5]:
data = df[['Part5','Customer_Complaint','Target']]

In [6]:
data.head()

Unnamed: 0,Part5,Customer_Complaint,Target
0,81690,CK FOR THE DRIVERS SIDE 2ND ROW SEAT WILL NOT ...,0
1,57455,CONTINUATION OF FIRST LINE. ADDED FOR TECHNICI...,0
2,57306,C/S BRAKE SYSTEM WARNING LIGHT & OTHER WARNING...,0
3,57111,CLIENTS STATES THERE IS A WARNING LIGHT ON AND...,0
4,57111,VERIFIER PEDALE DE FREIN . DESCEND AU FOND ET ...,0


#### Data is imbalanced:

In [7]:
data['Target'].value_counts()

0    1514
1     158
Name: Target, dtype: int64

In [8]:
data.shape

(1672, 3)

## Encode Features Data

Import encoders:

In [9]:
enc_label = LabelEncoder()
enc_onehot = OneHotEncoder()
enc_labelbinarizer = LabelBinarizer()
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()

### Encode the part # column

Label encode it first:

In [10]:
X_partno_labelencoded = enc_label.fit_transform(data.Part5)

In [11]:
X_partno_labelencoded

array([31, 29, 28, ...,  2,  2,  2])

Then onehot encode it:

In [12]:
X_partno_onehot = enc_onehot.fit_transform(X_partno_labelencoded.reshape(-1, 1))

In [13]:
X_partno_onehot.shape

(1672, 32)

Alternatively, you can use LabelBinarizer to label encode and one hot encode all in one step.  By default, it returns a "dense" matrix, which is in contrast to onehote encoder.  To return sparse matrix instead, just provide ```sparse_output=True``` to the ```LabelBinarizer``` constructor:

In [14]:
X_partno_onehot_lb_dense = enc_labelbinarizer.fit_transform(data.Part5)

In [15]:
X_partno_onehot_lb_dense

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [16]:
enc_labelbinarizer_sparse = LabelBinarizer(sparse_output=True)
X_partno_onehot_lb_sparse = enc_labelbinarizer_sparse.fit_transform(data.Part5)

In [17]:
X_partno_onehot_lb_sparse

<1672x32 sparse matrix of type '<class 'numpy.int64'>'
	with 1672 stored elements in Compressed Sparse Row format>

In [18]:
X_partno_onehot_lb_sparse.data

array([1, 1, 1, ..., 1, 1, 1])

So which output should I obtain?  dense matrix or sparse?  For large data set, of course, you should use sparse!

### Encode the customer contention text column

First, CountVectorize() it:

In [19]:
X_complaint_counts = count_vect.fit_transform(data.Customer_Complaint)
X_complaint_counts

<1672x3054 sparse matrix of type '<class 'numpy.int64'>'
	with 25298 stored elements in Compressed Sparse Row format>

Then, tfidf tranform it:

In [20]:
X_complaint_tfidf = tfidf_transformer.fit_transform(X_complaint_counts)
X_complaint_tfidf.shape

(1672, 3054)

### Split the original data into training and testing data sets without separating features from label data:

In [21]:
df_train, df_test = train_test_split(data[['Part5','Customer_Complaint','Target']],
                                                    test_size = 0.5, random_state = 12)

In [22]:
df_train.head()

Unnamed: 0,Part5,Customer_Complaint,Target
892,1469,COMPLAINT;CUSTOMER REQUESTS BRAKE LININGS BE I...,0
1528,1469,CUST STATES BRAKES MAKEING A POPPING NOISE WHE...,0
324,46402,CUSTOMER STATES THAT WHEN VEHICLE SITS OVER NI...,0
783,1469,CUSTOMER STATES AFTER SITTING OVERNIGHT THE BR...,0
440,46101,CUST. STATES:BRAKE LIGHT FLASHES WHILE DRIVING,0


In [23]:
df_train.Target.value_counts()

0    756
1     80
Name: Target, dtype: int64

Additional rows that were created:

In [26]:
756-80

676

Total rows after upsampling:

In [28]:
836+676

1512

#### Upsample ONLY the training data:

In [31]:
from sklearn.utils import resample

df_train_upsampled = resample(df_train.query("Target == 1"), 
                                    replace=True, 
                                    n_samples = df_train.query("Target == 0").shape[0], 
                                    random_state = 321)

df_train_upsampled.Target.value_counts()

1    756
Name: Target, dtype: int64

In [32]:
df_train_upsampled.head()

Unnamed: 0,Part5,Customer_Complaint,Target
69,46402,C/S BRAKES FAIL IN COLD WEATHER,1
74,46402,0 CUSTOMER STATES BRAKES ARE HARD TO PRESS AND...,1
62,46402,LA P DALE DE FREIN FORCE VERRE LE HAUT QUAND L...,1
682,1469,CUSTOMER STATES BRAKE PEDAL IS REALLY HARD AND...,1
22,46402,CUSTOMER STATED STRANGE NOISE HEARD FROM FRONT...,1


In [33]:
df_train_balanced = pd.concat([df_train.query("Target == 0"), df_train_upsampled])

In [34]:
df_train_balanced.Target.value_counts()

1    756
0    756
Name: Target, dtype: int64

In [35]:
df_train_balanced.shape

(1512, 3)

### Encode the part5 training data

In [36]:
df_train_part5_label_encoded = enc_label.transform(df_train_balanced.Part5)
df_train_part5_onehot_encoded = enc_onehot.transform(df_train_part5_label_encoded.reshape(-1, 1))

In [37]:
df_train_part5_onehot_encoded.shape

(1512, 32)

In [38]:
df_train_part5_onehot_encoded

<1512x32 sparse matrix of type '<class 'numpy.float64'>'
	with 1512 stored elements in Compressed Sparse Row format>

### Encode the contention text training data

In [39]:
df_train_contention_count_vectorized = count_vect.transform(df_train_balanced.Customer_Complaint)
df_train_contention_tfidf = tfidf_transformer.transform(df_train_contention_count_vectorized)

In [40]:
df_train_contention_tfidf.shape

(1512, 3054)

In [41]:
df_train_contention_tfidf

<1512x3054 sparse matrix of type '<class 'numpy.float64'>'
	with 23929 stored elements in Compressed Sparse Row format>

### Encode the part5 test data

In [42]:
df_test_part5_label_encoded = enc_label.transform(df_test.Part5)
df_test_part5_onehot_encoded = enc_onehot.transform(df_test_part5_label_encoded.reshape(-1, 1))

In [43]:
df_test_part5_onehot_encoded.shape

(836, 32)

### Encode the contention text test data

In [44]:
df_test_contention_count_vectorized = count_vect.transform(df_test.Customer_Complaint)
df_test_contention_tfidf = tfidf_transformer.transform(df_test_contention_count_vectorized)

In [45]:
df_test_contention_tfidf.shape

(836, 3054)

### Combine the encoded part # and contention text test data together

In [46]:
X_test = sparse.hstack((df_test_part5_onehot_encoded, df_test_contention_tfidf), format='csr')

In [47]:
X_test.shape

(836, 3086)

In [48]:
y_test = df_test.Target.values

In [49]:
y_test.shape

(836,)

### Combine the encoded part # and contention text training data together

In [50]:
X_train = sparse.hstack((df_train_part5_onehot_encoded, df_train_contention_tfidf), format='csr')

In [51]:
X_train.shape

(1512, 3086)

Do the dimensions look right?  We know X_final should have 3028 rows and 3086 columns (32 columns from part5 and 3054 columns from contention)

Now, let's create our y_final variable containing our label data.  Since it is already numeric (0 or 1), there is no need for additional processing or encoding of our label data.

In [52]:
y_train = df_train_balanced.Target.values

In [53]:
y_train.shape

(1512,)

### Fit the training data to the model

In [54]:
clf = MultinomialNB().fit(X_train, y_train)

In [55]:
clf.score(X_test, y_test)

0.81698564593301437

### Test on unseen sample data

This should return 1:

In [56]:
part_test = np.array(['57111'])
complaint_test = np.array(['BRAKE PEDAL FEELS VERY HARD'])

X_new_part_labelencoded = enc_label.transform(part_test)
X_new_part_onehot = enc_onehot.transform(X_new_part_labelencoded.reshape(-1,1))

X_new_complaint_counts = count_vect.transform(complaint_test)
X_new_complaint_tfidf = tfidf_transformer.transform(X_new_complaint_counts)

# Horizontally stack together the 2 sparse matrices
X_new_combined_tfidf = sparse.hstack((X_new_part_onehot, X_new_complaint_tfidf), format='csr')

predicted = clf.predict(X_new_combined_tfidf)
print(predicted)

[1]


This should return 0:

In [57]:
part_test = np.array(['57111'])
complaint_test = np.array(['BRAKE PEDAL IS SOFT'])

X_new_part_labelencoded = enc_label.transform(part_test)
X_new_part_onehot = enc_onehot.transform(X_new_part_labelencoded.reshape(-1,1))

X_new_complaint_counts = count_vect.transform(complaint_test)
X_new_complaint_tfidf = tfidf_transformer.transform(X_new_complaint_counts)

# Horizontally stack together the 2 sparse matrices
X_new_combined_tfidf = sparse.hstack((X_new_part_onehot, X_new_complaint_tfidf), format='csr')

predicted = clf.predict(X_new_combined_tfidf)
print(predicted)

[0]
