<h3>Library Imports</h3>

In [3]:
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from string import punctuation
from gensim.models import Word2Vec, KeyedVectors
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import train_test_split

<h3>Loading text data of quality defect comments and symptoms</h3>

In [12]:
df = pd.read_csv('data_type.csv')
df = df.astype('string')

This is the overall dataset available for training and testing the model. The dataset is not large enough to train the model to classify all symptom types, as many classes have few samples which if used for training will render model more biased towards the first three classes, especially biased towards the first class - **"5 - Quality Issue - Assembly".**

In cases where sufficient data size for training is not available, Data Augmentation can be used to augment data for training NLP models. With some techniques, a dataset with small size can be utlized to create more data for training. Some of the techniques are
1. Synonymn Replacement
2. Back Translation
3. Bigram Flipping
4. Replacing Entities
5. Adding Noise to Data

For this particular project, first three classes with the largest samples are considered for this training to keep things simple.

In [26]:
df['Symptom Type'].value_counts()

5 - Quality Issue - Assembly      5011
2 - Quality Issue - Appearance    1594
4 - Quality Issue - Functional    1016
Name: Symptom Type, dtype: Int64

In [27]:
df.dropna(inplace=True)
df = df[df['Symptom Type'].isin(['5 - Quality Issue - Assembly','2 - Quality Issue - Appearance','4 - Quality Issue - Functional'])]
x = df['Reporter Comment']
y = df['Symptom Type']

<h3>Text Pre-Processing</h3>
To prepare the text for training and testing model, the following needs to be performed:

1. Tokenization. Splitting the setences or text into words.
2. Lowercasing the words.
3. Removing stopwords
4. Removing punctuations and digits

In [28]:
def text_processor(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stpwrd_punc_dig(tokens):
        return [token.lower() for token in tokens if token.lower() not in mystopwords and not token.isdigit() and token not in punctuation]
    return [remove_stpwrd_punc_dig(word_tokenize(text)) for text in texts]

processed_x = text_processor(x)

Here is the breakdown of training and testing data

In [29]:
print('Size of processed text: ',len(processed_x), '\nSize of classes: ', len(y))
print('\nClasses and their sizes:\n', y.value_counts())

Size of processed text:  7621 
Size of classes:  7621

Classes and their sizes:
 5 - Quality Issue - Assembly      5011
2 - Quality Issue - Appearance    1594
4 - Quality Issue - Functional    1016
Name: Symptom Type, dtype: Int64


<h3>Feature Engineering/Text Representation with Word Embeddings</h3>

The pre-processed text is represented in a numerical form so that it can be fed to a machine learning (ML) algorithm, and this step is referred to as text representation in NLP. More often in NLP, an optimal text representation yields far greater results even when used with an ordinary ML algorithm.

There are different techniques available for performing text representations in NLP. Some of them that I am faimiliar with are
1. Basic Vectorization of Text
2. Word Embeddings

One way to perform text representation is through basic vector representation of pre-processed text data, such using One-Hote encoding, Bag of Words, Bag of N-Grams, N-Grams, and Term Frequency-Inverse Document Frequency (TF-IDF). However, using basic vectorization of text come with drawbacks:
- words are treated as atomic units, so relationship among them cannot be established
- features vector size and sparisty incraeses which makes NLP model computationally expensive and can cause overfitting
- the NLP model cannot handle out of vocabulary words

Another way is to use Word Embeddings, which utlize the concept of Distributional Representations. Here the goal is to come up with a text representation that enables the model to derive meaning of the words from their contexts. A Data Scientist can train her/his own word embeddings for text representation or load existing, pre-trained word embeddings. Some popular pre-trained word embeddings are Word2Vec by Google, fasttext embeddings by Facebook, and GloVe word embeddings by Stanford University.

For this particular task in hand, I will use pre-trained Word2Vec model. When creating this model, researchers came up with two architecures that could be used to train this model - Continous Bag of Words (CBOW) and SkipGram. The architectures are similar. With both architectures, the basic idea is to generate small size (between 25 to 600) numerical feature vectors for each word in the corpus, and use those feature vectors as means of comparing with other words in the corpus. Cosine similarity is generally used to compare between the words (feature vectors). Although there are some differences between CBOW and SkipGram, but the main conceptual difference is that in SkipGram a center word is used to predict its surrounding words; whereas, in CBOW, the surrounding (context) words are used to predict the center word.

Word Embeddings, whether pre-trained or trained, also come with drawbacks. Two common drawbacks are:
1. Word embeddings model size is large which makes the NLP model difficult to deploy. As a student of Computer Science, it will be a challenge for me to deploy an NLP model that used word2vec embeddings into Heroku's free tier server. 

2. Out of vocabulary issue as mentioned above

As shown below, word2vec model is being loaded.

In [30]:
w2v_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

Below is an example cosine similarities of the most similar words to 'tesla' are displayed using the word2vec model. As can be seen, the context words for tesla are related to physics or electricity. This points out that the corpus used for modelling the Word2Vec embeddings did not contain or had very much less information related to the car manufacturing company, Tesla.

**Note-** All words shown here, including 'universe', are present in the model's corpus, which means these words were present in the corpus that used for creating the Word2Vec embeddings model.

In [40]:
w2v_model.most_similar('tesla')

[('gauss', 0.6623970866203308),
 ('FT_ICR', 0.5639052391052246),
 ('MeV', 0.5619181990623474),
 ('keV', 0.5605965256690979),
 ('superconducting_magnet', 0.5567352175712585),
 ('electron_volt', 0.5503560900688171),
 ('SQUIDs', 0.5393732786178589),
 ('nT', 0.5386143326759338),
 ('electronvolts', 0.5377055406570435),
 ('kelvin', 0.5367920994758606)]

**If a particular word is not present in Word2Vec model corpus then an 'Key not present' error will display!**

In [38]:
w2v_model.most_similar('AWordNotPresentInTheModelCorpus')

KeyError: "Key 'AWordNotPresentInTheModelCorpus' not present"

Feature vector of a word present in model's corpus can be acquired. The code below shows an example of getting feature vector of the word 'tesla'. Also, the feature vector size was set to 300 during Word2Vec development, which results in all feature vectors of words in corpus to have same size feature vector of 300.

In [41]:
print(len(w2v_model['tesla']))
w2v_model['tesla']

300


array([-1.74804688e-01,  1.25000000e-01,  5.66406250e-01,  2.58789062e-02,
        5.12695312e-03,  1.27929688e-01,  3.02734375e-01, -7.03125000e-01,
       -9.27734375e-02, -1.30859375e-01, -4.80957031e-02,  1.18164062e-01,
        2.83203125e-01,  1.40625000e-01,  1.29882812e-01,  4.19921875e-01,
       -3.32031250e-01,  2.96875000e-01, -2.41088867e-03, -1.51367188e-02,
       -3.02734375e-01, -1.54296875e-01,  2.38037109e-02, -3.24707031e-02,
       -2.09960938e-01, -3.35937500e-01, -3.06640625e-01, -5.95092773e-03,
       -3.75000000e-01, -2.91015625e-01, -7.03125000e-02, -1.39648438e-01,
       -2.73437500e-01,  9.76562500e-03, -1.70898438e-01, -8.05664062e-02,
       -3.10546875e-01,  3.32031250e-01,  7.62939453e-03,  2.10937500e-01,
        3.08593750e-01,  2.75390625e-01,  1.31835938e-01,  2.45117188e-01,
       -6.95800781e-03, -1.35742188e-01,  1.48437500e-01,  2.91748047e-02,
       -3.18359375e-01, -2.75878906e-02, -8.59375000e-02, -2.27539062e-01,
       -1.18408203e-02,  

<h4>Function to acquire feature vectors of words in the pro-processed training corpus</h4>

The **embedding_features** function is to acquire feature vectors of words in the pro-processed training corpus. The function goes through each word in training corpus and verifies whether the word is present in embeddings model. If a word exists in the w2v_model then its feature vector is generated. For a single sentence, feature vectors of all words are averaged to represent an average feature vector of each sentence in the training corpus.

As mentioned above, one drawback of Word2Vec model is the out of vocabulary issue. In this function below, any word in the training corpus that is not present in the w2v_model is discarded. Therefore, when using a pre-trained embeddings model it is important to keep in mind the domain under which the embeddings model was trained. Word2Vec was trained using large corpus from Google News; however, my training dataset is a corpus from manufacturing, which not ideal. A pre-trained word embeddings model trained using manufacturing text corpus would yeild a much higher end result.

In [43]:
def embedding_features(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    features = []
    for tokens in list_of_lists:
        count_for_this = 0 + 1e-5
        feat_for_this = np.zeros(DIMENSION)
        for token in tokens:
            if token in w2v_model:
                feat_for_this += w2v_model[token]
                count_for_this += 1
        if (count_for_this != 0):
            features.append(feat_for_this/count_for_this)
        else:
            features.append(zero_vector)
    return features

features = embedding_features(processed_x)

Although the size of processed_x (mentioned few cells above) and features are same, 7621, within each list (pre-processed sentence) of processed_x, some words have been eliminated because of **domain discrepancy** between manufacturing and news.

In [45]:
features = np.array(features)
features.shape

(7621, 300)

<h3>Random Forest Classifier for Prediction</h4>

In [46]:
#Take any classifier (LogisticRegression here, and train/test it like before.
classifier = RandomForestClassifier()
train_data, test_data, train_cats, test_cats = train_test_split(features, y)
classifier.fit(train_data, train_cats)
preds = classifier.predict(test_data)

<h3>Model Evaluation</h3>

Model's f1-score with respect to **5 - Quality Issue - Assembly** is the highest compared to the other two categories. As assumed before due to imbalanced sample sizes of the categories, the classifier's prediction for this 'assembly' symptom was more precise (on average, out of 100 samples predicted as 'assembly', 96 of them were accurate/true positive) compared to its prediction for other two symptom's.

In [48]:
print(classification_report(preds, test_cats))

                                precision    recall  f1-score   support

2 - Quality Issue - Appearance       0.68      0.89      0.77       309
4 - Quality Issue - Functional       0.29      0.80      0.42        89
  5 - Quality Issue - Assembly       0.96      0.80      0.87      1508

                      accuracy                           0.82      1906
                     macro avg       0.64      0.83      0.69      1906
                  weighted avg       0.89      0.82      0.84      1906

