# <center> Classifying amazon product reviews with logistic regression
## <center> Two levels of structured classes
    
We are faced to a simple NLP problem – Amazon product reviews classification. But classes are structured, like in this picture. 

<img src="https://habrastorage.org/webt/nf/en/j7/nfenj7gktep6dtbrtzgijcsdzwy.png" width=40%/>

That poses a question, what's the best way to approach this hierarchical text classification problem. 

Here we present a basic tf-idf + logreg baseline. There're 3 levels of this taxonomy in our data, but here we disregard the 3rd one.

**Idea**

Each review has 3 labels which are elements of a taxonomy, eg. 

> 'The description and photo on this product needs to be changed to indicate this product is the BuffalOs version of this beef jerky.'

> Category 1: `grocery gourmet food` 

> Category 2: `meat poultry`

> Category 3: `jerky`

First, we concatenate Category 1 and Category 2 classes for each sample, eg. `grocery gourmet food/meat poultry`. Then we train the model and measure F1 score for Category 2.

Then we split the prediction string and thus get predictions for Category 1:

-  Category 3 prediction is `grocery gourmet food/meat poultry/jerky` --> Category 1 prediction is `grocery gourmet food`

After that we measure F1 scores for Category 1.

**Results:**

F1 micro (=accuracy):
- Category 1: **0.948**
- Category 2: **0.889**

PS. using "level" and "category" interchangeably here.

## Reading and analyzing the data

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score

from matplotlib import pyplot as plt
%config InlineBackend.figure_format = 'retina'

In [None]:
PATH_TO_DATA = Path('../input/hierarchical-text-classification/')

In [None]:
train_df = pd.read_csv(PATH_TO_DATA / 'train_40k.csv').fillna(' ')
valid_df = pd.read_csv(PATH_TO_DATA / 'val_10k.csv').fillna(' ')

In [None]:
train_df.head()

Fields:

* productId – the review is given about this product
* Title - title of a review as given by the author
* user - Iduser ID of the author of the review
* Helpfulness - whether the review is found helpful by other users
* Score - score of a review as rated by other users
* Time - timestamp of the review
* Text - text of a review

In [None]:
train_df.info()

Example of a review

In [None]:
train_df.loc[0, 'Text']

In [None]:
train_df.loc[0, 'Cat1'], train_df.loc[0, 'Cat2'], train_df.loc[0, 'Cat3']

Distribution of level 1 classes

In [None]:
train_df['Cat1'].value_counts()

We concatenate level 1 and level 2 classes, the model will be trained with these targets. It's very important that the model satisfies the class taxonomy. This way it never predicts contradicting level 1 and level 2 classes (eg. 'pet supplies' as L1 and 'meat poultry' as L2 when actually 'meat poultry' is a sub-level of 'grocery gourmet food')

In [None]:
train_df['Cat1_Cat2'] = train_df['Cat1'] + '/' + train_df['Cat2']
valid_df['Cat1_Cat2'] = valid_df['Cat1'] + '/' + valid_df['Cat2']

Now we have 64 classes

In [None]:
train_df['Cat1_Cat2'].nunique()

Most popular ones (at level 2) are:

In [None]:
train_df['Cat1_Cat2'].value_counts().head()

We'll be training the model with concatenations of review titles and texts

## Training the model

We are training our model only with review titles, a ciuple of experiments show that it works better than with review text. 

In [None]:
# put a limit on maximal number of features and minimal word frequency
tf_idf = TfidfVectorizer(max_features=50000, min_df=2)
# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1e2, n_jobs=4, solver='lbfgs', 
                           random_state=17, verbose=0, 
                           multi_class='multinomial',
                           fit_intercept=True)
# sklearn's pipeline
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), 
                                 ('logit', logit)])

In [None]:
%%time
tfidf_logit_pipeline.fit(train_df['Title'], train_df['Cat1_Cat2'])

In [None]:
%%time
valid_pred_level_2 = tfidf_logit_pipeline.predict(valid_df['Title'])

That was a level 2 model. Now to predict level 1 as well we simple take the first part of level1/level2 prediction. Eg. if 'health personal care/health care' is predicted, then the level 1 prediction is 'health personal care'

In [None]:
valid_pred_level_1 = [el.split('/')[0] for el in valid_pred_level_2]

For evaluation, let's take a look at F1 score (micro and weigthed) at Level 1 and Level 2 separately. Note that in a multiclass setting F1 score with micro averaging is the same as accuracy.

In [None]:
print("Level 1:\n\tF1 micro (=accuracy): {}\n\tF1 weighted:\t      {}".format(
    f1_score(y_true=valid_df['Cat1'], y_pred=valid_pred_level_1, average='micro').round(3),
    f1_score(y_true=valid_df['Cat1'], y_pred=valid_pred_level_1, average='weighted').round(3)
    )
)

In [None]:
print("Level 2:\n\tF1 micro (=accuracy): {}\n\tF1 weighted:\t      {}".format(
    f1_score(y_true=valid_df['Cat1_Cat2'], y_pred=valid_pred_level_2, average='micro').round(3),
    f1_score(y_true=valid_df['Cat1_Cat2'], y_pred=valid_pred_level_2, average='weighted').round(3)
    )
)

## Explaining model predictions

In [None]:
def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title='Confusion matrix', figsize=(7,7),
                          cmap=plt.cm.Blues, path_to_save_fig=None):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    import itertools
    cm = confusion_matrix(y_true, y_pred).T
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    plt.figure(figsize=figsize)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('Predicted label')
    plt.xlabel('True label')
    
    if path_to_save_fig:
        plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')

Confusion matrix is quite balanced.

In [None]:
plot_confusion_matrix(
    y_true=valid_df['Cat1'],
    y_pred=valid_pred_level_1, 
    classes=sorted(train_df['Cat1'].unique()),
    figsize=(8, 8)
)

We can explore words/ngrams, which a most indicative of different classes. With 64 classes it might me a bit overwhelming though.

In [None]:
%%capture
import eli5

In [None]:
eli5.show_weights(
    estimator=tfidf_logit_pipeline.named_steps['logit'],
    vec=tfidf_logit_pipeline.named_steps['tf_idf'])