### Use Pairwise Ranking to rank reviews according to their usefulness

In [1]:
import pandas as pd
import numpy as np
from joblib import load, dump
from copy import deepcopy
from statistics import mean

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from collections import Counter

In [2]:
df = pd.read_csv('data/Features.csv')

In [3]:
df

Unnamed: 0,product,answer_option,label,review_len,Rn,Rp,Rs,Rc,Rd,Rsc
0,Accucheck,Fast and accurate delivery,0,4,0.232928,0.30,0.616667,0.005413,1.0,0.0000
1,Accucheck,Expected a longer expire date. Your Product Li...,0,14,1.017529,-0.10,0.400000,0.017591,1.0,0.0000
2,Accucheck,I liked the prompt service,0,5,0.319756,0.60,0.800000,0.006766,1.0,0.4215
3,Accucheck,Good product,0,2,0.544303,0.70,0.600000,0.002706,0.0,0.4404
4,Accucheck,I not needed,0,3,0.000000,0.00,0.000000,0.004060,0.0,0.0000
...,...,...,...,...,...,...,...,...,...,...
1654,shampoo,Liked it very nicely working now my scalp is a...,1,11,0.166374,0.69,0.900000,0.026253,0.0,0.5709
1655,shampoo,its my regular choice,0,4,0.500000,0.00,0.076923,0.009547,0.0,0.0000
1656,shampoo,Works well with my hair oil to decrease dandruff,1,9,0.564857,0.00,0.000000,0.021480,0.0,0.2732
1657,shampoo,"Best therapy of dandruff, I like it.",0,7,0.538680,1.00,0.300000,0.016706,0.0,0.7717


## Ranking 
Ranking is a canonical problem for humans. It is easy to classify whether a review is useful (informative) or not. However, ranking reviews on the basis of usefulness, is a complex task. 

### Pairwise ranking approach
in this project pairwise ranking is applied to rank reviews in the semi-supervised learning method. The pairwise ranking approach looks at a pair of documents at a time in a loss function and predicts a relative ordering. The objective is not to determine the relevance score but to find which document is more relevant than others. This relevance is developed to judge the preference of one review over another.

### Review Segregation: 
We segregated two sets of reviews on which we train our model.
+ Set 0 represents reviews with label 0, i.e., ones that are not informative. These include reviews based on delivery, customer support, packaging, etc. These reviews do not describe the product.
+ Set 1 represents reviews with label 1, i.e., reviews that are informative and are better than all reviews of Set 0;

#### How we segregated and determined labels for reviews:
Our entire review ranking system is based on the idea that it is easier for humans to binary classify reviews which we call Set 0 and Set 1.

For each product 'Accucheck', 'Becadexamin', 'Evion', 'Neurobion','SevenseascodLiverOil', 'Shelcal', 'Supradyn','shampoo', we asked 10 different people to label reviews as a 1 (informative review) and 0 ( not informative review). Different participants were asked to label so that there is no bias and the model learns to its best.

In [4]:
data_split = pd.crosstab(df['product'],df['label'])
data_split

label,0,1
product,Unnamed: 1_level_1,Unnamed: 2_level_1
Accucheck,311,85
Becadexamin,53,27
Evion,89,33
Neurobion,283,135
SevenseascodLiverOil,60,22
Shelcal,259,125
Supradyn,50,23
shampoo,55,49


## Building the training set:
#### We pairwise compared each review of set 1 with all reviews of set 0 and vice-versa
+ (Rx, Ry,1) where x∈Set1 and y∈Set0 → Rx is better than Ry
+ (Ry, Rx, 0) where x∈Set1 and y∈Set0 → Ry is worst than Rx
<br>

#### This now becomes a classification problem.

<hr>

![PairwiseRanking](Photos/PairwiseRanking.png)

In [5]:
def building_training_data(df):
    A = df[df['label']==1]
    A.loc[df['label']==1,'join'] = 'j'
    B = df[df['label']==0]
    B.loc[df['label']==0,'join'] = 'j'
    trainset1 = pd.merge(A,B,how='outer',on='join')
    trainset2 = pd.merge(B,A,how='outer',on ='join')

    trainset = pd.merge(trainset1,trainset2,how='outer')
    return trainset

In [17]:
building_training_data(df)

Unnamed: 0,product_x,answer_option_x,label_x,review_len_x,Rn_x,Rp_x,Rs_x,Rc_x,Rd_x,Rsc_x,...,product_y,answer_option_y,label_y,review_len_y,Rn_y,Rp_y,Rs_y,Rc_y,Rd_y,Rsc_y
0,Accucheck,The reading is very accurate,1,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,...,Accucheck,Fast and accurate delivery,0,4,0.232928,0.300000,0.616667,0.005413,1.0,0.0000
1,Accucheck,The reading is very accurate,1,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,...,Accucheck,Expected a longer expire date. Your Product Li...,0,14,1.017529,-0.100000,0.400000,0.017591,1.0,0.0000
2,Accucheck,The reading is very accurate,1,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,...,Accucheck,I liked the prompt service,0,5,0.319756,0.600000,0.800000,0.006766,1.0,0.4215
3,Accucheck,The reading is very accurate,1,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,...,Accucheck,Good product,0,2,0.544303,0.700000,0.600000,0.002706,0.0,0.4404
4,Accucheck,The reading is very accurate,1,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,...,Accucheck,Good product,0,2,0.544303,0.700000,0.600000,0.002706,0.0,0.4404
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1157675,shampoo,It is better but not a permanent solution for ...,0,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,...,shampoo,It really does help but the bottle is only apt...,1,13,0.702325,0.266667,0.733333,0.031026,0.0,0.2475
1157676,shampoo,It is better but not a permanent solution for ...,0,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,...,shampoo,This Shampoo is good quality n anti dandruff s...,1,15,0.856367,0.700000,0.600000,0.035800,0.0,0.1531
1157677,shampoo,It is better but not a permanent solution for ...,0,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,...,shampoo,No use... Wash ur hair with water.. that much ...,1,10,0.475916,0.500000,0.500000,0.023866,0.0,0.1779
1157678,shampoo,It is better but not a permanent solution for ...,0,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,...,shampoo,Liked it very nicely working now my scalp is a...,1,11,0.166374,0.690000,0.900000,0.026253,0.0,0.5709


In [6]:
import warnings
warnings.filterwarnings("ignore")
product_list = df['product'].unique()
data_stack = []
for product in product_list:
    temp = deepcopy(df[df['product']==product].iloc[:,2:])
    build_data = building_training_data(temp)
    print(product, len(temp), len(build_data))
    build_data.drop(columns = ['join','label_y'],inplace=True)
    data = build_data.iloc[:,1:]
    data['target'] = build_data.iloc[:,0]
    data_stack.append(data)

Accucheck 396 52870
Becadexamin 80 2862
Evion 122 5874
Neurobion 418 76410
SevenseascodLiverOil 82 2640
Shelcal 384 64750
Supradyn 73 2300
shampoo 104 5390


In [22]:
# data_stack

In [7]:
train = pd.concat(data_stack).reset_index(drop = True)

In [8]:
train

Unnamed: 0,review_len_x,Rn_x,Rp_x,Rs_x,Rc_x,Rd_x,Rsc_x,review_len_y,Rn_y,Rp_y,Rs_y,Rc_y,Rd_y,Rsc_y,target
0,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,4,0.232928,0.300000,0.616667,0.005413,1.0,0.0000,1
1,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,14,1.017529,-0.100000,0.400000,0.017591,1.0,0.0000,1
2,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,5,0.319756,0.600000,0.800000,0.006766,1.0,0.4215,1
3,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,2,0.544303,0.700000,0.600000,0.002706,0.0,0.4404,1
4,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,2,0.544303,0.700000,0.600000,0.002706,0.0,0.4404,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213091,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,13,0.702325,0.266667,0.733333,0.031026,0.0,0.2475,0
213092,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,15,0.856367,0.700000,0.600000,0.035800,0.0,0.1531,0
213093,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,10,0.475916,0.500000,0.500000,0.023866,0.0,0.1779,0
213094,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,11,0.166374,0.690000,0.900000,0.026253,0.0,0.5709,0


In [9]:
X = train.iloc[:,:-1].values
y = train.iloc[:,-1].values

from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2,shuffle = True, stratify = y) 
print("Test Len:",len(X_test)," ",len(y_test))

Test Len: 42620   42620


In [10]:
train

Unnamed: 0,review_len_x,Rn_x,Rp_x,Rs_x,Rc_x,Rd_x,Rsc_x,review_len_y,Rn_y,Rp_y,Rs_y,Rc_y,Rd_y,Rsc_y,target
0,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,4,0.232928,0.300000,0.616667,0.005413,1.0,0.0000,1
1,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,14,1.017529,-0.100000,0.400000,0.017591,1.0,0.0000,1
2,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,5,0.319756,0.600000,0.800000,0.006766,1.0,0.4215,1
3,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,2,0.544303,0.700000,0.600000,0.002706,0.0,0.4404,1
4,5,0.544334,0.52,0.823333,0.006766,0.0,0.0000,2,0.544303,0.700000,0.600000,0.002706,0.0,0.4404,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213091,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,13,0.702325,0.266667,0.733333,0.031026,0.0,0.2475,0
213092,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,15,0.856367,0.700000,0.600000,0.035800,0.0,0.1531,0
213093,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,10,0.475916,0.500000,0.500000,0.023866,0.0,0.1779,0
213094,10,0.418615,0.50,0.500000,0.023866,0.0,-0.1263,11,0.166374,0.690000,0.900000,0.026253,0.0,0.5709,0


# Spot Checking-
+ Linear Model
+ Non-Linear Model
+ Ensemble Model

<hr>

## Linear Model: Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train,y_train)
print("Training Accuracy\n", accuracy_score(y_train,classifier.predict(X_train)))
print("Test Accuracy\n", accuracy_score(y_test,classifier.predict(X_test)))

print('CLASSIFICATION REPORT')
print("Training\n", classification_report(y_train,classifier.predict(X_train)))
print("Test \n", classification_report(y_test,classifier.predict(X_test)))

Training Accuracy
 0.8484302775757291
Test Accuracy
 0.8488503050211168
CLASSIFICATION REPORT
Training
               precision    recall  f1-score   support

           0       0.85      0.85      0.85     85238
           1       0.85      0.85      0.85     85238

    accuracy                           0.85    170476
   macro avg       0.85      0.85      0.85    170476
weighted avg       0.85      0.85      0.85    170476

Test 
               precision    recall  f1-score   support

           0       0.85      0.85      0.85     21310
           1       0.85      0.85      0.85     21310

    accuracy                           0.85     42620
   macro avg       0.85      0.85      0.85     42620
weighted avg       0.85      0.85      0.85     42620



### Accuracy: 85%
### F1-score: 85%

## Non-Linear Model: DecisionTree

In [12]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()
classifier.fit(X_train,y_train)

print("Training Accuracy\n", accuracy_score(y_train,classifier.predict(X_train)))
print("Test Accuracy\n", accuracy_score(y_test,classifier.predict(X_test)))

print('CLASSIFICATION REPORT')
print("Training\n", classification_report(y_train,classifier.predict(X_train)))
print("Test \n", classification_report(y_test,classifier.predict(X_test)))

Training Accuracy
 0.9969027898355194
Test Accuracy
 0.9838808071328015
CLASSIFICATION REPORT
Training
               precision    recall  f1-score   support

           0       0.99      1.00      1.00     85238
           1       1.00      0.99      1.00     85238

    accuracy                           1.00    170476
   macro avg       1.00      1.00      1.00    170476
weighted avg       1.00      1.00      1.00    170476

Test 
               precision    recall  f1-score   support

           0       0.98      0.98      0.98     21310
           1       0.98      0.98      0.98     21310

    accuracy                           0.98     42620
   macro avg       0.98      0.98      0.98     42620
weighted avg       0.98      0.98      0.98     42620



## Ensemble Model: RandomForest

In [13]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators=50, n_jobs = -1, oob_score = True,random_state=42)
classifier.fit(X_train,y_train)

print("Training Accuracy\n", accuracy_score(y_train,classifier.predict(X_train)))
print("Test Accuracy\n", accuracy_score(y_test,classifier.predict(X_test)))

print('CLASSIFICATION REPORT')
print("Training\n", classification_report(y_train,classifier.predict(X_train)))
print("Test \n", classification_report(y_test,classifier.predict(X_test)))

print("Test\nConfusion Matrix: \n", confusion_matrix(y_test, classifier.predict(X_test)))

Training Accuracy
 0.9969027898355194
Test Accuracy
 0.9894650398873768
CLASSIFICATION REPORT
Training
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85238
           1       1.00      1.00      1.00     85238

    accuracy                           1.00    170476
   macro avg       1.00      1.00      1.00    170476
weighted avg       1.00      1.00      1.00    170476

Test 
               precision    recall  f1-score   support

           0       0.99      0.99      0.99     21310
           1       0.99      0.99      0.99     21310

    accuracy                           0.99     42620
   macro avg       0.99      0.99      0.99     42620
weighted avg       0.99      0.99      0.99     42620

Test
Confusion Matrix: 
 [[21090   220]
 [  229 21081]]


In [14]:
## Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.
classifier.oob_score_

0.9887432835120487

In [15]:
feature_importances = pd.DataFrame(classifier.feature_importances_,
                                   index = train.iloc[:,:-1].columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances

Unnamed: 0,importance
Rd_x,0.190409
Rd_y,0.188564
review_len_x,0.1121
review_len_y,0.09094
Rc_y,0.088746
Rc_x,0.073322
Rsc_x,0.041877
Rn_y,0.040544
Rn_x,0.040385
Rsc_y,0.040031


In [16]:
dump(classifier, 'randomforest.joblib', compress = 2)

['randomforest.joblib']

## RandomForest Classifier Weights Saved. 
### Accuracy: 0.98
### oob_score: 0.98

+ Note, if in your usecase data is too small to split to train-test-split then one can train model on entire data and measure out of bag score. 

<hr>

## PART 2. Model Ranking Metric

### Accuracy of Ranking Methodology
+ After sorting the reviews by the review score, we wanted all reviews in Set 1 to be above all reviews of Set 0.
+ To test this hypothesis, we developed the following Ranking Metric
+ Let the number of 1s in our Dataset be x.
### `Ranking Accuracy on Single Product = Number of 1s found in first x positions / x`

In [17]:
classifier = load('randomforest.joblib')

In [18]:
product_list = df['product'].unique()
df['win']=0
df['lose']=0
df['review_score'] = 0.0
df.reset_index(inplace = True, drop = True)


def score_giver(C,D):
    E = pd.merge(C,D,how='outer',on='j')
    E.drop(columns=['j'],inplace = True)
    q= classifier.predict(E.values)
    return Counter(q)

for product in product_list:
    data = df[df['product']==product]
    for indx in data.index:
        review = df.iloc[indx, 3:-3]
        review['j'] = 'jn'
        C = pd.DataFrame([review])
        D = data[data.index!=indx].iloc[:,3:-3]
        D['j'] = 'jn'
        score = score_giver(C,D)
        df.at[indx, 'win'] = 0 if score.get(1) is None else score.get(1)
        df.at[indx, 'lose'] = 0 if score.get(0) is None else score.get(0)
        df.at[indx, 'review_score'] = float(0 if score.get(1) is None else score.get(1)) / len(data) * 1.0

df = df.sort_values(by = ['product','review_score'], ascending = False)

r_accuracy =[]
for product in product_list:
    x = data_split[data_split.index == product][1][0]
    number_of_1_in_x = Counter(df[df['product']==product].iloc[:x, ]['label']).get(1)
    rank_accuracy = float(number_of_1_in_x*1.0 / x*1.0)
    print("Product: {} | Rank Accuracy: {}".format(product, rank_accuracy))
    r_accuracy.append(rank_accuracy)
print("Mean Rank Accuracy: {}".format(mean(r_accuracy)))

Product: Accucheck | Rank Accuracy: 0.9647058823529412
Product: Becadexamin | Rank Accuracy: 1.0
Product: Evion | Rank Accuracy: 1.0
Product: Neurobion | Rank Accuracy: 0.9117647058823529
Product: SevenseascodLiverOil | Rank Accuracy: 1.0
Product: Shelcal | Rank Accuracy: 0.9435483870967742
Product: Supradyn | Rank Accuracy: 1.0
Product: shampoo | Rank Accuracy: 1.0
Mean Rank Accuracy: 0.9775023719165086


In [19]:
df

Unnamed: 0,product,answer_option,label,review_len,Rn,Rp,Rs,Rc,Rd,Rsc,win,lose,review_score
1564,shampoo,Wash your head within 3 days for sometimes Or ...,1,39,1.162791,0.15000,0.300000,0.073986,1.0,-0.1823,104,0,0.990476
1615,shampoo,I was diagnosed with Seborrheic Dermatitis a d...,1,79,0.924148,0.00875,0.510000,0.136038,0.0,0.7184,103,1,0.980952
1550,shampoo,Like product is as expected Dislike- no immedi...,1,12,0.681312,-0.10000,0.400000,0.026253,0.0,0.7506,102,2,0.971429
1551,shampoo,It really helps to relieve dandruff and itching,1,8,0.000000,0.20000,0.200000,0.019093,0.0,0.6865,101,3,0.961905
1568,shampoo,At first the Dandruff disappears - - but it re...,1,33,0.271280,0.25000,0.333333,0.066826,0.0,-0.6010,96,8,0.914286
...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,Accucheck,Prompt Delivery,0,2,0.364110,0.00000,0.000000,0.002710,1.0,0.0000,3,391,0.007595
223,Accucheck,courier service,0,2,0.000000,0.00000,0.000000,0.002710,1.0,0.0000,3,391,0.007595
43,Accucheck,The time/days took to deliver is really bad,0,8,0.369768,-0.70000,0.666667,0.010840,1.0,-0.5849,0,394,0.000000
311,Accucheck,The product was delivered after days and too o...,0,19,0.482174,-1.00000,1.000000,0.025745,1.0,-0.5423,0,394,0.000000


In [18]:
df.iloc[:, [0,1,-1]].to_csv('data/train_ranked_output.csv',index = False)

In [2]:
!ls

1. Data Analysis and Preprocessing.ipynb
2. Feature Engineering.ipynb
3.Model Training.ipynb
[34mPhotos[m[m
[34mdata[m[m
datapipeline.py
feature_analysis.html
randomforest.joblib
requirements.txt
[34mutils[m[m


In [19]:
t = pd.read_csv('data/test.csv')