In [None]:
import numpy as np #Math operations
import seaborn as sns #Figures and graphics
import matplotlib.pyplot as plt #Figure and graphics
plt.rcParams["figure.figsize"] = (10,5)
import pandas as pd #Data analysis
from spacytextblob.spacytextblob import SpacyTextBlob#Give us the subjetivity of a text
import scipy.stats as stats #statistics tools
import warnings
warnings.filterwarnings("ignore")
from pysentimiento import create_analyzer #Give us the sentiment of a text
import spacy
from statsmodels.stats import multitest
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
from langdetect import detect

In [None]:
from test_utils import preproc, permutation #We use the module that we've created to simulate and pre-process the data

# Task two: Is there any difference in the review quality between the two types of users: incentivized one and organic one?
Our goal is to define the quality of the reviews and then compare it between incentivized and organic groups. To do this we must find the parameters that define quality and perform statistical tests that allow us to decide if both samples belong to the same population.

We have modularized the data pre-processing. This involves deletion of rows in languages other than English to make the analysis more accurate. There are 35 non-English reviews over 2150 total so it is safe to just remove them.
Furthermore we fill NaN values, and aggregate new columns to the original dataset. The new columns are: 
- 'label': contains a 1 if the review is orgnic and a 0 if it is not.
- 'all_quest_answ': contains a 1 if all the questions are answered within the review and 0 otherwise.
- 'review_text': this column contains the full review text, concatenating all 4 questions.


In [None]:
#Load the dataset
df = pd.read_excel('formatted_review_Asana.xlsx', index_col = 0)


In [None]:
#Apply the pre-process function to our dataset
df = preproc(df)

In [None]:
#View of the fisrt 5 rows of df
df.head(5)

## Sentiment analysis

We are going to extract the following features related to text sentiment.
- Positivity/negativity: this is a floating point estimator representing the probability that the sentiment of the text is positive/negative. We obtain this metric using the pysentimiento library, which uses BERT algorithm to obtain the sentiment of texts. BERT is an open source ML framework for natural language processing (NLP), having the advantage of being able to use surrounding text to establish context. The BERT framework was pre-trained using text from Wikipedia and social-media.

- Subjetivity: this is a measure of the degree of the author's pesonal opinion in the text, represented by a floating point number between 0 (completely objective text) and 1 (completely subjective). We use spacytextblob for this.

In [None]:
analyzer = create_analyzer(task="sentiment", lang="en")
sp = spacy.load('en_core_web_sm')
sp.add_pipe('spacytextblob')

Example: this review clearly has a positive sentiment, which should correlate with a high positivity probability, and a certain degree of personal opinion, which should result in a non-negligible subjectivity.

In [None]:
ex_review = df.loc[5,'question1']
print(ex_review)

In [None]:
print(f'Review sentiment probabilities: {analyzer.predict(ex_review).probas}')

As expected, this sentence has a very high probability of being classified as Positive (POS), and negligible probabilities of being Neutral (NEU) or Negative (NEG)

In [None]:
print(f'Review subjectivity: {sp(ex_review)._.blob.subjectivity}')

Also, a subjectivity of 0.64, indicating the presence of the author's personal opinion.

We thus find the mentioned indicators for each of the text columns:

In [None]:
from tqdm import tqdm
lista=[[],[],[],[]]

for q in range(1,5):
    for row in tqdm(df['question'+str(q)]):
        lista[q-1].append(analyzer.predict(row).probas)
        
    

In [None]:
l_rev=[]
for row in tqdm(df['review_text']):
    l_rev.append(analyzer.predict(row).probas)

In [None]:
lista_sub=[[],[],[],[]]
for q in range(1,5):
    for row in tqdm(df['question'+str(q)]):
        lista_sub[q-1].append(sp(row)._.blob.subjectivity)


In [None]:
l_rev_s=[]
for row in tqdm(df['review_text']):
    l_rev_s.append(sp(row)._.blob.subjectivity)

In [None]:
data = pd.DataFrame()
for i in range(0, 4):
    df_sent_q = pd.DataFrame(lista[i])
    df_sent_q.columns = [
        "NEG_q" + str(i + 1),
        "NEU_q" + str(i + 1),
        "POS_q" + str(i + 1),
    ]

    df_subg_q = pd.DataFrame(lista_sub[i], columns=["subj_q" + str(i + 1)])
    data = pd.concat([data, df_sent_q, df_subg_q], axis=1)


df_sent_rev = pd.DataFrame(l_rev)
df_sent_rev.columns = ["NEG_rev", "NEU_rev", "POS_rev"]

df_subj_rev = pd.DataFrame(l_rev_s, columns=["subj_rev"])

data = pd.concat([data, df_sent_rev, df_subj_rev], axis=1)
data["all_quest_answ"] = df.loc[:, "all_quest_answ"]
data["star"] = df.loc[:, "star"]
data["label"] = df.loc[:, "label"]

In [None]:
data

we have the data with the parameters that define the quality of the reviews saved in an external file. This way it is easier to get the final results, although you can also run all the lines of this notebook and get the same result.

In [None]:
data = pd.read_csv("quality_reviews.csv", index_col=0)

In [None]:
#we separete the data with two conditions
org = data['label']==1 #organic
inc = data['label']!=0 #incentivized

### Definition of quality: We define the quality of the review based on 4 indicators: positivity, negativity, subjectivity and the fact that all users have answered all the questions.

The code below shows the distributions of positivity, negativity and subjectivity for the organic and incentivized groups. Finally we also show the distribution of values of the ' ' column. The images have the names rev_pos, rev_neg, rev_sub and questions_answ respectively. 

In [None]:
g = sns.FacetGrid(data, col="label", height=5, aspect=1.5)
g.map(sns.histplot, "POS_rev",  stat="probability", bins=20)
g.set_axis_labels("Review positivity", "POS_rev")
#plt.savefig("rev_pos.pdf")
plt.show()


In [None]:
g = sns.FacetGrid(data, col="label", height=5, aspect=1.5)
g.map(sns.histplot, "NEG_rev", stat="probability", bins=15)
g.set_axis_labels("Review negativity", "NEG_rev")
#plt.savefig("rev_neg.pdf")
plt.show()

In [None]:
g = sns.FacetGrid(data, col="label", height=5, aspect=1.5)
g.map(sns.histplot, "subj_rev", stat="probability", bins=20)
plt.xlabel("Review subjetivity")
#plt.savefig("rev_subj.pdf")
plt.show()

## Statisical tests

For all tests we have used as null hypothesis that both samples belong to the same population of the parameters used to define quality.

We conduct a permutations test in order to assert whether there is a statistically significant difference between the means of the previously mentioned features. The statistical tests have been done on the column of the dataset containing the text of all the questions concatenated. 

In order of appearance the images are the named files: sdist_positivity, sdist_negativity, sdist_subjectivity.

In [None]:
n_iters = 100000
mean_dif = data.loc[org,'POS_rev'].mean() - data.loc[~org,'POS_rev'].mean()

dist, p_pos, c_l, c_h = permutation(
    mean_dif,
    data.loc[org,'POS_rev'], 
    data.loc[~org,'POS_rev'], 
    niters=n_iters, 
    dist=True
)

sns.histplot(data=dist, stat = 'probability', bins=100)
plt.axvline(x = -mean_dif , color = 'red', label='Negative observed difference')
plt.axvline(x = mean_dif, color = 'green', label='Observed difference')
plt.title(f'Distribution of differences of means after {n_iters} resamples')
plt.xlabel('Sample mean difference')
plt.legend()
#plt.savefig("sdist_positivity.pdf")
plt.show()

print(f'P-value for review positivity is {p_pos} with a confidence interval ({c_l}, {c_h})')

In [None]:
n_iters = 100000
mean_dif = data.loc[org,'NEG_rev'].mean() - data.loc[~org,'NEG_rev'].mean()

dist, p_neg, c_l, c_h = permutation(
    mean_dif, 
    data.loc[org,'NEG_rev'], 
    data.loc[~org,'NEG_rev'], 
    niters=n_iters, 
    dist=True
)

sns.histplot(data=dist, stat = 'probability', bins=100)
plt.axvline(x = -mean_dif , color = 'red', label='Negative observed difference')
plt.axvline(x = mean_dif, color = 'green', label='Observed difference')
plt.title(f'Distribution of differences of means after {n_iters} resamples')
plt.xlabel('Sample mean difference')
plt.legend()
#plt.savefig("sdist_negativity.pdf")
plt.show()

print(f'P-value for review negativity is {p_neg} with a confidence interval ({c_l}, {c_h})')

In [None]:
n_iters = 100000
mean_dif = data.loc[org,'subj_rev'].mean() - data.loc[~org,'subj_rev'].mean()

dist, p_subj, c_l, c_h = permutation(
    mean_dif, 
    data.loc[org,'subj_rev'], 
    data.loc[~org,'subj_rev'], 
    niters=n_iters, 
    dist=True
)

sns.histplot(data=dist, stat = 'probability', bins=100)
plt.axvline(x = -mean_dif , color = 'red', label='Negative observed difference')
plt.axvline(x = mean_dif, color = 'green', label='Observed difference')
plt.title(f'Distribution of differences of means after {n_iters} resamples')
plt.xlabel('Sample mean difference')
plt.legend()
#plt.savefig("sdist_subjectivity.pdf")
plt.show()

print(f'P-value for review subjectivity is {p_subj} with a confidence interval ({c_l}, {c_h})')


## Amount of questions answered

Another interesting question is whether the user left any of the four questions unanswered. We can see at first glance that there is a difference in this regard among both classes, and we think it could be another useful parameter. 

In [None]:
data["all_quest_answ_cat"] = 'Yes'
data.loc[data["all_quest_answ"]!=1,"all_quest_answ_cat"] = "No"

g = sns.FacetGrid(data, col="label", height=5, aspect=1.2)
g.map(sns.histplot, "all_quest_answ_cat", stat="probability", discrete=True)
g.set_axis_labels("Have all questions been answered?", "all_quest_answ_cat")
#plt.savefig("questions_answ.pdf")
plt.show()


In this case, we are going to compare with a chi-squared test the ratio of reviews that give answers to all of the four questions between both groups.

In [None]:
T = np.array(
    [[data[org][data[org]['all_quest_answ']==1].shape[0],
      data[org][data[org]['all_quest_answ']!=1].shape[0]],
     [data[~org][data[~org]['all_quest_answ']==1].shape[0],
      data[~org][data[~org]['all_quest_answ']!=1].shape[0]]]
)

pval_chi = stats.chi2_contingency(T,correction=False)[1]

print(f"P-value for chi-squared test is {pval_chi}")

# Final result

We finally correct the p-values obtained to account for the amount of tests conducted, using an FDR correction. This is done because when performing multiple comparison tests, the probability of finding a test that tells us that the two populations are different is higher, and this corresponds to a higher probability of obtaining type 1 errors (rejecting the null hypothesis when it is true). The significance level that we used is $\alpha=0.05$.

In [None]:
#This object has two components, the first is an array that tell us if the null hypthosis is rejected or not
#the second component is an other array, it contains the corrected p-values.
multitest.fdrcorrection([p_subj,pval_chi,p_neg, p_pos], alpha=0.05, method='indep')

This means that by performing statistical tests on the difference in means of the parameters that define the quality of the reviews, we have obtained differences between the samples by studying negativity and whether the user answered all the questions or not. On the other hand, the results of the positivity and subjectivity tests indicate that we cannot reject the null hypothesis. For a random seed of permutation tests the results were as follows:


$p_{subj}=0.094 >\alpha=0.05$ It implies not being able to reject the null hypothesis (subjectivity parameter).

$p_{val-chi}=0.009 <\alpha=0.05$ Implies rejecting the null hypothesis (Parameter that counted whether all questions were answered or not).

$p_{neg}=0.018 <\alpha=0.05$ It implies rejecting the null hypothesis (Negativity parameter).


$p_{pos}=0.064 >\alpha=0.05$ It implies not being able to reject the null hypothesis (Positivity parameter).



### Final Answer: There is a difference between the quality of reviews written by incentivized and non-incentivized users. This difference can be observed in the review negativity and the amount of answered questions. The parameters of positivity and subjectivity cannot be used to assert differences after statistical analysis.