# Sentiment Analysis

![Sonny and Mariel high fiving.](https://github.com/senolcemhan98/templates/blob/main/sentiment.png?raw=true)

Model (HuggingFace) : https://huggingface.co/pysentimiento/robertuito-sentiment-analysis
- The model has several different language options (es, en, it, pt). (pt:Portuguese)
- Base model : BERT
- pysentimiento is an **open-source** library

In [1]:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="pt") 

def analyze_sentiment(text:str):

    probs = analyzer.predict(text).probas
    # Calculate the weighted average for sentiment_score
    return (probs['POS'] * 1 + probs['NEU'] * 0 + probs['NEG'] * -1)



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import pandas as pd
from scipy import stats
from scipy.stats import spearmanr,shapiro

data = pd.read_csv('./S_Data/order_reviews.csv')
data = data[data['review_comment_message'].isna() == False]
data = data[['review_score','review_comment_message']]

In [3]:
data['sentiment_score'] = data['review_comment_message'].apply(analyze_sentiment)

In [4]:
data.head()

Unnamed: 0,review_score,review_comment_message,sentiment_score
3,5,Recebi bem antes do prazo estipulado.,0.031753
4,5,Parabéns lojas lannister adorei comprar pela I...,0.986891
9,4,aparelho eficiente. no site a marca do aparelh...,-0.597534
12,4,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",-0.578317
15,5,"Vendedor confiável, produto ok e entrega antes...",0.056698


In [5]:
data.describe()

Unnamed: 0,review_score,sentiment_score
count,41753.0,41753.0
mean,3.640409,0.253674
std,1.626383,0.700631
min,1.0,-0.991368
25%,2.0,-0.260423
50%,4.0,0.260692
75%,5.0,0.977939
max,5.0,0.992501


In [6]:
# Filter minimum sentiment_score
pd.set_option('display.max_colwidth', None)
min_sentiment_score = data['sentiment_score'].min()

print(f"Comment : {data.loc[data[data['sentiment_score'] == min_sentiment_score].index, 'review_comment_message']}")
print(f"Review Score : {data.loc[data[data['sentiment_score'] == min_sentiment_score].index, 'review_score']}")
print(f"Sentiment Score : {data.loc[data[data['sentiment_score'] == min_sentiment_score].index, 'sentiment_score']}")

Comment : 41817    Saca rolhas de plástico, EXTREMAMENTE FRACO, que não seria capaz de abrir nem uma mamadeira, quanto mais uma garrafa de vinho. Quebrou no 1° uso! DINHEIRO TOTALMENTE JOGADO FORA! PÉSSIMO! Loja targaryen.
Name: review_comment_message, dtype: object
Review Score : 41817    1
Name: review_score, dtype: int64
Sentiment Score : 41817   -0.991368
Name: sentiment_score, dtype: float64


In [7]:
# Filter maximum sentiment_score
max_sentiment_score = data['sentiment_score'].max()

print(f"Comment : {data.loc[data[data['sentiment_score'] == max_sentiment_score].index, 'review_comment_message']}")
print(f"Review Score : {data.loc[data[data['sentiment_score'] == max_sentiment_score].index, 'review_score']}")
print(f"Sentiment Score : {data.loc[data[data['sentiment_score'] == max_sentiment_score].index, 'sentiment_score']}")

Comment : 8425    Adorei a cauterização da trivitt quero pra vida inteira😍😍
Name: review_comment_message, dtype: object
Review Score : 8425    4
Name: review_score, dtype: int64
Sentiment Score : 8425    0.992501
Name: sentiment_score, dtype: float64


<img src="https://github.com/senolcemhan98/templates/blob/main/reviews.gif?raw=true" width="800" />

# Calculate Correlation

The **Spearman rank-order correlation coefficient** is a nonparametric measure of the monotonicity of the relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with **0 implying no correlation**. Correlations of **-1 or +1 imply an exact monotonic relationship**. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

In [8]:
correlation, _ = spearmanr(data['sentiment_score'], data['review_score'])
print(f"Spearman's correlation coefficient: {correlation}")

Spearman's correlation coefficient: 0.7283938241629598


# Conclusion

We can say that there is a positive correlation between our sentiment scores and review scores(data)!

Note: Model need further train. It sometimes cannot distinguish either positive or negative. Ie. "The model sometimes cannot distinguish between good and bad". Model predict as notr however it might positive because it's actually positive for receiver(customer).  

# Extra

I also want to double check to make sure if there is statistically significant difference between groups. I will only check groups which has 4 and 5 reviews And i will test that if there is a difference on their sentiment scores.

In [14]:
def hyphothesis_test(dataframe, group, target):
    import scipy.stats as stats
    import numpy as np
    
    # Split A/B
    groupA = dataframe[dataframe[group] == 4][target]
    groupB = dataframe[dataframe[group] == 5][target]
    
    # Assumption: Normality
    ntA = shapiro(groupA)[1] < 0.05
    ntB = shapiro(groupB)[1] < 0.05
    # H0: Distribution is Normal! - False
    # H1: Distribution is not Normal! - True
    
    if (ntA == False) & (ntB == False): # "H0: Normal Distribution"
        # Parametric Test
        # Assumption: Homogeneity of variances
        leveneTest = stats.levene(groupA, groupB)[1] < 0.05
        # H0: Homogeneity: False
        # H1: Heterogeneous: True
        
        if leveneTest == False:
            # Homogeneity
            ttest = stats.ttest_ind(groupA, groupB, equal_var=True)[1]
            # H0: M1 == M2 - False
            # H1: M1 != M2 - True
        else:
            # Heterogeneous
            ttest = stats.ttest_ind(groupA, groupB, equal_var=False)[1]
            # H0: M1 == M2 - False
            # H1: M1 != M2 - True
    else:
        # Non-Parametric Test
        ttest = stats.mannwhitneyu(groupA, groupB)[1] 
        # H0: M1 == M2 - False
        # H1: M1 != M2 - True
        
    # Result
    temp = pd.DataFrame({
        "Test Hypothesis":[ttest < 0.05], 
        "p-value":[ttest]
    })
    temp["Test Type"] = np.where((ntA == False) & (ntB == False), "Parametric", "Non-Parametric")
    temp["Test Hypothesis"] = np.where(temp["Test Hypothesis"] == False, "Fail to Reject H0", "Reject H0")
    temp["Comment"] = np.where(temp["Test Hypothesis"] == "Fail to Reject H0", "Test groups are similar!", "Test groups are not similar!")
    
    # Columns
    if (ntA == False) & (ntB == False):
        temp["Homogeneity"] = np.where(leveneTest == False, "Yes", "No")
        temp = temp[["Test Type", "Homogeneity","Test Hypothesis", "p-value", "Comment"]]
    else:
        temp = temp[["Test Type","Test Hypothesis", "p-value", "Comment"]]
    
    # Print Hypothesis
    print("# A/B Testing Hypothesis")
    print("H0: A == B")
    print("H1: A != B", "\n")
    
    return temp
    
    
    
# Apply A/B Testing
hyphothesis_test(dataframe=data, group = "review_score", target = "sentiment_score")

# A/B Testing Hypothesis
H0: A == B
H1: A != B 



  res = hypotest_fun_out(*samples, **kwds)
  res = hypotest_fun_out(*samples, **kwds)


Unnamed: 0,Test Type,Test Hypothesis,p-value,Comment
0,Non-Parametric,Reject H0,0.0,Test groups are not similar!


Test groups are statistically and significantly different.