### Part 3 - Testing difference between features prioritised by crypto and normal app users

By looking at the p-values for each of the regression models 
and also the interaction variable

[Polynomial Transformation]("https://stackoverflow.com/questions/45828964/how-to-add-interaction-term-in-python-sklearn")  
[Example Linear regression in SKlearn]("https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html")  
[Selecting columns in 2D arrays]("https://stackoverflow.com/questions/41659535/valueerror-x-and-y-must-be-the-same-size")  

In [1]:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
import matplotlib.pyplot as plt

In [2]:
# Load in the CSVs
df_crypto = pd.read_csv("sent_crypto_bow.csv", index_col=0)
df_normal = pd.read_csv("sent_normal_bow.csv", index_col=0)

# Create table with the interaction terms
df_crypto["is_crypto"] = 1
df_normal["is_crypto"] = 0

In [3]:
df_crypto

Unnamed: 0,reviewId,topic,sent_pol,sent_sub,rating,review_len,sentence_len,is_crypto
0,40e56de6-c266-446a-89a6-5191a72324e8,N,0.016667,0.166667,4.0,154,150,1
1,7bbae22c-e255-478e-aa79-078104b23046,fee,0.000000,0.000000,1.0,485,244,1
2,7bbae22c-e255-478e-aa79-078104b23046,service,-0.025000,0.125000,1.0,485,102,1
3,7bbae22c-e255-478e-aa79-078104b23046,trust,0.700000,0.600000,1.0,485,135,1
4,c592afe5-b785-49f7-a760-c663f516d303,N,0.000000,0.000000,1.0,109,97,1
...,...,...,...,...,...,...,...,...
12461,2ba1ad26-be94-462a-a5df-8d62224abe73,N,-0.166667,0.166667,1.0,110,45,1
12462,2ba1ad26-be94-462a-a5df-8d62224abe73,transaction,0.000000,0.000000,1.0,110,34,1
12463,ec4bae64-7537-4654-a08b-28f5f3f2f925,N,0.034524,0.394048,5.0,478,239,1
12464,ec4bae64-7537-4654-a08b-28f5f3f2f925,transaction,0.000000,0.000000,5.0,478,64,1


In [4]:
df_normal

Unnamed: 0,reviewId,topic,sent_pol,sent_sub,rating,review_len,sentence_len,is_crypto
0,748c2355-d884-463b-8c9a-46d9e8cfa1ea,fee,0.369333,0.671111,1.0,496,495,0
1,40fe5012-ca9d-4aef-bfd5-d1a2b4de3be8,N,0.081724,0.350551,5.0,408,254,0
2,40fe5012-ca9d-4aef-bfd5-d1a2b4de3be8,trust,0.500000,0.350000,5.0,408,150,0
3,7444c5dc-8395-4890-9c47-5f690fb6f69b,N,0.000000,0.050000,1.0,441,90,0
4,7444c5dc-8395-4890-9c47-5f690fb6f69b,trust,0.000000,0.100000,1.0,441,215,0
...,...,...,...,...,...,...,...,...
11233,35bc4a50-a347-44d8-bdcb-0e259b024e5b,N,0.000000,0.000000,4.0,91,26,0
11234,35bc4a50-a347-44d8-bdcb-0e259b024e5b,transaction,0.700000,0.600000,4.0,91,64,0
11235,dd441a32-1fc6-47a9-b5de-adaa7901f1fa,fee,0.000000,0.000000,1.0,308,99,0
11236,dd441a32-1fc6-47a9-b5de-adaa7901f1fa,N,0.432552,0.408333,1.0,308,201,0


In [5]:
df_all = pd.concat([df_normal, df_crypto], ignore_index=True)
df_all



Unnamed: 0,reviewId,topic,sent_pol,sent_sub,rating,review_len,sentence_len,is_crypto
0,748c2355-d884-463b-8c9a-46d9e8cfa1ea,fee,0.369333,0.671111,1.0,496,495,0
1,40fe5012-ca9d-4aef-bfd5-d1a2b4de3be8,N,0.081724,0.350551,5.0,408,254,0
2,40fe5012-ca9d-4aef-bfd5-d1a2b4de3be8,trust,0.500000,0.350000,5.0,408,150,0
3,7444c5dc-8395-4890-9c47-5f690fb6f69b,N,0.000000,0.050000,1.0,441,90,0
4,7444c5dc-8395-4890-9c47-5f690fb6f69b,trust,0.000000,0.100000,1.0,441,215,0
...,...,...,...,...,...,...,...,...
23699,2ba1ad26-be94-462a-a5df-8d62224abe73,N,-0.166667,0.166667,1.0,110,45,1
23700,2ba1ad26-be94-462a-a5df-8d62224abe73,transaction,0.000000,0.000000,1.0,110,34,1
23701,ec4bae64-7537-4654-a08b-28f5f3f2f925,N,0.034524,0.394048,5.0,478,239,1
23702,ec4bae64-7537-4654-a08b-28f5f3f2f925,transaction,0.000000,0.000000,5.0,478,64,1


In [6]:
df_all["topic"].value_counts()

N               13689
fee              3079
app              3064
service          1796
transaction      1076
trust             584
verification      416
Name: topic, dtype: int64

In [7]:
new_frame = {"reviewId":[], "service_pol":[], "app_pol":[], "transaction_pol":[], "fee_pol":[], "trust_pol":[], "verification_pol": []}

default_sent = 0
avg = dict.fromkeys(new_frame, 0)
avg.pop("reviewId")
for key in avg.keys():
    d_frame = df_all.loc[df_all["topic"] == key[:-4]]
    avg[key] = d_frame["sent_pol"].mean()

def row_apply(row):

    for key in avg.keys():
        row[key] = row["sent_pol"] if row["topic"] == key[:-4] else 0
    return row;
    
new_frame = df_all.apply(row_apply, axis=1)
new_frame = new_frame.loc[new_frame["topic"] != "N"]

In [8]:
# #! Xi = feature of some sample i of all feature set X i.e. Xi = [poli, is_cryptoi]
#! When you add more 0s for each var, explainability for e.g. D_pol goes down
X = new_frame.loc[:, ["transaction_pol", "app_pol", "service_pol", "fee_pol", "trust_pol", "verification_pol", "is_crypto"]]
Y = new_frame.loc[:, ["rating"]]

In [9]:
X

Unnamed: 0,transaction_pol,app_pol,service_pol,fee_pol,trust_pol,verification_pol,is_crypto
0,0.0,0.000000,0.000000,0.369333,0.0,0.0,0
2,0.0,0.000000,0.000000,0.000000,0.5,0.0,0
4,0.0,0.000000,0.000000,0.000000,0.0,0.0,0
5,0.0,0.000000,-0.403333,0.000000,0.0,0.0,0
8,0.0,0.000000,0.000000,0.000000,0.0,0.0,0
...,...,...,...,...,...,...,...
23697,0.0,0.000000,0.000000,0.273333,0.0,0.0,1
23698,0.0,0.352083,0.000000,0.000000,0.0,0.0,1
23700,0.0,0.000000,0.000000,0.000000,0.0,0.0,1
23702,0.0,0.000000,0.000000,0.000000,0.0,0.0,1


In [10]:
Y

Unnamed: 0,rating
0,1.0
2,5.0
4,1.0
5,1.0
8,1.0
...,...
23697,5.0
23698,5.0
23700,1.0
23702,5.0


In [11]:
# Create interaction variable of degree two between the 7 vars = 7c2 + 7 = 35 
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_tr = poly.fit_transform(X)
X_tr[0] ## 3rd column = POLi * CLASSi

array([0.        , 0.        , 0.        , 0.36933333, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        ])

In [12]:
poly.get_feature_names(X.columns)



['transaction_pol',
 'app_pol',
 'service_pol',
 'fee_pol',
 'trust_pol',
 'verification_pol',
 'is_crypto',
 'transaction_pol app_pol',
 'transaction_pol service_pol',
 'transaction_pol fee_pol',
 'transaction_pol trust_pol',
 'transaction_pol verification_pol',
 'transaction_pol is_crypto',
 'app_pol service_pol',
 'app_pol fee_pol',
 'app_pol trust_pol',
 'app_pol verification_pol',
 'app_pol is_crypto',
 'service_pol fee_pol',
 'service_pol trust_pol',
 'service_pol verification_pol',
 'service_pol is_crypto',
 'fee_pol trust_pol',
 'fee_pol verification_pol',
 'fee_pol is_crypto',
 'trust_pol verification_pol',
 'trust_pol is_crypto',
 'verification_pol is_crypto']

In [13]:
feature_names = poly.get_feature_names(X.columns)
def get_is_crypto_interactions(feature_names):
    indices = []
    for i in range(len(feature_names)):
        word_list = feature_names[i].split(" ")
        if len(word_list) > 1 and "is_crypto" in word_list or len(word_list) == 1: indices.append(i)
    return indices
label_indices = get_is_crypto_interactions(feature_names)
X_tr = X_tr[:, label_indices]

In [14]:
print(len(X_tr[0]))

13


In [15]:
# Fit the model to the interaction model and get p-value
model = linear_model.LinearRegression()
model.fit(X_tr, Y)
Y_pred = model.predict(X_tr)

In [16]:
X_tr

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.28854167, 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [17]:

import statsmodels.api as sm

X2 = sm.add_constant(X_tr)
est = sm.OLS(Y, X2)
est2 = est.fit()
xlabel = ["const"] 
for index in label_indices:
    xlabel.append(poly.get_feature_names(X.columns)[index])

# print(len(xlabel))
est2.summary(xname=xlabel)

#! Interaction model should include everything that is relevant to the dependent variable
#! i.e. include all 6 things that you use to drive the rating, and see how this differs across two apps
#! for all other online reviews that don't have a sentiment score for a given topic, put 0 to assume neutral sentiment on that particular topic.
#! Make sure to include itneraction variables between all 6 topics 6C2 interaction variables.
# TODO: Include OLS regression results for the combined regression model.
# TODO: have list of words associated with each topic and use NB to associate 
# TODO: How to convert LDA into a text classifier? https://medium.com/analytics-vidhya/text-classification-using-lda-35d5b98d4f05
#! I think the r^2 is lower than before because this time we are considering 7.7k sentences in the sample. whereas in prev we were just considering 64, which we all had data for.
# -> Imputation: use the mean? or 0? 
# TODO: try removing the subjectivity restriction + improving the text classifier so you don't get as much variation which is uncontrolled for 

# TODO: 15/7

# TODO: Look at other papers that predict mobile ratings and see what kind of control variable in those papers. e.g time, app downloads
# TODO: Consider other more rigorous methods of text classification e.g. you use the word list (freq(word) in word list) and use each as a feature, and still label (do maybe 500 reviews 5 in 1 minute = 100 minutes)
    # -> Way to inject human intelligence into model i.e. "supervise" the model by guiding it towards the more important features i.e. words in word list.
    # TODO: try using k-means to cluster topics together and see if you can get more discernable topics -> nah lda is still better
    # TODO: then use k-nn to classify topics? -> nah, does about the same as naive bayes, but could try using lda + manual clustering to label
# TODO: How t explain LDA has 6 topics e.g. explain by trial and error and found 6 have most interpretable results. OR quant approach e.g. coherence in lda (Cluster quality)
    # TODO: OR generate 50 topics, you can then manually group topics together. avoids multiple topics in same cluster problem
    # TODO: NEED TO SHOW WHY 6 topics using LDA. e.g. 6 from grouping 50 subtopics, or coherence metric
# TODO: Add more words to trust, and use larger lda topic selection.

# 18/7

# Maybe try using random forests for classifying the texts? leverage the benefit of uncorrelated but good decision trees discovered. -> no benefit, just classifies eveyrthing to same class



#! Based on results below it seems that the difference between the two. 
#! I might need to increase the sample size.
#! To address underfitting - look at features/model (from book)
#! Or maybe i'm underfitting e.g. Could use the sigmoid function to fit the scatter plot better.
    #! Rationale: Shift between slightly negatie and slightly positive has a signicant impact on rating.
#! Or is it because that I should use a two tailed t test instead, if just one tail, p-value is 0.052 -> no use two tail for beta of regression i.e. hypo = any relation at all


## https://www.statisticssolutions.com/should-you-use-a-one-tailed-test-or-a-two-tailed-test-for-your-data-analysis/



0,1,2,3
Dep. Variable:,rating,R-squared:,0.273
Model:,OLS,Adj. R-squared:,0.272
Method:,Least Squares,F-statistic:,288.6
Date:,"Tue, 19 Jul 2022",Prob (F-statistic):,0.0
Time:,13:30:26,Log-Likelihood:,-18026.0
No. Observations:,10015,AIC:,36080.0
Df Residuals:,10001,BIC:,36180.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.3973,0.024,100.621,0.000,2.351,2.444
transaction_pol,1.9375,0.254,7.626,0.000,1.440,2.436
app_pol,3.8776,0.101,38.429,0.000,3.680,4.075
service_pol,1.9690,0.134,14.746,0.000,1.707,2.231
fee_pol,2.1252,0.131,16.282,0.000,1.869,2.381
trust_pol,1.9889,0.235,8.463,0.000,1.528,2.450
verification_pol,0.3274,0.678,0.483,0.629,-1.002,1.657
is_crypto,-0.3454,0.032,-10.778,0.000,-0.408,-0.283
transaction_pol is_crypto,0.2515,0.329,0.765,0.444,-0.393,0.896

0,1,2,3
Omnibus:,759.054,Durbin-Watson:,1.576
Prob(Omnibus):,0.0,Jarque-Bera (JB):,493.526
Skew:,0.424,Prob(JB):,6.79e-108
Kurtosis:,2.318,Cond. No.,77.2


[Interpreting the OLS Regression Results Sheet](https://www.youtube.com/watch?v=U7D1h5bbpcs)

### Ordinary Sentiment T-test average

In [18]:
# T-test Sentiment comparison
from statsmodels.stats.weightstats import ttest_ind

df_crypto_all_classified = df_crypto.loc[df_crypto["topic"] != "N"]
df_normal_all_classified = df_normal.loc[df_normal["topic"] != "N"]

results_df = {"topic": [], "tstat": [], "pvalue": [], "df": [], "is_significant": []}

for topic in df_crypto_all_classified["topic"].unique():
    (tstat, pvalue, df) = ttest_ind(df_crypto_all_classified.loc[df_crypto_all_classified["topic"] == topic]["sent_pol"], 
                                    df_normal_all_classified.loc[df_normal_all_classified["topic"] == topic]["sent_pol"])
    results_df["topic"].append(topic)
    results_df["tstat"].append(tstat)
    results_df["pvalue"].append(pvalue)
    results_df["df"].append(df)
    is_sig = pvalue < 0.05
    results_df["is_significant"].append(is_sig)

results_df = pd.DataFrame(results_df)
results_df

Unnamed: 0,topic,tstat,pvalue,df,is_significant
0,fee,-4.523476,6.313283e-06,3077.0,True
1,service,0.259339,0.7954032,1794.0,False
2,trust,-1.213379,0.2254773,582.0,False
3,transaction,-1.744908,0.08128688,1074.0,False
4,app,-10.888372,4.08548e-27,3062.0,True
5,verification,-0.408251,0.6833005,414.0,False


In [19]:
df_normal_all_classified

Unnamed: 0,reviewId,topic,sent_pol,sent_sub,rating,review_len,sentence_len,is_crypto
0,748c2355-d884-463b-8c9a-46d9e8cfa1ea,fee,0.369333,0.671111,1.0,496,495,0
2,40fe5012-ca9d-4aef-bfd5-d1a2b4de3be8,trust,0.500000,0.350000,5.0,408,150,0
4,7444c5dc-8395-4890-9c47-5f690fb6f69b,trust,0.000000,0.100000,1.0,441,215,0
5,7444c5dc-8395-4890-9c47-5f690fb6f69b,service,-0.403333,0.822222,1.0,441,133,0
8,3ced6d9c-be58-4213-bdf3-ec12a750c288,app,0.000000,0.000000,1.0,452,81,0
...,...,...,...,...,...,...,...,...
11226,2ba187a8-f33f-42a6-9b72-949d406bed59,app,0.368750,0.625000,4.0,136,95,0
11228,3f2b7de9-8745-48de-a047-08555ae5d99a,service,0.187500,0.250000,1.0,189,187,0
11231,b28ce657-a318-454e-abd7-06d6db719655,app,0.511111,0.644444,5.0,69,69,0
11234,35bc4a50-a347-44d8-bdcb-0e259b024e5b,transaction,0.700000,0.600000,4.0,91,64,0


In [20]:
# T-test Ratings comparison
results_df = {"tstat": [], "pvalue": [], "df": [], "is_significant": []}

(tstat, pvalue, df) = ttest_ind(df_crypto_all_classified["rating"], 
                                    df_normal_all_classified["rating"])

results_df["tstat"].append(tstat)
results_df["pvalue"].append(pvalue)
results_df["df"].append(df) 
is_sig = pvalue < 0.05
results_df["is_significant"].append(is_sig)

results_df = pd.DataFrame(results_df)
results_df

Unnamed: 0,tstat,pvalue,df,is_significant
0,-17.762106,1.6093040000000002e-69,10013.0,True


### % of review discussion vs Y rating.

In [21]:
df_all["%_of_review"] = df_all["sentence_len"]/df_all["review_len"]
df_classified_only = df_all[df_all["topic"] != "N"]
df_classified_only = df_classified_only.copy()
df_classified_only.head(100)

Unnamed: 0,reviewId,topic,sent_pol,sent_sub,rating,review_len,sentence_len,is_crypto,%_of_review
0,748c2355-d884-463b-8c9a-46d9e8cfa1ea,fee,0.369333,0.671111,1.0,496,495,0,0.997984
2,40fe5012-ca9d-4aef-bfd5-d1a2b4de3be8,trust,0.500000,0.350000,5.0,408,150,0,0.367647
4,7444c5dc-8395-4890-9c47-5f690fb6f69b,trust,0.000000,0.100000,1.0,441,215,0,0.487528
5,7444c5dc-8395-4890-9c47-5f690fb6f69b,service,-0.403333,0.822222,1.0,441,133,0,0.301587
8,3ced6d9c-be58-4213-bdf3-ec12a750c288,app,0.000000,0.000000,1.0,452,81,0,0.179204
...,...,...,...,...,...,...,...,...,...
185,a47c56b7-9c30-40fd-8c0f-d5bc8f1a367c,app,0.563333,1.000000,5.0,70,29,0,0.414286
188,ea7dee83-f625-4f66-9bc4-58696ee6e2b4,app,0.250000,0.333333,5.0,97,26,0,0.268041
190,8b161207-04f7-4011-a9b0-b14b3f885044,app,0.587500,0.750000,4.0,120,68,0,0.566667
195,de92bac5-754c-46bb-9352-41fd6502a7df,app,0.133333,0.450000,5.0,84,47,0,0.559524


In [22]:
# Create 1,0 flags for whether topic belongs to a certain class
unique_topics = df_classified_only.topic.unique()
def add_topic_flags(row):
    for topic in unique_topics:
        topic_flag_col = "is_" + topic
        row[topic_flag_col] = 1 if row["topic"] == topic else 0
    return row

df_classified_only = df_classified_only.apply(add_topic_flags, axis="columns")

In [24]:
X = df_classified_only.loc[:, ["is_" + topic for topic in unique_topics] + ["%_of_review", "is_crypto", "review_len"]]
Y = df_classified_only["rating"]

In [29]:
# Create interaction variable of degree two between the 7 vars = 7c2 + 7 = 35 
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_tr = poly.fit_transform(X)
poly.get_feature_names(X.columns)



['is_fee',
 'is_trust',
 'is_service',
 'is_app',
 'is_verification',
 'is_transaction',
 '%_of_review',
 'is_crypto',
 'review_len',
 'is_fee is_trust',
 'is_fee is_service',
 'is_fee is_app',
 'is_fee is_verification',
 'is_fee is_transaction',
 'is_fee %_of_review',
 'is_fee is_crypto',
 'is_fee review_len',
 'is_trust is_service',
 'is_trust is_app',
 'is_trust is_verification',
 'is_trust is_transaction',
 'is_trust %_of_review',
 'is_trust is_crypto',
 'is_trust review_len',
 'is_service is_app',
 'is_service is_verification',
 'is_service is_transaction',
 'is_service %_of_review',
 'is_service is_crypto',
 'is_service review_len',
 'is_app is_verification',
 'is_app is_transaction',
 'is_app %_of_review',
 'is_app is_crypto',
 'is_app review_len',
 'is_verification is_transaction',
 'is_verification %_of_review',
 'is_verification is_crypto',
 'is_verification review_len',
 'is_transaction %_of_review',
 'is_transaction is_crypto',
 'is_transaction review_len',
 '%_of_review is

In [30]:
# Get only desired interaction features
feature_names = poly.get_feature_names(X.columns)
def get_is_crypto_interactions(feature_names):
    indices = []
    for i in range(len(feature_names)):
        word_list = feature_names[i].split(" ")
        if len(word_list) > 1 and any(item in ["is_crypto", "review_len", "%_of_review"] for item in word_list) or len(word_list) == 1: indices.append(i)
    return indices
label_indices = get_is_crypto_interactions(feature_names)
X_tr = X_tr[:, label_indices]

# TODO: Redo this change X = ["fee_coverage", "service_coverage" e.t.c.]

In [31]:
X2 = sm.add_constant(X_tr)
model = sm.OLS(Y, X2)
est = model.fit()
xlabel = ["const"] 
for index in label_indices:
    xlabel.append(poly.get_feature_names(X.columns)[index])

# print(len(xlabel))
est.summary(xname=xlabel)

0,1,2,3
Dep. Variable:,rating,R-squared:,0.218
Model:,OLS,Adj. R-squared:,0.216
Method:,Least Squares,F-statistic:,107.4
Date:,"Tue, 19 Jul 2022",Prob (F-statistic):,0.0
Time:,13:32:04,Log-Likelihood:,-18387.0
No. Observations:,10015,AIC:,36830.0
Df Residuals:,9988,BIC:,37020.0
Df Model:,26,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.2058,0.093,23.765,0.000,2.024,2.388
is_fee,0.3454,0.112,3.075,0.002,0.125,0.566
is_trust,-0.3271,0.212,-1.539,0.124,-0.744,0.089
is_service,0.3603,0.136,2.651,0.008,0.094,0.627
is_app,2.0092,0.110,18.307,0.000,1.794,2.224
is_verification,-0.7906,0.294,-2.690,0.007,-1.367,-0.214
is_transaction,0.6085,0.171,3.566,0.000,0.274,0.943
%_of_review,0.4410,0.111,3.964,0.000,0.223,0.659
is_crypto,-0.4489,0.098,-4.566,0.000,-0.642,-0.256

0,1,2,3
Omnibus:,1042.501,Durbin-Watson:,1.497
Prob(Omnibus):,0.0,Jarque-Bera (JB):,634.047
Skew:,0.487,Prob(JB):,2.08e-138
Kurtosis:,2.245,Cond. No.,1.17e+16


In [None]:
# Regression with single 

# TODOS to potentially improve the p-value
# -> Investigate text classification algorithm to determine if p-value is problematic
# -> Increase sample size to get more texts that are classified into Deposits/withdrawals

#1
# TODO: look at average sentiment for fee and compare this to average sentiment for fee in non-crypto apps, and see if there is a significant difference. -> link
# TODO: Ordinary t-test on sentiment and rating (see in general are crypto users more positive about trust)
    # Completed -> but crypto users usually have lower ratings -> will this squash the beta and thus is why interaction coef is negative?
# TODO: Maybe consider the amount of discussion for each topic and weigh the sentiment of those sentiments more. -> caluclate % of words on this topic -> 
    # 
# TODO: need to have something that crytpo useres care more and care less for interesting discussion or conference paper.
#! I rebroke all reviews into individual sentnences but hte correlation became even weaker, and so I think it might not be so good to use that, I think the text classifier and sentiment analyser struggles with shorter sentences. 


#2
# Block of text - 3 setnences e.g. sent 1 (app) (word/total words = 20%), sent 2 (trust) (40%), sent 3 (N) (40%)
# Do this for every review
# Run regression, but isntead of using the polarity as x, you use the coverage or % of words as X and Y is still rating.
# use the total words of the review as a control variable.
# Indicates: How much people talk about a particular topic and the relationship with Y

#3
# do same interaction test 
# Look at  magazine articles about coinbase and other similar crypto apps not necessarily journal articles because they don't exist. -> look at characteristic.
    # E.g. articles may say that crypto app useres care more about the service than normal finance apps, but you refute it using -coef in interaction model for service_pol.

#4
# maybe review lda for 1,2,3 app and 4,5 star apps.
# 

# contribution in conference paper - theory e.g. different topics for crypto vs non-crypto, practical e.g. 
# Try different angle and usually it will results in different findings.
# Maybe add more app reveiws