Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect polarity calculation #21

Closed
swsankar opened this issue Jul 27, 2016 · 8 comments
Closed

Incorrect polarity calculation #21

swsankar opened this issue Jul 27, 2016 · 8 comments

Comments

@swsankar
Copy link

Finding it strange. Trying the sentence "Crashing tv isn't showing" yields a sentiment score of 0.5

Sentiment for "Crashing TV" yields -0.70
Sentiment for "isn't showing" yields 0
Sentiment for "isn't " yields 0 - This is surprising coz I have "isn't" as negator in my valence table

There were only a couple of additions to the valence table and the polarity table - and none of it should have any impact in the context of this sentence.

Any idea what is wrong ?

sentiment_by("Crashing tv isn't showing", by = NULL, polarity_dt = pk_table,

  •          valence_shifters_dt = vs_table)
    
    element_id word_count sd ave_sentiment
    1: 1 4 NA 0.5

    sentiment_by("Crashing tv", by = NULL, polarity_dt = pk_table,

  •          valence_shifters_dt = vs_table)
    
    element_id word_count sd ave_sentiment
    1: 1 2 NA -0.7071068

    sentiment_by("isn't showing", by = NULL, polarity_dt = pk_table,

  •          valence_shifters_dt = vs_table)
    
    element_id word_count sd ave_sentiment
    1: 1 2 NA 0
    sentiment_by("isn't", by = NULL, polarity_dt = pk_table,
  •          valence_shifters_dt = vs_table)
    
    element_id word_count sd ave_sentiment
    1: 1 1 NA 0
@trinker
Copy link
Owner

trinker commented Jul 27, 2016

Thanks for trying sentimentr.

It's hard to tell discuss this without reproducible example. I believe I know where you are getting tripped up but will wait until you make a reproducible example so I can see your process. Please use markdown formatting to display intext and blocks of code so that it's easy to read & grab.

@swsankar
Copy link
Author

swsankar commented Jul 28, 2016

Not sure if I understand the ask
I am trying to evaluate/debug the anomalies I am getting in terms of the sentiment score and eventually improve my dictionary. One such example is for the sentence "Crashing tv isn't showing"

All I am doing is running <sentimentby()> function for the above sentence in RStudio as it appears above

sentiment_by("Crashing tv isn't showing", by = NULL, polarity_dt = pk_table, valence_shifters_dt = vs_table)

Below is what I use to update the polarity table and Valence table.

vs_table <- sentimentr::valence_shifters_table  
 vs_table <- update_key(vs_table, drop=NULL, x = data.frame(x = c("especially", "most", "more", "bigger"), y = c(2,4), stringsAsFactors = FALSE), comparison=sentimentr::polarity_table, sentiment=F)

pk_table <- sentimentr::polarity_table
pk_table <- update_key(pk_table,  x = data.frame(x = c("used to", "outdated", "restarts", "reboot", "i wish"), y = c(rep(-2, 5))

@swsankar
Copy link
Author

Here are more examples where I am getting a positive score instead of a negative

Horrible can't even watch the game and it's football season, this app needs a face lift.

Looks like half the channels from basic cable lineup are missing! (tried adding 'looks like' as a de-amp, but getting a duplication error while both the polarity & valence table does not contain it)

It crashes every time I use it. They marketed it like it was as good or better then the Netflix app... Please. Don't even bother with this.

@trinker
Copy link
Owner

trinker commented Jul 28, 2016

Let's start with this:

x = data.frame(x = c("especially", "most", "more", "bigger"), y = c(2,4)

These two vectors are not equal in length so R invokes the recycling rule to make the data.frame. IS that really what you want? Also...What is the 4 for? It's use isn't documented so I'm wondering what you are using it for.

Realize that negators (isn't in this case) before or after a polarized word can flip it's polarity. The default is if a negator is 2 words after a polarized word it flips the sign. You can tone this down but may affect other statements the opposite way.

sentiment_by("Crashing tv isn't showing", by = NULL, polarity_dt = pk_table, valence_shifters_dt = vs_table, n.after = 1)

gives:

   element_id word_count sd ave_sentiment
1:          1          4 NA          -0.5

In your original post you wrote:

Sentiment for "isn't showing" yields 0
Sentiment for "isn't " yields 0 - This is surprising coz I have "isn't" as negator in my valence table

isn't showing showing isn't a polarized word so it's not surprising that this is considered neutral. Your second statement has me believing you don't understand the difference between a negative word and a negator. Negative words make polarity negative. Negators flip the sign of the polarity. A negator has no polarity of its own, it can only affect polarized words.

@trinker
Copy link
Owner

trinker commented Jul 28, 2016

The sentences you are showing are not surprising to me. Here's a few things to note:

  1. There is no claims to sentimentr being 100% accurate. Even the best taggers such as Stanford's does not come close to 100% accurate. See the comparison between a few taggers here: https://github.com/trinker/sentimentr#comparing-sentimentr-syuzhet-and-stanford
  2. The sentiment_by function averages the sentiments for each sentence using a simple mean. So if you have a combo of negative and positive the eman smoothes that out and may not be what you want. Use sentiment and figure out how to handle the differences between sentences yourself.
  3. The tagger requires properly formatted sentences as the tagger is based on a model of how English works. This sentence "Horrible can't even watch the game and it's football season, this app needs a face lift." in particular breaks this model. This is actually 2 sentences, not one. There should be a period after the word horrible. Instead the word 'can't' negated horrible. Not what you want.

Also realize the update_key protects you from adding words to a key when they are found in the other key. In this case it won't let you add isn't to the sentiment key because it's in the valence key. You'll need to update the valence key first using the drop argument. This is why you're getting warnings. The key's are data.table objects so you can see if your added words made it in the by looking at the key.

The act of making dictionaries is important and the format in sentimentr was designed to be mutable but requires attention to detail. As you go through this process, if you have ideas to make the UX of dictionary updating smoother please share.

@swsankar
Copy link
Author

Thank you for the detailed insights

  1. x = data.frame(x = c("especially", "most", "more", "bigger"), y = c(2,4)
    My bad - I intended to use the c(rep(2,4)). Now I have corrected that.
  2. I now get why "isn't" and "isn't showing" is yielding 0 - And it makes perfect sense. Guess for my domain, I am gonna try n.after=1 to see how it does overall
  3. I completely understand that 100% accuracy is impossible and I have seen your comparison as well. Just that sentences like "Crashing TV isn't showing", "Don't even bother with this" kept bothering me if I am doing anything wrong. Now, I re-collect how the valence shifters context is resulting in this
  4. Lastly on the update_key, I understand the comparison part before adding a new word. But curious with my specific example. When I try to add Looks like in the Valence table, it did not allow me. I verified that it does not exists in both Polarity as well as Valence table. However I kept getting the error. One thing to note is there was this word like that already exists in Polarity table. Was wondering if the logic checks for every singular word in a n-gram word for duplication when we add a n-gram word to the table ?

And once again, thanks a lot for building an excellent sentiment analysis tool.

@trinker
Copy link
Owner

trinker commented Nov 24, 2016

I will check int this.

@trinker trinker reopened this Nov 24, 2016
@trinker
Copy link
Owner

trinker commented Nov 25, 2016

Can you show me the code you tried? It works for me. Here's my code and output:

update_key(
    valence_shifters_table, 
    x = data.frame(x = c("Looks like"), y = c(3)), 
    comparison = sentimentr::polarity_table
)

Output:

                x y
 1:         acute 2
 2:       acutely 2
 3:         ain't 1
 4:      although 4
 5:        aren't 1
 6:        barely 3
 7:           but 4
 8:         can't 1
 9:        cannot 1
10:       certain 2
11:     certainly 2
12:      colossal 2
13:    colossally 2
14:      couldn't 1
15:          deep 2
16:        deeply 2
17:      definite 2
18:    definitely 2
19:        didn't 1
20:       doesn't 1
21:         don't 1
22:      enormous 2
23:    enormously 2
24:       extreme 2
25:     extremely 2
26:       faintly 3
27:           few 3
28:       greatly 2
29:        hardly 3
30:        hasn't 1
31:       haven't 1
32:       heavily 2
33:         heavy 2
34:          high 2
35:        highly 2
36:       however 4
37:          huge 2
38:        hugely 2
39:       immense 2
40:     immensely 2
41:  incalculable 2
42:  incalculably 2
43:         isn't 1
44:         least 3
45:        little 3
46:    looks like 3
47:       massive 2
48:     massively 2
49:      mightn't 1
50:          more 2
51:          much 2
52:       mustn't 1
53:       neither 1
54:         never 1
55:            no 1
56:        nobody 1
57:          none 1
58:           nor 1
59:           not 1
60:          only 3
61:    particular 2
62:  particularly 2
63:       purpose 2
64:     purposely 2
65:         quite 2
66:        rarely 3
67:          real 2
68:        really 2
69:        seldom 3
70:       serious 2
71:     seriously 2
72:        severe 2
73:      severely 2
74:        shan't 1
75:     shouldn't 1
76:   significant 2
77: significantly 2
78:      slightly 3
79:      sparesly 3
80:  sporadically 3
81:          sure 2
82:        surely 2
83:       totally 2
84:          true 2
85:         truly 2
86:          vast 2
87:        vastly 2
88:          very 2
89:      very few 3
90:   very little 3
91:        wasn't 1
92:       weren't 1
93:         won't 1
94:      wouldn't 1
                x y
Warning message:
In update_key(valence_shifters_table, x = data.frame(x = c("Looks like"),  :
  One or more terms in the first column contain capital letters. Capitals are ignored.
  I found the following suspects:

   * Looks like

These terms have been lower cased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants