Skip to content
This repository has been archived by the owner on Sep 11, 2020. It is now read-only.

Unexpected (and impossible?) peak of classifications #141

Open
AnLand opened this issue Dec 22, 2016 · 10 comments
Open

Unexpected (and impossible?) peak of classifications #141

AnLand opened this issue Dec 22, 2016 · 10 comments

Comments

@AnLand
Copy link

AnLand commented Dec 22, 2016

The Chimp&See status showed yesterday an unexpected, very high peak in classifications in the 2 p.m. time window (European time). One volunteer apparently made 1,337 classifications. Even with using mainly the previews and a lot of empty videos, it is in our opinion impossible to make so many classifications. So, we wonder whether one can find out whether these are legitimate classifications or what else could it be. The science team has no explanation either.

This happened already before several times, but we never reported it. If a bot or misclassification of any sort, it might influence our classification results. Of course, we hope that somebody is only extremely efficient or it is a group working together from the same IP address.

I attach the status screenshot with the peak at 2 p.m. Thank you!

github

@srallen
Copy link
Contributor

srallen commented Dec 22, 2016

A school or library would appear to be from the same IP address, so it's possible if there was a group doing classifications that way. Further analysis of the classification data from this time would need to happen to determine if they are valid or not.

@AnLand
Copy link
Author

AnLand commented Jul 7, 2017

Hi! We had again yesterday (July 6) and last week (June 30) two very high classification events, with yesterday more than 4,000 classifications within one hour. Even for a school class or a course with 40 participants this would be a quite high turnover for Chimp&See. If I remember the numbers correctly, unexpectedly many videos got retired. Of course, that is all possible ... Nevertheless, we would like to ask you again to look into that issue in case there is a problem. Thank you!

20170706

@srallen
Copy link
Contributor

srallen commented Jul 12, 2017

@astopy Any chance you'll have time in the next couple weeks to help take a look at this? We'll probably will need to take a look at the classifications in this time range to check if they're real dups, if they look like were made by a bot, or if they look like they were caused by a front end bug? I'd think if it was a front end bug, this would have been happening since the project launched, so I have my doubts about that.

@adammcmaster
Copy link
Contributor

I'm thinking this is some kind of bot, deliberately (I assume) submitting bogus "nothing here" classifications. I just been manually spot checking some of the classifications from that time, and just in the first few I checked there are several videos which they classified as empty but which actually have obvious animals which would be impossible to miss.

I'm not sure if we can really do anything to stop this, but that's not necessarily a problem since it's going to be fairly easy to detect it when you analyse the data. There are two scenarios I can think of:

  • If they're submitting a few thousand classifications for different subjects, so each subject only gets one classification from them, then we don't really need to worry about it. Genuine classifications from other users will still reach agreement (we expect that users will be wrong sometimes anyway after all).
  • If they're concentrating their classifications on a smaller number of subjects to get them retired, it shouldn't be hard to spot subjects that received a lot of classifications in a short period of time, and then just reactivate them with a higher retirement limit to get some more classifications later.

@AnLand
Copy link
Author

AnLand commented Dec 6, 2019

Hi!

Chimp&See is experiencing today again an unexpected spike of classifications. We are already at over 10,000 classifications today (roughly five times of our usual daily classification count). There are also not massively many people online (visible). Are these real classifications and not only a counter problem and can we find out whether they are valid, i.e., actual people classifying? Even for a group event 10,000 classifications of videos is quite a lot.

I ping @adammcmaster, as I am not sure whether this is still monitored with the switch to Panoptes. The problem is in the new interface: https://www.zooniverse.org/projects/sassydumbledore/chimp-and-see

Thanks for looking into it!

@trouille
Copy link
Member

trouille commented Dec 12, 2019

@adammcmaster or @zwolf Please do look into the above^. Additional info from Colleen, one of the lead researchers:

There is indeed weird stuff happening, but I'm not sure what's happening exactly.

The spike in classifications is due to a single "user name" at a single IP address (not-logged-in-940ce270d0a4e33f5535 at 940ce270d0a4e33f5535). There are 8885 classifications from 8am to 5pm UTC time so far. These classifications are relentless and the system shows them as occurring throughout the day with no breaks in activity. Most classifications have very little time between them, most under 5 seconds, even 0 seconds, up to a minute or so.

The IP address associated with the spike is not only associated with this user name, there are a dozen other properly registered users that have contributed today (AndrewDShone, ChrisHall1973, danby86, DanCarr, evan97, karla.halliday, mgarsidezoo, Moinina, ScottAC, scuzbag683). These people have only contributed today under those user names, while the not-logged-in-940ce270d0a4e33f5535 user name has contributed about 80 classifications between September 2 and now. We could try asking them what their deals are.

The weird thing is that for the classifications by not-logged-in-940ce270d0a4e33f5535 that I've looked into by hand so far, all of them seem legit when I look at the responses. This isn't like last time where it was clearly a bug or a bot or something that was spamming "nothing here" 100x in a minute. All parts of the response made sense, the species, the count, and the behavior.

One other thing is that in the metadata, the information about the type of browser being used (I think that's what it's trying to show) says 3 different browsers, which obviously isn't possible. And there seems to be a session identifier, which I would imagine means the period of time one person on one computer is classifying before leaving the website, and there are different session IDs that are spaced out in a reasonable manner. This is something I just discovered ten seconds ago so I have to look into it more, and see if I can find a definition of what the session ID means. I ran this by Nick who knows more about this kind of stuff, and he think it's suggestive of the possible use of a single VPN by multiple people. So I think there's at least the possibility that there's like, a teacher who spent a lazy Friday with their class doing some of this stuff for a few periods in a row, and the school has its own VPN so all the computers show up with the same IP address, and most of the students don't feel like signing up properly so they I guess all get labeled as the same user because of the IP? Maybe that's not how it works, I really don't know. Maybe it's someone doing a Zamba-like thing that interacts with our platform, and it's some university computer science project. It wouldn't surprise me if there was some legit reason like this, with a bunch of responses flooding in, and zooniverse's servers or whatever can't process them fast enough so there is a backlog which gets processed every 5 seconds or so throughout the day to catch up. It also wouldn't surprise me if some wannabe hacker jag is trying to mess with us. But there are definitely some real-looking classifications in the bunch, which makes me hesitate to label it as a mistake or malicious right away. Plus there are even some responses for the MonkeySee and Trotters ID workflows, like someone was trying them out.

[Previously linked]) are the classifications in question in case you want to see anything specific. The last column is the time in seconds between that classification and the one before it. When I download the classifications again tomorrow after the 24 hours are up, I can look again and see if the problem persists.

If there's anything that you'd like to see that I can summarize or calculate or whatnot, just let me know. Thanks!
Colleen

@trouille
Copy link
Member

To highlight from her comment above is that when she looks at the answers, they do seem legit. So perhaps it is just a school with a bunch of non-logged in users all sharing the same IP who did 8K+ classifications in one day? Would just be good to get your expert opinion/input. Thanks!

Also important to note that the classification rate has dropped back down to typical since that one date. https://www.zooniverse.org/projects/sassydumbledore/chimp-and-see/stats

@adammcmaster
Copy link
Contributor

It certainly sounds like it was just some kind of group classification session with a lot of people sharing an IP. I'm not sure what else we can do really, but if there's a concern about these classifications they could be excluded from any analysis later. The affected subjects could also be added back to the workflow later with a higher retirement limit to get more classifications if needed.

@zwolf
Copy link
Member

zwolf commented Dec 13, 2019

Could this be related to the code.org group classification event that was discussed earlier this month? It was supposed to be for Snapshot Serengeti, but maybe they switched?

@trouille
Copy link
Member

The code.org activity prompted them towards snapshotSafari.org projects, but good question to ask. Very possible a teacher involved w/ that found chimpandsee on their own and liked it and spread that through the school’s activities

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants