Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_all_comments does not return max results #52

Open
Roechiiii opened this issue Mar 17, 2018 · 15 comments
Open

get_all_comments does not return max results #52

Roechiiii opened this issue Mar 17, 2018 · 15 comments
Assignees
Labels

Comments

@Roechiiii
Copy link

Roechiiii commented Mar 17, 2018

Hi, first of all thank you for your awesome R package to scrape for comments at youtube. I am using your package to analyse some comments, but I came up with the problem that not all comments can be collected. I think the issue have been mentioned in other issues (get_comment_threads) as well but this problem focuses on the method "get_all_comments". The original video has 3040 comments, the function returns only 2335 records, so approximately 30% get lost. The bigger problem in my opinion is the returning of the replies. Looking at the user in "top comment" category it can be seen that the original video counts 34 different replies, the function returns only 5, so the communication between different users will be lost.

comments <- get_all_comments(video_id = "zz-RpiUFY-I")

@Roechiiii Roechiiii changed the title get_comment_all does not return max results get_all_comments does not return max results Mar 17, 2018
@soodoku soodoku self-assigned this Mar 19, 2018
@soodoku soodoku added the bug label Mar 19, 2018
@soodoku
Copy link
Member

soodoku commented Mar 19, 2018

Thanks for diagnosing this. Need to investigate why I am not getting more than 5 replies. Super weird. But confirmed that it is true. Aargh!

When youtube counts total comments, it also counts replies. So the discrepancy is very likely just driven by not fetching more than 5.

@Roechiiii
Copy link
Author

You are welcome! Yes absolutely, I was wondering why the maximum amount of replies is always between 0-5. The following example on https://stackoverflow.com/questions/29692972/youtube-comment-scraper-returns-limited-results/29871427#29871427 describes a similar problem. In this example the pageToken of the current request is returned as the previous requests nextPageToken to update the session. I am note sure if you already implemented it in your package, but maybe it will help you.

@soodoku
Copy link
Member

soodoku commented Mar 19, 2018

The function iterates over pages of results. So that isn't a problem. There may be an issue with basically getting replies of replies. Will investigate this.

@rangaro
Copy link

rangaro commented May 2, 2018

I stumbled upon the same issue. Interestingly, other tools like Webometric Analyst and YouTube Data Tools also do not return the maximum of comments, but the discrepancy is way smaller (e.g., 2.276 of 2.287 comments).

So I am curious if there will be any fix soon?

@rainbowfan
Copy link

Hi, I would like to follow up on the issue above and see whether there's a solution. There are three different scenarios that I encountered. 1. As what was discussed above, replies are not extracted completely using get_all_comments function. 2. I found a youtube video (id: 49Ilvc8WiG8) in which there is no comment but shows 1 comment in total. This is not a bug from the package, but if someone can give me a hint, that would be greatly appreciated. 3. Some hidden comments are being extracted. For example, for video bK6DVXty0gQ I extracted two more comments which were hidden on that video page. Does this mean the author of the channel deleted those comments or someone reported issues on thost comments?

Thank you!

@chspoerlein
Copy link

Hi soodoku, thanks for your amazing work and service to the community! I found the same issue as Roechiiii where a maximum of 5 replies per comment get extracted. I remember some wrapper function for extracting reddit comments where one had to explicitely code that the function presses "show more". Could this be an issue here?

@leftyveggie
Copy link

Hi Soodoku - first of all - thank you to you and your co-contributors for making this excellent package. I think I have a possible work around for the replies issue (up to 100 replies). If you do not unlist totalReplyCount (lines 63-66), then we can more easily identify those comments with replies (e.g. df[!(df$totalReplyCount=="0"),] and use your get_comments(filter = c(parent_id = x)) to get the (in most cases) complete threads.

Would possibly be something that you might be able to do? Or is there a more important reason why total reply count is deleted?

@soodoku
Copy link
Member

soodoku commented Aug 2, 2018

Thanks for the hint @leftyveggie! Will try it over the weekend if that works for you.

@leftyveggie
Copy link

@soodoku I think maybe I have slightly misunderstood - perhaps it is not at this part of the code - I meant if you could change the output of the data frame to include totalReplyCount as a column! Thank you for the quick reply!

@voskresenskiy
Copy link

Hi! Thanks for the great package!
Was the reason for not downloading all comments identified?
Don't we get some replies?
Sorry for disturbing)

@orientune
Copy link

I tried for different videos and my results here.
If one author has only replies (doesnt have comment about video) ,then according to number of replies behave.If the author has not more than 5 replies then dont scrape anyone.But if has more than 5 replies then some comments are scraping.
And if one author has both himself comments and replies then more than second man (up I told) comments are scraping.

@rangaro
Copy link

rangaro commented Feb 10, 2020

On the 22nd of February 2019, we did a test run with the Dayum video (DcJFdCmN98s). The video page informed us to expect 47,163 comments. YouTube Data Tools from Bernhard Rieder extracted 47,153 comments (10 missings). However, tuber extracted 44,810 comments, and Webometric Analyst from Mike Thelwall extracted 44,828.

Webometric Analyst only retrieves five follow-up comments because it does not take the comment pagination into account. The tuber results are pretty close. I think the iteration through the replies pages is not working correctly in tuber. Maybe Bernhard Rieder can be asked how he solved the problem in his tool. https://twitter.com/riederb

@rangaro
Copy link

rangaro commented Mar 15, 2021

Is there any update on this? The bug report is now nearly 3 years old.

@balthasars
Copy link
Contributor

balthasars commented Apr 13, 2021

Correct me if I'm wrong but to me it appears that get_all_comments() only implements the query to the commentThreads resource: https://developers.google.com/youtube/v3/docs/commentThreads

The resource states

A commentThread resource contains information about a YouTube comment thread, which comprises a top-level comment and replies, if any exist, to that comment.

So far so good, but it then goes on saying

The commentThread resource does not necessarily (!) contain all replies to a comment, and you need to use the comments.list method if you want to retrieve all replies for a particular comment. Also note that some comments do not have replies.

So I believe additional queries to this resource need to be implemented to retrieve the replies to all comments. I don't see any other GET queries in the source code apart from those to the commentThreads resource.

So this is just a wild guess (to be completely honest, I don't fully understand the code of process_page() yet, but could this be the issue here?

balthasars added a commit to balthasars/tuber that referenced this issue Apr 20, 2021
…ermediate fix for `get_all_comments()` that doesn't seem to be working properly (see gojiplus#52)
@hamaer0214
Copy link

I am in the same problem today. I use youtube API to get all comments, but only 0-5 replies max.
And I didn't find any clue on the youtube API page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants