Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to copy tweets ordered by date #51

Closed
lightnin opened this issue Jul 25, 2021 · 10 comments · Fixed by #61
Closed

Allow user to copy tweets ordered by date #51

lightnin opened this issue Jul 25, 2021 · 10 comments · Fixed by #61
Assignees
Labels
bug Something isn't working

Comments

@lightnin
Copy link
Contributor

Upon executing pleroma-bot to copy tweets with a certain hashtag from my twitter account, I noticed that the order in which the tweets are copied over is seemingly random (when no start date is specified).

For example, when copying tweets with the hashtag #tinkering, like these:
https://twitter.com/search?q=%40amoslightnin%20%23tinkering&src=typed_query&f=live

One of the first statuses copied over was this one:
https://masto.amosamos.net/notice/A9e0sWGhO6Da69j7rc

Which is from a tweet from 2018.
https://twitter.com/AmosLightnin/status/1020098097110806529?s=20

But there are tweets with the #tinkering hashtag from earlier years on my twitter timeline.

Again - this is fantastic software already, and I recognize that the user base this would be useful for may be fairly small. But then again, I think it's nice if we can provide ways for people to migrate their content from the big platforms onto their own little pleroma / activitypub instances, especially when they've used twitter as a repository for documentation, notes, or experiments, as I have.
Thanks again for making and sharing this software!

@robertoszek
Copy link
Owner

robertoszek commented Jul 25, 2021

My hunch is that this behaviour is potentially a bug on the implementation I introduced for hashtag handling, compounded with the fact that the number of tweets with the hashtags exceed the number of max_tweets, I suspect.

Would you mind providing the value of max_tweets in your config so I can try to reproduce and fix the bug?

@robertoszek robertoszek added the bug Something isn't working label Jul 25, 2021
@lightnin
Copy link
Contributor Author

Yes - I believe when I ran this small test run it was something like 40. I then tried to set it to 3199 based on the documentation indicating that the twitter API would tolerate no more than 3200. But I got an error message from pleroma-bot stating that the max was 100, so that's where things stand at the moment.

You are right - there are more than 100 total tweets with the hashtags I'm using. My planned workaround, assuming that pleroma-bot does them sequentially, is to do a run of 100, find the last tweet copied over and then use its timestamp +1 minute for the start time for the next run, and that way get through all of them. But if they aren't done sequentially that won't work, hence the issue report. I'm open to other workarounds or suggestions as to how to do it if you have any!

@lightnin
Copy link
Contributor Author

lightnin commented Dec 5, 2021

Just noticed that the readme indicates the following:
"If the --noProfile argument is passed, only new tweets will be posted. The profile picture, banner, display name and bio will not be updated on the Fediverse account."

I believe I had been using that switch in my previous tests - perhaps this had something to do with the order being off? I had been using the --noProfile switch b/c I don't want my profile updated from my birdsite profile, but it would be easy to set it back to the original values after the pleroma-bot import. Seems like this is worth trying out again!

@robertoszek
Copy link
Owner

Hmmm... I think it has to do more with that section on the README being poorly worded.
The --noProfile flag should not interfere with the order or number of tweets gathered and posted.

It really does what you initially assumed, if the flag is present then the profile isn't updated (profile image, banner image, display name, etc.). The tweets are unaffected either way by the noProfile argument.

The help message maybe does a better job trying to explain what it's supposed to do:

-n, --noProfile       skips Fediverse profile update (no background image,
                        profile image, bio text, etc.)

Perhaps the use of "new" on the README is a bit misleading:
"If the --noProfile argument is passed, only new tweets will be posted."

And rewording it to something along the lines of this would be more appropiate:
"If the --noProfile argument is passed, the profile picture, banner, display name and bio will not be updated on the Fediverse account. However, it will still gather and post the tweets following the config parameters."

In any case, do let me know if somehow you find it actually changes (for better or worse) how it behaves in relation to your issue.

@robertoszek
Copy link
Owner

Hi,
I've reworked the pagination and how the processing is performed on the gathered tweets in my test branch and I was also trying to reproduce your issue to see how we can approach it.
I verified that the order of the tweets and the hashtags are working on the test branch, so no issue there. The exception raised for max_tweets value being higher than 100 was an issue with the pagination, sorry about that.

Regarding your earliest tweet retrieved only being from 2018 for your hashtag:
Unfortunately, Twitter only makes available the full archive search endpoints to projects with academic research access level (even worse, the publicly available /search/recent endpoint only goes back 1 week).

So the workaround we use is to get them from the /2/tweets endpoint, which itself is capped to 3200 total tweets (100 per page).

The issue here is that we cannot use this endpoint with a query to filter tweets (that have an specific hashtag, for example), so we can only retrieve the latest 3200 tweets for the user and then check locally which ones of those to keep.
We cannot even get them from an start date onwards, if we provide an start date Twitter's response goes from latest to oldest until it reaches that start date (or goes over the 3200 cap).
So what we do goes something like this:

  • Get the latest 3200 tweets' metadata for that Twitter user
  • Remove those tweets that we don't want (if they are replies, RTs, not including hashtags, etc. and the config says so)
  • Download related media for those tweets we want to keep

I'm sure you see the issue here, if there are older tweets (than the 3200th latest) which meet the criteria (have a specific hashtag), we won't be able to fetch them without Premium/Enterprise or Academic Research account access levels.

Testing locally, the oldest tweet I was able to get with the hashtags you wanted was from 2018-08-17T06:01:47.000Z, getting 3200 total tweets for the Twitter account, which only 194 were kept because they matched the criteria.

It looks pretty grim, I don't see any possible workaround/solution that could work for this specific case.
It's a limitation imposed by Twitter for Essential and Elevated access levels.

@robertoszek robertoszek self-assigned this Dec 13, 2021
@lightnin
Copy link
Contributor Author

First - thank you @robertoszek for diving so deeply into this and explaining it to me. That's some hairy stuff - and rather disappointing on Twitter's part but I'm sure they have their reasons. Since I don't think I'll be able to get essential or elevated access levels, the next thing to try is an import without any start date to get the parts of the corpus stretching back to 2018 - which is still something! Tbh, I'm not so keen on twitter anymore, so if I can find some utility for mass deletion of tweets, than I will probably get what I can down to 2018, then delete all tweets down to the last one gotten in 2018, and then get the rest if I can.... That being a big "if" I can find such a utility and that strategy works. Either way - I'm super grateful! Do you have a patreon or something I can "buy you a beer" through? (And I guess I should close this now?)

@robertoszek
Copy link
Owner

No problem @lightnin, happy to help!
Twitter's stance on this is rather disappointing, I agree.

Hmm... That's a good idea, I actually don't know if it would work, I'd need to check if removing a tweet let's you retrieve the latest 3200 (not counting removed tweets) once you try again to fetch them.

If that's the case, I could see myself adding some archival capabilities for people that are on the same boat as you. A way to download the 3200 latest tweets locally (with metadata, dates, media and so on), then automatically remove them from Twitter and continue the archival process with the next batch until the tweets run out (and maybe zip them up letting you keep them as a backup).
You would have to be fully committed at that point, there's no going back after that haha.

Lastly, just as a clarification, you should have at the very least Essential access if you were able to use the bot (and Elevated if you created the tokens before Nov 15th)

If you’re already using the Twitter API v2, you’ll automatically see your Projects upgraded to Elevated access today.

https://blog.twitter.com/developer/en_us/topics/tools/2021/build-whats-next-with-the-new-twitter-developer-platform

The issue is that the full search API endpoint is only available for Academic Research accounts.

I'd leave this issue open until I create a new release version from the test branch (which includes some pagination fixes) if you don't mind.

Oh, and I don't have a Patreon, Subscribestar or anything like that. I guess I have a PayPal link if you feel inclined to donate but please, don't feel pressured into it. The fact you and other people find this software useful is reward enough.
https://paypal.me/robertoszek

@robertoszek
Copy link
Owner

I can confirm that the /2/tweets endpoint does not seem to include deleted tweets, next I need to verify that even then they don't count towards the 3200 limit.

Regarding the mass delete, Twitter rate limits the delete endpoint to 50 requests per 15-minute window:
https://developer.twitter.com/en/docs/twitter-api/tweets/manage-tweets/api-reference/delete-tweets-id
And even worse, to 1000 successful requests per 24-hour window per user:
https://developer.twitter.com/en/docs/twitter-api/rate-limits

So deleting 3200 tweets would take a very long time (73 hrs.) just based on rate limits alone.
I'd like to adapt pleroma-bot to allow running it as a daemon/service so it's feasible to leave it running on the background and doing these type of tasks.

The alternative would be using the official Twitter's archive files as an input to get the tweets to delete (and possibly also use them to post to the Fediverse).
I found a project that uses it to bulk-delete tweets:
https://github.com/koenrh/delete-tweets

It would be nice if pleroma-bot could use that official archive to post those tweets to the Fediverse too.
I've opened an issue to keep track of it here: #59

I've also added some donation links so it's easier for me to point people to them in case I get asked in the future 😅
https://robertoszek.github.io/pleroma-bot/#funding

@robertoszek
Copy link
Owner

Hey @lightnin,
I just wanted to let you know v1.0.0 is officially out and it includes some support for using Twitter's archives. It allows it to get tweets older than 2010 (and more than 3200), hopefully you find it useful!

@lightnin
Copy link
Contributor Author

Fantastic! I will give it a shot, hopefully this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants