Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions, Feedback, and Suggestions #4 #5262

Open
mikf opened this issue Mar 1, 2024 · 91 comments
Open

Questions, Feedback, and Suggestions #4 #5262

mikf opened this issue Mar 1, 2024 · 91 comments

Comments

@mikf
Copy link
Owner

mikf commented Mar 1, 2024

Continuation of the previous issue as a central place for any sort of question or suggestion not deserving their own separate issue.

Links to older issues: #11, #74, #146.

@BakedCookie
Copy link

For most sites I'm able to sort files into year/month folders like this:

"directory": ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]

However for redgifs it doesn't look like there's a date keyword available for directory. There's only a date keyword available for filename. Is this an oversight?

@mikf
Copy link
Owner Author

mikf commented Mar 2, 2024

Yep, that's a mistake that happened when adding support for galleries in 5a6fd80.
Will be fixed with the next git push.

edit: 82c73c7

@taskhawk
Copy link

taskhawk commented Mar 6, 2024

There's a typo in extractor.reddit.client-id & .user-agent:

"I'm not a rebot"

@the-blank-x
Copy link
Contributor

There's also another typo in extractor.reddit.client-id & .user-agent, "reCATCHA"

@biggestsonicfan
Copy link

Can you grab all the media from quoted tweets? Example.

mikf added a commit that referenced this issue Mar 7, 2024
#5262 (comment)

It's implemented as a search for 'quoted_tweet_id:…' on Twitter.
mikf added a commit that referenced this issue Mar 7, 2024
#5262 (comment)

This on was on the same line as the previous one ... (9fd851c)
@mikf
Copy link
Owner Author

mikf commented Mar 7, 2024

Regarding typos, thanks for pointing them out.
I would be surprised if there aren't at least 10 more somewhere in this file.

@biggestsonicfan
This is implemented as a search for quoted_tweet_id:…- on Twitter's end.
I've added an extractor for it similar to the hashtags one (40c0553), but it only does said search under the hood.

@BakedCookie
Copy link

BakedCookie commented Mar 7, 2024

Normally %-encoded characters in the URL get converted nicely when running gallery-dl, eg.

https://gelbooru.com/index.php?page=post&s=list&tags=nighthawk_%28circle%29
gives me a nighthawk_(circle) folder

but for this url:
https://gelbooru.com/index.php?page=post&s=list&tags=shin%26%23039%3Bya_%28shin%26%23039%3Byanchi%29

I'm getting a shin'ya_(shin'yanchi) folder. Shouldn't I be getting a shin'ya_(shin'yanchi) folder instead?

EDIT: Actually, I think there's just something wrong with that URL. I had it saved for a long time and searching that tag normally gives a different URL (https://gelbooru.com/index.php?page=post&s=list&tags=shin%27ya_%28shin%27yanchi%29). I still got valid posts from the weird URL so I didn't think much of it.

@mikf
Copy link
Owner Author

mikf commented Mar 7, 2024

%28 and so on are URL escaped values, which do get resolved.
#039; is the HTML escaped value for '.

You could use {search_tags!U} to convert them.

@taskhawk
Copy link

taskhawk commented Mar 8, 2024

Is there support to remove metadata like this?

gallery-dl -K https://www.reddit.com/r/carporn/comments/axo236/mean_ctsv/

...
preview['images'][N]['resolutions'][N]['height']
  144
preview['images'][N]['resolutions'][N]['url']
  https://preview.redd.it/mcerovafack21.jpg?width=108&crop=smart&auto=webp&s=f8516c60ad7fa17c84143d549c070738b8bcc989
preview['images'][N]['resolutions'][N]['width']
  108
...

Post-processor:

"filter-metadata":
    {
      "name": "metadata",
      "mode": "delete",
      "event": "prepare",
      "fields": ["preview[images][0][resolutions]"]
    }

I've tried a few variations but no dice.

"fields": ["preview[images][][resolutions]"]
"fields": ["preview[images][N][resolutions]"]
"fields": ["preview['images'][0]['resolutions']"]

@YuanGYao
Copy link

YuanGYao commented Mar 8, 2024

Hello, I left a comment in #4168 . Does the _pagination method of the WeiboExtractor class in weibo.py return when data["list"] is an empty list?
When I used gallery-dl to batch download the album page of Weibo, the download also appeared incomplete.
Through testing on the web page, I found that Weibo's getImageWall api sometimes returns an empty list when the image is not completely loaded. I think this may be what causes gallery-dl to terminate the download.

@mikf
Copy link
Owner Author

mikf commented Mar 8, 2024

@taskhawk
fields selectors are quite limited and can't really handle lists.
You might want to use a python post processor (example) and write some code that does this.

def remove_resolutions(metadata):
    for image in metadata["preview"]["images"]:
        del image["resolutions"]

(untested, might need some check whether preview and/or images exists)

@YuanGYao
Yes, the code currently stops when Weibo's API returns no more results (empty list).
This is probably not ideal, as I've hinted at in #4168 (comment)

@YuanGYao
Copy link

YuanGYao commented Mar 9, 2024

@mikf
Well, I think for Weibo's album page, since_id should be used to determine whether the image is fully loaded.
I updated my comment in #4168(comment) and attached the response returned by Weibo's getImageWall API.
I think this should help solve this problem.

@BakedCookie
Copy link

Not sure if I'm missing something, but are directory specific configurations exclusive to running gallery-dl via the executable?

Basically, I have a directory for regular tags, and a directory for artist tags. For regular tags I use "directory": ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"] since the tag number is manageable. For artist tags though, there's way more of them so this "directory": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"] makes more sense.

So right now the only way I know to get this per-directory configuration to work, is to copy the gallery-dl executable everywhere I want to use a master configuration override. Am I missing something? It feels like there should be a better way.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 11, 2024

Huh? No, the configuration works always in the same way. You're simply using different configuration files?

@BakedCookie
Copy link

@Hrxn

From the readme:

When run as executable, gallery-dl will also look for a gallery-dl.conf file in the same directory as said executable.

It is possible to use more than one configuration file at a time. In this case, any values from files after the first will get merged into the already loaded settings and potentially override previous ones.

I want to override my master configuration %APPDATA%\gallery-dl\config.json in specific directories with a local gallery-dl.conf but it seems like that's only possible with the standalone executable.

@taskhawk
Copy link

taskhawk commented Mar 11, 2024

You can load additional configuration files from the console with:

-c, --config FILE           Additional configuration files

You just need to specify the path to the file and any options there will overwrite your main configuration file.

Edit: From my understanding, yeah, automatic loading of local config files in each directory is only possible having the standalone executable in each directory. Are different directory options the only thing you need?

@BakedCookie
Copy link

@taskhawk

Thanks, that's exactly what I was looking for! Guess I didn't read the documentation thoroughly enough.

For now the only thing I'd want to override is the directory structure for artist tags. I don't think it's possible to determine from the metadata alone if a given tag is the name of an artist or not, so I thought the best way to go about it is to just have a separate directory for artists, and use a configuration override. So yeah, loading that override with the -c flag works great for that purpose, thanks again!

@taskhawk
Copy link

taskhawk commented Mar 11, 2024

You kinda can, but you need to enable tags for Gelbooru in your configuration to get them, which will require an additional request:

    "gelbooru": {
      "directory": {
        "search_tags in tags_artists": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"],
        ""                           : ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
      },
      "tags": true
    },

Set "tags": true in your config and run a test with gallery-dl -K "https://gelbooru.com/index.php?page=post&s=list&tags=TAG" so you can see the tags_* keywords.

Of course, this depends on the artists being correctly tagged. Not sure if it happens on Gelbooru, but at least in other boorus and booru-like sites I've come across posts with the artist tagged as a general tag instead of an artist tag. Another limitation is that your search tag can only include one artist at a time, doing more will require a more complex expression to check all tags are present in tags_artists.

What I do instead is that I inject a keyword to influence where it will be saved, like this:

gallery-dl -o keywords='{"search_tags_type":"artists"}' "https://gelbooru.com/index.php?page=post&s=list&tags=ARTIST"

And in my config I have

    "gelbooru": {
      "directory": ["boorus", "{search_tags_type}", "{search_tags}"]
    },

You can have:

    "gelbooru": {
      "directory": {
        "search_tags_type == 'artists'": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"],
        ""                             : ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
      }
    },

You can do this for other tag types, like general, copyright, characters, etc.

Because it's a chore to type that option every time I made a wrapper script, so I just call it like this because artists is my default:

~/script.sh "TAG"

For other tag types I can do:

~/script.sh --copyright "TAG"
~/script.sh --characters "TAG"
~/script.sh --general "TAG"

@BakedCookie
Copy link

Thanks for pointing out there's a tags option available for the gelbooru extractor. I already used it in the kemono extractor to get the name of the artist, but it didn't occur to me that gelbooru might also have such an option (and just accepted that the tags aren't categorized).

For artists I store all the url's in their respective gelbooru.txt, rule34.txt, etc files like so:

https://gelbooru.com/index.php?page=post&s=list&tags=john_doe
https://gelbooru.com/index.php?page=post&s=list&tags=blue-senpai
https://gelbooru.com/index.php?page=post&s=list&tags=kaneru
.
.
.

And then just run gallery-dl -c gallery-dl.conf -i gelbooru.txt. Since the search_tags ends up being the artist anyway, getting tags_artists is probably not worth the extra request. Same for general tags, and copyright tags, in their respective directories. With this workflow I can't immediately see where I'd be able to utilize keyword injection, but it's definitely a useful feature that I'll keep in mind.

@Wiiplay123
Copy link
Contributor

When I'm making an extractor, what do I do if the site doesn't have different URL patterns for different page types? Every single page is just a numerical ID that could be a forum post, image, blog post, or something completely different.

@mikf
Copy link
Owner Author

mikf commented Mar 19, 2024

@Wiiplay123 You handle everything with a single extractor and decide what type of result to return on the fly. The gofile code is a good example for this I think, or aryion.

@I-seah
Copy link

I-seah commented Mar 20, 2024

Hi, what options should I use in my config file to change the format of dates in metadata files? I would like to use "%Y-%m-%dT%H:%M:%S%z" for the values of "date" and "published" (from coomer/kemono downloads).

And would it also be possible to do this for json files that ytdl creates? I downloaded some videos with gallery-dl but the dates got saved as "upload_date": "20230910" and "timestamp": 1694344011, so I think it might be better to convert the timestamp to a date to get a more precise upload time, but I'm not sure if it's possible to do that either.

JackTildeD added a commit to JackTildeD/gallery-dl-forked that referenced this issue Apr 24, 2024
* save cookies to tempfile, then rename

avoids wiping the cookies file if the disk is full

* [deviantart:stash] fix 'index' metadata (mikf#5335)

* [deviantart:stash] recognize 'deviantart.com/stash/…' URLs

* [gofile] fix extraction

* [kemonoparty] add 'revision_count' metadata field (mikf#5334)

* [kemonoparty] add 'order-revisions' option (mikf#5334)

* Fix imagefap extrcator

* [twitter] add 'birdwatch' metadata field (mikf#5317)

should probably get a better name,
but this is what it's called internally by Twitter

* [hiperdex] update URL patterns & fix 'manga' metadata (mikf#5340)

* [flickr] add 'contexts' option (mikf#5324)

* [tests] show full path for nested values

'user.name' instead of just 'name' when testing for
"user": { … , "name": "…", … }

* [bluesky] add 'instance' metadata field (mikf#4438)

* [vipergirls] add 'like' option (mikf#4166)

* [vipergirls] add 'domain' option (mikf#4166)

* [gelbooru] detect returned favorites order (mikf#5220)

* [gelbooru] add 'date_favorited' metadata field

* Update fapello.py

get fullsize image instead resized

* fapello.py Fullsize image

by remove ".md" and ".th" in image url, it will download fullsize of images

* [formatter] fix local DST datetime offsets for ':O'

'O' would get the *current* local UTC offset and apply it to all
'datetime' objects it gets applied to.
This would result in a wrong offset if the current offset includes
DST and the target 'datetime' does not or vice-versa.

'O' now determines the correct local UTC offset while respecting DST for
each individual 'datetime'.

* [subscribestar] fix 'date' metadata

* [idolcomplex] support new pool URLs

* [idolcomplex] fix metadata extraction

- replace legacy 'id' vales with alphanumeric ones, since the former are
  no longer available
- approximate 'vote_average', since the real value is no longer
  available
- fix 'vote_count'

* [bunkr] remove 'description' metadata

album descriptions are no longer available on album pages
and the previous code erroneously returned just '0'

* [deviantart] improve 'index' extraction for stash files (mikf#5335)

* [kemonoparty] fix exception for '/revision/' URLs

caused by 03a9ce9

* [steamgriddb] raise proper exception for deleted assets

* [tests] update extractor results

* [pornhub:gif] extract 'viewkey' and 'timestamp' metadata (mikf#4463)

mikf#4463 (comment)

* [tests] use 'datetime.timezone.utc' instead of 'datetime.UTC'

'datetime.UTC' was added in Python 3.11
and is not defined in older versions.

* [gelbooru] add 'order-posts' option for favorites (mikf#5220)

* [deviantart] handle CloudFront blocks in general (mikf#5363)

This was already done for non-OAuth requests (mikf#655)
but CF is now blocking OAuth API requests as well.

* release version 1.26.9

* [kemonoparty] fix KeyError for empty files (mikf#5368)

* [twitter] fix pattern for single tweet (mikf#5371)

- Add optional slash
- Update tests to include some non-standard tweet URLs

* [kemonoparty:favorite] support 'sort' and 'order' query params (mikf#5375)

* [kemonoparty] add 'announcements' option (mikf#5262)

mikf#5262 (comment)

* [wikimedia] suppress exception for entries without 'imageinfo' (mikf#5384)

* [docs] update defaults of 'sleep-request', 'browser', 'tls12'

* [docs] complete Authentication info in supportedsites.md

* [twitter] prevent crash when extracting 'birdwatch' metadata (mikf#5403)

* [workflows] build complete docs Pages only on gdl-org/docs

deploy only docs/oauth-redirect.html on mikf.github.io/gallery-dl

* [docs] document 'actions' (mikf#4543)

or at least attempt to

* store 'match' and 'groups' in Extractor objects

* [foolfuuka] improve 'board' pattern & support pages (mikf#5408)

* [reddit] support comment embeds (mikf#5366)

* [build] add minimal pyproject.toml

* [build] generate sdist and wheel packages using 'build' module

* [build] include only the latest CHANGELOG entries

The CHANGELOG is now at a size where it takes up roughly 50kB or 10% of
an sdist or wheel package.

* [oauth] use Extractor.request() for HTTP requests (mikf#5433)

Enables using proxies and general network options.

* [kemonoparty] fix crash on posts with missing datetime info (mikf#5422)

* restore LD_LIBRARY_PATH for PyInstaller builds (mikf#5421)

* remove 'contextlib' imports

* [pp:ugoira] log errors for general exceptions

* [twitter] match '/photo/' Tweet URLs (mikf#5443)

fixes regression introduced in 40c0553

* [pp:mtime] do not overwrite '_mtime' for None values (mikf#5439)

* [wikimedia] fix exception for files with empty 'metadata'

* [wikimedia] support wiki.gg wikis

* [pixiv:novel] add 'covers' option (mikf#5373)

* [tapas] add 'creator' extractor (mikf#5306)

* [twitter] implement 'relogin' option (mikf#5445)

* [docs] update docs/configuration links (mikf#5059, mikf#5369, mikf#5423)

* [docs] replace AnchorJS with custom script

use it in rendered .rst documents as well as in .md ones

* [text] catch general Exceptions

* compute tempfile path only once

* Add warnings flag

This commit adds a warnings flag

It can be combined with -q / --quiet to display warnings.
The intent is to provide a silent option that still surfaces
warning and error messages so that they are visible in logs.

* re-order verbose and warning options

* [gelbooru] improve pagination logic for meta tags (mikf#5478)

similar to 494acab

* [common] add Extractor.input() method

* [twitter] improve username & password login procedure (mikf#5445)

- handle more subtasks
- support 2FA
- support email verification codes

* [common] update Extractor.wait() message format

* [common] simplify 'status_code' check in Extractor.request()

* [common] add 'sleep-429' option (mikf#5160)

* [common] fix NameError in Extractor.request()

… when accessing 'code' after an requests exception was raised.

Caused by the changes in 566472f

* [common] show full URL in Extractor.request() error messages

* [hotleak] download files with 404 status code (mikf#5395)

* [pixiv] change 'sanity_level' debug message to a warning (mikf#5180)

* [twitter] handle missing 'expanded_url' fields (mikf#5463, mikf#5490)

* [tests] allow filtering extractor result tests by URL or comment

python test_results.py twitter:+/i/web/
python test_results.py twitter:~twitpic

* [exhentai] detect CAPTCHAs during login (mikf#5492)

* [output] extend 'output.colors' (mikf#2566)

allow specifying ANSI colors for all loglevels
(debug, info, warning, error)

* [output] enable colors by default

* add '--no-colors' command-line option

---------

Co-authored-by: Luc Ritchie <luc.ritchie@gmail.com>
Co-authored-by: Mike Fährmann <mike_faehrmann@web.de>
Co-authored-by: Herp <asdf@qwer.com>
Co-authored-by: wankio <31354933+wankio@users.noreply.github.com>
Co-authored-by: fireattack <human.peng@gmail.com>
Co-authored-by: Aidan Harris <me@aidanharr.is>
@gwttk
Copy link

gwttk commented Apr 24, 2024

How to put artist name in file path for e-hentai? because the "artist:xxx" is in tags. I can't find a variable for "directory": ["{artist}"].

@mikf
Copy link
Owner Author

mikf commented Apr 25, 2024

@taskhawk
I slightly modified the Danbooru extractor to have it go through all ugoira posts uploaded there (https://danbooru.donmai.us/posts?tags=ugoira), and non of them had .png frames.
I'm aware that this is just a small subset, but at least its data can be accessed a lot faster than on Pixiv itself.

@throwaway26425
Using the same --user-agent string as the browser you got your cookies from might help.
Updating the HTTP headers sent during API requests is also something that needs to be done again ...

@Immueggpain
See #2117

@throwaway26425
Copy link

throwaway26425 commented Apr 26, 2024

Using the same --user-agent string as the browser you got your cookies from might help.

I'm using -o browser=firefox, is that the same?

or, do I need to use both?

-o browser=firefox
--user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0"

Updating the HTTP headers sent during API requests is also something that needs to be done again.

I don't understand this, can you please explain it better? :(

@mikf
Copy link
Owner Author

mikf commented Apr 26, 2024

I'm using -o browser=firefox, is that the same?

browser=firefox overrides your user agent to Firefox 115 ESR, regardless of your --user-agent setting. It also sets a bunch of extra HTTP headers and TLS cipher suites to somewhat mimic a real browser, but maybe you're better off without this option.

I don't understand this, can you please explain it better? :(

The instagram code sends specific HTTP headers when making API requests, which might now be out-of-date, meaning I should update them again. The last time I did this was October 2023 (969be65).

@fireattack
Copy link
Contributor

fireattack commented May 3, 2024

I'm pretty sure this has been asked before but can't find it.

My goal is to run gallery-dl as a module to download, while also get the record of processed posts (URLs, post ids) so I can use that info to do some custom functions.

I've read #642, but I still don't quite get it. It looks like you have to use DownloadJob for downloading, but in parallel use DataJob (or even a customized Job) to get the data?

My current code is pretty simple, just

def load_config():
    ....
def set_config(user_id):
    ....

def update(user_id):
    load_config()
    profile_url = set_config(user_id)
    job.DownloadJob(profile_url).run()

I tried to patch DownloadJob's handle_url so I can save the URLs and metadata into something like self.mydata, but that isn't enough because in handle_queue, it creates a new job with job = self.__class__(extr, self) for actual downloading, which makes it more complicated than I want in order to pass the data back to "parent" instance.

So I'm curious if there is an easier way to just do it other than re-write a whole new Job? Thanks in advance.

@climbTheStairs
Copy link

I have a suggestion, though I'm not sure how feasible or practical it would be.

Currently behavior:

  • twitter num starts at 1 for all posts
  • pixiv num starts at 0 for all posts
  • reddit num starts at 1 for posts containing multiple images and is 0 for posts containing one

Could the behavior for indices be made consistent across all sites?

@Vetches
Copy link

Vetches commented May 14, 2024

Hi! Is it possible to provide a "best practices" of sorts for using gallery-dl with Instagram? Things like using the same user-agent as that of the browser cookies are extracted from (source, which values to use for sleep and sleep-request, best parameters for the config, etc.

To that end, are there any plans on updating the HTTP headers that are sent for Instagram API calls? Is this something an end-user could update, and if so, where could we find such headers if we wanted to change this for ourselves?

Unrelatedly, is there any way to set up the running of gallery-dl to stop whenever an error occurs? I know there are errors and warnings when gallery-dl is ran, so I'm wondering if there's a way either via Bash or perhaps Python where I can stop if it encounters an error (applicable when I'm passing a list of URLs).

Thank you so, so much for taking the time to read this!

@biggestsonicfan
Copy link

Is there a flag you can set in a "prepare" post-processor to stop a different "prepare" post-processor from occurring?

@fbck
Copy link

fbck commented May 19, 2024

is it possible to download video thumbnails/preview pictures for twitter (and perhaps other sites too)?

@fireattack
Copy link
Contributor

is it possible to download video thumbnails/preview pictures for twitter (and perhaps other sites too)?

There are extractor.instagram.previews and extractor.artstation.previews, butt I can't seem to find a way for Twitter.

@noshii117
Copy link

I'm using --sleep-request 8.0 and --sleep 1.0-2.0 to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?

@Vetches
Copy link

Vetches commented May 22, 2024

I'm using --sleep-request 8.0 and --sleep 1.0-2.0 to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?

I use a JSON config rather than the command-line, but when I use "sleep-request": [15,45] and "sleep": [2,10], I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf.

Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.

@noshii117
Copy link

noshii117 commented May 22, 2024

I'm using --sleep-request 8.0 and --sleep 1.0-2.0 to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?

I use a JSON config rather than the command-line, but when I use "sleep-request": [15,45] and "sleep": [2,10], I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf.

Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.

thanks. another question, can I download/extract my followed users list? if so, how?

edit: I asked this because I'm gonna make another 2 accounts with the same followed people. and just to have a back up list of who I follow on there

@Vetches
Copy link

Vetches commented May 23, 2024

I'm using --sleep-request 8.0 and --sleep 1.0-2.0 to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?

I use a JSON config rather than the command-line, but when I use "sleep-request": [15,45] and "sleep": [2,10], I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf.
Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.

thanks. another question, can I download/extract my followed users list? if so, how?

edit: I asked this because I'm gonna make another 2 accounts with the same followed people. and just to have a back up list of who I follow on there

I'm not aware of a way to do that, I just have a list of IG accounts that I read from and append new accounts. What you could do is visit your following list and scroll down until there aren't anymore accounts to render, then read the HTML and extract all of the profile links that way.

@pt3rrorduck
Copy link

Hello, can someone please explain how to filter out values from lists for example tags[]
i tried something like this, but it didnt worked out:
"filter": "any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])"

@Vrihub
Copy link
Contributor

Vrihub commented May 25, 2024

Hello, can someone please explain how to filter out values from lists for example tags[] i tried something like this, but it didnt worked out: "filter": "any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])"

EDIT: my original suggestion was unnecessarily complicated, also, as pointed out by mikf, the correct parameter is "image-filter"

"image-filter": "any(tag in tags for tag in ['tag1', 'tag2', 'tag3'])"

should work, provided that tags is the name of the metadata containing tags for the extractor you are using (check with gallery-dl -K)

@mikf
Copy link
Owner Author

mikf commented May 25, 2024

@fireattack
I'd create a new Job class that inherits from DownloadJob, extends dispatch() to store all messages, and overwrites handle_queue() to get the data collected by childen.

There really isn't an easier way, but you can more or less just copy-paste the relevant code parts and delete whatever you don't really need.

@climbTheStairs
I do plan on doing this for v2.0 as well as adding options that control enumeration behavior.

@Vetches
I did look into IG's headers some time ago and at least for the API endpoints used by gallery-dl, nothing seems to have changed. The potential problem is that IG now uses a different set of endpoints with query parameters I have no idea what they mean ...

You can find endpoints, headers, and parameters by opening your browser's dev tools (F12), selecting XHR in its network monitor, and browsing the site.

Stopping gallery-dl on errors is possible with the actions option:

"actions": {"error": "exit"}

@mikf
Copy link
Owner Author

mikf commented May 25, 2024

@biggestsonicfan
There isn't, but couldn't you use "event": "init" instead? It triggers only once before the first file.

@noshii117
You can get a list of an account's followed users with

gallery-dl -g https://www.instagram.com/USER/following

where USER is your account's name. You can write them to a file by redirecting stdout with >.

gallery-dl -g https://www.instagram.com/USER/following > followed_users.txt

@pt3rrorduck
The config file name for --filter is image-filter for ... reasons.
Also, 'tags' should probably be a variable name instead of a string.

"image-filter": "any(tag in tags for tag in ['tag1', 'tag2', 'tag3'])"

@pt3rrorduck
Copy link

Thank you,
"image-filter":"any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])", solved it,
but it only works with ' ' around 'tags'. Without it i get following error:
FilterError: Evaluating filter expression failed (NameError: name 'tags' is not defined)

@Vetches
Copy link

Vetches commented May 25, 2024

Thank you so much as always for the incredibly helpful reply!

I did look into IG's headers some time ago and at least for the API endpoints used by gallery-dl, nothing seems to have changed. The potential problem is that IG now uses a different set of endpoints with query parameters I have no idea what they mean ...

You can find endpoints, headers, and parameters by opening your browser's dev tools (F12), selecting XHR in its network monitor, and browsing the site.

Oh wow, that's quite interesting! So how can gallery-dl function if IG uses different endpoints with unknown query parameters? Or do you mean that there are effectively two sets of endpoints, the ones used by gallery-dl and the ones with the unknown query parameters? Does this potentially mean that the gallery-dl endpoints could become deprecated at some point, at which point we'd have to figure out what those query parameters do?

Stopping gallery-dl on errors is possible with the actions option:

Amazing, this is just what I was looking for, thank you so much!

@mikf
Copy link
Owner Author

mikf commented May 25, 2024

@pt3rrorduck
What site are you trying to use this on? Only some provide a list of tags, for some "tags" are available under a different name, and most of the time there are no tags available at all and you'd end up with a NameError exception when trying to access an undefined tags value.

With ' ' around 'tags', this only checks if any of your tags can be found inside the word tags.

@Vetches
Yep, IG has multiple ways / API endpoints to access its data, which does mean that the ones currently used by gallery-dl could get deprecated or outright removed.

@biggestsonicfan
Copy link

Hmm, alright. how about passing the entire json dump to a pre/post processor, is that possible?

@pt3rrorduck
Copy link

@mikf
i tried Redgifs and Newgrounds.
Here is an example from metadata JSON:
"tags": [ "arlong", "banbuds", "east-blue", "eastblue", "fanart", "krieg", "kuro", "luffy", "morgan", "one-piece" ],

@mikf
Copy link
Owner Author

mikf commented May 26, 2024

@biggestsonicfan
Do you mean all collected metadata of every file/post? What exactly do you want to accomplish?

Whatever it is, you're most likely best off using a python post processor to run a custom Python function, where you could theoretically also set a flag that could prevent any further post processors from running.

@pt3rrorduck
As it turns out, there has been a similar issue in the past (#2446) where I've added a contains() function as a workaround:

"image-filter": "contains(tags, ['tag1', 'tag2', 'tag3'])"

Python can't see local variables like tags inside a generator expression like any(…), it seems.

@biggestsonicfan
Copy link

biggestsonicfan commented May 26, 2024

@biggestsonicfan Do you mean all collected metadata of every file/post? What exactly do you want to accomplish?

Whatever it is, you're most likely best off using a python post processor to run a custom Python function, where you could theoretically also set a flag that could prevent any further post processors from running.

Yes, that's what I intend to do. I have a lot of kemono data without metadata, and a lot of incorrectly named files because of that. Currently I download the metadata for all revisions of a post (which can lean upwards of 20,000 json files per post) and then try to parse based on the partial filename and hash from the json metadata. I then delete the 19,999 unused json metadata files. It would be nice to just send the json to python, check if it's the file I am looking for, and if so, rename and dump the json to the folder. If not, it doesn't download the json in the first place.

EDIT: I'm just passing data to a python script via a command post-processor but I can't seem to find the right formatted string. Would it be something like {extractor.post}?

EDIT2: I can see the json metadata can be output to stdout, but I don't see how this can be combined with an exec name.

@throwaway26425
Copy link

for Instagram, is there a way to use -o include=all but at the same time exclude tagged posts?

@fireattack
Copy link
Contributor

fireattack commented May 28, 2024

Just use include=posts,reels,stories,highlights,avatar? I agree it's weird to include tagged in all, but it's just a hardcoded list so very easy to workaround.

@throwaway26425
Copy link

so, there's no exclude option?

btw, is this the right order that all uses?

include=avatar,stories,highlights,posts,reels,tagged?

@fireattack
Copy link
Contributor

It's in this order:

return self._dispatch_extractors((
(InstagramAvatarExtractor , base + "avatar/"),
(InstagramStoriesExtractor , stories),
(InstagramHighlightsExtractor, base + "highlights/"),
(InstagramPostsExtractor , base + "posts/"),
(InstagramReelsExtractor , base + "reels/"),
(InstagramTaggedExtractor , base + "tagged/"),
), ("posts",))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests