A command-line tool and Python interface to Archive.org.
Python Makefile
Pull request Compare This branch is 248 commits behind jjjake:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


A Python and Command-Line Interface to Archive.org

This package installs a command-line tool named ia for using Archive.org from the command-line. It also installs the internetarchive Python module for programatic access to archive.org. Please report all bugs and issues on Github.


You can install this module via pip:

$ pip install internetarchive

Binaries of the command-line tool are also available:

$ curl -LO https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia help


You can configure both the ia command-line tool and the Python interface from the command-line:

$ ia configure

You will be prompted to enter your Archive.org login credentials. If authorization is successful a config file will be saved on your computer that contains your Archive.org S3 keys for uploading and modifying metadata.

Command-Line Usage

Help is available via ia help. You can also get help on a specific command: ia help <command>. Available commands:

help      Retrieve help for subcommands.
configure Configure `ia`.
metadata  Retrieve and modify metadata for items on Archive.org.
upload    Upload items to Archive.org.
download  Download files from Archive.org.
delete    Delete files from Archive.org.
search    Search items on Archive.org.
tasks     Retrieve information about your Archive.org catalog tasks.
list      List files in a given item.


You can use ia to read and write metadata from Archive.org. To retrieve all of an items metadata in JSON, simply:

$ ia metadata TripDown1905

You can also modify metadata after configuring ia.

$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"

See ia help metadata for more details.


ia cand also be used to upload items to Archive.org. After configuring ia, you can upload files like so:

$ ia upload <identifier> file1 file2 --metadata="title:foo" --metadata="blah:arg"

You can upload files from stdin:

$ curl http://dumps.wikimedia.org/kywiki/20130927/kywiki-20130927-pages-logging.xml.gz \
  | ia upload <identifier> - --remote-name=kywiki-20130927-pages-logging.xml.gz --metadata="title:Uploaded from stdin."

You can use the --retries parameter to retry on errors (i.e. if IA-S3 is overloaded):

$ ia upload <identifier> file1 --retries 10

See ia help upload for more details.


Download an entire item:

$ ia download TripDown1905

Download specific files from an item:

$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv

Download specific files matching a glob pattern:

$ ia download TripDown1905 --glob='*.mp4'

Download only files of a specific format:

$ ia download TripDown1905 --format='512Kb MPEG4'

You can get a list of the formats a given item like so:

$ ia metadata --formats TripDown1905

Download an entire collection:

$ ia download --search 'collection:freemusicarchive'

Download from an itemlist:

$ ia download --itemlist itemlist.txt

See ia help download for more details.


You can use ia to delete files from Archive.org items:

$ ia delete <identifier> <file>

Delete a file and all files derived from the specified file:

$ ia delete <identifier> <file> --cascade

Delete all files in an item:

$ ia delete <identifier> --all

See ia help delete for more details.


ia can also be used for retrieving Archive.org search results in JSON:

$ ia search 'subject:"market street" collection:prelinger'

By default, ia search attempts to return all items meeting the search criteria, and the results are sorted by item identifier. If you want to just select the top n items, you can specify a page and rows parameter. For example, to get the top 20 items matching the search 'dogs':

$ ia search --parameters="page:1;rows:20" "dogs"

You can use ia search to create an itemlist:

$ ia search 'collection:freemusicarchive' --itemlist > itemlist.txt

You can pipe your itemlist into a GNU Parallel command to download items concurrently:

$ ia search 'collection:freemusicarchive' --itemlist | parallel 'ia download {}'

See ia help search for more details.


You can also use ia to retrieve information about your catalog tasks, after configuring ia. To retrieve the task history for an item, simply run:

$ ia tasks <identifier>

View all of your queued and running Archive.org tasks:

$ ia tasks

See ia help tasks for more details.


You can list files in an item like so:

$ ia list goodytwoshoes00newyiala

See ia help list for more details.

Python module usage

Below is brief overview of the internetarchive Python library. Please refer to the API documentation for more specific details.

Downloading from Python

The Internet Archive stores data in items. You can query the archive using an item identifier:

>>> from internetarchive import get_item
>>> item = get_item('stairs')
>>> print(item.metadata)

Items contains files. You can download the entire item:

>>> item.download()

or you can download just a particular file:

>>> f = item.get_file('glogo.png')
>>> f.download()
>>> f.download('/foo/bar/some_other_name.png')

Uploading from Python

You can use the IA's S3-like interface to upload files to an item after configuring the internetarchive library.

>>> from internetarchive import get_item
>>> item = get_item('new_identifier')
>>> md = dict(mediatype='image', creator='Jake Johnson')
>>> item.upload('/path/to/image.jpg', metadata=md)

Item-level metadata must be supplied with the first file uploaded to an item.

You can upload additional files to an existing item:

>>> item = internetarchive.Item('existing_identifier')
>>> item.upload(['/path/to/image2.jpg', '/path/to/image3.jpg'])

You can also upload file-like objects:

>>> import StringIO
>>> fh = StringIO.StringIO('hello world')
>>> fh.name = 'hello_world.txt'
>>> item.upload(fh)

Modifying Metadata from Python

You can modify metadata for existing items, using the item.modify_metadata() function. This uses the IA Metadata API under the hood and requires your IAS3 credentials. So, once again make sure you have the internetarchive library configured.

>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> md = dict(blah='one', foo=['two', 'three'])
>>> item.modify_metadata(md)

Searching from Python

You can search for items using the archive.org advanced search engine:

>>> from internetarchive import search_items
>>> search = search_items('collection:nasa')
>>> print(search.num_found)

You can iterate over your results:

>>> for result in search:
...     print(result['identifier'])