Skip to content

Commit

Permalink
#1 Cleaned up for initial release.
Browse files Browse the repository at this point in the history
  • Loading branch information
roskakori committed Apr 11, 2020
1 parent 814ba1c commit ff7eb2a
Show file tree
Hide file tree
Showing 3 changed files with 142 additions and 8 deletions.
145 changes: 140 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# pimdb

Pimdb is a python package and collection of command line utilities to maintain
a local copy of the essential parts of the
[Internet Movie Database](https://imdb.com) (IMDb) based in the CSV files
Pimdb is a python package and command line utility to maintain a local copy of
the essential parts of the
[Internet Movie Database](https://imdb.com) (IMDb) based in the TSV files
available from [IMDb datasets](https://www.imdb.com/interfaces/).


## License

The [IMDb datasets](https://www.imdb.com/interfaces/) are only available for
personal and non-commercial use. For details refer to the previous link.

Pimdb is open source and distributed under the
[BSD license](https://opensource.org/licenses/BSD-3-Clause). The source
code is available from https://github.com/roskakori/pimdb.
Expand All @@ -19,13 +25,142 @@ installed using:
$ pip install pimdb
```


## Quick start

TODO

### Downloading datasets

To download the current IMDb datsets to the current folder, run:

```bash
pimdb download all
```

(This downloads about 1 GB of data and might take a couple of minutes).


### Transferring datasets into tables

To import them in a local SQLite database `pimdb.db` located in the current
folder, run:

```bash
pimdb transfer all
```

(This will take a while. On a reasonably modern laptop with a local database
you can expect about 2 hours).

The resulting database contains one tables for each dataset. The table names
are PascalCase variants of the dataset name. For example, the date from the
dataset `title.basics` are stored in the table `TitleBasics`. The column names
in the table match the names from the datasets, for example
`TitleBasics.primaryTitle`. A short description of all the datasets and
columns can be found at the download page for the
[IMDb datasets](https://www.imdb.com/interfaces/).


### Querying tables

To query the tables, you can use any database tool that supports SQLite, for
example the freely available and platform independent community edition of
[DBeaver](https://dbeaver.io/) or the
[command line shell for SQLite](https://sqlite.org/cli.html).


### Databases other than SQLite

Optionally you can specify a different database using the `--database` option
with an
[SQLAlchemy engine configuration](https://docs.sqlalchemy.org/en/13/core/engines.html),
which generally uses the template
"dialect+driver://username:password@host:port/database". SQLAlchemy supports
several SQL [dialects](https://docs.sqlalchemy.org/en/13/dialects/index.html)
out of the box, and there are external dialects available for other
SQL databases and other forms of tabular data such as
[pydruid](https://github.com/druid-io/pydruid) (for Pandas),
[PyHive](https://github.com/dropbox/PyHive#sqlalchemy) (for Presto and Hive)
or [Solr](https://github.com/aadel/sqlalchemy-solr) (for the Solr search
platform).

Here's an example for using a [PostgreSQL](https://www.postgresql.org/)
database:

```bash
pimdb transfer --database "postgresql://user:password@localhost:5432/mydatabase" all
```


### Building normalized tables

The tables so far are almost verbatim copies of the IMDb datasets with the
exception that possible duplicate rows have been removed. This means that
`NameBasics.nconst` and `TitleBasics.tconst` are unique, which sadly is not
always (but still sometimes) the case for the datasets in the `.tsv.gz` files.

This data model already allows to perform several kinds of queries quite
easily and efficiently.

However, the IMDb datasets do not offer a simple way to query N:M relations.
For example, the column `NameBasics.knownForTitles` contains a comma separated
list of tconsts like "tt2076794,tt0116514,tt0118577,tt0086491".

To perform such queries efficiently you can build strictly normalized tables
derived from the dataset tables by running:

```bash
pimdb build
```
If you did specify a `--database` for the `transfer` command before, you have to
specify the same value for `build` in order to find the source data. These tables
generally use snake_case names for both tables and columns, for example
`title_allias.is_original`.


## Querying normalized tables

N:M relations are stored in tables using the naming template `some_to_other`,
for example `name_to_known_for_title`. These relation tables contain only the
numeric ID's to the respective actual data and a numeric column `ordering` to
remember the sort order of the comma separated list in the IMDb dataset column.

For example, here is an SQL query to list the titles Alan Smithee is known
for:

```sql
select
title.primary_title,
title.start_year
from
name_to_known_for_title
join name on
name.id = name_to_known_for_title.name_id
join title on
title.id = name_to_known_for_title.title_id
where
name.primary_name = 'Alan Smithee'
```


## Reference

To get an overview of general command line option and available commands run:

```bash
pimdb --help
```

To learn the available command line options for a specific command run for
example:

```bash
pimdb transfer --help
```


## Changes

Version 0.0.1, 2020-04-xx
Version 0.1.0, 2020-04-11

* Initial public release.
2 changes: 1 addition & 1 deletion pimdb/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.0.1"
__version__ = "0.1.0"
3 changes: 1 addition & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ max-line-length = 120
[tool:pytest]
addopts = --cov=pimdb --cov=tests


[metadata]
license_file = LICENSE.txt
name = pimdb
Expand All @@ -19,7 +18,7 @@ author = Thomas Aglassinger
author_email = roskakori@users.sourceforge.net
license = BSD
classifiers =
Development Status :: 4 - Beta
Development Status :: 3 - Alpha
Environment :: Console
Intended Audience :: Developers
Intended Audience :: Science/Research
Expand Down

0 comments on commit ff7eb2a

Please sign in to comment.