Skip to content

Commit

Permalink
Add TypeTester to tutorial and API docs. Closes #213.
Browse files Browse the repository at this point in the history
  • Loading branch information
onyxfish committed Sep 6, 2015
1 parent 145c0f5 commit 79d3770
Show file tree
Hide file tree
Showing 9 changed files with 133 additions and 93 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.DS_Store
*.pyc
*.swp
*.swo
Expand Down
1 change: 0 additions & 1 deletion agate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
from agate.column_types import *
from agate.computations import *
from agate.exceptions import *
from agate.inference import TypeTester
from agate.table import Table
from agate.tableset import TableSet

Expand Down
83 changes: 80 additions & 3 deletions agate/column_types/__init__.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,94 @@
#!/usr/bin/env python

"""
This module contains the :class:`ColumnType` class and its subclasses. These
This module contains the :class:`.ColumnType` class and its subclasses. These
types define how data should be converted during the creation of a
:class:`.Table`. Each subclass of :class:`ColumnType` is associated with a
:class:`.Table`. Each subclass of :class:`.ColumnType` is associated with a
subclass of :class:`.Column`. For instance, specifying that data is of
:class:`NumberType` will cause a :class:`.NumberColumn` to be created on the
:class:`.NumberType` will cause a :class:`.NumberColumn` to be created on the
table.
A :class:`TypeTester` class is also included which be used to infer column
types from data.
"""

from copy import copy

try:
from collections import OrderedDict
except ImportError: # pragma: no cover
from ordereddict import OrderedDict

from agate.column_types.base import *
from agate.column_types.boolean import *
from agate.column_types.date_time import *
from agate.column_types.number import *
from agate.column_types.text import *
from agate.column_types.time_delta import *

from agate.exceptions import *

class TypeTester(object):
"""
Infer types for the columns in a given set of data.
:param force: A dictionary where each key is a column name and each
value is a :class:`.ColumnType` instance that overrides inference.
"""
def __init__(self, force={}):
self._force = force

# In order of preference
self._possible_types =[
BooleanType(),
NumberType(),
TimeDeltaType(),
DateTimeType(),
TextType()
]

def run(self, rows, column_names):
"""
Apply inference to the provided data and return an array of
:code:`(column_name, column_type)` tuples suitable as an argument to
:class:`.Table`.
:param rows: The data as a sequence of any sequences: tuples, lists,
etc.
:param column_names: A sequence of column names.
"""
num_columns = len(column_names)
hypotheses = [set(self._possible_types) for i in range(num_columns)]

force_indices = [column_names.index(name) for name in self._force.keys()]

for row in rows:
for i in range(num_columns):
if i in force_indices:
continue

h = hypotheses[i]

if len(h) == 1:
continue

for column_type in copy(h):
if not column_type.test(row[i]):
h.remove(column_type)

column_types = []

for i in range(num_columns):
if i in force_indices:
column_types.append(self._force[column_names[i]])
continue

h = hypotheses[i]

# Select in prefer order
for t in self._possible_types:
if t in h:
column_types.append(t)
break

return zip(column_names, column_types)
76 changes: 0 additions & 76 deletions agate/inference.py

This file was deleted.

2 changes: 1 addition & 1 deletion agate/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@

from agate.aggregations import Sum, Mean, Median, StDev, MAD
from agate.columns.base import ColumnMapping
from agate.column_types import TypeTester
from agate.computations import Computation
from agate.exceptions import ColumnDoesNotExistError, RowDoesNotExistError
from agate.inference import TypeTester
from agate.rows import RowSequence, Row
from agate.tableset import TableSet
from agate.utils import NullOrder, memoize
Expand Down
3 changes: 1 addition & 2 deletions agate/tableset.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,8 @@
from ordereddict import OrderedDict

from agate.aggregations import Aggregation
from agate.column_types import *
from agate.column_types import TextType, TypeTester
from agate.exceptions import ColumnDoesNotExistError
from agate.inference import TypeTester
from agate.rows import RowSequence

class TableMethodProxy(object):
Expand Down
25 changes: 25 additions & 0 deletions docs/api/column_types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,28 @@ agate.column_types
.. automodule:: agate.column_types
:members:
:undoc-members:

.. automodule:: agate.column_types.base
:members:
:undoc-members:
:show-inheritance:

.. automodule:: agate.column_types.boolean
:members:
:undoc-members:
:show-inheritance:

.. automodule:: agate.column_types.date_time
:members:
:undoc-members:
:show-inheritance:

.. automodule:: agate.column_types.number
:members:
:undoc-members:
:show-inheritance:

.. automodule:: agate.column_types.text
:members:
:undoc-members:
:show-inheritance:
34 changes: 25 additions & 9 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,21 +57,27 @@ Now let's import our dependencies:
Defining the columns
====================

agate requires us to give it some information about each column in our dataset. No effort is made to determine these types automatically, however, :class:`.TextType` is always a safe choice if you aren't sure what kind of data is in a column.
There are two ways to specify column types in agate. You can specify a particular type one-by-one, which gives you complete control over how the data is processed, or you can use agate's :class:`.TypeTester` to infer types from the data. The latter is more convenient, but it is imperfect, so it's wise to check that the types in infers are reasonable. (For instance, some date formats look exactly like numbers and some numbers are really text.)

First we create instances of the column types we will be using:
You can create a :class:`.TypeTester` like this:

.. code-block:: python
tester = agate.TypeTester()
If you prefer to specify your columns manually you will need to create instances of each type that you are using:

.. code-block:: python
text_type = agate.TextType()
number_type = agate.NumberType()
boolean_type = agate.BooleanType()
Then we define the names and types of the columns that are in our dataset:
Then you define the names and types of the columns that are in our dataset as a sequence of pairs. For the exonerations dataset, you would define:

.. code-block:: python
COLUMNS = (
columns = (
('last_name', text_type),
('first_name', text_type),
('age', number_type),
Expand All @@ -92,20 +98,30 @@ Then we define the names and types of the columns that are in our dataset:
('inadequate_defense', boolean_type),
)
You'll notice here that we define the names and types as pairs (tuples), which is what the :class:`.Table` constructor will expect in the next step.

.. note::

The column names defined here do not necessarily need to match those found in your CSV file. I've kept them consistent in this example for clarity.
If specifying column names manually they do not necessarily need to match those found in your CSV file. I've kept them consistent in this example for clarity. If using :class:`.TypeTester` column names will be inferred from the headers of your CSV.

Loading data from a CSV
=======================

The :class:`.Table` is the basic class in agate. A time-saving method is included to load table data from CSV:
The :class:`.Table` is the basic class in agate. A time-saving method is included to create a table from CSV. To infer column types automatically while reading the data:

.. code-block:: python
exonerations = agate.Table.from_csv('exonerations-20150828.csv', COLUMNS)
exonerations = agate.Table.from_csv('exonerations-20150828.csv', tester)
.. note::

The :class:`.TypeTester` can be slow to evaluate the data. It's best to use it with a tool such as `proof <http://proof.readthedocs.org/en/latest/>`_ so you don't have to run it everytime you work with your data.

Or, to use the column types we created manually:

.. code-block:: python
exonerations = agate.Table.from_csv('exonerations-20150828.csv', columns)
In either case the ``exonerations`` variable will now be an instance of :class:`.Table`.

.. note::

Expand Down
1 change: 0 additions & 1 deletion tests/test_inference.py → tests/test_column_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import unittest

from agate.column_types import *
from agate.inference import TypeTester
from agate.table import Table
from agate.tableset import TableSet

Expand Down

0 comments on commit 79d3770

Please sign in to comment.