Skip to content

Commit

Permalink
Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
Browse files Browse the repository at this point in the history
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (48 commits)
  dm mpath: change to be request based
  dm: disable interrupt when taking map_lock
  dm: do not set QUEUE_ORDERED_DRAIN if request based
  dm: enable request based option
  dm: prepare for request based option
  dm raid1: add userspace log
  dm: calculate queue limits during resume not load
  dm log: fix create_log_context to use logical_block_size of log device
  dm target:s introduce iterate devices fn
  dm table: establish queue limits by copying table limits
  dm table: replace struct io_restrictions with struct queue_limits
  dm table: validate device logical_block_size
  dm table: ensure targets are aligned to logical_block_size
  dm ioctl: support cookies for udev
  dm: sysfs add suspended attribute
  dm table: improve warning message when devices not freed before destruction
  dm mpath: add service time load balancer
  dm mpath: add queue length load balancer
  dm mpath: add start_io and nr_bytes to path selectors
  dm snapshot: use barrier when writing exception store
  ...
  • Loading branch information
torvalds committed Jun 24, 2009
2 parents ea94b50 + f40c67f commit c3cb5e1
Show file tree
Hide file tree
Showing 35 changed files with 3,993 additions and 435 deletions.
54 changes: 54 additions & 0 deletions Documentation/device-mapper/dm-log.txt
@@ -0,0 +1,54 @@
Device-Mapper Logging
=====================
The device-mapper logging code is used by some of the device-mapper
RAID targets to track regions of the disk that are not consistent.
A region (or portion of the address space) of the disk may be
inconsistent because a RAID stripe is currently being operated on or
a machine died while the region was being altered. In the case of
mirrors, a region would be considered dirty/inconsistent while you
are writing to it because the writes need to be replicated for all
the legs of the mirror and may not reach the legs at the same time.
Once all writes are complete, the region is considered clean again.

There is a generic logging interface that the device-mapper RAID
implementations use to perform logging operations (see
dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
logging implementations are available and provide different
capabilities. The list includes:

Type Files
==== =====
disk drivers/md/dm-log.c
core drivers/md/dm-log.c
userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h

The "disk" log type
-------------------
This log implementation commits the log state to disk. This way, the
logging state survives reboots/crashes.

The "core" log type
-------------------
This log implementation keeps the log state in memory. The log state
will not survive a reboot or crash, but there may be a small boost in
performance. This method can also be used if no storage device is
available for storing log state.

The "userspace" log type
------------------------
This log type simply provides a way to export the log API to userspace,
so log implementations can be done there. This is done by forwarding most
logging requests to userspace, where a daemon receives and processes the
request.

The structure used for communication between kernel and userspace are
located in include/linux/dm-log-userspace.h. Due to the frequency,
diversity, and 2-way communication nature of the exchanges between
kernel and userspace, 'connector' is used as the interface for
communication.

There are currently two userspace log implementations that leverage this
framework - "clustered_disk" and "clustered_core". These implementations
provide a cluster-coherent log for shared-storage. Device-mapper mirroring
can be used in a shared-storage environment when the cluster log implementations
are employed.
39 changes: 39 additions & 0 deletions Documentation/device-mapper/dm-queue-length.txt
@@ -0,0 +1,39 @@
dm-queue-length
===============

dm-queue-length is a path selector module for device-mapper targets,
which selects a path with the least number of in-flight I/Os.
The path selector name is 'queue-length'.

Table parameters for each path: [<repeat_count>]
<repeat_count>: The number of I/Os to dispatch using the selected
path before switching to the next path.
If not given, internal default is used. To check
the default value, see the activated table.

Status for each path: <status> <fail-count> <in-flight>
<status>: 'A' if the path is active, 'F' if the path is failed.
<fail-count>: The number of path failures.
<in-flight>: The number of in-flight I/Os on the path.


Algorithm
=========

dm-queue-length increments/decrements 'in-flight' when an I/O is
dispatched/completed respectively.
dm-queue-length selects a path with the minimum 'in-flight'.


Examples
========
In case that 2 paths (sda and sdb) are used with repeat_count == 128.

# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
91 changes: 91 additions & 0 deletions Documentation/device-mapper/dm-service-time.txt
@@ -0,0 +1,91 @@
dm-service-time
===============

dm-service-time is a path selector module for device-mapper targets,
which selects a path with the shortest estimated service time for
the incoming I/O.

The service time for each path is estimated by dividing the total size
of in-flight I/Os on a path with the performance value of the path.
The performance value is a relative throughput value among all paths
in a path-group, and it can be specified as a table argument.

The path selector name is 'service-time'.

Table parameters for each path: [<repeat_count> [<relative_throughput>]]
<repeat_count>: The number of I/Os to dispatch using the selected
path before switching to the next path.
If not given, internal default is used. To check
the default value, see the activated table.
<relative_throughput>: The relative throughput value of the path
among all paths in the path-group.
The valid range is 0-100.
If not given, minimum value '1' is used.
If '0' is given, the path isn't selected while
other paths having a positive value are available.

Status for each path: <status> <fail-count> <in-flight-size> \
<relative_throughput>
<status>: 'A' if the path is active, 'F' if the path is failed.
<fail-count>: The number of path failures.
<in-flight-size>: The size of in-flight I/Os on the path.
<relative_throughput>: The relative throughput value of the path
among all paths in the path-group.


Algorithm
=========

dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
dispatched and substracts when completed.
Basically, dm-service-time selects a path having minimum service time
which is calculated by:

('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'

However, some optimizations below are used to reduce the calculation
as much as possible.

1. If the paths have the same 'relative_throughput', skip
the division and just compare the 'in-flight-size'.

2. If the paths have the same 'in-flight-size', skip the division
and just compare the 'relative_throughput'.

3. If some paths have non-zero 'relative_throughput' and others
have zero 'relative_throughput', ignore those paths with zero
'relative_throughput'.

If such optimizations can't be applied, calculate service time, and
compare service time.
If calculated service time is equal, the path having maximum
'relative_throughput' may be better. So compare 'relative_throughput'
then.


Examples
========
In case that 2 paths (sda and sdb) are used with repeat_count == 128
and sda has an average throughput 1GB/s and sdb has 4GB/s,
'relative_throughput' value may be '1' for sda and '4' for sdb.

# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4


Or '2' for sda and '8' for sdb would be also true.

# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
30 changes: 30 additions & 0 deletions drivers/md/Kconfig
Expand Up @@ -231,6 +231,17 @@ config DM_MIRROR
Allow volume managers to mirror logical volumes, also
needed for live data migration tools such as 'pvmove'.

config DM_LOG_USERSPACE
tristate "Mirror userspace logging (EXPERIMENTAL)"
depends on DM_MIRROR && EXPERIMENTAL && NET
select CONNECTOR
---help---
The userspace logging module provides a mechanism for
relaying the dm-dirty-log API to userspace. Log designs
which are more suited to userspace implementation (e.g.
shared storage logs) or experimental logs can be implemented
by leveraging this framework.

config DM_ZERO
tristate "Zero target"
depends on BLK_DEV_DM
Expand All @@ -249,6 +260,25 @@ config DM_MULTIPATH
---help---
Allow volume managers to support multipath hardware.

config DM_MULTIPATH_QL
tristate "I/O Path Selector based on the number of in-flight I/Os"
depends on DM_MULTIPATH
---help---
This path selector is a dynamic load balancer which selects
the path with the least number of in-flight I/Os.

If unsure, say N.

config DM_MULTIPATH_ST
tristate "I/O Path Selector based on the service time"
depends on DM_MULTIPATH
---help---
This path selector is a dynamic load balancer which selects
the path expected to complete the incoming I/O in the shortest
time.

If unsure, say N.

config DM_DELAY
tristate "I/O delaying target (EXPERIMENTAL)"
depends on BLK_DEV_DM && EXPERIMENTAL
Expand Down
5 changes: 5 additions & 0 deletions drivers/md/Makefile
Expand Up @@ -8,6 +8,8 @@ dm-multipath-y += dm-path-selector.o dm-mpath.o
dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \
dm-snap-persistent.o
dm-mirror-y += dm-raid1.o
dm-log-userspace-y \
+= dm-log-userspace-base.o dm-log-userspace-transfer.o
md-mod-y += md.o bitmap.o
raid456-y += raid5.o
raid6_pq-y += raid6algos.o raid6recov.o raid6tables.o \
Expand Down Expand Up @@ -36,8 +38,11 @@ obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o
obj-$(CONFIG_DM_LOG_USERSPACE) += dm-log-userspace.o
obj-$(CONFIG_DM_ZERO) += dm-zero.o

quiet_cmd_unroll = UNROLL $@
Expand Down
19 changes: 18 additions & 1 deletion drivers/md/dm-crypt.c
Expand Up @@ -1132,6 +1132,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad_crypt_queue;
}

ti->num_flush_requests = 1;
ti->private = cc;
return 0;

Expand Down Expand Up @@ -1189,6 +1190,13 @@ static int crypt_map(struct dm_target *ti, struct bio *bio,
union map_info *map_context)
{
struct dm_crypt_io *io;
struct crypt_config *cc;

if (unlikely(bio_empty_barrier(bio))) {
cc = ti->private;
bio->bi_bdev = cc->dev->bdev;
return DM_MAPIO_REMAPPED;
}

io = crypt_io_alloc(ti, bio, bio->bi_sector - ti->begin);

Expand Down Expand Up @@ -1305,9 +1313,17 @@ static int crypt_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
}

static int crypt_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct crypt_config *cc = ti->private;

return fn(ti, cc->dev, cc->start, data);
}

static struct target_type crypt_target = {
.name = "crypt",
.version= {1, 6, 0},
.version = {1, 7, 0},
.module = THIS_MODULE,
.ctr = crypt_ctr,
.dtr = crypt_dtr,
Expand All @@ -1318,6 +1334,7 @@ static struct target_type crypt_target = {
.resume = crypt_resume,
.message = crypt_message,
.merge = crypt_merge,
.iterate_devices = crypt_iterate_devices,
};

static int __init dm_crypt_init(void)
Expand Down
26 changes: 23 additions & 3 deletions drivers/md/dm-delay.c
Expand Up @@ -197,6 +197,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv)
mutex_init(&dc->timer_lock);
atomic_set(&dc->may_delay, 1);

ti->num_flush_requests = 1;
ti->private = dc;
return 0;

Expand Down Expand Up @@ -278,8 +279,9 @@ static int delay_map(struct dm_target *ti, struct bio *bio,

if ((bio_data_dir(bio) == WRITE) && (dc->dev_write)) {
bio->bi_bdev = dc->dev_write->bdev;
bio->bi_sector = dc->start_write +
(bio->bi_sector - ti->begin);
if (bio_sectors(bio))
bio->bi_sector = dc->start_write +
(bio->bi_sector - ti->begin);

return delay_bio(dc, dc->write_delay, bio);
}
Expand Down Expand Up @@ -316,16 +318,34 @@ static int delay_status(struct dm_target *ti, status_type_t type,
return 0;
}

static int delay_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct delay_c *dc = ti->private;
int ret = 0;

ret = fn(ti, dc->dev_read, dc->start_read, data);
if (ret)
goto out;

if (dc->dev_write)
ret = fn(ti, dc->dev_write, dc->start_write, data);

out:
return ret;
}

static struct target_type delay_target = {
.name = "delay",
.version = {1, 0, 2},
.version = {1, 1, 0},
.module = THIS_MODULE,
.ctr = delay_ctr,
.dtr = delay_dtr,
.map = delay_map,
.presuspend = delay_presuspend,
.resume = delay_resume,
.status = delay_status,
.iterate_devices = delay_iterate_devices,
};

static int __init dm_delay_init(void)
Expand Down
2 changes: 1 addition & 1 deletion drivers/md/dm-exception-store.c
Expand Up @@ -216,7 +216,7 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
return -EINVAL;
}

type = get_type(argv[1]);
type = get_type(&persistent);
if (!type) {
ti->error = "Exception store type not recognised";
r = -EINVAL;
Expand Down
2 changes: 1 addition & 1 deletion drivers/md/dm-exception-store.h
Expand Up @@ -156,7 +156,7 @@ static inline void dm_consecutive_chunk_count_inc(struct dm_snap_exception *e)
*/
static inline sector_t get_dev_size(struct block_device *bdev)
{
return bdev->bd_inode->i_size >> SECTOR_SHIFT;
return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT;
}

static inline chunk_t sector_to_chunk(struct dm_exception_store *store,
Expand Down

0 comments on commit c3cb5e1

Please sign in to comment.