Skip to content

Latest commit

 

History

History
626 lines (597 loc) · 26.3 KB

cs206.md

File metadata and controls

626 lines (597 loc) · 26.3 KB

$$ \def\empty{\mathcal E} $$

CS206

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Spring 2015: Concurrency

[TOC]

Introduction

  • Moore's law : 2x transistors every 24 months
    • got faster, software grew too
    • but in 2005 : chips got bigger but not faster (15% not 50% anymore per year)
    • reason : keep power at 100W
  • power : proportionnal to operating voltages squared times clock frequency
    • mobile cpu : few W
    • latop : 10s of W (burn you)
    • server : 100 W (cooling)
    • datacenter : 20 MW
  • voltage are not going down enough (allowing better clock frequency)
  • new Moore's law : more cores
    • simpler cores (Prius instead of Audi)
    • each core : fewer joules/op
    • need for parallelism/concurrency : cannot wait 6 month for each software grows
  • scaling process : not intuitive
    • require great care : synchronization
  • historical view
    • I/O : message passing, hadoop, sql
    • memory : shared memory, java threads
    • processor : dataflow, GPU, CUDA

Basic architecture

  • model
    • multiple threads (or so called processes)
    • objects live in shared memory
    • unpredictable asynchronous delays
      • cache misses (short)
      • page faults (long)
      • scheduling quantum used up (really long)
  • multicore architecture
    • processors : on the same chip
      • SMP : symmetric multiprocessor (no longer these days, but for this course)
      • NUMA : non-uniform memory access (organized in a grid)
      • CC-NUMA : cache-coherent
      • bottleneck : bus
      • 1 cycle : fetch and execute one instruction
    • threads
      • execution of a sequential program
      • does I/O
      • spining : waiting
    • interconnect
      • broadcast medium
      • connects : processor to processor and processor to memory
      • network : tiny LAN
      • finite resource : avoid using too much bandwidth (if >50%, waiting is exponential)
    • memory
      • 100s cycles : access
      • far apart
    • caches
      • keep memory close
      • avoid use of interconnect
      • replacement policy
      • types
        • fully associative : any line can be anywhere in the cache (flexible replace, hard to find)
        • direct mapped cache : every address has exactly 1 slot (fixed replace, easy to find)
        • K-way set associative : each slots holds k lines (flexible replace, easy to find)
  • memory hierarchies
    • L1 cache : small & fast (1/2 cycles)
    • L2 cache : larger & slower (10s of cycles ~ 128 byte block)
      • lower level can be more associative (> 16 ways) : share sets
    • latency : average memory access time
      • = cache time time x (1 - cache miss rate) + memory latency x cache miss rate
    • misses
      • compulsory
      • conflict
      • capacity misses
      • coherence : read/write from multiple processors (need protocol)
        • true sharing : read/write same variable, communicate unique value in block from one thread to another
        • false sharing : read/write distinct variables that reside in the same cache block, reduce performance (cause extra misses), increased if more spatial
        • when writing software should try to divide up data structures to avoid false sharing
    • write policy
      • write-through : memory up-to-date, cache always agree, too much bus traffic (especially for unshared data as a loop)
      • write-back : accumulate changes, write when the block is evicted (what happens if another processor wants it : need invalidate)
      • need for write buffer : absorbing / batching

Performance and efficiency

  • parallel thinking
    • maximize parallelism
    • reduce overhead
    • galatic motion : simulation of n bodies can be approximated
    • ocean simulation : grid of element with velocity, compute impact and iterate
    • deep learning
    • concurrency : two or more threads make forward progress together
    • parallelism : two or more threads execute at the same time
    • all parallel threads are concurrent but not the converse
    • task : piece of work
    • thread : performs tasks
    • processors : execute threads
  • forms of parallelism
    • throughput parallelism : perform many identical sequential tasks
    • functional or task parallelism : perform tasks that are functionnaly different
    • pipeline parallelism : perform tasks that are different in a particular order
    • data parallelism : perform the same task on different data
  • division of work
    • balance workload
    • reduce communication : overhead
    • reduce extra work : thread creations, scheduling
    • thread require resources
      • memory for stacks
      • a copy of the register file
      • program state : program counter, stack pointr
      • setup
      • short-lived : bad ratio of work
  • performance
    • Amdahl's law : if you speed up only a small fraction of the execution time of a program or a computation, the speedup you achieve on the whole application is limited
      • speedup = 1 / (fraction enhanced / speedup enhanced (cores ?) + 1 - fraction enhanced)
    • parallel execution is not ideal : 10 processors rarely get a speedup of 10
      • load imbalance
      • thread start/join overhead
      • communication overhead
      • even if fraction is almost everything : speedup enhanced << # processors
    • best efficiency : concurrency ~ parallelism (1 thread per processor)
    • prime finding :
      • split work rejected : unbalanced workload
      • need a shared counter : ticketting
      • but counter not atomic : need for mutual exclusion (synchronization)

Mutal exclusion

  • share items
    • cannot just see and check : not atomic
    • need explicit communication for coordination
    • communication must be : persistent (like writing) and not transient (like speaking)
  • protocols
    • cans (interrupt) : fallen = ready, upright = not ready
      • finite resources
      • what if long reset
    • flag : raise, wait until other's flag is down, unleash, lower flag at returns
      • danger at start, but once started it is good
    • flag 2 : same but let always Alice go first
      • proof : derive a contradiction by reasoning backwards
        • assume Alice was the last to look
        • therefore her flag is down
        • if her flag is down Bob can go
        • but Bob only go with his flag up
        • thus Alice will wait until he lower his flag
  • properties
    • correctness
      • safety : nothing bad happens ever
        • never in simultaneously
      • liveness : something good happens eventually
        • no deadlock
        • if one wants in, it gets in
        • if both want in, one gets in
    • quality
      • fairness : both are equally getting in
      • performance (waitingd) : stuck if the thread in is stuck
  • commons problems
    • producer-consumer
      • cannot meet
      • cyclic dependence : don't put much or left the place empty
      • imply some waiting (idle) : problematic is one is delayed but common
      • safe : mutual exclusion
      • starvation : if producer is willing, consumer will always buy (liveness)
    • reader-writer
      • multiple read
      • if one write, no one reads : no mixed message
      • mutual exclusion : requires waiting but reduce performance
      • solvable without mutual exclusion : add validity flag (semaphore)
        • Alice only writes to invalid billboards (validate them after)
        • Bob only reads from valid billboards (invalidate them after)
        • Alice waits only if all billboards are written
        • Bob only waits if there is no message
        • read/write in parallel (impossible with locks)
  • mutual exclusion
    • formal definition
      • time
      • event $a_0$ : instantaneous, no simultaneous events
      • thread $a_0,a_1,\ldots$ : sequence of events, state machines (events are transistions)
      • states : thread state (program counter, local variables), system state (shared variables, union of thread states)
      • critical section $CS_i^k$ (thread $i$, $k$th execution) : piece of code that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution
      • interleavings : events of two or more threads (not necessarly independent)
      • interval $A=(a_0,a_1)$ : time between events (may overlap or disjoint)
      • precedence $A_0\to B_0$ : end event of $A_0$ before start event of $B_0$
        • irreflexive : never true $A\to A$
        • antisymmetric : if $A\to B$ then not true that $B\to A$
        • transitive : if $A\to B \wedge B\to C$ then $A\to C$
      • $k$th occurence $a_0^k$
      • locks : acquire lock and release unlock methods
    • properties
      • deadlock-free : system as a whole makes progress even if individuals starve
      • starvation-free : individual threads make progress
    • variable should be volatile : no optimization is done (i.e not reorderable, not in register)
    • solutions for 2
      • lockOne : raise flag, idle until other flag is down
        • source : write A (flag A = true) -> read A (flag B = false) -> CS A
        • source : write B (flag B = true) -> read B (flag A = false) -> CS B
        • assumption : read A (flag B = false) -> write B (flag B = true)
        • assumption : read B (flag A = false) -> write A (flag A = true)
        • give a cycle !
        • fails deadlock-freedom (both concurrent entering)
      • lockTwo : set itself as victim, idle until the victim is itself
        • satisfy mutual exclusion
        • not deadlock free (stuck if one thread finishes)
      • Peterson's algorithm : combine both flag[i] = true; victim = i; while (flag[j] && victim == i) {};
        • source : write B (flag B = true) -> write B (victim = B)
        • source : write A (victim = A) -> read A (flag B) -> read A (victim)
        • assumption : write B (victim = B) -> write A (victim A)
        • observation : write B (flag B = true) -> write B (victim = B) -> write A (victim = A) -> read A (flag B) -> read A (victim)
        • last two step : a read flag B = true and victim = A so it could not have entered the CS
        • deadlock free
        • starvation free : let other first but not twice
    • solutions for n
      • filter algorithm : n-1 waiting rooms (levels)
        • at each level : at least one enteres level and at least one blocked if many try
        • for (int L = 1; L < n; L++) { level[i] = L; victim[L] = i; while ((∃ k != i level[k] >= L) && victim[L] == i ); }
        • claim : start at level 0, at most n-L threads enter level L
        • no starvation
        • fairness : threads can be overtaken by others
      • bounded waiting : stronger fairness guaranteees
        • divide lock into 2 parts : doorway interval (finite) and waiting interval (unbounded)
      • Bakery algorithm : wait until lower numbers are severed
        • lexicographic order with tickets and thread id
        • First come first served (FIFO)
        • flag[i] = true; label[i] = max(label[0], …,label[n-1])+1; while (∃k flag[k] && (label[i],i) > (label[k],k));

Synchronization constructs

  • synchronization hardware
    • disable interrupts : too inefficient on multiprocessors
    • atomics : non-interruptable
      • test memory words and set value
        • boolean Test&Set (boolean *target) { boolean rv = *target; *target = TRUE; return rv; }
        • while ( TestAndSet ( &lock )); // critical section; lock = FALSE;
      • swap memory words
        • void Swap (boolean *a, boolean *b){ boolean temp = *a; *a = *b; *b = temp; }
        • key = TRUE; while ( key ) Swap ( &lock, &key ); // critical section; lock = FALSE;
      • not pefect : thread wait spinning, huge traffic on the bus, not fair
      • 50% bus utilization : exponential response time

      • add cache test before
        • Test&Test&Set : do { while ( lock ); // lock is 0; } while ( TestAndSet ( &lock )); // critical section; lock = FALSE
        • Test&Swap : do { key = TRUE; while ( lock ); Swap ( &lock, &key ); } while ( key ); // critical section; lock = FALSE;
    • transactions
      • atomic instruction sequence
      • all memory changes visible before/after but not during
  • sophisticated primitives
    • semaphores S : atomic integer variable
      • wait : while (S <= 0); S--;
      • signal : S++
      • mutex : binary semaphore (initialized to 1) provide mutual exclusion
      • each wait : there has to be a be a corresponding signal
      • buzy waiting : always check something
      • without busy waiting : implement a waiting queue
        • wait : S->value--; if (S->value < 0) { add this thread to S->list; block(); }
        • signal : S->value++; if (S->value <= 0) { remove a thread P from S->list; wakeup(P); }
      • deadlock & starvation : if mutex are set in the inverse order at the same time
      • pririty inversion : problem when lower priority thread holds a lock neede by a higher one (solvable by priority inheritance protocol)
    • monitors (synchronized method in Java)
      • abstraction data type, variable visible only inside
      • only one thread may be active within the monitor at a time
    • conditional variables : conditions on queue in monitors
      • on a condition variable
        • wait suspends the thread until signal
        • signal resumes a thread (if any) that invoked wait
      • options
        • signal & wait : P waits until Q leaves or waits for another condition
        • signal & continue : Q waits until P leaves the monitor or waits for another condition
  • common problems
    • bounded buffer
      • 1 buffer holds N items
      • 1 mutex at 1
      • semaphore full at 0
      • semaphre empty at N
    • reader/writer
      • mutex at 1
      • semaphore write at 1
      • readcount : incremented by reader
        • first reader in when write is out
        • last reader out when write is in
    • dining philosophers
      • 5 philosopher : either thining or eating
      • 5 chopsticks : need 2 for eating
      • semaphore chopstick[5] at 1
      • deadlock : each philosopher grabs left chopstick (solvable if they grab always the small chopstick id first)
      • solution : monitor + states

Concurrent data structures

  • contention : when many threads compete for a lock
  • concurrent object : should increase throughput without contention
  • coarse-grained synchronization : every method locks (with queue)
    • sequential bottleneck : stand in line
    • does not improve throughput
  • fine-grained synchronization : split object into independently-synchonized components
    • only conflict : same component at same time
  • finally : always wrap up try block with unlock in finally
  • traffic jam : any concurrent data structure based on mutual exclusion has a big weakness (if one has cache fault, every others have to wait on it)
  • optimistic stategies
    • optimistic traversal : traverse without locking & check after lock (give up any guarantees)
    • lazy synchronization : break into two part, logical remove (marks component) and physical removal
  • lock-free data structures
    • CAS : compare, if successfull sets it to new value, all of it atomically
  • linked list : using a list-based set
    • set : add, remove, contains
    • node : item, key (hashcode), next
    • invariant : tail reachable from head, sorted, no duplicates
    • coarse-grained locking : one lock to rule them all
    • fine-grained locking : each piece has own lock
      • hand-over-hand locking : two consecutive lock (inefficient)
    • lock-free : CAS not enough, need logical removal (AtomicMarkableReference)
      • remove : set mark bit in next field, redirect predecessor's pointer
      • boolean[] (argument) : extract and return mark at array index 0
      • public boolean compareAndSet(Object expectedRef, Object updateRef, boolean expectedMark, boolean updateMark);
      • public boolean attemptMark(Object expectedRef, boolean updateMark);
      • timing : advanced pointer $T_p$, perform CAS $T_C$, remove $T_r$
  • concurrent hash tables
    • bucket : avoid collision
    • methods (from lock-free list) : add (9%), remove (1%), contains (90%)
    • resize : grows array, adjust hash function
    • when to resize
      • bucket threshold : number in one exceed
      • global threshold : quarter of buckets exceed this value
    • coarse-graind locking : no concurrency
    • striped locks : each lock now associated with one buckets
      • array of lock
      • acquire locks in ascending order : make sure table references didn't change once all lock acquired (concurrent resize)
      • allocated new super-sized table
      • resize : lock whole array and each lock now associated with two or more buckets
    • fine-grained locking
      • resize : exclusive mode
      • add, remove, contains : shared mode
      • read/write locks : allow two different locks
        • read : locks out writers, allow concurrent readers
        • write : locks out writers and readers
        • safe but liveness hard (continual stream of readers)
        • FIFO R/W lock : as soon as writer requests a lock, no more readers accepted
public boolean add(Object key) {
	int keyHash = key.hashCode() % lock.length;
	synchronized (lock[keyHash]) {
		int tabHash = key.hashCode() % table.length;
		return table[tabHash].add(key);
	}
}
private void resize(int depth, List[] oldTab) {
	synchronized (lock[depth]) {
		if (oldTab == this.table) {
			int next = depth + 1;
			if (next < lock.length) resize (next, oldTab);
			else sequentialResize();
		}
	}
}
  • lock-free : don't move items, move buckets instead (all items in a single lock-free list, buckets are shortcut)
    • quick access, reasonable load balance
    • split-order : $2^i$ key sorted in reverse LSB order (Least Significant Bit, allow recursive split)
    • resize : split in two or more the buckets ($b=k\mod 2^{i+1}$ stay, $b+2^i=k\mod 2^{i+1}$ move, determined by $i+1$ bit counting backwards)
    • sentinel nodes : remove node pointed by shortcut by adding some
      • MSB bit to one before reverse for all non regular keys
    • split a bucket : add new sentinel, point from the bucket to the sentinel
    • list : add take key argument, constructor take previous list and key
  • split-ordered set
    • set size / number of buckets ratio with threshold
    • intialize buckets : originally null, if you find one, initialize it
public class SOSet {
	protected LockFreeList[] table;
	protected AtomicInteger tableSize;
	protected AtomicInteger setSize;
	public SOSet(int capacity) {
		table = new LockFreeList[capacity];
		table[0] = new LockFreeList();
		tableSize = new AtomicInteger(2);
		setSize = new AtomicInteger(0);
	}
}
public boolean add(Object object) {
	int hash = object.hashCode();
	int bucket = hash % tableSize.get();
	int key = makeRegularKey(hash);
	LockFreeList list = getBucketList(bucket);
	if (!list.add(object, key))
		return false;
	resizeCheck();
	return true;
}
void initializeBucket(int bucket) {
	int parent = getParent(bucket);
	if (table[parent] == null)
		initializeBucket(parent);
	int key = makeSentinelKey(bucket);
	LockFreeList list = new LockFreeList(table[parent], key);
}

Schedulding & work distribution

  • threads
    • overhead : memory for stacks, setup, teardown, scheduler
    • short-lived thread (~200 $\mu$s) : bad ratio work/overhead
    • pool abstraction (long-lived) : run task, rejoin pool, wait for next
      • ExecutorService
      • task : Runnable (calls run) or Callable (returns call)
      • submit task to executor (~$\mu$s) : future in return
      • futures : get the result (blocking)
      • efficacity : ratio task length/submission overhead
class FibTask implements Callable<Integer> {
	static ExecutorService exec = Executors.newCachedThreadPool();
	int arg;
	public FibTask(int n) {
		arg = n;
	}
	public Integer call() {
		if (arg > 2) {
			Future<Integer> left = exec.submit(new FibTask(arg-1));
			Future<Integer> right = exec.submit(new FibTask(arg-2));
			return left.get() + right.get();
		} else {
			return 1;
		}
	}
}
  • Blumofe-Leiserson DAG model : multithreaded program is Directed Acyclic Graph that unfolds dynamically
  • timing
    • time on $P$ processor : $T_p$
    • work : $T_1$
    • critical path : $T_\infty$ (longest dependency path)
    • parallelism : $T_1/T_\infty$
    • work law : $T_P\ge T_1/P$ (cannot do more than $P$ work in one step)
    • critical path law : $T_P\ge T_\infty$ (finite resources)
    • speedup on $P$ processors : ratio $T_1/T_P$ (linear $\propto P$, max $T_1/T_\infty$)
  • work dealing : everyone passes extra work to others
    • lock-free work stealing : each thread has a pool, remove without synchronization, steal someone at random
    • double-ended queue ideal : lock-free, constant time, linearizable
      • pushBottom, popBottom, popTop
      • need to seal stealing access : stamp/top-bottom
public class BDEQueue {
	AtomicStampedReference<Integer> top;
	volatile int bottom;
	Runnable[] tasks;
}
void pushBottom(Runnable r){
	tasks[bottom] = r;
	bottom++;
}
public Runnable popTop() {
	int[] stamp = new int[1];
	int oldTop = top.get(stamp), newTop = oldTop + 1;
	int oldStamp = stamp[0], newStamp = oldStamp + 1;
	if (bottom <= oldTop)
		return null;
	Runnable r = tasks[oldTop];
	if (top.CAS(oldTop, newTop, oldStamp, newStamp))
		return r;
	return null;
}
Runnable popBottom() {
	if (bottom == 0) return null;
	bottom--;
	Runnable r = tasks[bottom];
	int[] stamp = new int[1];
	int oldTop = top.get(stamp), newTop = 0;
	int oldStamp = stamp[0], newStamp = oldStamp + 1;
	if (bottom > oldTop) return r;
	if (bottom == oldTop){
		bottom = 0;
		if (top.CAS(oldTop, newTop, oldStamp, newStamp))
			return r;
	}
	top.set(newTop,newStamp); return null;
	bottom = 0;
}
  • work balancing : stealing is costly, move more than one task randomly

Data parallel computing

  • perform same task on different data
  • map reduce
    • divide up the data among servers
    • compute the stats
    • reduce
  • vector porcessors : pipelined execution, single instruction multiple data (SIMD)
    • reduce ops by 1.2x
    • reduce instructions by 20x
    • automatic code vectorization with some compiler (but for (i =0; i < N; i++) a[i] = a[b[i]] * fade; not)
    • ISA SIMD support (x86 also) : streaming SIMD extenstions (SSE)
      • vector register with : integer (16 bytes, 8 shorts, 4 int), single precision (4), double precision (2)
      • instruction : add, sub, mul, div, sqrt, max, min, and, or, xor, andn, cmpps
      • in ARM (NEON) : 32x64bits registers
    • micro architecture : functional units & vector registers
  • graphic processing units (GPU) : dense grid of alu, single instruction multiple threads (SIMT)
    • tousands tiny cores : little cache mostly alu
    • integrated GPU (AMD) : shared cache hierarchy, one memory
    • discrete GPU (nVidia) : specialized memory, must move data
    • CPU/GPU connections
      • CPU <-> memory : 30 GB/s
      • GPU <-> GPU memory : 300 GB/s
      • CPU <-> GPU : 3 GB/s
    • GPU thread : little overhead (microsecond), full hardware, thousands
    • help : parallel computation, memory overlap
    • execution timeline : copy data to GPU, execution configuration launch kernel, synchronize, copy data to CPU
    • CUDA (nVidia, market leader) vs OpenCL (CUDA superset, new)

Cuda

  • no execution order
  • computation partioning
    • thread in block in grid
    • thread within same block can communicate/synchronize
    • grid : machine dependent
    • thread : max 1024
  • memory model
    • thread : local memory, local register, all within block have shared memory (on-chip)
    • block : global memory (may be cached), constant memory (read-only), texture memory (read-only)
  • CUDA function : excution from caller
    • __device__ : device from device
    • __global__ : device from host (return void)
    • __host__ : host from host
    • device and host can be used at same time
  • CUDA memory copy : cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction)
    • cudaMemcpyHostToDevice
    • cudaMemcpyDeviceToHost
    • cudaMemcpyDeviceToDevice
__global__ void fade (int *a, int fade, int N) {
	int x = blockDim.x * blockIdx.x + threadIdx.x;
	int y = blockDim.y * blockIdx.y + threadIdx.y
	int offset = y * (blockDim.x * gridDim.x) + x;
	if (offset > N) return;
	int v = a[offset];
	v = v * fade;
	a[offset] = v;
}
cudaMalloc((void **) &da, sizeof (float) * N);
int threads_block = 64;
int blocks = (N + threads_block1) / threads_block;
arradd<<<blocks, threads_block>>(da, 10f, N);
cudaDeviceSynchronize (); // forces CPU to wait
cudaMemCpy((void *) ha, (void *) da, sizeof (float) * N, cudaMemcpyDeviceToHost); // dest, source
cudaFree(da);
  • FLOPs : floating operations per second
  • shared memory : fastest computation
  • tiled kernel : matrix multiply
    • __syncthreads require : no loading and computing at same time
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH);
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
	__shared__float Mds[TILE_WIDTH][TILE_WIDTH];
	__shared__float Nds[TILE_WIDTH][TILE_WIDTH];
	int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;
	int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Pvalue = 0;
	for (int m = 0; m < Width/TILE_WIDTH; ++m) {
		// Collaborative loading of Md and Nd tiles into shared memory
		Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
		Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
		__syncthreads();
		for (int k = 0; k < TILE_WIDTH; ++k)
			Pvalue += Mds[ty][k] * Nds[k][tx];
		__syncthreads();
	}
	Pd[Row*Width+Col] = Pvalue;
}
  • wrap : subgroup of a block
  • banks : memory division to achieve high bandwidth (1 access per 2 cycles)
    • serialization : threads accessing different words in same bank (sequential)
    • multicasting : threads accessing same word in same bank (parallel)
    • less bank conflict, fastest it is (cost = max simultaneous access to single bank)
    • conflict free if no common factors shared with bank number
  • reduce operations
    • pairewise : $\log_2(n)$
    • strategy
      • interleaved accesses : shared memory, each step reduce by 2
      • non-divergent threads : group all active threads together
      • thread-sequential accesses : 2-way conflict at every step
      • read two elements and do the first step
      • unrolling the last 6 iterations
__global__ void reduce0(int *g_idata, int *g_odata, int n) {
	extern __shared__ int sdata[];
	// each thread loads one element from global to shared mem
	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x* blockDim.x * 2 + threadIdx.x;
	sdata[tid] = g_idata[i] + g_idata[i + blockDim.x];
	__syncthreads();
	for (unsigned int s = blockDim.x/2; s > 32; s /= 2) {
		if (tid < s) {
			sdata[tid] += sdata[tid + s];
		}
		__syncthreads();
	}
	if (tid < 32) {
		sdata[tid] += sdata[tid + 32];
		sdata[tid] += sdata[tid + 16];
		sdata[tid] += sdata[tid + 8];
		sdata[tid] += sdata[tid + 4];
		sdata[tid] += sdata[tid + 2];
		sdata[tid] += sdata[tid + 1];
	}
	if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}