XMPP_CoreData

robbiehanson edited this page May 1, 2012 · 4 revisions

There are several modules within the xmpp framework that require some kind of storage mechanism. The perfect example is the roster. This module handles fetching the roster (buddy list), and keeping track of who's online / offline. In order to accomplish its task, it needs to store all the user information somewhere. The trouble is that different applications have very different use cases and requirements. Perhaps the roster contains thousands of users, so caching the information to disk is best. Or perhaps the roster is small, and users go online/offline at a rapid pace - thus storing the information in memory is preferred. The upshot is that the framework allows you to choose which storage mechanism best suites your needs. To accomplish this, you'll notice that storage is often abstracted within a module. The module provides one or two storage implementations, and specifies a protocol for implementing your own custom storage (if needed). For example, the XMPPRoster module comes with XMPPRosterMemoryStorage and XMPPRosterCoreDataStorage. Plus it specifies the XMPPRosterStorage protocol in case you'd like to implement your own.

Now after implementing CoreDataStorage for several modules within the xmpp framework, we noticed several patterns emerge. Some of them were obvious, such as the standard boilerplate code that's required to setup a core data stack. However others came with age, such as optimizing disk IO, and properly managing core data in a multithreaded context.

Core Data Overview

There are some technologies that Apple creates that almost seem magical. They take a complicated thing, and make it so simple that it "just works" without much effort. Core Data is NOT one of these technologies. Over the years I've seen many developers jump into core data without bothering to first research the technology. Their app becomes buggy or slow and they throw their hands in the air, declare that core data sucks, and move on to some other storage mechanism. In reality, core data is a wonderful technology, but requires a deeper understanding of how it works to properly use it.

Often times when people complain about core data, I ask them what they want. Here's what they say:

First, this magical new storage technology should be built on top of SQLite. This database is incredibly powerful and fast, has been around for years, and benefits from extensive use, debugging and optimizations. In other words, it's an industry standard. There's no use reinventing the wheel here.

What else?

The magical new storage technology should let me use objects, just like I do now. If I make changes to the object, it should automatically save to the database. And when I fetch from the database, I want it to give me objects back. I don't want to have to deal with database rows and columns. I don't want to have to create my objects from a database row. Just have it give me objects.

Sounds pretty cool. I bet you could write something like this yourself, even without being an SQLite expert. Now what are you going to do when you write version 2.0 of your app, and your objects don't look the same?

Oh, well for that I'll cook in some kind of clever versioning. Yeah... And it will be automatic if the changes are obvious. Otherwise I'll add hooks so I can manually implement versioning myself.

Sounds fancy. And what about thread-safety? I hear those new iPhones and iPads even come with multi-core CPUs these days, and everyone is writing more multithreaded code. What happens when you change an object on a background thread, and the main thread is also using that object?

Yeah, thread-safety... Um... I suppose I may not want to block the main thread with disk access all the time, so interacting with the database on background threads would be pretty cool. So um...

And now we've circled back to explaining what core data is. It's a mature implementation of all the features listed above.

Core Data can be layered atop SQLite. It provides a base class called NSManagedObject. Create your own objects atop this, and Core Data can manage most of the data storage for you. It has powerful support for versioning, oftentimes doing everything for you automatically. And it has well-defined thread-safety rules, and even provides change notifications to sync objects when they're changed on another thread. It even provides various rules to deal with conflicts. It's data storage, all grown up. But it's complicated, and not without caveats (just like grown up TV shows).

I'm not going to try to teach core data here. Apple provides incredible documentation on the subject. But I'd like to highlight one thing you must be aware of when using core data within the xmpp framework.

Core Data is primarily broken down into 2 components:

  • The persistent store component, which manages reading/writing to the database (NSPersistentStoreCoordinator). This component is thread-safe. Internally is uses locks and serializes access.
  • The per-thread caches (NSManagedObjectContext). This component is obviously not thread-safe. It implements a smart cache, including background pre-fetching, and maintains an in-memory array of objects that have been added, modified or deleted.

The important concept to understand is that every core data object (NSManagedObject) is tied to a particular per-thread cache (NSManagedObjectContext). Thus these objects are not thread-safe. So, for example, you should not take a managed object, and pass it to a different thread. (There are other ways to easily accomplish this task.)

Now the xmpp framework is massively parallel. The modules and storage mechanisms are running in their own dedicated GCD queues (background threads in the thread-pool managed by grand central dispatch). So if these xmpp modules are running in different threads, and they're using core data, and there are components of core data that aren't thread safe... how do I use it? And this is where a better understanding of core data comes into play.

XMPPCoreData and the ManagedObjectContext

Recall that every thread has a "per-thread cache" in the form of the NSManagedObjectContext. XMPP modules run in background threads/queues, and thus have their own internal managedObjectContext. Since the "per-thread caches" aren't thread safe, you obviously can't use the internal managedObjectContext in your separate thread. However, they supply various mechanisms to help you setup your own managedObjectContext. And even better, they automatically supply a "mainThreadManagedObjectContext" for this common case. So if you're working on the main thread, you can simply access this context through the property, and use it to do any fetching etc that you need to do.

Many xmpp modules also provide various getters and fetchers that return managed objects. Recall that these managed objects are not thread-safe, and are implicitly tied to a managed object context. Thus, you'll notice that these methods take, as a parameter, an NSManagedObjectContext instance. That way the returned object will be safe for you to use.

For example, XMPPRosterCoreDataStorage has this public method:

- (XMPPUserCoreDataStorageObject *)userForJID:(XMPPJID *)jid
                                   xmppStream:(XMPPStream *)stream
                         managedObjectContext:(NSManagedObjectContext *)moc;

The returned object (XMPPUserCoreDataStorageObject) is a managed object, and is tied to a particular managedObjectContext. Thus, the third parameter is the context that should be used to fetch the user. This way the result is thread-safe.

So how do changes work?

The xmpp module will be updating the data store in the background as users go online / offline, etc. So how does the main thread (or any other thread for that matter) get notified of changes? It's mostly handled by core data. Here's how it works:

Any changes made to a managed object are automatically noted by the managed object context. The context (per-user cache), keeps the change list in memory until we're ready to commit the changes by invoking the save operation on the context. At that point, the context flushes the changes to disk, and then packages up a NSManagedObjectContextDidSave notification. This notification gets broadcasted to the other managedObjectContext's that are also associated with the same database. Now all these other contexts have some objects loaded into memory (the per-thread cache). Obviously they don't have the entire database in memory, but they have some. So when the other context's receive the DidSave notification, it includes the change-set. So the other context's simply go through the objects they have in memory, and see if any of the changes affect the objects they have in RAM. (It's probably a bit more complicated than this, but you get the idea.) They also check to see if any of the changes might affect any existing long-lived queries against the data store.

If you use the mainThreadManagedObjectContext, all this change notification, and merging of changes is automatically handled for you.

XMPPCoreData Disk IO Optimizations

Using SQLite as the data storage mechanism makes a lot of sense. You can store a lot of data without using up lots of memory (critical for apps running on mobile devices). You can persist data between app launches, possibly reducing the amount of data you have to fetch over the network everytime. And we all know SQLite is incredibly fast and efficient. But ultimately storing anything to the disk means... writing to the disk. And writing to the disk is many many times slower than other in-memory operations, so it can be a bottleneck. How does XMPPCoreDataStorage help optimize this automatically?

There are multiple considerations for the design of XMPPCoreDataStorage (the base class for all core data storage implementations within the xmpp framework):

  • The class should consolidate multiple changes into a single write
  • The class cannot delay writes too long, or it risks delaying propagation events to other threads such as the main thread.
  • The class cannot allow the in-memory change-set to grow too big, or it risks growing the memory footprint exceedingly large.
  • The class should have some clue as to what operations it needs to do next. This way it won't delay upcoming tasks since it can delay disk IO til after immediate requests have been satisfied.

The XMPPCoreDataStorage architecture meets all these needs via several clever optimizations.

First, the storage module runs in its own dedicated GCD queue. This is needed to serialize access to the thread-unsafe NSManagedObjectContext. But it's also used to serialize all incoming requests, either from the xmpp stream, or from your application asking for something. Before a request is added to the queue, an internal "pendingRequests" integer gets atomically incremented. And after the request is processed, the "pendingRequests" integer gets decremented. This way the storage module knows if there are other pending requests that it needs to process. Thus it may delay an expensive write operation to first handle an incoming change from the xmpp stream. Further, it can use this information to flush changes as soon as possible. If a small change is processed by the storage module, and there aren't any other pending requests, it can immediately flush changes without delaying any pending requests. Thus the architecture allows it to optimize a burst of changes from the stream, as well as the single-change scenario.

In addition to this it monitors the "length of the change-set". That is, the number of changed objects that have changed, and need to be saved to disk. There is a public property named "saveThreshold" that is used to trigger a save if the list gets too long. This prevents the memory footprint from growing exceedingly large. So, for example, if one's roster contains thousands of people, the storage module might automatically split this into 2 separate save operations so your app doesn't crash.

More information on these optimizations (and how to use them if you're extending XMPPCoreDataStorage) can be found in the XMPPCoreDataStorageProtected header file.

XMPPCoreData plus In-Memory Storage

Did you know that CoreData can be used for in-memory-only storage? It doesn't have to written to disk.

So if you're thinking, "Right now my application doesn't require a database. The amount of data I'm handling can easily be stored in memory. And when it grows in the next version... well I'll cross that bridge when I get there." You're in luck. You can setup core data to use an InMemoryStore now, and benefit from the pure speed now. And if you dataset grows in the future, and you need to move to an on-disk database, you can essentially flip a switch.

This is supported by XMPPCoreDataStorage. Just use whichever init method suites your needs:

/**
 * Initializes a core data storage instance, backed by SQLite, with the given database store filename.
 * [...]
**/
- (id)initWithDatabaseFilename:(NSString *)databaseFileName;

/**
 * Initializes a core data storage instance, backed by an in-memory store.
**/
- (id)initWithInMemoryStore;