## How to create DStreams?

Now there're many different ways to make DStreams, but fundamentally every DStream, except for the file sream, is associated with a receiver. A reciver is the object that receives data from the source and stores it in Spark memory for processing. 

Some of these receivers are available directly from Sparks Streaming context API such as file systems and TCP socket connections of RDD. Other receivers that may require other dependencies including Kafka, Flume.

Note when specifying the number of cores that you're using with Spark it's always important to use more than one. 

So for the receiver available from the streaming context API, we would approach this by importing 'StreamingContext', which is the prerequesite for all Spark's streaming functionality and creating a local StreamingContext with 10 execution threads for example and a batch interval with the specified number of seconds.

### Example : Twitter as a DStream Source.

In [9]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# (1)SparkContext     :  Main entry point for Spark functionality.
#                        A SparkContext represents the connections to a Spark Cluster,
#                        and can be used to create RDD and broadcast 
#                        variables on that cluster.
# (2)StreamingContext :  Main entry point for Spark Streaming functionality.
#                        A StreamingContext represents the connections to a Spark cluster,
#                        and can be used to create DStream various input sources.
# (3)socketTextSteam  : Create an input from TCP source hostname:prt
#                       Data is received using a TCP socket and receive byte is interpreted 
#                       as UTF8 encoded '\\n' delimited lines.


# Set up a twitter stream, and create the Spark socket stream
# with the following lines. 
sc     = SparkContext()
ssc    = StreamingContext(sc, batchDuration=10) 
socket_stream = ssc.socketTextStream('127.0.0.1', 555)

### Example : text files as a Stream

textFileStream(<i>dataDirectory</i>)
: Creates an input stream from new text files that enter a specific directory.

 * dataDirectory : filepath for a folder with new files being added after the start of the stream
   * Only the text format file applicable for Pyspark.


In [17]:
def simple_text_to_stream(ssc):
    ssc.textFileStream('/data').pprint()

### Example : Queue of RDDs as a Stream

queueStream(rdds, oneAtATime=True, default=None)

Creates an input stream from an queue of RDDs or list. In each batch, it will process either one or all of the RDDs returned by the queue.

 * rdds       : Queue of RDDs
 * oneAtATime : pick one rdd each time or pick all of them once
 * default    : The default rdd if no more in rdds

In [18]:
def simple_queue_one_at_a_time(ssc):
    ssc.queueStream([range(5), ['a','b'], ['c']], 
                     oneAtATime=True).pprint()