# Twitter Streaming

We start by creating a streaming context.

In [None]:
# Create a local StreamingContext with a batch interval of 10 second
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 10)

Now we create a stream that reads from a network socket. This will read data that we send by hand from a terminal later.

We need to specify a PORTNUMBER, which has to be unique for each socket and server.

In [None]:
# Create a DStream that will connect to hostname:port, like localhost:19999 
lines = ssc.socketTextStream("localhost", 6789)

The stream produces a sequence of RDDs, each RDD referred to as a batch. Therefore we can apply some RDD functionality, like flatMap or map. Finally we make it a DataFrame.

In [None]:
from collections import namedtuple # Each element will be assigned a field
fields = ("tag", "count" )
Tweet = namedtuple( 'Tweet', fields )

( lines.flatMap( lambda text: text.split( " " ) ) #Splits to a list
  .filter( lambda word: word.lower().startswith("#") ) # Checks for hashtag calls
  .map( lambda word: ( word.lower(), 1 ) ) # Lower cases the word, sets up a tuple
  .reduceByKey( lambda a, b: a + b ) # Reduces by key
  .map( lambda rec: Tweet( rec[0], rec[1] ) ) # Stores in a Tweet Object
  .foreachRDD( lambda rdd: rdd.toDF().sort( desc("count") ) # Sorts in descendoing order by count
  .limit(10).registerTempTable("tweets") ) ) # For every ten tweets is will be egistered as a table.

Now we have set up the stream, but it is not yet running. For that we need to start the StreamingContext (ssc).

**Before starting the streaming context in the next cell, you need to start the server that it will read from: **

1) open a terminal, just like the one you used to start the jupyter notebook, via SSH.

2) for testing, use the file  `RandomRead.py`. It contains a PORTNUMBER which you need to adjust to the one used  above (3 cells before this one). You can then run the server with `python3 RandomRead.py`. 
 
3) then start the cell below, it should show arriving words like `Lorem ipsum dolor sit amet [...]`

4) once that works, stope the server on the terminal with ctrl-C, and start the `TweetRead.py`. Here you will also need to adjust the PORTNUMBER. I have set up a Twitter account called "City University Big Data Module". Please don't post tweets on there, and if you are going run it for more than testing, please get your own credentials from [https://apps.twitter.com](https://apps.twitter.com).

Watch the ouptut, if it says anything about port number taken. If you get strange behaviour try restarting and using PORTNUMBER + 1. 

In [None]:
%time ssc.start()             # Start the computation
# after this, the stream will still run but it won't block the notebook any more

When you've tested the stream processing, you can stop the stream context with the next cell. After this the stream context is not usable and more and you need to restart this notebook to start another stream.

In [None]:
import time
from IPython import display # Enables us to show stuff in the notebook
import matplotlib.pyplot as plt #Visualization library
import seaborn as sns # Visualization library
# Only works for Jupyter Notebooks!
# The following code enables us to view the bar plot within a cell in the jupyter notebook
%matplotlib inline

The code below will fail if there is not data actually coming in.

In [None]:
count = 0
while count < 10:
    
    time.sleep( 3 )
    top_10_tweets = sqlContext.sql( 'Select tag, count from tweets' )
    top_10_df = top_10_tweets.toPandas() # Dataframe library
    display.clear_output(wait=True) #Clears the output, if a plot exists.
    sns.plt.figure( figsize = ( 10, 8 ) )
    sns.barplot( x="count", y="tag", data=top_10_df)
    sns.plt.show()
    count = count + 1


At the end, stop the stream. You can not restart it in the same process, therefore you have to restart the kernel to start again.

In [None]:
ssc.stop()