# Suspicious Purchase Amounts with Window Operations

In this lab, you will see how to use Spark Streaming's support for windowed operations to identify suspicious activity from a stream of purchase data.

## Objectives

Use Spark Streaming's support for sliding windows of data to watch an incoming stream of transactions to determine the presence of anomalies.


## Instructions

We're going to take a look (ok, a *naive* look) at how Spark Streaming can be used to watch for fraudulent financial activity.  To do so, we're going to stream simplified transaction data to our program, which will, in turn, continuously calculate statistical information based on time-delineated, sliding windows of data about the stream and use it to identify potentially fraudulent activity.

## Create Spark Session

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Streaming").getOrCreate()


## Create Streaming Context with Batch Interval Duration

In [None]:
from pyspark.streaming import StreamingContext

# the duration is in seconds
intervalDuration = 2

ssc = StreamingContext(spark.sparkContext, intervalDuration)

# checkpoint for backups
ssc.checkpoint("checkpoint")


## Transaction Anomalies

We'll define a transaction anomaly as a transaction with debit exceeding the average debits in the transaction streaming window.

In order to determine the average we will reduce the debits calling the reduceByKeyAndWindow function:

```python
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval)
```

where the reduce value of each window is calculated incrementally using the reduce values of the previous window. 

This is done by reducing the new data that enters the sliding window, and "inverse reducing" the old data that leaves the window. 

```windowLength``` is the length of the streaming window in seconds; it must me a multiple of ```intervalDuration```.

```slideInterval``` is the interval at which the window operation is performed; it must me a multiple of ```intervalDuration```.


## Define "reduce" and "inverse reduce" Functions

In [None]:
# amount reduce function for the transactions that entered the window
def add(r, c):
   count  = r[0]+c[0]
   amount = r[1]+c[1]
   mean   = amount/count if count != 0 else 0

   return (count, amount, mean)

# inverse amount reduce function for the transactions that left the window
def sub(r, c):
   count  = r[0]-c[0]
   amount = r[1]-c[1]
   mean   = amount/count if count != 0 else 0

   return (count, amount, mean)


## Create Socket Stream

In [None]:
hostname = "nc"
port = 9999

# create a DStream that will connect to hostname:port
lines = ssc.socketTextStream(hostname, port)


## Parse Transactions and Identify Debits

In [None]:
# parsing transactions
rawTxns = lines.map(lambda st: st.split(",")).map(lambda el: (el[0], el[1], float(el[2])))

# filtering transactions for purchases only
debitTxns = rawTxns.filter(lambda s: s[2] < 0)


We need to add the key to transactions to be able to compare with the amount average.

In the real application it would be more natural to use the account id as the key.


In [None]:
keyedTxns = debitTxns.map(lambda s: (1, s))


## Reduce Debit Amounts in Window

In [None]:
windowDuration   = 3*intervalDuration  # window interval duration
slidingDuration  = 1*intervalDuration  # sliding duration

amounts = debitTxns.map(lambda s: (1, (1, s[2], s[2])))
meanAmount = amounts.reduceByKeyAndWindow(lambda r, c: add(r, c), lambda r, c: sub(r, c), windowDuration, slidingDuration)


## Join Transactions and Reduced Debits Streams

In [None]:
joinedTxns = keyedTxns.join(meanAmount)


## Identify Suspicious Transactions

In [None]:
fraud_factor = 1.33

suspiciousTxns = joinedTxns.map(lambda v: v[1]).filter(lambda t: t[0][2] < t[1][2]*fraud_factor).map(lambda t: t[0])

suspiciousTxns.pprint()


## Start Streaming

In [None]:
ssc.start()             # starting the computation
ssc.awaitTermination()  # waiting for the computation to terminate


### Simulate transaction source

We will be receiving transaction data on port 9999.  There's a convenient utility in most Unix-like systems called `nc` ("netcat") that will take data from `stdin` and pipe it to a designated socket.

> Note:  Windows systems should have a similar command called `ncat`.

For our application, this will be `localhost` on port `9999`, of course.

Open a new terminal and, if you're on Linux, issue the command

``` 
nc localhost 9999
```

If you're on Mac, issue the command

``` 
nc -lk 9999
```

Check your documentation if you're on Windows (I'll avoid the obligatory snide "Windoze" or other remark here).

If there's an error, diagnose & correct it.  Otherwise, you should see the cursor on the next line, waiting for input on `stdin`.  Leave that process be for the moment; we're going to return to our program now and come back to it when we're ready to stream data.

Copy and paste a large amount of transaction data from tx.csv file.

## Conclusion

In this lab, you saw how Spark Streaming's support for window operations can be used to process continuously streamed data and handle it not only with ease, but also with nearly the same API and programming concepts as batch data!

## Complete Solution

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Streaming").getOrCreate()

from pyspark.streaming import StreamingContext

# the duration is in seconds
intervalDuration = 2

ssc = StreamingContext(spark.sparkContext, intervalDuration)

# checkpoint for backups
ssc.checkpoint("checkpoint")


# amount reduce function for the transactions that entered the window
def add(r, c):
   count  = r[0]+c[0]
   amount = r[1]+c[1]
   mean   = amount/count if count != 0 else 0

   return (count, amount, mean)

# inverse amount reduce function for the transactions that left the window
def sub(r, c):
   count  = r[0]-c[0]
   amount = r[1]-c[1]
   mean   = amount/count if count != 0 else 0

   return (count, amount, mean)


hostname = "nc"
port = 9999

# create a DStream that will connect to hostname:port
lines = ssc.socketTextStream(hostname, port)

# parsing transactions
rawTxns = lines.map(lambda st: st.split(",")).map(lambda el: (el[0], el[1], float(el[2])))

# filtering transactions for purchases only
debitTxns = rawTxns.filter(lambda s: s[2] < 0)

# we need to add the key to transactions to be able to compare with the amount mean
# in the real application it would be more natural to use the account id as the key
keyedTxns = debitTxns.map(lambda s: (1, s))

windowDuration   = 3*intervalDuration  # window interval duration
slidingDuration  = 1*intervalDuration  # sliding duration

amounts = debitTxns.map(lambda s: (1, (1, s[2], s[2])))
meanAmount = amounts.reduceByKeyAndWindow(lambda r, c: add(r, c), lambda r, c: sub(r, c), windowDuration, slidingDuration)

fraud_factor = 1.33

# joining two streams with the purchase transactions and the mean
joinedTxns = keyedTxns.join(meanAmount)

# getting suspicious purchases
suspiciousTxns = joinedTxns.map(lambda v: v[1]).filter(lambda t: t[0][2] < t[1][2]*fraud_factor).map(lambda t: t[0])

suspiciousTxns.pprint()

ssc.start()             # starting the computation
ssc.awaitTermination()  # waiting for the computation to terminate
