# Display Advertising: Ad fraud detection

## Introduction
Ad fraud is a big, costly problem in our industry. To fight it, you have to understand the different forms it can take. The brand safety experts typically divide ad fraud into three categories: non-human traffic (i.e., bots); ads with zero chance of being seen (i.e., zero-percent viewability); and intentional misrepresentation. The imposters who are responsible for these kinds of fraud are savvy, and they are continually finding new and more sophisticated ways to make money by defrauding advertisers.

The scale of online ad fraud has a significant impact on advertising ROI and advertiser confidence because all those falsified impressions and clicks cost money without yielding conversions or revenue. It’s estimated that fraud consumes 1 of every 3 dollars spent on digital advertising. In 2018, advertisers lost an estimated \$51 million every day to fraud, a figure that is expected to more than double by 2022. Time and time again, advertisers unwittingly reinvest in fraudulent inventory because it appears on reports to be driving results. Worst of all, ad fraud is not technically illegal, so there is minimal risk for bad actors.


Here’s a closer look at some of the most common types of fraud:

#### 1. Bot basics

* **General invalid traffic (GIVT)** are scripts that run from a server such as Amazon Web Services or some other hosting provider. As their name implies, these bots are usually easy to identify because they have a static IP, user agent, and cookie ID. This makes fingerprinting them pretty easy using DSP auction logs or even web server logs to spot abnormally high clickthrough rate (CTR) or unexpected spikes in traffic that are the signatures of simple bots.

* **Sophisticated invalid traffic (SIVT)** is not as easy to identify. These bots rotate user agents, using random proxies to rotate IP addresses, and they mimic normal “human” CTRs, so they are more challenging to detect. They are also now capable of completing complicated tasks like filling out forms or completing videos. Sophisticated bots can even put items in shopping carts and visit multiple sites to generate histories and cookies—making them look attractive to advertisers and publishers.



#### 2. The unviewable
Ad stacking is a common way that fraudulent publishers get credit for running an ad that is actually hidden behind other ads and not viewable. The publisher can thereby generate multiple impressions for a single page view, even when only the top ad in the “stack” is ever seen.

#### 3. Site scams

* **Domain spoofing** is a scheme employed by deceitful publishers, ad exchanges, or networks to obscure the nature of their traffic to resemble legitimate websites. For example, an advertiser might sign off on a contract to run a campaign on a legitimate entertainment website with very high monthly traffic, but instead its ads end up on an unknown site. This practice is most prevalent in the programmatic space where publishers are sometimes allowed to declare their own domains and label their own site IDs. Spoofed domains are not just fake website addresses, they are also banner farms that contain bad content.

* **Ghost sites** are among the most difficult fraud methods for advertisers to spot. Fraudsters create content farms and use bots to mimic human traffic. The sites may then be introduced to a legitimate ad exchange, where ad impressions are made available for advertisers to buy programmatically. Exchanges usually spot these schemes quickly, but even a short lifespan can be profitable to the ghost site creators.

* **Zero-ad sites** are those where advertising is forbidden, such as government or educational sites. But fraudsters still find ways to inject ads into them when a user downloads and installs a browser extension or app (such as a free PDF converter or browser toolbar) bundled with software that quietly injects unwanted ads into the user’s browser.


## Ad fraud detection companies
Ad fraud detection companies (commonly known as “ad verification” vendors in the ad tech space) automate most of the manual work of checking ad campaigns for fraud. Manually checking makes sense if you have the resources, but it simply doesn’t scale for larger advertising operations. In those cases, automated ad verification vendors play a helpful role in complementing manual analysis.

Here are some of the most notable players:

- Integral Ad Science
- DoubleVerify
- Forensiq
- White Ops
- Fraudlogix
- Trust Metrics
- comScore
- Oxford BioChronometrics
- Pixalate


<img src="../../resources/adfraud-forensiq.png" alt="Intellinum Bootcamp" style="width: 800px; height: 500px">


__Reference__

- https://marketingland.com/ad-fraud-detection-guide-marketers-214928
- https://www.sizmek.com/blog/types-of-ad-fraud-and-what-you-can-do-about-them/
- https://clearcode.cc/blog/rtb-online-advertising-fraud/
- https://impact.com/ad-fraud-detection/

## Project goal:


Display advertising is a billion dollar effort and one of the central uses of machine learning on the Internet. However, its data and methods are usually kept under lock and key. In this research project, you will be working on three hours’s worth of [Real-Time Bidding](https://arxiv.org/pdf/1610.03013.pdf) [bid request data](https://docs.bidswitch.com/standards/bid-request-examples.html) and develop models identifying `Sophisticated invalid BOTs traffic (SIVT)`. 

This requirement of this project comes from the AD fraud detection product of one of the renowned [DSP](https://en.wikipedia.org/wiki/Demand-side_platform).




In [33]:
#MODE = "LOCAL"
MODE = "CLUSTER"

import sys
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.storagelevel import StorageLevel
from matplotlib import interactive
interactive(True)
import matplotlib.pyplot as plt
%matplotlib inline
import json
import math
import numbers
import numpy as np
import plotly
plotly.offline.init_notebook_mode(connected=True)

sys.path.insert(0,'../../src')
from settings import *

try:
    fh = open('../../libs/pyspark24_py36.zip', 'r')
except FileNotFoundError:
    !aws s3 cp s3://yuan.intellinum.co/bins/pyspark24_py36.zip ../../libs/pyspark24_py36.zip
        
try:
    spark.stop()
    print("Stopped a SparkSession")
except Exception as e:
    print("No existing SparkSession")

SPARK_DRIVER_MEMORY= "1G"
SPARK_DRIVER_CORE = "1"
SPARK_EXECUTOR_MEMORY= "1G"
SPARK_EXECUTOR_CORE = "1"
SPARK_EXECUTOR_INSTANCES = 6



conf = None
if MODE == "LOCAL":
    os.environ["PYSPARK_PYTHON"] = "/home/yuan/anaconda3/envs/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("adtech_bots_detection").\
            setMaster('local[*]').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', '../../libs/mysql-connector-java-5.1.45-bin.jar').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1')
else:
    os.environ["PYSPARK_PYTHON"] = "./MN/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("adtech_bots_detection").\
            setMaster('yarn-client').\
            set('spark.executor.cores', SPARK_EXECUTOR_CORE).\
            set('spark.executor.memory', SPARK_EXECUTOR_MEMORY).\
            set('spark.driver.cores', SPARK_DRIVER_CORE).\
            set('spark.driver.memory', SPARK_DRIVER_MEMORY).\
            set("spark.executor.instances", SPARK_EXECUTOR_INSTANCES).\
            set('spark.sql.files.ignoreCorruptFiles', 'true').\
            set('spark.yarn.dist.archives', '../../libs/pyspark24_py36.zip#MN').\
            set('spark.sql.shuffle.partitions', '5000').\
            set('spark.default.parallelism', '5000').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1'). \
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', 's3://yuan.intellinum.co/bins/mysql-connector-java-5.1.45-bin.jar')
        

spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()


sc = spark.sparkContext

sc.addPyFile('../../src/settings.py')

sc=spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

print("-----------------DONE-----------------")

def display(df, limit=10):
    return df.limit(limit).toPandas()

def dfTest(id, expected, result):
    assert str(expected) == str(result), "{} does not equal expected {}".format(result, expected)

Stopped a SparkSession


KeyboardInterrupt: 

In [2]:
import plotly.graph_objs as go

def plot_barchart(dataframes, xValues, yValues, xLabels, yLabels, plotNames, plotTitles):
    """
    dataframes = an array of dataframes to be plotted
    xValues = an array of x values for each datafraem in the dataframes array to be plotted
    yValues = an array of y values for each datafraem in the dataframes array to be plotted
    xLabels = an array of x labels for each datafraem in the dataframes array to be plotted
    yLabels = an array of y labels for each datafraem in the dataframes array to be plotted
    plotNames = an array of plot names for each datafraem in the dataframes array to be plotted
    plotTitles = an array of plot titles for each datafraem in the dataframes array to be plotted
    """
    no_of_plots = len(dataframes)
    plots = []
    layouts = []
    
    # Creating the plots from the dataframe and x-y values
    print("Creating the plots from the dataframe and x-y values . . . .")
    for i in range(no_of_plots):
        print("Converting the  "+str(dataframes[i])+" to Pandas series")
        dataframe = dataframes[i].toPandas()
        xValue = dataframe[xValues[i]]
        yValue = dataframe[yValues[i]]
        plotName = plotNames[i]
        plotData = go.Bar(
                    x = xValue,
                    y = yValue,
                    name = plotName
                )
        plots.append(plotData)
    print("---------------Done------------- \n")
    
    # Creating the layouts for each plot using x-y labels
    print("Creating the layouts for each plot using x-y labels . . . .")
    for i in range(no_of_plots):
        xLabel = xLabels[i]
        yLabel = yLabels[i]
        plotTitle = plotTitles[i]
        
        layout = go.Layout(
                    title=plotTitle,
                    xaxis={
                        "title" : xLabel,
                        "tickfont" : {
                            "size" : 14,
                        }
                    },
                    yaxis={
                        "title" : yLabel,
                        "tickfont" : {
                            "size" : 10,
                        }
                    },
                    legend = {
                        "x" : 1,
                        "y" : 1
                    },
                    bargap=0.15,
                )
        
        layouts.append(layout)
    print("---------------Done------------- \n")
    
    # Actually plotting all the plots
    print("Actually plotting all the plots")
    for i in range(no_of_plots):
        fig = go.Figure(data=[plots[i]], layout=layouts[i])
        plotly.offline.iplot(fig)

## Load RTB data

The data for this project only contains four fields: 
* user_id: Annonymised [mobile device ID](https://www.aerserv.com/blog/mobile-device-identifiers/)
* ip: device IP address
* timestamp: Timestamp of the [impression](https://www.youtube.com/watch?v=rTg9l4d8MU4)
* bundle_id: The app bundle ID of a mobile app that generated the impressions or clicks, or the domain name if the click came from web inventory.


In [3]:
rtb_bids_stream_df = spark.read.parquet("s3a://data.intellinum.co/project/bots_detection/")

In [4]:
rtb_bids_stream_df.count()

3722923

In [5]:
# Create a random sample of the original data

sample_rtb_bids_stream_df = rtb_bids_stream_df.sample(False, 0.2, 99)

In [6]:
sample_rtb_bids_stream_df.count()

744402

### First Order Differences 

#### Looking ath FOD in consequest Timestamps for a user grouped by Ip can give us insights such as
- Activity of the user on that ip for a time duration (which in case of bots should exhibit some kind of a pattern)
- Visulalizing those users vs their history of interactions based on the time differences can help separate legitimate users from bots

In [7]:
# Reduce the sample search space by eliminating seeminingly legit users
# users having activity more than 150 times within two hours is a bit suspicious

# USER_ACTIVITY_THRESHOLD = '50'

userCountDF = (sample_rtb_bids_stream_df.alias("a").groupBy("user_id")
                                                .agg(F.count("user_id").alias("user_id_count"))
                                                )

In [8]:
mostActiveUserDF = (sample_rtb_bids_stream_df.alias("a").join(userCountDF.alias("b"), 
                       sample_rtb_bids_stream_df.user_id == userCountDF.user_id)
                       .select("a.user_id", "a.ip","a.timestamp", "a.bundle_id", "b.user_id_count"))
display(mostActiveUserDF)

Unnamed: 0,user_id,ip,timestamp,bundle_id,user_id_count
0,002f353573ea0ba669e60d99144ec384b37c429e070c7a...,24.63.138.33,2019-05-01 01:45:50,207946492c0309c739dec11a00f6b5889f65c181c46a57...,2
1,002f353573ea0ba669e60d99144ec384b37c429e070c7a...,24.63.138.33,2019-05-01 01:45:49,5191f38a8770a2bafe7da861940387eecb69af8f50d494...,2
2,0718eb4ea4330960226fac0e14b630a0c73c7eb0241eb7...,69.124.99.123,2019-05-01 00:34:27,b229af038b1d7bcace1a046756c0bfd962a7159bd0de67...,1
3,0812d9ffbac8cf3963c731e3ce3ea69534df3d0096060f...,73.119.82.95,2019-05-01 02:58:03,f2d54bdc4e8293b82581cdb315600b439ba0326715ffc3...,1
4,09269d493d8c3401ef9b9acdd54c164cdb69b30d1a255b...,100.12.181.251,2019-05-01 01:47:38,902bd6a84758de8b9c644162a1352f983a8c5adc192af3...,1
5,09b054c6afb596b2c0089481921a4580d4d85cb2495ac8...,96.59.56.216,2019-05-01 00:54:36,7ff7b4bf65abcf58bdb6f73beacad68b4307a61d4a9cba...,1
6,0a28f0b098234193450aa45abbded8bfd2cfdec9d769fb...,166.137.14.113,2019-05-01 03:33:49,297d6cd809fbcf2637f340ead233cb43040c2094ecad7e...,2
7,0a28f0b098234193450aa45abbded8bfd2cfdec9d769fb...,166.137.14.113,2019-05-01 03:40:17,297d6cd809fbcf2637f340ead233cb43040c2094ecad7e...,2
8,1bf8a63780946ef50cd2ed389fec3c94dbb30ccda495d8...,166.137.126.24,2019-05-01 01:41:54,12813e8e8367762b61a1ac2cd7138634d709a13817d9aa...,1
9,2bb275399b0dc926397e371ee3bf8821b418dc59f1e32c...,2.50.182.78,2019-05-01 04:04:52,971d676c98e18e49a59077f79b4bc059a99440145eb2d7...,1


In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import when, isnull, unix_timestamp, lag, desc


In [None]:
window = Window.partitionBy("user_id", "ip").orderBy("timestamp")

timestampPrevDF = (mostActiveUserDF
                       .withColumn("prev", 
                                  lag(mostActiveUserDF.timestamp)
                                  .over(window)))

In [None]:
timeDiff = (unix_timestamp("timestamp")
            - unix_timestamp("prev"))

timestampDeltaDF = timestampPrevDF.withColumn("timeDelta", 
                                              when(isnull(timeDiff), 0)
                              .otherwise(timeDiff))

In [None]:
display(timestampDeltaDF)

Unnamed: 0,user_id,ip,timestamp,bundle_id,user_id_count,prev,timeDelta
0,002f353573ea0ba669e60d99144ec384b37c429e070c7a...,24.63.138.33,2019-05-01 01:45:49,5191f38a8770a2bafe7da861940387eecb69af8f50d494...,2,,0
1,002f353573ea0ba669e60d99144ec384b37c429e070c7a...,24.63.138.33,2019-05-01 01:45:50,207946492c0309c739dec11a00f6b5889f65c181c46a57...,2,2019-05-01 01:45:49,1
2,0718eb4ea4330960226fac0e14b630a0c73c7eb0241eb7...,69.124.99.123,2019-05-01 00:34:27,b229af038b1d7bcace1a046756c0bfd962a7159bd0de67...,1,,0
3,0812d9ffbac8cf3963c731e3ce3ea69534df3d0096060f...,73.119.82.95,2019-05-01 02:58:03,f2d54bdc4e8293b82581cdb315600b439ba0326715ffc3...,1,,0
4,09269d493d8c3401ef9b9acdd54c164cdb69b30d1a255b...,100.12.181.251,2019-05-01 01:47:38,902bd6a84758de8b9c644162a1352f983a8c5adc192af3...,1,,0
5,09b054c6afb596b2c0089481921a4580d4d85cb2495ac8...,96.59.56.216,2019-05-01 00:54:36,7ff7b4bf65abcf58bdb6f73beacad68b4307a61d4a9cba...,1,,0
6,0a28f0b098234193450aa45abbded8bfd2cfdec9d769fb...,166.137.14.113,2019-05-01 03:33:49,297d6cd809fbcf2637f340ead233cb43040c2094ecad7e...,2,,0
7,0a28f0b098234193450aa45abbded8bfd2cfdec9d769fb...,166.137.14.113,2019-05-01 03:40:17,297d6cd809fbcf2637f340ead233cb43040c2094ecad7e...,2,2019-05-01 03:33:49,388
8,1bf8a63780946ef50cd2ed389fec3c94dbb30ccda495d8...,166.137.126.24,2019-05-01 01:41:54,12813e8e8367762b61a1ac2cd7138634d709a13817d9aa...,1,,0
9,2bb275399b0dc926397e371ee3bf8821b418dc59f1e32c...,2.50.182.78,2019-05-01 04:04:52,971d676c98e18e49a59077f79b4bc059a99440145eb2d7...,1,,0


In [None]:
firstOrderTsDiffDF = (timestampDeltaDF
                        .filter('user_id_count > "5" ')
                        .groupby('user_id', 'ip')
                        .agg(F.collect_list("timeDelta")
                        .alias("first_order_ts_diff")))
display(firstOrderTsDiffDF)

Unnamed: 0,user_id,ip,first_order_ts_diff
0,447f3eef27865de54c79a9efebcae58a3c552bd7cea148...,104.162.71.56,"[0, 37, 43, 53, 297]"
1,447f3eef27865de54c79a9efebcae58a3c552bd7cea148...,208.54.37.206,"[0, 80, 827, 1, 75]"
2,b5933d1e45bdecb4a99fd711101048359c69f1912717a1...,172.58.172.238,"[0, 51, 69, 60, 10, 14, 2]"
3,105d74f3fd246c22fd762fc1e76be1c842265a1720660a...,71.233.99.6,"[0, 15, 23, 71, 178, 359]"
4,3e0da3c473cb4310b65952c3713dcbc9cde4042930f0d8...,100.34.60.228,"[0, 149, 222, 78, 697, 12, 247, 66]"
5,3f4846bb853bb184c0c52adca6b1e1055e9b37ffc66112...,66.177.29.215,"[0, 2480, 2, 46, 24, 64, 1132, 10993, 2106]"
6,43a07c4ab103802760cff41539a5f774e5964c34a57aad...,78.95.215.115,"[0, 35, 99, 537, 488, 566]"
7,53154c87c5fa246db317e67d3f47407d6340ad19afcf09...,110.141.224.79,"[0, 90, 240, 219, 45, 45, 135, 45, 18, 42, 135..."
8,c754115549affe5932658254fbe264a58f40c032eda3e6...,101.190.36.202,"[0, 184, 134, 178, 202, 9, 648, 109, 61, 180, ..."
9,27cc3e1957b871ff172f7276c7c4bc248eb2ccd1612ce8...,68.194.185.45,"[0, 557, 1288, 711, 432, 615]"


In [None]:
firstOrderTsDiffDF.count()

23322

In [None]:
df1 = firstOrderTsDiffDF.toPandas()

### Plotting the First order Time differences for each user

In [25]:
traces = []

for idx, row in df1.iterrows():
    if len(row.first_order_ts_diff) > 50:
        trace = go.Scatter(
                        x = [idx+1]*len(row.first_order_ts_diff),
                        y = row.first_order_ts_diff,
                        mode = 'markers'
                        )
        traces.append(trace)

    
layout = go.Layout(
    title='First Order Differences vs Users',
    yaxis=dict(
        title='First Order Differences'
    ),
    xaxis=dict(
        title='Users'
    )
)
fig = go.Figure(data=traces, layout=layout)
plotly.offline.iplot(fig, filename='multiple-axes-double')


### Second Order Differences

In [26]:
window = Window.partitionBy("user_id", "ip").orderBy("timestamp")

timestampSODPrevDF = (timestampDeltaDF
                       .withColumn("sod_prev", 
                                  lag(timestampDeltaDF.timeDelta, 1)
                                  .over(window)))
display(timestampSODPrevDF, 1000)

Unnamed: 0,user_id,ip,timestamp,bundle_id,user_id_count,prev,timeDelta,sod_prev
0,002f353573ea0ba669e60d99144ec384b37c429e070c7a...,24.63.138.33,2019-05-01 01:45:49,5191f38a8770a2bafe7da861940387eecb69af8f50d494...,2,,0,
1,002f353573ea0ba669e60d99144ec384b37c429e070c7a...,24.63.138.33,2019-05-01 01:45:50,207946492c0309c739dec11a00f6b5889f65c181c46a57...,2,2019-05-01 01:45:49,1,0.0
2,0718eb4ea4330960226fac0e14b630a0c73c7eb0241eb7...,69.124.99.123,2019-05-01 00:34:27,b229af038b1d7bcace1a046756c0bfd962a7159bd0de67...,1,,0,
3,0812d9ffbac8cf3963c731e3ce3ea69534df3d0096060f...,73.119.82.95,2019-05-01 02:58:03,f2d54bdc4e8293b82581cdb315600b439ba0326715ffc3...,1,,0,
4,09269d493d8c3401ef9b9acdd54c164cdb69b30d1a255b...,100.12.181.251,2019-05-01 01:47:38,902bd6a84758de8b9c644162a1352f983a8c5adc192af3...,1,,0,
5,09b054c6afb596b2c0089481921a4580d4d85cb2495ac8...,96.59.56.216,2019-05-01 00:54:36,7ff7b4bf65abcf58bdb6f73beacad68b4307a61d4a9cba...,1,,0,
6,0a28f0b098234193450aa45abbded8bfd2cfdec9d769fb...,166.137.14.113,2019-05-01 03:33:49,297d6cd809fbcf2637f340ead233cb43040c2094ecad7e...,2,,0,
7,0a28f0b098234193450aa45abbded8bfd2cfdec9d769fb...,166.137.14.113,2019-05-01 03:40:17,297d6cd809fbcf2637f340ead233cb43040c2094ecad7e...,2,2019-05-01 03:33:49,388,0.0
8,1bf8a63780946ef50cd2ed389fec3c94dbb30ccda495d8...,166.137.126.24,2019-05-01 01:41:54,12813e8e8367762b61a1ac2cd7138634d709a13817d9aa...,1,,0,
9,2bb275399b0dc926397e371ee3bf8821b418dc59f1e32c...,2.50.182.78,2019-05-01 04:04:52,971d676c98e18e49a59077f79b4bc059a99440145eb2d7...,1,,0,


In [27]:
# timeDiff = ("sod_prev"
#             - "timeDelta")

timestampSODDeltaDF = timestampSODPrevDF.withColumn("timeSODDelta", 
                                              when(isnull(timestampSODPrevDF.sod_prev - timestampSODPrevDF.timeDelta), 0)
                              .otherwise(timestampSODPrevDF.sod_prev - timestampSODPrevDF.timeDelta))

In [28]:
display(timestampSODDeltaDF, 1000)

Unnamed: 0,user_id,ip,timestamp,bundle_id,user_id_count,prev,timeDelta,sod_prev,timeSODDelta
0,002f353573ea0ba669e60d99144ec384b37c429e070c7a...,24.63.138.33,2019-05-01 01:45:49,5191f38a8770a2bafe7da861940387eecb69af8f50d494...,2,,0,,0
1,002f353573ea0ba669e60d99144ec384b37c429e070c7a...,24.63.138.33,2019-05-01 01:45:50,207946492c0309c739dec11a00f6b5889f65c181c46a57...,2,2019-05-01 01:45:49,1,0.0,-1
2,0718eb4ea4330960226fac0e14b630a0c73c7eb0241eb7...,69.124.99.123,2019-05-01 00:34:27,b229af038b1d7bcace1a046756c0bfd962a7159bd0de67...,1,,0,,0
3,0812d9ffbac8cf3963c731e3ce3ea69534df3d0096060f...,73.119.82.95,2019-05-01 02:58:03,f2d54bdc4e8293b82581cdb315600b439ba0326715ffc3...,1,,0,,0
4,09269d493d8c3401ef9b9acdd54c164cdb69b30d1a255b...,100.12.181.251,2019-05-01 01:47:38,902bd6a84758de8b9c644162a1352f983a8c5adc192af3...,1,,0,,0
5,09b054c6afb596b2c0089481921a4580d4d85cb2495ac8...,96.59.56.216,2019-05-01 00:54:36,7ff7b4bf65abcf58bdb6f73beacad68b4307a61d4a9cba...,1,,0,,0
6,0a28f0b098234193450aa45abbded8bfd2cfdec9d769fb...,166.137.14.113,2019-05-01 03:33:49,297d6cd809fbcf2637f340ead233cb43040c2094ecad7e...,2,,0,,0
7,0a28f0b098234193450aa45abbded8bfd2cfdec9d769fb...,166.137.14.113,2019-05-01 03:40:17,297d6cd809fbcf2637f340ead233cb43040c2094ecad7e...,2,2019-05-01 03:33:49,388,0.0,-388
8,1bf8a63780946ef50cd2ed389fec3c94dbb30ccda495d8...,166.137.126.24,2019-05-01 01:41:54,12813e8e8367762b61a1ac2cd7138634d709a13817d9aa...,1,,0,,0
9,2bb275399b0dc926397e371ee3bf8821b418dc59f1e32c...,2.50.182.78,2019-05-01 04:04:52,971d676c98e18e49a59077f79b4bc059a99440145eb2d7...,1,,0,,0


In [29]:
secondOrderTsDiffDF = (timestampSODDeltaDF
                        .filter("user_id_count > '5' ")
                        .groupby('user_id', 'ip')
                        .agg(F.collect_list("timeSODDelta")
                        .alias("second_order_ts_diff")))

In [30]:
display(secondOrderTsDiffDF, 1000)

Unnamed: 0,user_id,ip,second_order_ts_diff
0,447f3eef27865de54c79a9efebcae58a3c552bd7cea148...,104.162.71.56,"[0, -37, -6, -10, -244]"
1,447f3eef27865de54c79a9efebcae58a3c552bd7cea148...,208.54.37.206,"[0, -80, -747, 826, -74]"
2,b5933d1e45bdecb4a99fd711101048359c69f1912717a1...,172.58.172.238,"[0, -51, -18, 9, 50, -4, 12]"
3,105d74f3fd246c22fd762fc1e76be1c842265a1720660a...,71.233.99.6,"[0, -15, -8, -48, -107, -181]"
4,3e0da3c473cb4310b65952c3713dcbc9cde4042930f0d8...,100.34.60.228,"[0, -149, -73, 144, -619, 685, -235, 181]"
5,3f4846bb853bb184c0c52adca6b1e1055e9b37ffc66112...,66.177.29.215,"[0, -2480, 2478, -44, 22, -40, -1068, -9861, 8..."
6,43a07c4ab103802760cff41539a5f774e5964c34a57aad...,78.95.215.115,"[0, -35, -64, -438, 49, -78]"
7,53154c87c5fa246db317e67d3f47407d6340ad19afcf09...,110.141.224.79,"[0, -90, -150, 21, 174, 0, -90, 90, 27, -24, -..."
8,c754115549affe5932658254fbe264a58f40c032eda3e6...,101.190.36.202,"[0, -184, 50, -44, -24, 193, -639, 539, 48, -1..."
9,27cc3e1957b871ff172f7276c7c4bc248eb2ccd1612ce8...,68.194.185.45,"[0, -557, -731, 577, 279, -183]"


In [31]:
df = secondOrderTsDiffDF.toPandas()

In [32]:
traces = []

for idx, row in df.iterrows():
    if len(row.second_order_ts_diff) > 50:
        trace = go.Scatter(
                        x = [idx+1]*len(row.second_order_ts_diff),
                        y = row.second_order_ts_diff,
                        mode = 'markers'
                        )
        traces.append(trace)

    
layout = go.Layout(
    title='Second Order Differences vs Users',
    yaxis=dict(
        title='Second Order Differences'
    ),
    xaxis=dict(
        title='Users'
    )
)
fig = go.Figure(data=traces, layout=layout)
plotly.offline.iplot(fig, filename='multiple-axes-double')

## Identify bots traffic

It is typical for a contemporary RTB system to receive millions of bid requests per second at peak time. The system caches 2 hours's worth of bid requests data in the memory and your job is to implement a machine learning model to identify bot's IP or userId using pattern recognitions. 

Spend some time looking for patterns in the data first. You are free to use any python libraries for this project. 

In [None]:

useridGroupCountDF = sample_rtb_bids_stream_df.groupBy("user_id").agg(count("user_id").alias("userVisits"))

In [None]:
from pyspark.sql.functions import col, max, desc, lag, count

newRTBBidsDF = (sample_rtb_bids_stream_df.alias("a").join(useridGroupCountDF.alias("b"), 
                sample_rtb_bids_stream_df.user_id == useridGroupCountDF.user_id)
                .select("a.user_id", "a.ip", "a.bundle_id", "a.timestamp","b.userVisits")
                .orderBy(desc(("b.userVisits"))))

In [None]:
display(newRTBBidsDF)

In [None]:
from pyspark.sql.window import Window

window = Window.partitionBy("user_id", "ip", "bundle_id").orderBy("timestamp")

timestampPrevDF = (newRTBBidsDF
                       .withColumn("prev", 
                                  lag(newRTBBidsDF.timestamp)
                                  .over(window)))

In [None]:
display(timestampPrevDF, 1000)

In [None]:
from pyspark.sql.functions import when, isnull, unix_timestamp

timeDiff = (unix_timestamp("timestamp")
            - unix_timestamp("prev"))

timestampDeltaDF = timestampPrevDF.withColumn("timeDelta", 
                                              when(isnull(timeDiff), 0)
                              .otherwise(timeDiff))

In [None]:
display(timestampDeltaDF.filter("user_id_counts > 10"), 1000)

In [None]:
"""
from pyspark.sql.window import Window
from pyspark.sql.functions import col, max, desc

year = [1885, 1915, 1945, 1975, 2005]

maxNamesDF = (ssaDF
                .filter("gender = 'F'")
                .filter(col("year").isin(year))
             )

window = Window.partitionBy("year").orderBy(desc("total"))

joinedQueryDF = (maxNamesDF
                    .withColumn("rn", F.row_number().over(window))
                    .where("rn = 1") 
                    .drop("rn")
                    .orderBy(("year"))
                )
joinedQueryDF.show()
"""

&copy; 2019 [Intellinum Analytics, Inc](http://www.intellinum.co). All rights reserved.<br/>