Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IAB Robots & Spiders enrichment #3750

Closed
wants to merge 8 commits into from
Closed

IAB Robots & Spiders enrichment #3750

wants to merge 8 commits into from

Conversation

rzats
Copy link
Contributor

@rzats rzats commented May 2, 2018

Closes #937.

Still needs some work:

  • test this in an actual pipeline
  • figure out how to determine when basic HTTP authentication is necessary and apply the username/password

but should be ready for an initial review.

@rzats rzats requested review from chuwy and BenFradet May 2, 2018 13:02
Copy link
Contributor

@BenFradet BenFradet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, please introduce more tests 👍

@@ -168,6 +169,8 @@ object EnrichmentRegistry {
SqlQueryEnrichmentConfig.parse(enrichmentConfig, schemaKey).map((nm, _).some)
} else if (nm == "pii_enrichment_config") {
PiiPseudonymizerEnrichment.parse(enrichmentConfig, schemaKey).map((nm, _).some)
} else if (nm == "iab_spiders_and_robots_enrichment") {

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

* @return Option boxing the IabEnrichment instance
*/
def getIabEnrichment: Option[IabEnrichment] =
getEnrichment[IabEnrichment]("iab_spiders_and_robots_enrichment")

This comment was marked as resolved.

"iab_spiders_and_robots_enrichment",
"jsonschema",
1,
0)

This comment was marked as resolved.


// Scalaz
import scalaz.Scalaz._
import scalaz._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the imports are:

import scalaz._
import Scalaz._

* @return a configured IabEnrichment instance
*/
def parse(config: JValue, schemaKey: SchemaKey, localMode: Boolean): ValidatedNelMessage[IabEnrichment] =
isParseable(config, schemaKey).flatMap(conf => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary parens


private type DbEntry = Option[(Option[URI], String)]
private val schemaUri =
"iglu:com.snowplowanalytics.snowplow.enrichments/iab_spiders_and_robots_enrichment/jsonschema/1-0-0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be the wrong coordinates here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should actually be pointing to the response POJO, will fix

private val iabClient = {
def file(db: DbEntry): File = new File(db.get._2)

new IabClient(file(dbs._1), file(dbs._2), file(dbs._3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's worth introducing a test which references not existing files

private[enrichments] def performCheck(userAgent: String,
ipAddress: String,
accurateAt: DateTime): Validation[Throwable, IabResponse] =
Validation.fromTryCatch(iabClient.checkAt(userAgent, InetAddress.getByName(ipAddress), accurateAt.toDate))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unsafe with the scalaz version we're using as it will also catch exceptions that are not NonFatal

try {
getIab(userAgent, ipAddress, accurateAt).map(addSchema)
} catch {
case NonFatal(exc) => exc.toString.fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getMessage

(check.getPrimaryImpact.toString must_== expectedPrimaryImpact)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too light testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this. Is there a way to do integration tests that interact directly with EnrichmentManager and send in Snowplow events? (instead of method params)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integration tests are usually done in stream enrich or spark enrich

@alexanderdean
Copy link
Member

Thanks @BenFradet - Rostyslav please action Ben's feedback...

@rzats
Copy link
Contributor Author

rzats commented May 9, 2018

@BenFradet @chuwy just finished this round of QA, should be ready for another review

Copy link
Contributor

@BenFradet BenFradet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're getting there 👍 , what I would do is log a separate ticket to handle integration tests in spark enrich for example

Option(event.user_ipaddress),
Option(event.derived_tstamp).map(EventEnrichments.fromTimestamp))
.map(_.some)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for curlies

(uri.toValidationNel |@| db.toValidationNel) { (uri, db) =>
for {
u <- getDatabaseUri(uri, db).toValidationNel: ValidatedNelMessage[URI]
} yield (name, u, db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(uri.toValidationNel |@| db.toValidationNel) { (uri, db) => getDatabaseUri(uri, db).toValidationNel.map(u => (name, u, db)) }

modulo type inference

u <- getDatabaseUri(uri, db).toValidationNel: ValidatedNelMessage[URI]
} yield (name, u, db)

}.flatMap(x => x).some
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, you can use identity instead of x => x

*/
private def getDatabaseUri(uri: String, database: String): ValidatedMessage[URI] =
ConversionUtils
.stringToUri(uri + "/" + database)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would shy away from doing path handling using string concatenation, what if uri has a / at the end? does ConversionUtils handle it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example doesn't have two trailing slashes afaict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work fine (the multiple slashes are simply added to the URI with no errors). I've added a normalize call here just in case - that should remove the extra slashes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize doesn't deal with the case I meant, doing string concatenation on paths can result in:

val uri = "http://google.com/"
val databse = "database"
uri + "/" + database
// http://google.com//database

ConversionUtils.stringToUri(uri + "/" + database)
// Success(Some(http://google.com//database))

new URI("http://google.com/").resolve("/" + database)
// http://google.com/database

Do you see what I mean now?

def db(dbPath: FileTuple): DbEntry = dbPath.map {
case (name, uri, file) =>
if (localMode) {
(None, getClass.getResource(file).toURI.getPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmh even if this is only for tests maybe we should make it safer wrt to npes

*/
private[enrichments] def performCheck(userAgent: String,
ipAddress: String,
accurateAt: DateTime): Validation[Throwable, IabEnrichmentResponse] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be occuredAt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

occurred even

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the parameter names used by the IAB client: https://github.com/snowplow/iab-spiders-and-robots-java-client/blob/master/src/main/java/com/snowplowanalytics/iab/spidersandrobotsclient/IabClient.java#L77

I could of course rename this but that would be a bit inconsistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then I think it's a typo in the client since accurateAt being a DateTime doesn't make sense to me?

ipAddress: String,
accurateAt: DateTime): Validation[Throwable, IabEnrichmentResponse] =
try {
val result = iabClient.checkAt(userAgent, InetAddress.getByName(ipAddress), accurateAt.toDate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it's not an ip address? I'd rather we correctly type our failure states instead of blanket wrapping in try/catch

IabEnrichmentResponse(result.isSpiderOrRobot,
result.getCategory.toString,
result.getReason.toString,
result.getPrimaryImpact.toString).success
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the types further? i.e. no toString here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is converted into a scala case class is because those types of Java enums seem to have some issues when they're serialized via Extraction.decompose(). I could convert the IabResponse into an IabEnrichmentResponse in getIab instead but I don't think that would make much of a difference.

getIab(userAgent, ipAddress, accurateAt).map(addSchema)
} catch {
case NonFatal(exc) => s"${exc.getMessage}".fail
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this is our code, we shouldn't need try catch at this level

* Type alias for an Option-boxed
* Tuple3 of a file location
*/
type FileTuple = Option[(String, URI, String)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type alias doesn't provide any benefits as we can't infer what's inside from the name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I use a case class then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@BenFradet
Copy link
Contributor

BenFradet commented May 10, 2018

Also since this enrichment relies on local files, they'll need to be cached, e.g. for spark enrich: https://github.com/snowplow/snowplow/blob/master/3-enrich/spark-enrich/src/main/scala/com.snowplowanalytics.snowplow.enrich.spark/EnrichJob.scala#L179-L183

Copy link
Contributor

@chuwy chuwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rzats, LGTM! Well done.

result.getReason.toString,
result.getPrimaryImpact.toString).success
} catch {
case NonFatal(exc) => exc.fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.fail is deprecated, try to stick with .failure (few more places)

case obj: JObject => obj.success
case _ => s"Couldn't transform IAB response $response into JSON".fail
}
case Failure(throwable) => s"${throwable.getMessage}".fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwable.getMessage.failure (quotes and curlies are redundant)

* @param context IAB context as JSON Object
* @return JSON Object wrapped as Self-describing JSON
*/
private def addSchema(context: JObject): JObject =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of surprised we don't have this method defined somewhere.

|}""".format(database, database match {
case "ipFile" => "ip_exclude_current_cidr.txt"
case "excludeUseragentFile" => "exclude_current.txt"
case "includeUseragentFile" => "include_current.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird indent

for (idx <- IabEnrichmentCfLineSpec.expected.indices) {
actual(idx) must BeFieldEqualTo(IabEnrichmentCfLineSpec.expected(idx), idx)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have a test that is a bot?

*/
private def getDatabaseUri(uri: String, database: String): ValidatedMessage[URI] =
ConversionUtils
.stringToUri(uri + "/" + database)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize doesn't deal with the case I meant, doing string concatenation on paths can result in:

val uri = "http://google.com/"
val databse = "database"
uri + "/" + database
// http://google.com//database

ConversionUtils.stringToUri(uri + "/" + database)
// Success(Some(http://google.com//database))

new URI("http://google.com/").resolve("/" + database)
// http://google.com/database

Do you see what I mean now?

"parameters": {
"ipFile": {
"database": "ip_exclude_current_cidr.txt",
"uri": "https://example.com"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a test incorporating my remark above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was meant to add a / at the end of uri to check uri resolving works properly

|"embedded": {
|"path": "/iglu"
|}
|}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove before release

"not write any bad rows" in {
dirs.badRows must beEmptyDir
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have the two iab specs turned into one spec? No need to have two different files for the same enrichment.

Copy link
Contributor

@BenFradet BenFradet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important q regarding anon ip and we're good to go I think 👍

"uri": "s3://my-private-bucket/iab"
},
"httpUsername": "USERNAME_PLACEHOLDER",
"httpPassword": "PASSWORD_PLACEHOLDER"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd remove basic auth or change the uris to be http to make the example coherent

Option(event.derived_tstamp).map(EventEnrichments.fromTimestamp))
.map(_.some)
case None => None.success
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I missed before, at this point event.user_ipaddress may have been modified by anon ip, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll move this step to be executed just after the IP lookups enrichment - that shouldn't be affected by anonymization

case None => Nil
}
ipLookupFiles ++ iabFiles
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be defined in the enrichment registry, I logged #3789. Of course, it's out of scope for your release.

"parameters"
],
"additionalProperties": false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to rm

@alexanderdean
Copy link
Member

Remember @rzats that once you get the green ticket, you need to start work on assembling the R107 PR (have you come up with an archeological site yet?)

@rzats rzats force-pushed the iab-enrichment branch 3 times, most recently from 30db0b2 to f247ecd Compare June 6, 2018 20:49
Copy link
Contributor

@BenFradet BenFradet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the enrichment can work in stream enrich as it is.

Please try to test it before asking for a review, that will speed things up.

unitSuccess
}
case f => f
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just map, no?

*
* @param config The iab_spiders_and_robots_enrichment JSON
* @param name The name of the lookup:
* "geo", "isp", "organization", "domain"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

"parameters": {
"ipFile": {
"database": "ip_exclude_current_cidr.txt",
"uri": "https://example.com"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was meant to add a / at the end of uri to check uri resolving works properly

@@ -42,7 +42,7 @@ object Dependencies {
val json4s = "3.2.11"
val pureconfig = "0.8.0"
val snowplowRawEvent = "0.1.0"
val snowplowCommonEnrich = "0.32.0"
val snowplowCommonEnrich = "0.34.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting - those changes were apparently lost after rebasing, will fix. I'll also be doing some smoke testing today which should help prevent issues like these

@BenFradet
Copy link
Contributor

@rzats feel free to ping me once you've resolved the points I raised during the last review

@rzats rzats force-pushed the iab-enrichment branch 2 times, most recently from bf7fb0d to 8586726 Compare June 11, 2018 20:44
@alexanderdean
Copy link
Member

What's the hold-up here @rzats ?

@BenFradet
Copy link
Contributor

superseded by #3799, closing

@BenFradet BenFradet closed this Jun 13, 2018
@BenFradet BenFradet deleted the iab-enrichment branch June 13, 2018 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants