-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IAB Robots & Spiders enrichment #3750
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, please introduce more tests 👍
@@ -168,6 +169,8 @@ object EnrichmentRegistry { | |||
SqlQueryEnrichmentConfig.parse(enrichmentConfig, schemaKey).map((nm, _).some) | |||
} else if (nm == "pii_enrichment_config") { | |||
PiiPseudonymizerEnrichment.parse(enrichmentConfig, schemaKey).map((nm, _).some) | |||
} else if (nm == "iab_spiders_and_robots_enrichment") { |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
* @return Option boxing the IabEnrichment instance | ||
*/ | ||
def getIabEnrichment: Option[IabEnrichment] = | ||
getEnrichment[IabEnrichment]("iab_spiders_and_robots_enrichment") |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
"iab_spiders_and_robots_enrichment", | ||
"jsonschema", | ||
1, | ||
0) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
||
// Scalaz | ||
import scalaz.Scalaz._ | ||
import scalaz._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the imports are:
import scalaz._
import Scalaz._
* @return a configured IabEnrichment instance | ||
*/ | ||
def parse(config: JValue, schemaKey: SchemaKey, localMode: Boolean): ValidatedNelMessage[IabEnrichment] = | ||
isParseable(config, schemaKey).flatMap(conf => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary parens
|
||
private type DbEntry = Option[(Option[URI], String)] | ||
private val schemaUri = | ||
"iglu:com.snowplowanalytics.snowplow.enrichments/iab_spiders_and_robots_enrichment/jsonschema/1-0-0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be the wrong coordinates here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should actually be pointing to the response POJO, will fix
private val iabClient = { | ||
def file(db: DbEntry): File = new File(db.get._2) | ||
|
||
new IabClient(file(dbs._1), file(dbs._2), file(dbs._3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's worth introducing a test which references not existing files
private[enrichments] def performCheck(userAgent: String, | ||
ipAddress: String, | ||
accurateAt: DateTime): Validation[Throwable, IabResponse] = | ||
Validation.fromTryCatch(iabClient.checkAt(userAgent, InetAddress.getByName(ipAddress), accurateAt.toDate)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unsafe with the scalaz version we're using as it will also catch exceptions that are not NonFatal
try { | ||
getIab(userAgent, ipAddress, accurateAt).map(addSchema) | ||
} catch { | ||
case NonFatal(exc) => exc.toString.fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getMessage
(check.getPrimaryImpact.toString must_== expectedPrimaryImpact) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too light testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this. Is there a way to do integration tests that interact directly with EnrichmentManager and send in Snowplow events? (instead of method params)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integration tests are usually done in stream enrich or spark enrich
Thanks @BenFradet - Rostyslav please action Ben's feedback... |
@BenFradet @chuwy just finished this round of QA, should be ready for another review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're getting there 👍 , what I would do is log a separate ticket to handle integration tests in spark enrich for example
Option(event.user_ipaddress), | ||
Option(event.derived_tstamp).map(EventEnrichments.fromTimestamp)) | ||
.map(_.some) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for curlies
(uri.toValidationNel |@| db.toValidationNel) { (uri, db) => | ||
for { | ||
u <- getDatabaseUri(uri, db).toValidationNel: ValidatedNelMessage[URI] | ||
} yield (name, u, db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(uri.toValidationNel |@| db.toValidationNel) { (uri, db) => getDatabaseUri(uri, db).toValidationNel.map(u => (name, u, db)) }
modulo type inference
u <- getDatabaseUri(uri, db).toValidationNel: ValidatedNelMessage[URI] | ||
} yield (name, u, db) | ||
|
||
}.flatMap(x => x).some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, you can use identity
instead of x => x
*/ | ||
private def getDatabaseUri(uri: String, database: String): ValidatedMessage[URI] = | ||
ConversionUtils | ||
.stringToUri(uri + "/" + database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would shy away from doing path handling using string concatenation, what if uri
has a /
at the end? does ConversionUtils
handle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example doesn't have two trailing slashes afaict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to work fine (the multiple slashes are simply added to the URI with no errors). I've added a normalize call here just in case - that should remove the extra slashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalize doesn't deal with the case I meant, doing string concatenation on paths can result in:
val uri = "http://google.com/"
val databse = "database"
uri + "/" + database
// http://google.com//database
ConversionUtils.stringToUri(uri + "/" + database)
// Success(Some(http://google.com//database))
new URI("http://google.com/").resolve("/" + database)
// http://google.com/database
Do you see what I mean now?
def db(dbPath: FileTuple): DbEntry = dbPath.map { | ||
case (name, uri, file) => | ||
if (localMode) { | ||
(None, getClass.getResource(file).toURI.getPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmmh even if this is only for tests maybe we should make it safer wrt to npes
*/ | ||
private[enrichments] def performCheck(userAgent: String, | ||
ipAddress: String, | ||
accurateAt: DateTime): Validation[Throwable, IabEnrichmentResponse] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be occuredAt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
occurred even
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the parameter names used by the IAB client: https://github.com/snowplow/iab-spiders-and-robots-java-client/blob/master/src/main/java/com/snowplowanalytics/iab/spidersandrobotsclient/IabClient.java#L77
I could of course rename this but that would be a bit inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then I think it's a typo in the client since accurateAt
being a DateTime
doesn't make sense to me?
ipAddress: String, | ||
accurateAt: DateTime): Validation[Throwable, IabEnrichmentResponse] = | ||
try { | ||
val result = iabClient.checkAt(userAgent, InetAddress.getByName(ipAddress), accurateAt.toDate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if it's not an ip address? I'd rather we correctly type our failure states instead of blanket wrapping in try/catch
IabEnrichmentResponse(result.isSpiderOrRobot, | ||
result.getCategory.toString, | ||
result.getReason.toString, | ||
result.getPrimaryImpact.toString).success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep the types further? i.e. no toString
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this is converted into a scala case class is because those types of Java enums seem to have some issues when they're serialized via Extraction.decompose()
. I could convert the IabResponse
into an IabEnrichmentResponse
in getIab
instead but I don't think that would make much of a difference.
getIab(userAgent, ipAddress, accurateAt).map(addSchema) | ||
} catch { | ||
case NonFatal(exc) => s"${exc.getMessage}".fail | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, this is our code, we shouldn't need try catch at this level
* Type alias for an Option-boxed | ||
* Tuple3 of a file location | ||
*/ | ||
type FileTuple = Option[(String, URI, String)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type alias doesn't provide any benefits as we can't infer what's inside from the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I use a case class then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Also since this enrichment relies on local files, they'll need to be cached, e.g. for spark enrich: https://github.com/snowplow/snowplow/blob/master/3-enrich/spark-enrich/src/main/scala/com.snowplowanalytics.snowplow.enrich.spark/EnrichJob.scala#L179-L183 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rzats, LGTM! Well done.
result.getReason.toString, | ||
result.getPrimaryImpact.toString).success | ||
} catch { | ||
case NonFatal(exc) => exc.fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.fail
is deprecated, try to stick with .failure
(few more places)
case obj: JObject => obj.success | ||
case _ => s"Couldn't transform IAB response $response into JSON".fail | ||
} | ||
case Failure(throwable) => s"${throwable.getMessage}".fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throwable.getMessage.failure
(quotes and curlies are redundant)
* @param context IAB context as JSON Object | ||
* @return JSON Object wrapped as Self-describing JSON | ||
*/ | ||
private def addSchema(context: JObject): JObject = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of surprised we don't have this method defined somewhere.
6ad4061
to
04784b5
Compare
|}""".format(database, database match { | ||
case "ipFile" => "ip_exclude_current_cidr.txt" | ||
case "excludeUseragentFile" => "exclude_current.txt" | ||
case "includeUseragentFile" => "include_current.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird indent
for (idx <- IabEnrichmentCfLineSpec.expected.indices) { | ||
actual(idx) must BeFieldEqualTo(IabEnrichmentCfLineSpec.expected(idx), idx) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have a test that is a bot?
*/ | ||
private def getDatabaseUri(uri: String, database: String): ValidatedMessage[URI] = | ||
ConversionUtils | ||
.stringToUri(uri + "/" + database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalize doesn't deal with the case I meant, doing string concatenation on paths can result in:
val uri = "http://google.com/"
val databse = "database"
uri + "/" + database
// http://google.com//database
ConversionUtils.stringToUri(uri + "/" + database)
// Success(Some(http://google.com//database))
new URI("http://google.com/").resolve("/" + database)
// http://google.com/database
Do you see what I mean now?
"parameters": { | ||
"ipFile": { | ||
"database": "ip_exclude_current_cidr.txt", | ||
"uri": "https://example.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a test incorporating my remark above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was meant to add a / at the end of uri
to check uri resolving works properly
|"embedded": { | ||
|"path": "/iglu" | ||
|} | ||
|} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove before release
"not write any bad rows" in { | ||
dirs.badRows must beEmptyDir | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have the two iab specs turned into one spec? No need to have two different files for the same enrichment.
c5cd408
to
710bc9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One important q regarding anon ip and we're good to go I think 👍
"uri": "s3://my-private-bucket/iab" | ||
}, | ||
"httpUsername": "USERNAME_PLACEHOLDER", | ||
"httpPassword": "PASSWORD_PLACEHOLDER" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd remove basic auth or change the uris to be http to make the example coherent
Option(event.derived_tstamp).map(EventEnrichments.fromTimestamp)) | ||
.map(_.some) | ||
case None => None.success | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I missed before, at this point event.user_ipaddress
may have been modified by anon ip, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll move this step to be executed just after the IP lookups enrichment - that shouldn't be affected by anonymization
case None => Nil | ||
} | ||
ipLookupFiles ++ iabFiles | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be defined in the enrichment registry, I logged #3789. Of course, it's out of scope for your release.
"parameters" | ||
], | ||
"additionalProperties": false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to rm
Remember @rzats that once you get the green ticket, you need to start work on assembling the R107 PR (have you come up with an archeological site yet?) |
30db0b2
to
f247ecd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the enrichment can work in stream enrich as it is.
Please try to test it before asking for a review, that will speed things up.
unitSuccess | ||
} | ||
case f => f | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just map, no?
* | ||
* @param config The iab_spiders_and_robots_enrichment JSON | ||
* @param name The name of the lookup: | ||
* "geo", "isp", "organization", "domain" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
"parameters": { | ||
"ipFile": { | ||
"database": "ip_exclude_current_cidr.txt", | ||
"uri": "https://example.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was meant to add a / at the end of uri
to check uri resolving works properly
@@ -42,7 +42,7 @@ object Dependencies { | |||
val json4s = "3.2.11" | |||
val pureconfig = "0.8.0" | |||
val snowplowRawEvent = "0.1.0" | |||
val snowplowCommonEnrich = "0.32.0" | |||
val snowplowCommonEnrich = "0.34.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting - those changes were apparently lost after rebasing, will fix. I'll also be doing some smoke testing today which should help prevent issues like these
@rzats feel free to ping me once you've resolved the points I raised during the last review |
…s & Bots List (close #937)
bf7fb0d
to
8586726
Compare
What's the hold-up here @rzats ? |
superseded by #3799, closing |
Closes #937.
Still needs some work:
but should be ready for an initial review.