# Bug 1710854: cpu_arch doesn't work with exists/does-not-exist filter

This is a brief investigation using JupyterLab into why `cpu_arch` doesn't
work with the exists/does-not-exist filter.



In [1]:
import os
import pandas as pd
import requests

HOST = "https://crash-stats.mozilla.org"

We want to know whether the exists/does-not-exist filters work.

Let's do a search for crash reports for all products between 2021-05-05 and 2021-05-13, then one
where `cpu_arch` exists and then one where `cpu_arch` does not exist.

In [2]:
def fetch_supersearch(params):
    resp = requests.get(HOST + "/api/SuperSearch/", params=params)
    return resp.json()

results = {}

params = {
    "date": [">=2021-05-05", "<2021-05-13"],
}

results["total"] = fetch_supersearch(params)["total"]
results["does exist"] = fetch_supersearch(dict(params, cpu_arch="!__null__"))["total"]
results["does not exist"] = fetch_supersearch(dict(params, cpu_arch="__null__"))["total"]

pd.DataFrame([results], columns=["total", "does not exist", "does exist"])

Unnamed: 0,total,does not exist,does exist
0,1284170,0,1284170


That seems puzzling. That suggests that every crash report has a `cpu_arch`
value.

Let's grab a facet on `cpu_arch` for crash reports for all products
submitted between 2021-05-05 and 2021-05-13.

In [3]:
params = {
    "date": [">=2021-05-05", "<2021-05-13"],
    "_facets": "cpu_arch",
    "_results_number": 0,
}
data = fetch_supersearch(HOST)["facets"]

pd.DataFrame(data["cpu_arch"], columns=["term", "count"])

Unnamed: 0,term,count
0,x86,424865
1,amd64,350935
2,,347140
3,arm,89674
4,arm64,71556


So in the table, we notice that like 1/3 of the crash reports have
a `cpu_arch` that's the empty string.

In Crash Stats, it's very difficult and/or impossible to search for
the empty string value. So this is a bit of a drag.

But knowing the facet, we can search for all the crash reports that don't
have one of the known values.

In [4]:
params = {
    "date": [">=2021-05-05", "<2021-05-13"],
    "cpu_arch": ["!x86", "!amd64", "!arm", "!arm64"],
}

fetch_supersearch(params)["total"]


347140

So that works. Let's look at 10 of them.

In [5]:
params = {
    "date": [">=2021-05-05", "<2021-05-13"],
    "cpu_arch": ["!x86", "!amd64", "!arm", "!arm64"],
    "_columns": ["uuid", "product", "signature"],
    "_results_number": 10,
}

data = fetch_supersearch(params)
pd.DataFrame(data["hits"], columns=["uuid", "product", "signature"])

Unnamed: 0,uuid,product,signature
0,f0937412-6115-4391-bc1c-3dc980210505,Fenix,android.database.sqlite.SQLiteDiskIOException:...
1,c13b1a16-4887-491e-a261-7d79c0210505,Fenix,mozilla.appservices.logins.InvalidKeyException...
2,2ffc7595-8221-4fc8-80e2-ccfc50210505,Fenix,mozilla.appservices.logins.InvalidKeyException...
3,0b15ec5a-0544-48bb-b981-21bf80210505,Fenix,mozilla.appservices.logins.InvalidKeyException...
4,15681abf-9201-4dd0-8644-8555a0210505,Fenix,EMPTY: no crashing thread identified; ERROR_NO...
5,60aa2ddd-7a00-40a7-9b67-c1ada0210505,Fenix,java.lang.OutOfMemoryError: at java.util.Array...
6,b40b84c4-3e4c-4fcd-8d86-b2eea0210505,Fenix,[INFO] MalformedMessage(message=parsing encryp...
7,069d171b-b6b9-4862-9cf5-1d65b0210505,Fenix,EMPTY: no crashing thread identified; ERROR_NO...
8,f9903afa-961e-431f-a7a7-398110210505,Fenix,EMPTY: no crashing thread identified; ERROR_NO...
9,f3816ea3-5f09-4abc-a377-f517b0210505,Fenix,java.lang.IllegalStateException: at mozilla.co...


All of those are Fenix crash reports. I wonder whether this affects other products.

In [6]:
params = {
    "date": [">=2021-05-05", "<2021-05-13"],
    "cpu_arch": ["!x86", "!amd64", "!arm", "!arm64"],
    "_results_number": 0,
    "_facets": "product"
}

data = fetch_supersearch(params)
pd.DataFrame(data["facets"]["product"], columns=["term", "count"])

Unnamed: 0,term,count
0,Fenix,314708
1,Firefox,23372
2,Focus,5005
3,Thunderbird,3894
4,FirefoxReality,112
5,SeaMonkey,48
6,ReferenceBrowser,1


So it does affect other products.

At this point, we look at the code:

https://github.com/mozilla-services/socorro/blob/c276752b9f4b6e767e4bfbfea4d0ac1ad7e30398/socorro/processor/rules/general.py#L87-L116

The `cpu_arch` field is populated by the `CPUInfoRule` in the Socorro
processor and it pulls it from the `system_info` section of the
minidump-stackwalk output.

Seems like if a crash report has no minidump or had an unparseable minidump,
it will have an empty string for `cpu_arch`. The majority of these are
Fenix crash reports that are Java crashes.

Crash reports from Fenix have an `Android_CPU_ABI` crash annotation. What
does that data look like?

In [7]:
params = {
    "date": [">=2021-05-05", "<2021-05-13"],
    "cpu_arch": ["!x86", "!amd64", "!arm", "!arm64"],
    "product": "Fenix",
    "_results_number": 0,
    "_facets": "android_cpu_abi"
}

data = fetch_supersearch(params)
pd.DataFrame(data["facets"]["android_cpu_abi"], columns=["term", "count"])

Unnamed: 0,term,count
0,arm64-v8a,137743
1,armeabi-v7a,137260
2,x86,20137
3,x86_64,2290
4,x86\r\n,13
5,armeabi-v7a\r\n,7


There's some junk in there. Fun.

# Summary

We looked at `cpu_arch` values to figure out why the exists/does-not-exist
filter wasn't working. Because the default value is the empty string, there's
always a value.

Cool, so then we looked at what cases we were getting an empty string and
the bulk of them are Fenix Java crashes.

Cool, so then we looked at that data and we can map the `Android_CPU_ABI`
field to `cpu_arch` values and fill in the `cpu_arch` field accordingly.

For the rest of the crash reports where this doesn't work, we should use
a better "has no value" value. For this, we're going to go with "unknown"
since it's clearer what it is.