## Analyzing Spot Checks
We spot checked 700-some images of stained search pages. The images were created by running the `abstract_paining()` function from web assay using screenshots and spatial metadata parsed from web pages. 

We used the Prodigy.ai annotation software to record errors. We removed some fields from from the annotation output including the base64 of the image.

For images with errors, we estimate the number of pixels over or under-counted for each category.

In [3]:
import unicodedata
from tqdm import tqdm
import pandas as pd

In [4]:
# these files contain the estimation of pixels over or under-counted
fn_pixel_1 = '../data/error_analysis/leon-pixel-errors.csv'
fn_pixel_2 ='../data/error_analysis/adrianne-pixel-errors.csv'

# contains certain fields from prodigy for data points we spot checked.
fn_spotcheck_1 = '../data/error_analysis/leon-annotations.csv.gz'
fn_spotcheck_2 = '../data/error_analysis/adrianne-annotations.csv.gz'

In [5]:
# read the files
pix1 = pd.read_csv(fn_pixel_1, index_col=0)
pix2 = pd.read_csv(fn_pixel_2, index_col=0)

spot1 = pd.read_csv(fn_spotcheck_1, compression='gzip')
spot2 = pd.read_csv(fn_spotcheck_2, compression='gzip')

In [6]:
pix1.loc[:, "annotator"] = 'L'
pix1.loc[:, "annotator"] = 'A'

spot1.loc[:, "annotator"] = 'L'
spot2.loc[:, "annotator"] = 'A'

In [7]:
# combine the two annotators two files.
pixels = pix1.append(pix2)
spot = spot1.append(spot2)

In [9]:
# how many we checked, and how many errors we found
len(spot), len(pixels)

(741, 74)

In [10]:
# what is our total error rate?
len(pixels) / len(spot)

0.09986504723346828

In [11]:
# net pixels over/undercounted per category
pixels.sum()

amp           1.023895e+06
google        6.522807e+05
ads           4.758840e+05
non-google    5.194243e+05
dtype: float64

In [12]:
# how many errors are from AMP?
len(pixels[(pixels.amp > 1)])

23

Analyze errors for each "type" of error.

In [13]:
classification_errors = spot[(spot['missing classification'] == True) | (spot['wrong classification'] == True)].text
classification_errors.nunique()

14

In [14]:
measurement_errors = spot[(spot['area is overestimated'] == True) | (spot['area is underestimated'] == True)].text
measurement_errors.nunique()

65

In [15]:
classification_errors.nunique() / len(spot), measurement_errors.nunique()/ len(spot)

(0.018893387314439947, 0.08771929824561403)

We can refer to the spatial metadata for each of the files in the spot check sample:

In [16]:
files = spot['text'].apply(
    lambda x : unicodedata.normalize("NFC",
                                     x.replace('data/', '../data/intermediary/google_search/')
                                      .replace('/png/abstract_painting_img.png', '/json/parsed_meta.jsonl'
))).to_list()

In [17]:
data = []
for fn in tqdm(files):
    _df = pd.read_json(unicodedata.normalize("NFC", fn), 
                       orient='records',
                       lines=True)
    _df["fn"] = fn
    data.extend(_df.to_dict(orient='records'))

100%|██████████| 741/741 [00:10<00:00, 73.17it/s]


In [18]:
df = pd.DataFrame(data)

In [19]:
cat2label = {
    'organic' : 'non-google',
    'link' : 'google',
    'answer' : 'google',
    'ads' : 'ads'
}

In [20]:
def assign_label(row):
    cat = row['category'].split('-')[0]
    return cat2label.get(cat, cat)

In [21]:
df['label'] = df.apply(assign_label, axis=1)

In [22]:
# Calculate real estate with corrections from errors
corrected_perc = (pixels.sum() + df.groupby('label')['area_page'].sum() )/ df['area_page'].sum()

In [23]:
# what is the un-corrected real estate?
og_perc = (df.groupby('label')['area_page'].sum() )/ df['area_page'].sum()

In [24]:
# how much do corrections alter results?
(corrected_perc - og_perc) * 100

ads           0.046351
amp           0.099728
google        0.063533
non-google    0.050592
dtype: float64

No category's real estate was effected beyond one-tenth of a percent when we look across the entire spot-checked sample.

What was the impact of real estate on individual pages?

In [25]:
def get_search_term_img(fn):
    return unicodedata.normalize("NFC",fn.split('/png/')[0].split('/')[-1])

def get_search_term_json(fn):
    return unicodedata.normalize("NFC",fn.split('/json/')[0].split('/')[-1])

In [26]:
pixels.index = pixels.index.map(get_search_term_img)

In [27]:
df['search_term'] = df.fn.apply(get_search_term_json)

In [28]:
len(pixels)

74

In [29]:
df_e = df[df.search_term.isin(pixels.index)]

In [30]:
# get the sum of pixels for each category for each search with an error.
term2pixels = {}
for _term, df_ in tqdm(df_e.groupby('search_term')):
    row = df_.groupby('label').area_page.sum().to_dict()
    row['all_area'] = df_.area_page.sum()
    term2pixels[_term] =row

100%|██████████| 74/74 [00:00<00:00, 1030.56it/s]


In [31]:
og_pixel = pd.DataFrame(term2pixels).T
og_pixel.fillna(0, inplace=True)

In [32]:
def stats(row):
    """
    Calculates the total pixels mis-counted as well as 
    the percentage of area that was mis-counted
    """
    perc_change = {}
    correspondant = og_pixel.loc[row.name]
    for cat in ['amp', 'google', 'non-google', 'ads']:
        row[f"{cat}_perc_cat_area"] = (row[cat] / correspondant[cat]) * 100
        row[f"{cat}_perc_total_area"] = (row[cat] / correspondant['all_area']) * 100
    return row

In [33]:
results = pd.DataFrame(pixels.apply(stats, axis=1))

  row[f"{cat}_perc_cat_area"] = (row[cat] / correspondant[cat]) * 100
  row[f"{cat}_perc_cat_area"] = (row[cat] / correspondant[cat]) * 100


In [34]:
# From classification errors
results[results.index.isin([get_search_term_img(f) for f in classification_errors])].sum()

amp                            46404.706410
google                       -166396.646223
ads                           210799.617048
non-google                    127963.936080
annotator                          0.000000
amp_perc_cat_area                 41.148044
amp_perc_total_area                3.530988
google_perc_cat_area             -31.273392
google_perc_total_area           -10.452977
non-google_perc_cat_area          24.214813
non-google_perc_total_area        12.401348
ads_perc_cat_area                182.329968
ads_perc_total_area               13.304421
dtype: float64

In [35]:
# from measurement errors
results[results.index.isin([get_search_term_img(f) for f in measurement_errors])].sum()

amp                           9.774901e+05
google                        6.468972e+05
ads                           2.713937e+05
non-google                    6.065320e+05
amp_perc_cat_area             8.076707e+02
amp_perc_total_area           8.492652e+01
google_perc_cat_area          1.059609e+02
google_perc_total_area        4.715500e+01
non-google_perc_cat_area      1.106186e+02
non-google_perc_total_area    4.986764e+01
ads_perc_cat_area                      inf
ads_perc_total_area           1.947207e+01
dtype: float64

In [36]:
# the min, average, and max percentage of area mis-counted per category:
for col in ['amp_perc_total_area', 'ads_perc_total_area', 'non-google_perc_total_area', 'google_perc_total_area']:
    print(results[results[col] != 0][col].describe())
    print()

count    23.000000
mean      3.845978
std       2.950401
min       1.609328
25%       2.065514
50%       2.487839
75%       4.161377
max      12.656428
Name: amp_perc_total_area, dtype: float64

count    13.000000
mean      2.472851
std       1.996503
min      -0.316017
25%       1.382312
50%       1.756720
75%       2.921937
max       7.075561
Name: ads_perc_total_area, dtype: float64

count    20.000000
mean      2.327070
std       3.861660
min     -12.207536
25%       1.945556
50%       2.665215
75%       3.545442
max       7.129694
Name: non-google_perc_total_area, dtype: float64

count    37.000000
mean      1.303300
std       2.506333
min      -1.756720
25%       0.557909
50%       1.006652
75%       1.772622
max      11.532905
Name: google_perc_total_area, dtype: float64

