# Analytics for Image Suggestions Notifications

[T292316](https://phabricator.wikimedia.org/T292316)

# Purpose

More Experienced Contributors are interested in specific articles that they watch and edit. MVP of Image Suggestions is to recommend images to them for the articles they are watching or recently edited. The image suggestion notification started from July 20th, 2022 in three test wikis: idwiki, ptwiki and ruwiki.

Structured Data team wants to understand how the image suggestions notifications feature is being used and how it relates to other notifications behavior, so that I can make decisions about how to build and/or make changes to the feature going forward, and determine its success.

In this analysis, we are looking at data from 07/20/2022 - 08/31/2022 to answer follwing questions:

- Number of notifications sent (To get a sense of the scope of our work and whether we should increase or decrease) 
- Number of opt-outs (To see whether users are annoyed by our notifications or not interested in them, etc)
- How many images on average are added by a user per month
- Number of images suggested that are added to the matched article (To see whether the notifications are successful)
- Number of suggested images not reverted from their matched article (low revert rate to show that the matches are good)
- Number of image added to unillustrated articles by experienced editors

# Data Preparation

In [3]:
import re

from wmfdata import hive, mariadb, spark
import wmfdata 

import math
import pandas as pd
import numpy as np

from datetime import datetime, timedelta, date

In [4]:
spark_session = wmfdata.spark.get_session(app_name='pyspark regular; media-changes',
                                  type='yarn-large', # local, yarn-regular, yarn-large
                                 )  

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


## Parameters

In [5]:
mw_snapshot = '2022-08'  
wiki_dbs = ('ptwiki', 'ruwiki', 'idwiki')
wiki_db_str = "('" + "','".join(wiki_dbs) + "')"  # otherwise single wiki leads to confusing syntax errors
wiki_db_list = list(wiki_dbs)

#notification ts
n_start_timestamp = 20220720000000
n_end_timestamp = 20220901000000

##edits ts
start_timestamp = '2022-07-20' 
end_timestamp = '2022-09-01'

media_list_table = 'cchen.media_jul_aug_2022'

## Notification Data

In [6]:
notification_query = """

SELECT 
   notification_event,
   notification_user,
   notification_timestamp,
   notification_read_timestamp,
   event_extra,
   event_page_id
FROM echo_notification
JOIN echo_event on event_id = notification_event
WHERE notification_timestamp >= {start_timestamp}
  AND notification_timestamp < {end_timestamp}
  AND event_type = 'image-suggestions'

"""

In [7]:
notification_data = pd.DataFrame()

for i in range(len(wiki_db_list)):
        
    print('getting data for %s' % wiki_db_list[i])
    
    
    noti_result = mariadb.run( 
        notification_query.format(
            start_timestamp = n_start_timestamp,
            end_timestamp = n_end_timestamp
        ),wiki_db_list[i],'wikishared','pandas'
    )
    
    noti_result.insert(0, 'wiki_db', wiki_db_list[i])
    
    notification_data = notification_data.append(noti_result)
    

getting data for ptwiki
getting data for ruwiki
getting data for idwiki


In [8]:
notification_data['event_extra'] = notification_data['event_extra'].astype(str)
notification_data['suggested_image'] = notification_data['event_extra'].str.extract(r'(?<=File:)(.*)(?=\")')

In [9]:
# store in global temp view
notification_sdf = spark_session.createDataFrame(notification_data)
notification_sdf.createGlobalTempView("notification_data_temp")

## Preference Data

In [21]:
pref_query = """

SELECT 
    up_property,
    up_value,
    up_user AS local_user_id
FROM user_properties
WHERE up_property like '%image_suggestions%'

"""

In [22]:
local_pref_data = pd.DataFrame()

for i in range(len(wiki_db_list)):
        
    print('getting data for %s' % wiki_db_list[i])
    
    data = pd.DataFrame()
    
    pref_result = mariadb.run( 
        pref_query,wiki_db_list[i]
    )
    
    pref_result.insert(0, 'wiki_db', wiki_db_list[i])
    
    local_pref_data = local_pref_data.append(pref_result)

getting data for ptwiki
getting data for ruwiki
getting data for idwiki


In [23]:
## convert up_value to a string in order to store in GlobalTempView
local_pref_data['up_value'] = local_pref_data['up_value'].astype(str)

In [24]:
# store in global temp view
local_pref_sdf = spark_session.createDataFrame(local_pref_data)
local_pref_sdf.createGlobalTempView("local_pref_temp")

We also need to get global preferences data, and compare it with local exception:

In [29]:
global_pref_query = """

SELECT
    gp_user,
    gp_property,
    gp_value,
    lu_wiki,
    lu_local_id
FROM global_preferences gp RIGHT JOIN localuser lu ON gp_user = lu_global_id
WHERE gp_property LIKE '%image-suggestions%'
  AND lu_wiki IN {wiki_db}
  
"""

In [30]:
global_pref_data = mariadb.run( 
        global_pref_query.format(
          wiki_db = wiki_dbs
        ),'centralauth','pandas'
    )

In [31]:
## convert up_value to a string in order to store in GlobalTempView
global_pref_data['gp_value'] = global_pref_data['gp_value'].astype(str)

In [32]:
global_pref_sdf = spark_session.createDataFrame(global_pref_data)
global_pref_sdf.createGlobalTempView("global_pref_temp")

## Image Edits Data

We use method mentioned in this ticket [T299712](https://phabricator.wikimedia.org/T299712) to find the image edits and corresponding images. 

### Utils for detecting image edits

In [12]:
MEDIA_PREFIXES = ['File', 'Image', 'Media']

MEDIA_ALIASES= {
    "ar": ["ميديا", "صورة", "وسائط", "ملف"],
    "arc": ["ܠܦܦܐ", "ܡܝܕܝܐ"],
    "arz": ["ميديا", "صورة", "وسائط", "ملف"],
    "as": ["চিত্ৰ", "चित्र", "চিত্র", "মাধ্যম"],
    "ast": ["Imaxen", "Ficheru", "Imaxe", "Archivu", "Imagen", "Medios"],
    "atj": ["Tipatcimoctakewin", "Natisinahikaniwoc"],
    "av": ["Медиа", "Файл", "Изображение"],
    "ay": ["Medio", "Archivo", "Imagen"],
    "az": ["Mediya", "Şəkil", "Fayl"],
    "azb": ["رسانه", "تصویر", "مدیا", "فایل", "رسانه‌ای"],
    "ba": ["Медиа", "Рәсем", "Файл", "Изображение"],
    "bar": ["Medium", "Datei", "Bild"],
    "bat-smg": ["Vaizdas", "Medėjė", "Abruozdielis"],
    "bcl": ["Medio", "Ladawan"],
    "be": ["Мультымедыя", "Файл", "Выява"],
    "be-x-old": ["Мэдыя", "Файл", "Выява"],
    "bg": ["Медия", "Файл", "Картинка"],
    "bh": ["मीडिया", "चित्र"],
    "bjn": ["Barakas", "Gambar", "Berkas"],
    "bm": ["Média", "Fichier"],
    "bn": ["চিত্র", "মিডিয়া"],
    "bpy": ["ছবি", "মিডিয়া"],
    "br": ["Skeudenn", "Restr"],
    "bs": ["Mediji", "Slika", "Datoteka", "Medija"],
    "bug": ["Gambar", "Berkas"],
    "bxr": ["Файл", "Меди", "Изображение"],
    "ca": ["Fitxer", "Imatge"],
    "ce": ["Хlум", "Медиа", "Сурт", "Файл", "Медйа", "Изображение"],
    "ceb": ["Payl", "Medya", "Imahen"],
    "ch": ["Litratu"],
    "ckb": ["میدیا", "پەڕگە"],
    "co": ["Immagine"],
    "crh": ["Медиа", "Resim", "Файл", "Fayl", "Ресим"],
    "cs": ["Soubor", "Média", "Obrázok"],
    "csb": ["Òbrôzk", "Grafika"],
    "cu": ["Видъ", "Ви́дъ", "Дѣло", "Срѣдьства"],
    "cv": ["Медиа", "Ӳкерчĕк", "Изображение"],
    "cy": ["Delwedd"],
    "da": ["Billede", "Fil"],
    "de": ["Medium", "Datei", "Bild"],
    "din": ["Ciɛl", "Apamduööt"],
    "diq": ["Medya", "Dosya"],
    "dsb": ["Wobraz", "Dataja", "Bild", "Medija"],
    "dty": ["चित्र", "मिडिया"],
    "dv": ["ފައިލު", "މީޑިއާ", "ފައިލް"],
    "el": ["Εικόνα", "Αρχείο", "Μέσο", "Μέσον"],
    "eml": ["Immagine"],
    "eo": ["Dosiero", "Aŭdvidaĵo"],
    "es": ["Medio", "Archivo", "Imagen"],
    "et": ["Pilt", "Fail", "Meedia"],
    "eu": ["Irudi", "Fitxategi"],
    "ext": ["Archivu", "Imagen", "Mediu"],
    "fa": ["رسانه", "تصویر", "مدیا", "پرونده", "رسانه‌ای"],
    "ff": ["Média", "Fichier"],
    "fi": ["Kuva", "Tiedosto"],
    "fiu-vro": ["Pilt", "Meediä"],
    "fo": ["Miðil", "Mynd"],
    "fr": ["Média", "Fichier"],
    "frp": ["Émâge", "Fichiér", "Mèdia"],
    "frr": ["Medium", "Datei", "Bild"],
    "fur": ["Immagine", "Figure"],
    "fy": ["Ofbyld"],
    "ga": ["Íomhá", "Meán"],
    "gag": ["Mediya", "Medya", "Resim", "Dosya", "Dosye"],
    "gan": ["媒体文件", "文件", "文檔", "档案", "媒體", "图像", "圖像", "媒体", "檔案"],
    "gd": ["Faidhle", "Meadhan"],
    "gl": ["Imaxe", "Ficheiro", "Arquivo", "Imagem"],
    "glk": ["رسانه", "تصویر", "پرونده", "فاىل", "رسانه‌ای", "مديا"],
    "gn": ["Medio", "Imagen", "Ta'ãnga"],
    "gom": ["माध्यम", "मिडिया", "फायल"],
    "gor": ["Gambar", "Berkas"],
    "got": ["𐍆𐌴𐌹𐌻𐌰"],
    "gu": ["દ્રશ્ય-શ્રાવ્ય (મિડિયા)", "દ્રશ્ય-શ્રાવ્ય_(મિડિયા)", "ચિત્ર"],
    "gv": ["Coadan", "Meanyn"],
    "hak": ["文件", "媒體", "圖像", "檔案"],
    "haw": ["Kiʻi", "Waihona", "Pāpaho"],
    "he": ["תמונה", "קו", "מדיה", "קובץ"],
    "hi": ["मीडिया", "चित्र"],
    "hif": ["file", "saadhan"],
    "hr": ["Mediji", "DT", "Slika", "F", "Datoteka"],
    "hsb": ["Wobraz", "Dataja", "Bild"],
    "ht": ["Imaj", "Fichye", "Medya"],
    "hu": ["Kép", "Fájl", "Média"],
    "hy": ["Պատկեր", "Մեդիա"],
    "ia": ["Imagine", "Multimedia"],
    "id": ["Gambar", "Berkas"],
    "ig": ["Nká", "Midia", "Usòrò", "Ákwúkwó orünotu", "Ákwúkwó_orünotu"],
    "ii": ["媒体文件", "文件", "档案", "图像", "媒体"],
    "ilo": ["Midia", "Papeles"],
    "inh": ["Медиа", "Файл", "Изображение"],
    "io": ["Imajo", "Arkivo"],
    "is": ["Miðill", "Mynd"],
    "it": ["Immagine"],
    "ja": ["メディア", "ファイル", "画像"],
    "jbo": ["velsku", "datnyvei"],
    "jv": ["Barkas", "Medhia", "Gambar", "Médhia"],
    "ka": ["მედია", "სურათი", "ფაილი"],
    "kaa": ["Swret", "Таспа", "سۋرەت", "Taspa", "Su'wret", "Сурет", "تاسپا"],
    "kab": ["Tugna"],
    "kbd": ["Медиа", "Файл"],
    "kbp": ["Média", "Fichier"],
    "kg": ["Fisye"],
    "kk": ["Swret", "سۋرەت", "Таспа", "Taspa", "Сурет", "تاسپا"],
    "kl": ["Billede", "Fiileq", "Fil"],
    "km": ["ឯកសារ", "រូបភាព", "មេឌា", "មីឌា"],
    "kn": ["ಚಿತ್ರ", "ಮೀಡಿಯ"],
    "ko": ["미디어", "파일", "그림"],
    "koi": ["Медиа", "Файл", "Изображение"],
    "krc": ["Медиа", "Файл", "Изображение"],
    "ks": ["میڈیا", "فَیِل"],
    "ksh": ["Beld", "Meedije", "Medie", "Belld", "Medium", "Datei", "Meedijum", "Bild"],
    "ku": ["میدیا", "پەڕگە", "Medya", "Wêne"],
    "kv": ["Медиа", "Файл", "Изображение"],
    "kw": ["Restren"],
    "ky": ["Медиа", "Файл"],
    "la": ["Imago", "Fasciculus"],
    "lad": ["Dossia", "Medya", "Archivo", "Dosya", "Imagen", "Meddia"],
    "lb": ["Fichier", "Bild"],
    "lbe": ["Медиа", "Сурат", "Изображение"],
    "lez": ["Медиа", "Mediya", "Файл", "Şəkil", "Изображение"],
    "lfn": ["Fix"],
    "li": ["Afbeelding", "Plaetje", "Aafbeilding"],
    "lij": ["Immaggine", "Immagine"],
    "lmo": ["Immagine", "Imàjine", "Archivi"],
    "ln": ["Média", "Fichier"],
    "lo": ["ສື່ອ", "ສື່", "ຮູບ"],
    "lrc": ["رسانه", "تصویر", "رسانه‌ای", "جانیا", "أسگ", "ڤارئسگأر"],
    "lt": ["Vaizdas", "Medija"],
    "ltg": ["Medeja", "Fails"],
    "lv": ["Attēls"],
    "mai": ["मेडिया", "फाइल"],
    "map-bms": ["Barkas", "Medhia", "Gambar", "Médhia"],
    "mdf": ["Медиа", "Няйф", "Изображение"],
    "mg": ["Rakitra", "Sary", "Média"],
    "mhr": ["Медиа", "Файл", "Изображение"],
    "min": ["Gambar", "Berkas"],
    "mk": ["Податотека", "Медија", "Медиум", "Слика"],
    "ml": ["പ്രമാണം", "ചി", "മീഡിയ", "പ്ര", "ചിത്രം"],
    "mn": ["Медиа", "Файл", "Зураг"],
    "mr": ["चित्र", "मिडिया"],
    "mrj": ["Медиа", "Файл", "Изображение"],
    "ms": ["Fail", "Imej"],
    "mt": ["Midja", "Medja", "Stampa"],
    "mwl": ["Multimédia", "Fexeiro", "Ficheiro", "Arquivo", "Imagem"],
    "my": ["ဖိုင်", "မီဒီယာ"],
    "myv": ["Медия", "Артовкс", "Изображение"],
    "mzn": ["رسانه", "تصویر", "مه‌دیا", "مدیا", "پرونده", "رسانه‌ای"],
    "nah": ["Mēdiatl", "Īxiptli", "Imagen"],
    "nap": ["Fiùra", "Immagine"],
    "nds": ["Datei", "Bild"],
    "nds-nl": ["Ofbeelding", "Afbeelding", "Bestaand"],
    "ne": ["मीडिया", "चित्र"],
    "new": ["किपा", "माध्यम"],
    "nl": ["Bestand", "Afbeelding"],
    "nn": ["Fil", "Bilde", "Filpeikar"],
    "no": ["Fil", "Medium", "Bilde"],
    "nov": [],
    "nrm": ["Média", "Fichier"],
    "nso": ["Seswantšho"],
    "nv": ["Eʼelyaaígíí"],
    "oc": ["Imatge", "Fichièr", "Mèdia"],
    "olo": ["Kuva", "Medii", "Failu"],
    "or": ["ମାଧ୍ୟମ", "ଫାଇଲ"],
    "os": ["Ныв", "Медиа", "Файл", "Изображение"],
    "pa": ["ਤਸਵੀਰ", "ਮੀਡੀਆ"],
    "pcd": ["Média", "Fichier"],
    "pdc": ["Medium", "Datei", "Bild", "Feil"],
    "pfl": ["Dadai", "Medium", "Datei", "Bild"],
    "pi": ["मीडिया", "पटिमा"],
    "pl": ["Plik", "Grafika"],
    "pms": ["Figura", "Immagine"],
    "pnb": ["میڈیا", "تصویر", "فائل"],
    "pnt": ["Εικόνα", "Αρχείον", "Εικόναν", "Μέσον"],
    "ps": ["انځور", "رسنۍ", "دوتنه"],
    "pt": ["Multimédia", "Ficheiro", "Arquivo", "Imagem"],
    "qu": ["Midya", "Imagen", "Rikcha"],
    "rm": ["Multimedia", "Datoteca"],
    "rmy": ["Fişier", "Mediya", "Chitro", "Imagine"],
    "ro": ["Fişier", "Imagine", "Fișier"],
    "roa-rup": ["Fişier", "Imagine", "Fișier"],
    "roa-tara": ["Immagine"],
    "ru": ["Медиа", "Файл", "Изображение"],
    "rue": ["Медіа", "Медиа", "Файл", "Изображение", "Зображення"],
    "rw": ["Dosiye", "Itangazamakuru"],
    "sa": ["चित्रम्", "माध्यमम्", "सञ्चिका", "माध्यम", "चित्रं"],
    "sah": ["Миэдьийэ", "Ойуу", "Билэ", "Изображение"],
    "sat": ["ᱨᱮᱫ", "ᱢᱤᱰᱤᱭᱟ"],
    "sc": ["Immàgini"],
    "scn": ["Immagine", "Mmàggini", "Mèdia"],
    "sd": ["عڪس", "ذريعات", "فائل"],
    "se": ["Fiila"],
    "sg": ["Média", "Fichier"],
    "sh": ["Mediji", "Slika", "Медија", "Datoteka", "Medija", "Слика"],
    "si": ["රූපය", "මාධ්‍යය", "ගොනුව"],
    "sk": ["Súbor", "Obrázok", "Médiá"],
    "sl": ["Slika", "Datoteka"],
    "sq": ["Figura", "Skeda"],
    "sr": ["Датотека", "Medij", "Slika", "Медија", "Datoteka", "Медиј", "Medija", "Слика"],
    "srn": ["Afbeelding", "Gefre"],
    "stq": ["Bielde", "Bild"],
    "su": ["Média", "Gambar"],
    "sv": ["Fil", "Bild"],
    "sw": ["Faili", "Picha"],
    "szl": ["Plik", "Grafika"],
    "ta": ["படிமம்", "ஊடகம்"],
    "tcy": ["ಮಾದ್ಯಮೊ", "ಫೈಲ್"],
    "te": ["ఫైలు", "దస్త్రం", "బొమ్మ", "మీడియా"],
    "tet": ["Imajen", "Arquivo", "Imagem"],
    "tg": ["Акс", "Медиа"],
    "th": ["ไฟล์", "สื่อ", "ภาพ"],
    "ti": ["ፋይል", "ሜድያ"],
    "tk": ["Faýl"],
    "tl": ["Midya", "Talaksan"],
    "tpi": ["Fail"],
    "tr": ["Medya", "Resim", "Dosya", "Ortam"],
    "tt": ["Медиа", "Рәсем", "Файл", "Räsem", "Изображение"],
    "ty": ["Média", "Fichier"],
    "tyv": ["Медиа", "Файл", "Изображение"],
    "udm": ["Медиа", "Файл", "Суред", "Изображение"],
    "ug": ["ۋاسىتە", "ھۆججەت"],
    "uk": ["Медіа", "Медиа", "Файл", "Изображение", "Зображення"],
    "ur": ["میڈیا", "تصویر", "وسیط", "زریعہ", "فائل", "ملف"],
    "uz": ["Mediya", "Tasvir", "Fayl"],
    "vec": ["Immagine", "Imàjine", "Mèdia"],
    "vep": ["Pilt", "Fail"],
    "vi": ["Phương_tiện", "Tập_tin", "Hình", "Tập tin", "Phương tiện"],
    "vls": ["Afbeelding", "Ofbeeldienge"],
}

# https://commons.wikimedia.org/wiki/Commons:File_types
IMAGE_EXTENSIONS = ['.jpg', '.jpeg', '.png', '.svg', '.gif','.tif', '.bmp', '.webp', '.xcf','.djvu', '.pdf']
VIDEO_EXTENSIONS = ['.ogv', '.webm', '.mpg', '.mpeg']
AUDIO_EXTENSIONS = ['.ogg', '.mp3', '.mid', '.webm', '.flac', '.wav', '.oga']
MEDIA_EXTENSIONS = list(set(IMAGE_EXTENSIONS + VIDEO_EXTENSIONS + AUDIO_EXTENSIONS))

In [13]:
exten_regex = ('(' + '|'.join([e + '\\b' for e in MEDIA_EXTENSIONS]) + ')').replace('.', '\.')
extension_pattern = re.compile(f'([\w ,\(\)\.-]+){exten_regex}', flags=re.UNICODE)
bracket_pattern = re.compile('(?<=\[\[)(.*?)(?=\]\])', flags=re.DOTALL)

# NOTE: I explored several approaches to this function and how they impacted speed:
# * mwparserfromhell parsing substantially increases processing time, even compared to many regexes
# * Reducing down the number of extensions considered has a very minimal impact on time
# * Removing the first regex that extracts links has a very minimal impact on time. In theory it should be mostly unnecessary but will catch some rare file extensions.
# * Ignoring upper-case file extensions (e.g., .JPG) by not lower-casing the wikitext and just doing .findall over the iterative .search has very little impact on time

def getMedia(wikitext, wiki_db='enwiki', max_link_length=240):
    """Gather counts of media files found directly in wikitext.
    
    See https://phabricator.wikimedia.org/T299712 for more details.
    Link length: https://commons.wikimedia.org/wiki/Commons:File_naming#Length
    """
    lang = wiki_db.replace('wiki', '')
    try:
        # find standard bracket-syntax links -- this likely could be dropped but adds minimal overhead
        med_prefixes = MEDIA_PREFIXES + MEDIA_ALIASES.get(lang, [])
        links = bracket_pattern.findall(wikitext)
        bracket_links = set([l.split(':', maxsplit=1)[1].split('|', maxsplit=1)[0].strip() for l in links if l.split(':', maxsplit=1)[0] in med_prefixes])
        
        # supplement with links outside brackets as determined via known file extensions
        # lower-case to handle e.g., .JPG instead of .jpg when searching for file extensions
        lc_wt = wikitext.lower()
        exten_links = []
        end = 0
        while True:
            m = extension_pattern.search(lc_wt, pos=end)
            if m is None:
                break
            start, end = m.span()
            exten_links.append(wikitext[start:end].strip())
        return [l.replace('\n', ' ') for l in bracket_links.union(exten_links) if len(l) <= max_link_length]
    except Exception:
        return None
    
spark_session.udf.register('getMedia', getMedia, 'ARRAY<String>')

<function __main__.getMedia(wikitext, wiki_db='enwiki', max_link_length=240)>

In [14]:
def compareMediaLists(curr_media, prev_media):
    """Compare two media lists to determine what changed."""
    try:
        changes = []
        unaligned = set(curr_media) ^ set(prev_media)
        for m in unaligned:
            if m in curr_media:
                changes.append((m, 1))
            elif m in prev_media:
                changes.append((m, -1))
        return changes
    except Exception:
        return None
    
spark_session.udf.register('compareMediaLists', compareMediaLists, 'ARRAY<STRUCT<filename:STRING, action:INT>>')

<function __main__.compareMediaLists(curr_media, prev_media)>

#### Generate Media List for Image Edits

In [8]:
create_table_query = f"""
    CREATE TABLE IF NOT EXISTS {media_list_table} (
        wiki_db                         STRING         COMMENT 'Wiki -- e.g., enwiki for English',
        event_timestamp                 STRING         COMMENT 'When the edits occurred',
        page_id                         INT            COMMENT 'Article page ID',
        user_id                         INT         COMMENT 'User id of who made edit',
        user_text                       STRING         COMMENT 'User name of who made edit',
        revision_id                     BIGINT         COMMENT 'Revision ID',
        revision_parent_id              BIGINT         COMMENT 'Revision ID of parent revision',
        revision_is_identity_reverted   BOOLEAN        COMMENT 'Was revision reverted?',
        revision_is_identity_revert     BOOLEAN        COMMENT 'Did revision restore a previous revision?',
        revision_seconds_to_identity_revert    BIGINT        COMMENT 'seconds elapsed between revision posting and its revert',
        revision_tags                   ARRAY<STRING>  COMMENT 'Edit tags associated with revision',
        cur_rev_media_array             ARRAY<STRING>  COMMENT 'List of images in current revision',
        par_rev_media_array             ARRAY<STRING>  COMMENT 'List of images in parent revision'
    )
    """

spark_session.sql(create_table_query)

DataFrame[]

In [None]:
"""
Explanation of CTEs:
* revisions: get all revisions + metadata from desired wikis / timeframe.
  * only main articles and filter out bots / anonymous users
* all_revision_ids: build deduplicated lists of all revision + parent revision IDs
* media_lists: for each revision ID, extract images from associated wikitext
* INSERT OVERWRITE...: join back in media lists with revisions + metadata

# TODO: are newlines in revision_text causing NULLs?
"""

query = f"""
WITH revisions AS (
    SELECT
      wiki_db,
      event_timestamp,
      page_id,
      event_user_id AS user_id,
      event_user_text AS user_text,
      revision_id,
      revision_parent_id,
      revision_is_identity_reverted,
      revision_is_identity_revert,
      revision_seconds_to_identity_revert,
      revision_tags
    FROM wmf.mediawiki_history
    WHERE
      snapshot = '{mw_snapshot}'
      AND wiki_db IN {wiki_db_str}
      AND page_namespace = 0
      AND event_type = 'create'
      AND event_entity = 'revision'
      AND event_timestamp >= '{start_timestamp}'
      AND event_timestamp < '{end_timestamp}'
      AND SIZE(event_user_is_bot_by) < 1
      AND SIZE(event_user_is_bot_by_historical) < 1
      AND NOT event_user_is_anonymous
      AND NOT page_is_redirect
),
all_revision_ids AS (
    SELECT DISTINCT
      wiki_db,
      rev_id
    FROM (
        SELECT
          wiki_db,
          revision_id AS rev_id
        FROM revisions
        UNION ALL
        SELECT
          wiki_db,
          revision_parent_id AS rev_id
        FROM revisions
    ) r
),
media_lists AS (
    SELECT
      r.wiki_db,
      r.rev_id,
      getMedia(revision_text, wt.wiki_db) AS media_array
    FROM wmf.mediawiki_wikitext_history wt
    INNER JOIN all_revision_ids r
      ON (wt.wiki_db = r.wiki_db
          AND wt.revision_id = r.rev_id)
    WHERE
      snapshot = '{mw_snapshot}'
      AND wt.wiki_db IN {wiki_db_str}
)

INSERT INTO TABLE {media_list_table}     
SELECT
  r.wiki_db,
  event_timestamp,
  page_id,
  user_id,
  user_text,
  revision_id,
  revision_parent_id,
  revision_is_identity_reverted,
  revision_is_identity_revert,
  revision_seconds_to_identity_revert,
  revision_tags,
  c.media_array AS cur_rev_media_array,
  p.media_array AS par_rev_media_array
FROM revisions r
LEFT JOIN media_lists c
  ON (r.wiki_db = c.wiki_db AND r.revision_id = c.rev_id)
LEFT JOIN media_lists p
  ON (r.wiki_db = p.wiki_db AND r.revision_parent_id = p.rev_id)
"""

if do_execute:
    spark_session.sql(query)


### Get image edits data and corresponding imgaes from media list

In [15]:
image_edits_query = """

SELECT
      m.wiki_db,
      event_timestamp,
      m.user_text,
      m.user_id,
      revision_id,
      page_id,
      revision_is_identity_reverted,
      revision_is_identity_revert,
      revision_seconds_to_identity_revert,
      revision_tags,
      u.user_editcount,
      INLINE(compareMediaLists(cur_rev_media_array, par_rev_media_array)),
      SIZE(par_rev_media_array) AS illustrated
    FROM {media_list_table} m
    LEFT JOIN wmf_raw.mediawiki_user u 
      ON (m.user_text = u.user_name AND m.wiki_db = u.wiki_db)
    WHERE revision_id IS NOT NULL
      AND m.wiki_db IN {wiki_db_str}
      AND u.snapshot = '{mw_snapshot}'
      
"""

In [16]:
image_edits_data = spark.run( 
        image_edits_query.format(
          media_list_table = media_list_table,
          wiki_db_str = wiki_dbs,
          mw_snapshot = mw_snapshot
        )
    )

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
                                                                                

In [17]:
image_edits_sdf = spark_session.createDataFrame(image_edits_data)
image_edits_sdf.createGlobalTempView("image_edit_temp")

22/10/05 07:36:53 WARN TransportChannelHandler: Exception in connection from /10.64.5.12:48584
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
	at io.netty.channel.nio.NioEventLoop.processSel

# Number of Notifications

In `echo_notification` table, there are two types of timestamps: **notification_timestamp** and **notification_read_timestamp**. The first one is thetimestamp when the notification was created, the later on is the timestamp when the user read the notification (null if unread). For some of notifications, they may be read by users, but marked as unread. 

We want to take a look at both of the timestamps here, to see the number of notification sent and the number of notifications read by users. 

In [16]:
daily_notification_query = """

SELECT 
   wiki_db,
   FROM_UNIXTIME(UNIX_TIMESTAMP(SUBSTR(notification_timestamp,0,8), 'yyyyMMdd')) AS date,
   COUNT(notification_event) AS notification_sent,
   COUNT(notification_read_timestamp) AS notification_read
FROM global_temp.notification_data_temp
GROUP BY wiki_db, FROM_UNIXTIME(UNIX_TIMESTAMP(SUBSTR(notification_timestamp,0,8), 'yyyyMMdd'))
   
"""

In [17]:
daily_notification = spark.run(daily_notification_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
[Stage 4:>                                                          (0 + 0) / 2]22/09/20 16:36:01 WARN TaskSetManager: Stage 4 contains a task of very large size (7559 KB). The maximum recommended task size is 100 KB.
                                                                                

In [18]:
notification_stats = daily_notification.groupby('wiki_db').agg(
     notification_sent = ('notification_sent', 'sum'),
     notification_read = ('notification_read', 'sum')
).reset_index()

In [19]:
notification_stats['read_pct'] = 100 * notification_stats['notification_read']/notification_stats['notification_sent']

Total notifications sent and read during 2022-07-20 to 2022-08-31 is as below. Less than 30% of the notifications sent was read. 

In [20]:
notification_stats

Unnamed: 0,wiki_db,notification_sent,notification_read,read_pct
0,idwiki,5913,1674,28.310502
1,ptwiki,18988,4329,22.79861
2,ruwiki,45264,13376,29.551078


# Number of opt-outs

For image suggestion notification, by default, we set web notification and push notification on and email notification off. To look for opt_outs we will take a look at how many users turn off `echo-subscriptions-<type>-image-suggestions` for web and app in `user_preferences` table for each wikis. 

We also need to check users' global preference settings. In addition, compare the global preferences with `echo-subscriptions-<type>-image-suggestions-local-exception`. If there's a local exception, it will over-write the global preferences. 

Opt-out Rate = Number of opt-outs / Number of users who received image suggestion notifications

In [25]:
pref_type_list = ("push", "web")

In [26]:
pref_query = """

WITH noti_users AS ( --total notification users

SELECT 
    wiki_db,
    notification_user
FROM global_temp.notification_data_temp
GROUP BY wiki_db, notification_user

), local_pref AS ( -- local preference 

SELECT
    wiki_db,
    local_user_id,
    up_value AS local_pref
FROM global_temp.local_pref_temp
WHERE up_property = "echo-subscriptions-{type}-image-suggestions"

), global_pref AS ( -- global preference

SELECT
    lu_wiki,
    lu_local_id,
    gp_value AS global_pref
FROM global_temp.global_pref_temp
WHERE gp_property  = "echo-subscriptions-{type}-image-suggestions"

), local_ex AS ( -- local exceptions

SELECT
    wiki_db,
    local_user_id,
    up_value AS local_ex   
FROM global_temp.local_pref_temp
WHERE up_property = "echo-subscriptions-{type}-image-suggestions-local-exception"

), global_all_pref AS ( -- compare local exception and global preference

SELECT 
    COALESCE(lu_wiki,wiki_db) AS wiki,
    COALESCE(lu_local_id,local_user_id) AS user_id,
    COALESCE(local_ex,global_pref) AS all_pref
FROM global_pref gp FULL OUTER JOIN local_ex le ON (gp.lu_wiki = le.wiki_db AND gp.lu_local_id = le.local_user_id)

)

SELECT 
    nu.wiki_db,
    nu.notification_user,
    COALESCE(gp.all_pref, lp.local_pref) AS preference
FROM noti_users nu 
    LEFT JOIN local_pref lp ON (nu.wiki_db = lp.wiki_db AND nu.notification_user = lp.local_user_id)
    LEFT JOIN global_all_pref gp ON (nu.wiki_db = gp.wiki AND nu.notification_user = gp.user_id)

"""

In [33]:
pref_stats = pd.DataFrame()

for i in range(len(pref_type_list)):
               
    pref_result = spark.run(pref_query.format(
                           type = pref_type_list[i]
                        ))
    
    pref_result.insert(0, 'type', pref_type_list[i])
    
    pref_stats = pref_stats.append(pref_result)


PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
[Stage 6:>    (0 + 0) / 2][Stage 7:>    (0 + 0) / 2][Stage 8:>    (0 + 0) / 2]22/09/20 16:39:24 WARN TaskSetManager: Stage 6 contains a task of very large size (7559 KB). The maximum recommended task size is 100 KB.
PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.                   
22/09/20 16:39:31 WARN TaskSetManager: Stage 13 contains a task of very large size (7559 KB). The maximum recommended task size is 100 KB.
                                                                                

In [38]:
## total number of users received notifications
total_users = pref_stats[(pref_stats['type'] == "web")].groupby(
    ['wiki_db']
).agg(
    total_users = ('notification_user', 'count')
).reset_index()

In [40]:
## opt-out for push notifications and web notifications
pref = pref_stats.groupby(
    ['wiki_db','type', 'preference']
).agg(
    optout_users = ('preference', 'count')
).reset_index()

pref = pref.merge(total_users, on='wiki_db', how='left')

In [41]:
pref['opt_out_pct'] = 100 * pref['optout_users']/pref['total_users']
pref[(pref['preference'] == "b'0'")]

Unnamed: 0,wiki_db,type,preference,optout_users,total_users,opt_out_pct
0,idwiki,push,b'0',9,802,1.122195
2,idwiki,web,b'0',10,802,1.246883
4,ptwiki,push,b'0',20,2516,0.794913
6,ptwiki,web,b'0',23,2516,0.914149
8,ruwiki,push,b'0',106,6010,1.763727
10,ruwiki,web,b'0',112,6010,1.863561


# Number of Images Added by Users in August

In [14]:
#stats on all editors in August
#excluding all the edits reverted within 48 hours

do_execute = True

query = f"""
WITH num_edits AS (
    SELECT
      wiki_db,
      COUNT(1) AS total_num_edits,
      COUNT(DISTINCT(user_text)) AS total_num_editors
    FROM {media_list_table}
    WHERE
      revision_id IS NOT NULL
    AND event_timestamp >= '2022-08-01'
    AND event_timestamp < '2022-09-01'
    AND !(revision_is_identity_reverted AND revision_seconds_to_identity_revert <= 172800)
    AND ! revision_is_identity_revert
    GROUP BY
      wiki_db
),
changes AS (
    SELECT
      *
    FROM global_temp.image_edit_temp
    WHERE
      revision_id IS NOT NULL
    AND event_timestamp >= '2022-08-01'
    AND event_timestamp < '2022-09-01'
    AND !(revision_is_identity_reverted AND revision_seconds_to_identity_revert <= 172800)
    AND ! revision_is_identity_revert
),
change_counts AS (
    SELECT
      wiki_db,
      COUNT(DISTINCT(revision_id)) AS num_edits,
      COUNT(DISTINCT(user_text)) AS num_editors,
      SUM(IF(action = 1, 1, 0)) AS images_added
    FROM changes
    GROUP BY
      wiki_db
)

SELECT
  c.wiki_db,
  num_edits,
  total_num_edits,
  ROUND(100 * num_edits / total_num_edits, 3) AS pct_edits,
  num_editors,
  total_num_editors,
  ROUND(100 * num_editors / total_num_editors, 3) AS pct_editors,
  images_added,
  ROUND(images_added / total_num_editors,1) AS average_images
FROM change_counts c
INNER JOIN num_edits t
  ON (c.wiki_db = t.wiki_db)
"""

if do_execute:
    spark_session.sql(query).show(500, False)

                                                                                

+-------+---------+---------------+---------+-----------+-----------------+-----------+------------+--------------+
|wiki_db|num_edits|total_num_edits|pct_edits|num_editors|total_num_editors|pct_editors|images_added|average_images|
+-------+---------+---------------+---------+-----------+-----------------+-----------+------------+--------------+
|ptwiki |6891     |91356          |7.543    |1346       |6063             |22.2       |9264        |1.5           |
|idwiki |4853     |60324          |8.045    |737        |2510             |29.363     |7203        |2.9           |
|ruwiki |15035    |265604         |5.661    |2152       |8271             |26.019     |22005       |2.7           |
+-------+---------+---------------+---------+-----------+-----------------+-----------+------------+--------------+



In [None]:
#Stats on edits made by editors who made over 500 edits 
#editors who made over 500 edits as ex_editors
#excluding all the edits reverted within 48 hours

do_execute = True

query = f"""
WITH num_edits AS (
    SELECT
      a.wiki_db,
      COUNT(1) AS total_num_edits,
      COUNT(DISTINCT(user_text)) AS total_num_editors
    FROM {media_list_table} a
        LEFT JOIN wmf_raw.mediawiki_user u ON (a.user_text = u.user_name AND a.wiki_db = u.wiki_db)
    WHERE
      revision_id IS NOT NULL
    AND a.wiki_db in {wiki_db_str}
    AND u.snapshot = '{mw_snapshot}'
    AND event_timestamp >= '2022-08-01'
    AND event_timestamp < '2022-09-01'
    AND !(revision_is_identity_reverted AND revision_seconds_to_identity_revert <= 172800)
    AND ! revision_is_identity_revert
    AND user_editcount > 500
    GROUP BY
      a.wiki_db
),
changes AS (
    SELECT
      *
    FROM global_temp.image_edit_temp 
    WHERE revision_id IS NOT NULL
    AND event_timestamp >= '2022-08-01'
    AND event_timestamp < '2022-09-01'   
    AND !(revision_is_identity_reverted AND revision_seconds_to_identity_revert <= 172800)
    AND ! revision_is_identity_revert
    AND user_editcount > 500
),
change_counts AS (
    SELECT
      wiki_db,
      COUNT(DISTINCT(revision_id)) AS num_edits,
      COUNT(DISTINCT(user_text)) AS num_editors,
      SUM(IF(action = 1, 1, 0)) AS images_added
    FROM changes
    GROUP BY
      wiki_db
)

SELECT
  c.wiki_db,
  num_edits,
  total_num_edits,
  ROUND(100 * num_edits / total_num_edits, 3) AS pct_edits,
  num_editors,
  total_num_editors,
  ROUND(100 * num_editors / total_num_editors, 3) AS pct_editors,
  images_added,
  ROUND(images_added / total_num_editors,1) AS average_images
FROM change_counts c
INNER JOIN num_edits t
  ON (c.wiki_db = t.wiki_db)
"""

if do_execute:
    spark_session.sql(query).show(500, False)

                                                                                ]

+-------+---------+---------------+---------+-----------+-----------------+-----------+------------+--------------+
|wiki_db|num_edits|total_num_edits|pct_edits|num_editors|total_num_editors|pct_editors|images_added|average_images|
+-------+---------+---------------+---------+-----------+-----------------+-----------+------------+--------------+
|ptwiki |4572     |68128          |6.711    |473        |957              |49.425     |6318        |6.6           |
|idwiki |3083     |43252          |7.128    |230        |386              |59.585     |4619        |12.0          |
|ruwiki |11926    |231313         |5.156    |1099       |2392             |45.945     |17895       |7.5           |
+-------+---------+---------------+---------+-----------+-----------------+-----------+------------+--------------+



# Number of image added to unillustrated articles by experienced editors in August

In [21]:
image_edits_query = """

WITH aug_image_edits AS (
    
    SELECT
      m.wiki_db,
      m.user_text,
      revision_id,
      revision_is_identity_reverted,
      revision_seconds_to_identity_revert,
      INLINE(compareMediaLists(cur_rev_media_array, par_rev_media_array)),
      size(par_rev_media_array) AS illustrated
    FROM {media_list_table} m
    LEFT JOIN wmf_raw.mediawiki_user u ON (m.user_text = u.user_name AND m.wiki_db = u.wiki_db)
    WHERE revision_id IS NOT NULL
    AND m.wiki_db in {wiki_db_str}
    AND u.snapshot = '2022-08'
    AND ! revision_is_identity_revert
    AND user_editcount > 500
)


    SELECT
      wiki_db,
      COUNT(DISTINCT(revision_id)) AS num_edits,
      COUNT(DISTINCT(user_text)) AS num_editors,
      SUM(IF(action = 1, 1, 0)) AS images_added
    FROM aug_image_edits
    WHERE illustrated = 0
      AND  !(revision_is_identity_reverted AND revision_seconds_to_identity_revert <= 172800)
    GROUP BY
      wiki_db
    
"""

In [None]:
spark.run(image_edits_query )

# Number of Images Suggested that are Added to the Matched Article

To find the image edits to the matched articles, we are looking for image edits timestamp after notification read timestamp. We also want to exclude image edits from Newcomer Tasks.

In [47]:
match_image_query = """

WITH image_edits AS (
    SELECT
        wiki_db,
        event_timestamp,
        revision_id,
        page_id,
        user_id,
        revision_is_identity_reverted,
        revision_seconds_to_identity_revert,
        REPLACE(filename, ' ', '_') AS filename
    FROM global_temp.image_edit_temp
    WHERE action = 1
      AND ! revision_is_identity_revert
      AND ! ARRAY_CONTAINS(revision_tags, 'newcomer task')
), 
suggested_images AS (
    SELECT
        wiki_db,
        notification_event,
        FROM_UNIXTIME(UNIX_TIMESTAMP(notification_timestamp, 'yyyyMMddHHmmss')) AS notification_timestamp,
        IFNULL(NULL, FROM_UNIXTIME(UNIX_TIMESTAMP(notification_read_timestamp, 'yyyyMMddHHmmss'))) AS notification_read_timestamp,
        event_page_id,
        notification_user,
        suggested_image
    FROM global_temp.notification_data_temp
)

SELECT 
    i.wiki_db,
    revision_id,
    page_id,
    user_id,
    filename,
    IF(revision_is_identity_reverted AND revision_seconds_to_identity_revert <= 172800, TRUE, FALSE) AS reverted
FROM image_edits i
LEFT JOIN suggested_images s 
    ON (i.wiki_db = s.wiki_db AND i.page_id = s.event_page_id AND i.user_id = s.notification_user AND i.filename = s.suggested_image)
WHERE notification_event IS NOT NULL
  AND notification_read_timestamp IS NOT NULL
  AND notification_read_timestamp < event_timestamp 

"""


In [None]:
matched_image = spark.run(match_image_query)

In [49]:
# Images Suggested that are Added to the Matched Article
matched_image.groupby(
    ['wiki_db']
).agg(
    images_added = ('filename', 'count'),
    related_pages = ('page_id', 'nunique'),
    related_users = ('user_id', 'nunique')
).reset_index()

Unnamed: 0,wiki_db,images_added,related_pages,related_users
0,idwiki,56,56,37
1,ptwiki,125,125,79
2,ruwiki,324,320,208


In [50]:
# Suggested Images not Reverted from Matched Article

matched_image[(matched_image['reverted'] == False)].groupby(
    ['wiki_db']
).agg(
    images_added = ('filename', 'count'),
    related_pages = ('page_id', 'nunique'),
    related_users = ('user_id', 'nunique')
).reset_index()

Unnamed: 0,wiki_db,images_added,related_pages,related_users
0,idwiki,56,56,37
1,ptwiki,125,125,79
2,ruwiki,324,320,208


There are 504 images added to the matched articles from 7/20/2022 - 8/31/2022. And all of the image edits for these images are not reverted within 48 hours. 

# (to do) time to add images