# UK Wikipedia Unique Devices by Project Family

## Task Description:

- Create updated UK monthly user count estimates for Wikipedia - for period 20240901 - 20250228.
- If possible, we want help to demonstrate that features like Special:Nearby and Special:NewPagesFeed are not used by many (UK) users.
- [Asana Task](https://app.asana.com/1/3758245663860/task/1209810804894230?focus=true)
- [T390785](https://phabricator.wikimedia.org/T390785)

## Methodology and Data Sources: 

Unique devices per project family data comes from the [unique_devices_per_project_family_monthly](https://github.com/wikimedia/analytics-refinery/blob/master/oozie/unique_devices/per_project_family/monthly/unique_devices_per_project_family_monthly.hql). 

**Metric**: The `uniques_estimate` average over the last 6 months (Sep 2024 through Feb 2025). Data limited to only countries within European Union.

**Important note**: Unique devices by family metrics were inflated between July to Nov 2024 which was fixed and backfilled by Jan 2025. 
For more info, read [2024-09-20 Unique Devices by Family Inflated Due to Miscategorized Traffic](https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Data_Issues/2024-09-20_Unique_Devices_by_Family_Inflated_Due_to_Miscategorized_Traffic)


In [1]:
import pandas as pd
import wmfdata as wmf

## UK Wikipedia Unique Devices by Project Family

In [2]:
query = '''

SELECT
    project_family,
    country_code,
    ud.year,
    ud.month,
    SUM(ud.uniques_estimate) as unique_devices
  FROM wmf.unique_devices_per_project_family_monthly ud
  WHERE 
  -- UK unique devices only  
    ud.country_code = 'GB'
    
   -- Wikipedia UK unique devices only  
    AND project_family = 'wikipedia' 
    
  -- time period: September 1, 2024 thru Feb 28, 2025
    AND (
    (YEAR = 2024 AND MONTH > 08)
    OR
    (YEAR = 2025 AND MONTH < 03))
    
  GROUP BY 
    project_family,
    country_code,
    year,
    month

'''

In [3]:
uk_wp_unique_devices = wmf.spark.run(query)

SPARK_HOME: /srv/home/mayakpwiki/.conda/envs/2025-03-21T23.13.13_mayakpwiki/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/16 20:30:41 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
25/04/16 20:30:41 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
25/04/16 20:30:42 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/04/16 20:30:44 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
25/04/16 20:30:44 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported i

In [54]:
uk_wp_unique_devices.sort_values(by=['year','month'])

Unnamed: 0,project_family,country_code,year,month,unique_devices
3,wikipedia,GB,2024,9,58406575
4,wikipedia,GB,2024,10,65020642
0,wikipedia,GB,2024,11,63651189
2,wikipedia,GB,2024,12,62155538
5,wikipedia,GB,2025,1,65937400
1,wikipedia,GB,2025,2,63492529


In [67]:
uk_wp_unique_devices_avg['Monthly Avg UK Unique Users'] = (uk_wp_unique_devices['unique_devices']/2.4).mean().round(-3)
uk_wp_unique_devices_avg

year                               2024.0
month                                 8.0
unique_devices                 63110646.0
Monthly Avg UK Unique Users    26296000.0
dtype: object

In [55]:
uk_wp_unique_devices_avg = (uk_wp_unique_devices.groupby(['project_family']).mean().round().reset_index())

  uk_wp_unique_devices_avg = (uk_wp_unique_devices.groupby(['project_family']).mean().round().reset_index())


In [56]:
uk_wp_unique_devices_avg

Unnamed: 0,project_family,year,month,unique_devices
0,wikipedia,2024.0,8.0,63110646.0


Using the 2.4 factor to estimate unique users from unique devices  
[Cisco Annual Internet Report (2018–2023)](https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.pdf)

In [57]:
uk_wp_unique_devices_avg['Monthly Avg UK Unique Users'] = (uk_wp_unique_devices_avg['unique_devices']/2.4).round(-3)

In [58]:
uk_wp_unique_devices_avg

Unnamed: 0,project_family,year,month,unique_devices,Monthly Avg UK Unique Users
0,wikipedia,2024.0,8.0,63110646.0,26296000.0


## Requests to Special Pages 

### UK users Special:Nearby and Special:NewPagesFeed 
We took a sample day of March 25, 2025 to check webrequest and see the number of requests made to these special days. 

In [68]:
# UK users Special:Nearby and Special:NewPagesFeed 

UK_special=wmf.spark.run("""

SELECT
COUNT(*),
CASE
WHEN LOWER(uri_path) like '%special:nearby%' THEN 'special_nearby' 
WHEN LOWER(uri_path) like '%special:newpagesfeed%' THEN 'special_newpagesfeed'
END AS special_page_UK

FROM wmf.webrequest
WHERE
  year = 2025
  AND month = 3
  --AND hour = 18
  AND day = 25
  AND geocoded_data['country_code'] = 'GB' 
  AND (LOWER(uri_path) like '%special:nearby%'
  OR
  LOWER(uri_path) like '%special:newpagesfeed%')
  AND agent_type = 'user'
  AND NOT is_pageview
GROUP BY
CASE
WHEN LOWER(uri_path) like '%special:nearby%' THEN 'special_nearby'
WHEN LOWER(uri_path) like '%special:newpagesfeed%' THEN 'special_newpagesfeed'
END


-- we don't need to sample since its a small number of requests. 
   --AND MOD(HASH(day || hour), 6) IN (0,1,2) -- Select ~16% of every hour 
  """)

                                                                                

In [None]:
UK_special

The counts are very small, hence should not be publicly shared per our [Data publication guidelines](https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines)

### Global users Special:Nearby and Special:NewPagesFeed 


In [70]:
# Global users Special:Nearby and Special:NewPagesFeed 

Global_special=wmf.spark.run("""

SELECT
COUNT(*),
CASE
WHEN LOWER(uri_path) like '%special:nearby%' THEN 'special_nearby' 
WHEN LOWER(uri_path) like '%special:newpagesfeed%' THEN 'special_newpagesfeed'
END AS special_page_Global

FROM wmf.webrequest
WHERE
  year = 2025
  AND month = 3
  --AND hour = 18
  AND day = 25
  AND (LOWER(uri_path) like '%special:nearby%'
  OR
  LOWER(uri_path) like '%special:newpagesfeed%')
  AND agent_type = 'user'
  AND NOT is_pageview
GROUP BY
CASE
WHEN LOWER(uri_path) like '%special:nearby%' THEN 'special_nearby'
WHEN LOWER(uri_path) like '%special:newpagesfeed%' THEN 'special_newpagesfeed'
END


-- we don't need to sample since its a small number of requests. 
   --AND MOD(HASH(day || hour), 6) IN (0,1,2) -- Select ~16% of every hour 
  """)

                                                                                

In [71]:
Global_special

Unnamed: 0,count(1),special_page_Global
0,8617,special_nearby
1,590,special_newpagesfeed


Unique users from the UK on the same day as the special pages request

In [72]:
UK_unique_users=wmf.spark.run("""

SELECT
day, 
CAST ((SUM (uniques_estimate)/2.4) AS INT) AS uk_unique_users
FROM wmf_readership.unique_devices_per_project_family_daily
WHERE
country_code = 'GB'
AND day IN
(
'2025-03-25'
)
GROUP BY day
""")

UK_unique_users

                                                                                

Unnamed: 0,day,uk_unique_users
0,2025-03-25,3651295


In [73]:
Absolute_proportion= (UK_special['count(1)'] / UK_unique_users['uk_unique_users'])*100

In [74]:
Absolute_proportion

0    0.008928
1         NaN
dtype: float64

##### Requests to Special:Nearby make up close to 0.01% of the total unique users in the UK in a single day.  Requests to Special:Newpagefeeds is even lower. 
#### Therefore, UK users aren't requesting for these two pages. 