This analysis was requested to help inform some discussions about article creation dynamics on the English Wikipedia. See [T149021](https://phabricator.wikimedia.org/T149021) and [T149049](https://phabricator.wikimedia.org/T149049).

In [19]:
import pandas as pd
import bokeh.plotting as bk
import bokeh
import datetime as dt
from IPython.display import display, HTML

# Survival of new articles over time

Data from the following queries (surviving creations are held in the `revision` table, while deleted creations have been moved to the `archive` table):

```
select left(rev_timestamp, 6) as month, count(*) as surviving_creations
from enwiki.revision
left join enwiki.page 
on rev_page = page_id
where 
page_namespace = 0 and
rev_parent_id = 0 and
convert(rev_comment using utf8) not like "%redir%" and
rev_len > 100
group by left(rev_timestamp, 6);
```

```
select left(a.ar_timestamp, 6) as month, count(*) as deleted_creations
from enwiki.archive a
inner join
(
select ar_title, min(ar_timestamp) as ar_timestamp
from enwiki.archive
where
ar_namespace = 0 and
convert(ar_comment using utf8) not like "%redir%" and
ar_len > 100
group by ar_title
) b
using (ar_title, ar_timestamp)
group by left(a.ar_timestamp, 6)
```

In [34]:
survived = pd.read_table("2016-10_enwiki_surviving_creations.tsv")
survived.head()

Unnamed: 0,month,surviving_creations
0,200101,156
1,200102,327
2,200103,531
3,200104,582
4,200105,1070


In [35]:
deleted = pd.read_table("2016-10_enwiki_deleted_creations.tsv")
deleted.head()

Unnamed: 0,month,deleted_creations
0,200101,15
1,200102,46
2,200103,35
3,200104,15
4,200105,53


In [36]:
survival = survived.merge(deleted, on = "month")
survival["pct_survival"] = \
    survival["surviving_creations"] / \
    (survival["surviving_creations"] + survival["deleted_creations"])

# Convert month column to real date
survival["month"] = pd.to_datetime(survival["month"], format = "%Y%m")
survival.set_index(keys = "month", inplace = True)

# Get rid of incomplete data for November
survival.drop(pd.to_datetime("2016-11-01"), inplace = True)

survival.tail()

Unnamed: 0_level_0,surviving_creations,deleted_creations,pct_survival
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-07-01,22971,7817,0.746102
2016-08-01,23873,7999,0.749027
2016-09-01,26488,7998,0.76808
2016-10-01,24203,6978,0.77621
2016-11-01,3711,1296,0.741162


In [48]:
bokeh.io.output_notebook()

c = bk.figure(width = 800, height = 400, x_axis_type = "datetime", y_range = (0, 1))
c.line(survival.index, survival["pct_survival"], color = "navy", line_width = 2)
c.toolbar.active_drag = None
bk.show(c)

# Article creation by non-autoconfirmed editors

**Summary**: I estimate that about 87% of new articles on the English Wikipedia are created by autoconfirmed users. (In this case, an article is a main-namespace page which is not a redirect.)

The data that I import came from the following SQL query. Note that, contrary to what I intially thought, this does *not* include edits to deleted pages, which are removed from the `recentchanges` table when they are deleted.
```
select 
    rc_title as page_title,
    rc_cur_id as page_id,
    rc_timestamp as creation_timestamp,
    rc_this_oldid as creation_rev_id,
    rc_new_len as length_at_creation,
    rc_user as user_id,
    rc_user_text as user_name,
    (select count(*)
        from enwiki.revision
        where 
            rev_user = rc_user and
            rev_timestamp < rc_timestamp
    ) as user_edit_count,
    user_registration
from enwiki.recentchanges
inner join enwiki.user
on rc_user = user_id
where 
    rc_type = 1 and
    rc_namespace = 0 and
    rc_comment not regexp "[Rr]edir" and
    rc_timestamp >= "20161025" and
    rc_timestamp < "20161101";
```

In [45]:
creations = pd.read_table(
    "2016-10_enwiki_article_creations.tsv",
    parse_dates = [2, 8])
creations.head()

Unnamed: 0,page_title,page_id,creation_timestamp,creation_rev_id,length_at_creation,user_id,user_name,user_edit_count,user_registration
0,Sundkler,52087925,2016-10-25 00:00:56,746047690,28,29493069,NorbayWarte,36,2016-10-24 21:14:31
1,Donald_Trump's_Wall,52087945,2016-10-25 00:04:02,746048100,1043,29133239,Christianhamby,1,2016-09-08 20:39:24
2,Bye_Bye_My_Blue,52088044,2016-10-25 00:21:22,746050319,1055,28768139,LuckyAries,690,2016-07-15 02:02:37
3,Ahmad_Danny_Ramadan,52088127,2016-10-25 00:31:26,746051791,1430,12755007,Danny3aw,0,2010-07-19 06:26:26
4,ACB_statistical_leaders,52088148,2016-10-25 00:35:56,746052392,7791,13174094,Bluesangrel,56083,2010-10-01 17:40:10


In [46]:
four_d = dt.timedelta(days = 4)
creations["creator_autoconfirmed"] = (
    (creations["user_edit_count"] >= 10) &
    (creations["creation_timestamp"] >= (creations["user_registration"] + four_d))
    )

However, this leaves a couple entries with null account creations dates because their accounts were created before Mediawiki started recording them. I'll manually set them to be autoconfirmed.

In [47]:
null_reg = creations[ creations["user_registration"].isnull() ]
null_reg

Unnamed: 0,page_title,page_id,creation_timestamp,creation_rev_id,length_at_creation,user_id,user_name,user_edit_count,user_registration,creator_autoconfirmed
1053,Stone_slab,52134207,2016-10-29 20:22:38,746827427,6979,44656,Mcapdevila,3787,NaT,False
1335,Niccolò_Lorini,52149965,2016-10-31 16:27:31,747128371,453,36571,Acrider,116,NaT,False
5305,Takashi_Nishimoto,52141654,2016-10-30 17:42:34,746968857,2149,602857,Muboshgu,173629,NaT,False


In [48]:
creations.ix[null_reg.index, "creator_autoconfirmed"] = True
creations[ creations["user_registration"].isnull() ]["creator_autoconfirmed"]

1053    True
1335    True
5305    True
Name: creator_autoconfirmed, dtype: bool

In [49]:
creations.groupby("creator_autoconfirmed").size()

creator_autoconfirmed
False     735
True     5477
dtype: int64

So 88.1% of creations were by autoconfirmed users. But I think there are still a good number of redirect creations here, even though I filtered out most of them using the edit summary. What if we pull out everything where the inital size was less than 100 bytes and the user was autoconfirmed? From spot-checking, it looks like that should get most of them while not removing too many creations of real stubs.

In [50]:
to_remove = creations[
    (creations["length_at_creation"] < 100) & 
    (creations["creator_autoconfirmed"] == True)
]

creations = creations.drop(to_remove.index)

In [51]:
creations.groupby("creator_autoconfirmed").size()

creator_autoconfirmed
False     735
True     4876
dtype: int64

That gives 86.9% of creations by autoconfirmed users. That's likely a better estimate, though it's not a large difference in any case. 

Ideally, I'd check the text of each revision to know for sure whether it was a redirect at the time of creation. But that would require a lot of API work, and this suggests that it wouldn't change the results much.

## Example articles

I'll pull out a linked list of the creations from 31 October.

In [52]:
examples = creations[
    (creations["creation_timestamp"] >= "2016-10-31") &
    (creations["creation_timestamp"] < "2016-11-01")
]

ac_ex = examples[examples["creator_autoconfirmed"] == True]
non_ac_ex = examples[examples["creator_autoconfirmed"] == False]

def print_table(df):
    print("Printing {} rows".format(df.shape[0]))
    output = "<table><tr><th>Page</th><th>Initial version</th></tr>"
    for row in df.iterrows():
        table_row = """
            <tr>
                <td><a href='https://en.wikipedia.org/wiki/{title}'>{title}</a></td>
                <td><a href='http://en.wikipedia.org/wiki/Special:Diff/{rev_id}'>Special:Diff/{rev_id}</a></td>
            </tr>
            """
        output += table_row.format(title = row[1][0], rev_id = row[1][3])
    
    output += "</table>"

    display(HTML(output))

### Autoconfirmed creations

In [54]:
print_table(ac_ex)

Printing 695 rows


Page,Initial version
Naristae,Special:Diff/747021309
Linda_Aranaydo,Special:Diff/747044839
Almaty_Central_Mosque,Special:Diff/747046267
Central_Mosque_Almaty,Special:Diff/747047315
Princess_Agents,Special:Diff/747047374
Yong_Muhajil,Special:Diff/747052051
Western_Girls_(song),Special:Diff/747052847
Computer_Center_Corporation,Special:Diff/747054584
Soumahoro_Bangaly,Special:Diff/747055775
Ana-Patricia_Torea,Special:Diff/747057221


### Non-autoconfirmed creations


In [55]:
print_table(non_ac_ex)

Printing 104 rows


Page,Initial version
List_of_Hellevator_episodes,Special:Diff/747044382
Modern_Combat_6:_Versus,Special:Diff/747049425
Thai_fabrics_(Thai_woven_fabrics),Special:Diff/747054482
PEAT_(Progressive_Environmental_And_Agricultural_Technologies),Special:Diff/747054979
Rajindar_Nath_Rehbar,Special:Diff/747067637
Saman_Moghadam,Special:Diff/747067643
Prince_Dreambert,Special:Diff/747068052
Remberto_G._Sotto,Special:Diff/747068736
Products_from_Tanintharyi,Special:Diff/747070293
Rajiv_Krishna_Saxena,Special:Diff/747070407
