## English Wikipedia Heading Frequency

This notebook serves to sort English Wikipedia section headers by frequency as related to this [research project](https://meta.wikimedia.org/wiki/Research:Investigate_frequency_of_section_titles_in_5_large_Wikipedias).

In [1]:
import numpy as np
import pandas as pd

In [2]:
# read in headers file by chunks of 100000 to conserve memory
# https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas
en_DF = pd.read_csv('enwiki_20161101_headings.tsv', sep='\t', header=0, dtype={'page_id': np.int32, 'page_title': object, 'page_ns': np.int16, 'heading_level': np.int8, 'heading_text': object})

In [3]:
en_DF.head()

Unnamed: 0,page_id,page_title,page_ns,heading_level,heading_text
0,3926881,Slab allocation,0,2,Basis
1,3926881,Slab allocation,0,2,Implementation
2,3926881,Slab allocation,0,2,Slabs
3,3926881,Slab allocation,0,3,Large slabs
4,3926881,Slab allocation,0,3,Small slabs


In [4]:
en_DF.page_ns.unique()

array([0])

In [5]:
# determine number of unique articles
len(en_DF.page_title.unique())

4947256

In [6]:
# remove leading and trailing whitespace from heading_text column
en_DF['heading_text'] = pd.core.strings.str_strip(en_DF['heading_text'])

In [7]:
# groupby heading_text and count the number of unique page_titles each heading appears in
# sort in descencing order
# this returns a pandas series object
article_count = en_DF.groupby('heading_text')['page_title'].apply(lambda x: len(x.unique())).sort_values(ascending=False)

In [8]:
# turn pandas series object into pandas dataframe
en_article_count_DF = pd.DataFrame({'section_title':article_count.index, 'number_of_articles':article_count.values})

In [9]:
en_article_count_DF.head()

Unnamed: 0,number_of_articles,section_title
0,4124326,References
1,2338241,External links
2,1134485,See also
3,533348,History
4,283198,Notes


In [10]:
# add a column for the percentage of articles that header appears in
en_article_count_DF['article_percentage'] = (en_article_count_DF['number_of_articles']/5275388)*100

In [11]:
# set pandas options to display 100 rows
# round percentage to 2 decimal places and show top 100 results
pd.options.display.max_rows = 100
en_article_count_DF.round({'article_percentage': 2}).head(100)

Unnamed: 0,number_of_articles,section_title,article_percentage
0,4124326,References,78.18
1,2338241,External links,44.32
2,1134485,See also,21.51
3,533348,History,10.11
4,283198,Notes,5.37
5,176414,Career,3.34
6,152419,Biography,2.89
7,148147,Further reading,2.81
8,145080,Track listing,2.75
9,122409,Bibliography,2.32
