## Wikimedia Foundation - Outreachy Microtask

This notebook was created as a microtask for the [Gnome Outreachy internship program](https://wiki.gnome.org/Outreachy).


The Wikipedia dataset can be downloaded [here](https://lists.wikimedia.org/pipermail/wiki-research-l/2016-April/005129.html).

Please have at least 2 GB of free RAM to run this notebook.

#### Import pandas library

In [1]:
import pandas as pd

#### Create DataFrame

In [2]:
#Note: you may need to change to an absolute path if the dataset is not in the same folder
enwiki_DF = pd.read_csv("enwiki_20160204_headings.tsv", header=0, sep='\t', error_bad_lines=False)

Skipping line 23763217: expected 4 fields, saw 6



#### Explore DataFrame

In [3]:
enwiki_DF.head()

Unnamed: 0,page_id,page_title,heading_level,heading_text
0,2336433,Helena Carroll,2,Death
1,3046517,Articles for deletion/Domotic maid,3,[[Domotic maid]]
2,2336433,Helena Carroll,2,References
3,3046518,2005Oct31 Hydnjo ns.png,2,Summary
4,2336433,Helena Carroll,2,External links


In [4]:
enwiki_DF.count()

page_id          27618485
page_title       27618475
heading_level    27618485
heading_text     27618477
dtype: int64

Count returns different values for columns, so there are null values in `page_title` and `heading_text`.

The articles below are titles "NaN" (not actually null).

In [5]:
enwiki_DF[(enwiki_DF['page_title'].isnull() == True)]

Unnamed: 0,page_id,page_title,heading_level,heading_text
7675588,49244,,2,Floating point
7675591,49244,,3,Operations generating NaN
7675599,49244,,3,Quiet NaN
7675603,49244,,3,Signaling NaN
7675605,49244,,2,Function definition
7675612,49244,,2,Integer NaN
7675633,49244,,2,Display
7675636,49244,,2,Encoding
7675641,49244,,2,References
7675646,49244,,2,External links


In [6]:
enwiki_DF[(enwiki_DF['heading_text'].isnull() == True)]

Unnamed: 0,page_id,page_title,heading_level,heading_text
1515667,2679835,Cthulhu Mythos reference codes and bibliography,4,
5227156,11962867,Glossary of baseball (N),3,
5330301,8717998,MG N-type,2,
22533740,45286225,Chembox/testcases6,4,
24065464,964702,Mazda C engine,2,
24547466,1051265,Mazda N platform,2,
26257538,1657544,List of acronyms: N,2,
27603282,49265090,List of television programs: N,3,


In [7]:
enwiki_DF.dtypes

page_id           int64
page_title       object
heading_level     int64
heading_text     object
dtype: object

In [8]:
len(enwiki_DF['page_title'].unique())

6357824

In [9]:
len(enwiki_DF['page_id'].unique())

6364745

In April 2016, Wikipedia had ~5.1 million articles, so out of the 6.3 million page_ids, about 1.2 million of the page_ids must be referring to non-articles (other namespaces).

#### Generate list of 100 most frequent section titles

In [10]:
#commented out this line of code to reduce size of notebook

#enwiki_DF.groupby('heading_text').size()

That doesn't work.

Instead, create a new column to remove leading and trailing whitespace from `heading_text` 

In [11]:
enwiki_DF['heading_text_noWS'] = pd.core.strings.str_strip(enwiki_DF['heading_text'])

In [12]:
enwiki_DF.head()

Unnamed: 0,page_id,page_title,heading_level,heading_text,heading_text_noWS
0,2336433,Helena Carroll,2,Death,Death
1,3046517,Articles for deletion/Domotic maid,3,[[Domotic maid]],[[Domotic maid]]
2,2336433,Helena Carroll,2,References,References
3,3046518,2005Oct31 Hydnjo ns.png,2,Summary,Summary
4,2336433,Helena Carroll,2,External links,External links


In [13]:
section_titles = enwiki_DF.groupby('heading_text_noWS').size().sort_values(ascending=False)

In [14]:
section_titles[:5]

heading_text_noWS
References        3933510
External links    2249554
See also          1114379
Summary            732295
Licensing          695887
dtype: int64

In [15]:
mostFrequentTitles_DF = pd.DataFrame({'section_title':section_titles.index, 'frequency':section_titles.values})

In [17]:
mostFrequentTitles_DF.head(100)

Unnamed: 0,frequency,section_title
0,3933510,References
1,2249554,External links
2,1114379,See also
3,732295,Summary
4,695887,Licensing
5,509498,History
6,276974,Notes
7,162768,Career
8,144400,Biography
9,141217,Track listing


In [16]:
#only run this cell if the dataframe did not display 100 rows
pd.options.display.max_rows = 100

### Bonus Task

Generate the list of the 100 section titles that are used in the largest number of articles, each with the percentage of articles that contain such a section.

Assume page titles which contain "/" or "." are not articles and exclude them from analysis.

In [18]:
enwiki_articles_DF = enwiki_DF.copy()

In [19]:
enwiki_articles_DF.head()

Unnamed: 0,page_id,page_title,heading_level,heading_text,heading_text_noWS
0,2336433,Helena Carroll,2,Death,Death
1,3046517,Articles for deletion/Domotic maid,3,[[Domotic maid]],[[Domotic maid]]
2,2336433,Helena Carroll,2,References,References
3,3046518,2005Oct31 Hydnjo ns.png,2,Summary,Summary
4,2336433,Helena Carroll,2,External links,External links


In [20]:
enwiki_articles_DF = enwiki_articles_DF.drop('heading_text', 1)

In [21]:
enwiki_articles_DF.head()

Unnamed: 0,page_id,page_title,heading_level,heading_text_noWS
0,2336433,Helena Carroll,2,Death
1,3046517,Articles for deletion/Domotic maid,3,[[Domotic maid]]
2,2336433,Helena Carroll,2,References
3,3046518,2005Oct31 Hydnjo ns.png,2,Summary
4,2336433,Helena Carroll,2,External links


In [22]:
#maybe change this to only drop NaNs from page_title

enwiki_articles_DF.dropna(inplace=True)

In [23]:
enWikiArticlesOnly_DF = enwiki_articles_DF[~enwiki_articles_DF.page_title.str.contains("\.|\/")]

In [24]:
enWikiArticlesOnly_DF.count()

page_id              21937694
page_title           21937694
heading_level        21937694
heading_text_noWS    21937694
dtype: int64

In [25]:
len(enWikiArticlesOnly_DF['page_id'].unique())

4631002

In [26]:
len(enWikiArticlesOnly_DF['page_title'].unique())

4624341

Only 4.6 million unique page_ids left, so I'm losing 400-500K articles

In [27]:
#commented out this line of code to reduce size of notebook

#enwiki_articles_DF[~enwiki_articles_DF.page_title.str.contains("\.|\/")]

In [28]:
enWikiArticlesOnly_DF.head()

Unnamed: 0,page_id,page_title,heading_level,heading_text_noWS
0,2336433,Helena Carroll,2,Death
2,2336433,Helena Carroll,2,References
4,2336433,Helena Carroll,2,External links
5,2336437,Mario Sammarco,2,Biography
6,2336437,Mario Sammarco,2,Recordings


In [29]:
header_freq = enWikiArticlesOnly_DF.groupby('heading_text_noWS').size().sort_values(ascending=False)

In [30]:
header_freq[:10]

heading_text_noWS
References         3792066
External links     2168509
See also           1060642
History             491924
Notes               263258
Career              152605
Track listing       135144
Further reading     132480
Biography           130716
Bibliography        106534
dtype: int64

In [31]:
header_freq_percent_DF = pd.DataFrame({'header_title':header_freq.index, 'header_frequency':header_freq.values})

In [32]:
header_freq_percent_DF.head()

Unnamed: 0,header_frequency,header_title
0,3792066,References
1,2168509,External links
2,1060642,See also
3,491924,History
4,263258,Notes


In [33]:
header_freq_percent_DF['header_percent'] = (header_freq_percent_DF['header_frequency']/4624341)*100

In [34]:
header_freq_percent_DF.head(100)

Unnamed: 0,header_frequency,header_title,header_percent
0,3792066,References,82.0023
1,2168509,External links,46.893363
2,1060642,See also,22.936068
3,491924,History,10.63771
4,263258,Notes,5.692876
5,152605,Career,3.300038
6,135144,Track listing,2.922449
7,132480,Further reading,2.864841
8,130716,Biography,2.826695
9,106534,Bibliography,2.303766
