## Preparing Wiki Data 
- Reading the wiki dump (from simple English Wiki, found at: [Simple English, 1/7/16](https://dumps.wikimedia.org/simplewiki/20160701/simplewiki-20160701-pages-articles-multistream.xml.bz2)
- Parses the xml tree, picking actual articles.
- From the total articles, we select 10K.
- Makes a Pandas data frame with the text of these articles.


In [2]:
NumberOfArticles = 10000
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
#from lxml import objectify

In [3]:
# takes a while
tree = ET.parse('simplewiki-20160701-pages-articles-multistream.xml') 

In [4]:
root = tree.getroot()
#print(root)
#print(root.attrib)
#print(root.tag)
for name, value in root.items():
    print('%s = %r' % (name, value))

{http://www.w3.org/2001/XMLSchema-instance}schemaLocation = 'http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd'
version = '0.10'
{http://www.w3.org/XML/1998/namespace}lang = 'en'


In [5]:
children = root.getchildren()
print ('total number of articles, uncleaned: %i.' % (len(children)) )

total number of articles, uncleaned: 220584.


In [6]:
# all the titles with their locations
# alltitles = [(i,root[i][0].text) for i in range(1,len(root)) if ":" not in root[i][0].text ]
alltitles = [(i,root[i][0].text) for i in range(1,len(root))]
titles = pd.DataFrame(data = alltitles,columns = ['ind','title' ])

remove = []
for i in range(len(titles)):
# check the NS tag
    if root[titles.ind[i]][1].text != '0': 
        # Remove redirect articles
        remove.append(i)
    else:
        if root[titles.ind[i]].find('{http://www.mediawiki.org/xml/export-0.10/}redirect') is not None:
            remove.append(i)
titles = titles.drop(remove)
# redundant, because loc and iloc differentiate between actual and numerical indices
titles.index = range(len(titles))
print("%d titles dropped \n%d remaining titles" % (len(remove), len(titles) ) )

#column_names = []
#for i in range(0,len(root.getchildren()[1000].getchildren())):
#    column_names.append(root.getchildren()[1000].getchildren()[i].tag)
#colnames = [x[43:] for x in column_names]
#print('colnames ', colnames)
#frame = pd.DataFrame(columns=colnames)

100958 titles dropped 
119625 remaining titles


In [46]:
# selecting NumberOfArticles articles randomly
np.random.seed(123)
randomindices = np.random.randint(low=0,high=len(titles),size = NumberOfArticles)
data = titles.iloc[randomindices]
data.index=range(NumberOfArticles)
data.head(3)

Unnamed: 0,ind,title
0,218066,Aérocentre
1,28184,Hexagon
2,51641,Khowar language


In [47]:
# adding text to the data frame
text = []
for i in data.index:
    for child in root[data.ind[i]]:
        for textnode in child.iter(tag ='{http://www.mediawiki.org/xml/export-0.10/}text'):
            text.append(textnode.text)
            
data.loc[:,'text'] = text
data.head(3) # how does it look?

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,ind,title,text
0,218066,Aérocentre,'''Aérocentre''' is a [[France|French]] group ...
1,28184,Hexagon,A '''hexagon''' is a [[polygon]] with 6 sides ...
2,51641,Khowar language,"{{Infobox Language\n|name=Khowar, Arniya\n|fam..."


In [49]:
data.to_pickle('uncleaned-10k-articles.pkl') # for pickle
# data.to_csv('uncleaned-10k-articles.csv') # for csv