<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/12_Clustering_Wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Clustering Wikipedia Dataset**

**Wikipedia**

**Robots learning from robots**

Because of the breadth and availability of its content, Wikipedia has been widely used as a reference dataset for research in machine learning and for tech demos. However, Wikipedia has some serious problems that are not apparent from our familiarity with it as a resource for human beings.

Wikipedia has good coverage of popular topics and very irregular coverage of unpopular topics. Humans are unaware of this, since it is precisely the popular pages that are consumed: the most popular 12% of Wikipedia accounts for 90% of all traffic. The irregularity of coverage is poisonous to many models. 

A topic model trained on all of Wikipedia, for example, will associate “river” with “Romania” and “village” with “Turkey”. Why? Because there are 10k pages on Villages in Turkey, and not enough pages on villages in other places.

To make things worse, unpopular pages are very often robot generated. For example, rambot authored 98% of all the articles on US towns, and half of the Swedish Wikipedia is written by lsjbot! Robot generated pages are built by inserting data into sentence templates. The sheer mass of these pages means that a huge proportion of the language examples a model learns from are just the same template used over and over. Robots learning from robots.

**Source:** https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/

On the day the dataset dump was downloaded, there were 4.3M pages, 1.3M of which had not been viewed once on that day, and 3.8M (i.e. 88%) were looked at less than 20 times.  

The human experience of Wikipedia is restricted to a very small proportional of the pages.  

And thus, the performance of test models improved considerably when the models were trained on only popular pages.

In [0]:
import pandas as pd
import numpy as np

In [56]:
# Download the wikipedia dataset using wget (Linux)
!wget https://storage.googleapis.com/lateral-datadumps/wikipedia_utf8_filtered_20pageviews.csv.gz

--2020-01-07 20:01:06--  https://storage.googleapis.com/lateral-datadumps/wikipedia_utf8_filtered_20pageviews.csv.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.214.128, 2607:f8b0:4001:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.214.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1215777113 (1.1G) [text/csv]
Saving to: ‘wikipedia_utf8_filtered_20pageviews.csv.gz.3’


2020-01-07 20:01:31 (48.9 MB/s) - ‘wikipedia_utf8_filtered_20pageviews.csv.gz.3’ saved [1215777113/1215777113]



In [0]:
# Read the wikipedia dataset into a DataFrame: df
df = pd.read_csv('wikipedia_utf8_filtered_20pageviews.csv.gz', compression='gzip', header=None, sep=',')

In [58]:
df.head()

Unnamed: 0,0,1
0,wikipedia-23885690,Research Design and Standards Organization T...
1,wikipedia-23885928,The Death of Bunny Munro The Death of Bunny ...
2,wikipedia-23886057,Management of prostate cancer Treatment for ...
3,wikipedia-23886425,Cheetah reintroduction in India Reintroducti...
4,wikipedia-23886491,Langtang National Park The Langtang National...


In [0]:
articles = df.values

In [60]:
print(articles)

[['wikipedia-23885690'
  ' Research Design and Standards Organization  The Research Design and Standards Organisation (RDSO) is an ISO 9001 research and development organisation under the Ministry of Railways of India, which functions as a technical adviser and consultant to the Railway Board, the Zonal Railways, the Railway Production Units, RITES and IRCON International in respect of design and standardisation of railway equipment and problems related to railway construction, operation and maintenance. History. To enforce standardisation and co-ordination between various railway systems in British India, the Indian Railway Conference Association (IRCA) was set up in 1903. It was followed by the establishment of the Central Standards Office (CSO) in 1930, for preparation of designs, standards and specifications. However, till independence in 1947, most of the designs and manufacture of railway equipments was entrusted to foreign consultants. After independence, a new organisation call

In [61]:
titles = articles[:,1]
titles

array([' Research Design and Standards Organization  The Research Design and Standards Organisation (RDSO) is an ISO 9001 research and development organisation under the Ministry of Railways of India, which functions as a technical adviser and consultant to the Railway Board, the Zonal Railways, the Railway Production Units, RITES and IRCON International in respect of design and standardisation of railway equipment and problems related to railway construction, operation and maintenance. History. To enforce standardisation and co-ordination between various railway systems in British India, the Indian Railway Conference Association (IRCA) was set up in 1903. It was followed by the establishment of the Central Standards Office (CSO) in 1930, for preparation of designs, standards and specifications. However, till independence in 1947, most of the designs and manufacture of railway equipments was entrusted to foreign consultants. After independence, a new organisation called Railway Testing

In [0]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

In [0]:
# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(articles[:,1])

In [0]:
# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

In [0]:
# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)

In [0]:
# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

In [0]:
# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)

In [0]:
# Fit the pipeline to articles
pipeline.fit(csr_mat)

In [0]:
# Calculate the cluster labels: labels
labels = pipeline.predict(csr_mat)

In [0]:
# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})

In [0]:
# Display df sorted by cluster label
print(df.sort_values('label'))