<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/12_Clustering_Wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Clustering Wikipedia Dataset**

**Wikipedia**

**Robots learning from robots**

Because of the breadth and availability of its content, Wikipedia has been widely used as a reference dataset for research in machine learning and for tech demos. However, Wikipedia has some serious problems that are not apparent from our familiarity with it as a resource for human beings.

Wikipedia has good coverage of popular topics and very irregular coverage of unpopular topics. Humans are unaware of this, since it is precisely the popular pages that are consumed: the most popular 12% of Wikipedia accounts for 90% of all traffic. The irregularity of coverage is poisonous to many models. 

A topic model trained on all of Wikipedia, for example, will associate “river” with “Romania” and “village” with “Turkey”. Why? Because there are 10k pages on Villages in Turkey, and not enough pages on villages in other places.

To make things worse, unpopular pages are very often robot generated. For example, rambot authored 98% of all the articles on US towns, and half of the Swedish Wikipedia is written by lsjbot! Robot generated pages are built by inserting data into sentence templates. The sheer mass of these pages means that a huge proportion of the language examples a model learns from are just the same template used over and over. Robots learning from robots.

**Source:** https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/

On the day the dataset dump was downloaded, there were 4.3M pages, 1.3M of which had not been viewed once on that day, and 3.8M (i.e. 88%) were looked at less than 20 times.  

The human experience of Wikipedia is restricted to a very small proportional of the pages.  

And thus, the performance of test models improved considerably when the models were trained on only popular pages.

In [0]:
import pandas as pd
import numpy as np

In [0]:
# Download the wikipedia dataset using wget (Linux)
!wget https://storage.googleapis.com/lateral-datadumps/wikipedia_utf8_filtered_20pageviews.csv.gz

--2020-01-05 17:43:02--  https://storage.googleapis.com/lateral-datadumps/wikipedia_utf8_filtered_20pageviews.csv.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.192.128, 2607:f8b0:4001:c0f::80
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.192.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1215777113 (1.1G) [text/csv]
Saving to: ‘wikipedia_utf8_filtered_20pageviews.csv.gz.1’


2020-01-05 17:43:27 (50.4 MB/s) - ‘wikipedia_utf8_filtered_20pageviews.csv.gz.1’ saved [1215777113/1215777113]



In [0]:
# Read the wikipedia dataset into a DataFrame: df
df = pd.read_csv('wikipedia_utf8_filtered_20pageviews.csv.gz', compression='gzip', header=0, sep=',')

In [0]:
df.head()

Unnamed: 0,wikipedia-23885690,"Research Design and Standards Organization The Research Design and Standards Organisation (RDSO) is an ISO 9001 research and development organisation under the Ministry of Railways of India, which functions as a technical adviser and consultant to the Railway Board, the Zonal Railways, the Railway Production Units, RITES and IRCON International in respect of design and standardisation of railway equipment and problems related to railway construction, operation and maintenance. History. To enforce standardisation and co-ordination between various railway systems in British India, the Indian Railway Conference Association (IRCA) was set up in 1903. It was followed by the establishment of the Central Standards Office (CSO) in 1930, for preparation of designs, standards and specifications. However, till independence in 1947, most of the designs and manufacture of railway equipments was entrusted to foreign consultants. After independence, a new organisation called Railway Testing and Research Centre (RTRC) was set up in 1952 at Lucknow, for undertaking intensive investigation of railway problems, providing basic criteria and new concepts for design purposes, for testing prototypes and generally assisting in finding solutions for specific problems. In 1957, the Central Standards Office (CSO) and the Railway Testing and Research Centre (RTRC) were integrated into a single unit named Research Designs and Standards Organisation (RDSO) under the Ministry of Railways with its headquarters at Manaknagar, Lucknow. The status of RDSO was changed from an ""Attached Office"" to a ""Zonal Railway"" on April 1, 2003, to give it greater flexibility and a boost to the research and development activities. Organisation. The RDSO is headed by a Director General who ranks with a General Manager of a Zonal Railway. The Director General is assisted by an Additional Director General and 23 Sr. Executive Directors and Executive Directors, who are in charge of the 27 directorates: Bridges and Structures, the Centre for Advanced Maintenance Techlogy (AMTECH), Carriage, Geotechnical Engineering, Testing, Track Design, Medical, EMU & Power Supply, Engine Development, Finance & Accounts, Telecommunication, Quality Assurance, Personnel, Works, Psycho-Technical, Research, Signal, Wagon Design, Electric Locomotive, Stores, Track Machines & Monitoring, Traction Installation, Energy Management, Traffic, Metallurgical & Chemical, Motive Power and Library & Publications. All the directorates except Defence Research are located in Lucknow. Projects. Development of a new crashworthy design of 4500 HP WDG4 locomotive incorporating new technology to improve dynamic braking and attain significant fuel savings. Development of Drivers’ Vigilance Telemetric Control System which directly measures and analyses variations in biometric parameters to determine the state of alertness of the driver. Development of Train Collision Avoidance System(TCAS). Development of Computer Aided Drivers Aptitude test equipment for screening high speed train drivers for Rajdhani/Shatabdi Express trains to evaluate their reaction time, form perception, vigilance and speed anticipation. Assessment of residual fatigue life of critical railway components like rail, rail weld, wheels, cylinder head, OHE mast, catenary wire, contact wire, wagon components, low components, etc. to formulate remedial actions. Modification of specification of Electric Lifting Barrier to improve its strength and reliability Design and development of modern fault tolerant, fail-safe, maintainer friendly Electronic Interlocking system Development of 4500 HP Hotel Load Locomotive to provide clean and noise free power supply to coaches from locomotive to eliminate the existing generator car of Garib Rath express trains. Field trials conducted for electric locomotive hauling Rajdhani/Shatabdi express trains with Head On Generation (HOG) system to provide clean and noise free power supply to end on coaches. Development of WiMAX technology to provide internet access to the passengers in running trains. Design and Development of Ballastless Track with indigenous fastening system (BLT-IFS). Major Achievements. Development of Pre-stressed concrete sleeper and allied components along with Source Development."
0,wikipedia-23885928,The Death of Bunny Munro The Death of Bunny ...
1,wikipedia-23886057,Management of prostate cancer Treatment for ...
2,wikipedia-23886425,Cheetah reintroduction in India Reintroducti...
3,wikipedia-23886491,Langtang National Park The Langtang National...
4,wikipedia-23886546,Shivapuri Nagarjun National Park Shivapuri N...


In [0]:
articles = df.values

In [0]:
print(articles)

[['wikipedia-23885928'
  ' The Death of Bunny Munro  The Death of Bunny Munro is the second novel written by Nick Cave, best known as the lead singer of Nick Cave and the Bad Seeds. His first novel, "And the Ass Saw the Angel", was published in 1989. The novel deals with Bunny Munro, a middle aged lothario whose constant womanising and alcohol abuse comes to a head after the suicide of his wife. A travelling door to door beauty product salesman, he and his son go on an increasingly out of control road trip around Brighton, over which looms the shadow of a serial killer making his way towards Brighton, as well as Bunny\'s own mortality. The novel is set in Brighton in 2003, around the time the West Pier was destroyed by fire. Many of the locations and street names used in the book relate to real places close to Cave\'s own home. The novel was also released as an audiobook, using a 3D audio effect produced and sound directed by British artists Iain Forsyth and Jane Pollard, with a soundt

In [0]:
titles = articles[:,1]
titles

array([' The Death of Bunny Munro  The Death of Bunny Munro is the second novel written by Nick Cave, best known as the lead singer of Nick Cave and the Bad Seeds. His first novel, "And the Ass Saw the Angel", was published in 1989. The novel deals with Bunny Munro, a middle aged lothario whose constant womanising and alcohol abuse comes to a head after the suicide of his wife. A travelling door to door beauty product salesman, he and his son go on an increasingly out of control road trip around Brighton, over which looms the shadow of a serial killer making his way towards Brighton, as well as Bunny\'s own mortality. The novel is set in Brighton in 2003, around the time the West Pier was destroyed by fire. Many of the locations and street names used in the book relate to real places close to Cave\'s own home. The novel was also released as an audiobook, using a 3D audio effect produced and sound directed by British artists Iain Forsyth and Jane Pollard, with a soundtrack by Nick Cave 

In [0]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

In [0]:
# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(articles[:,1])

In [0]:
# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

In [0]:
# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)

In [0]:
# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

In [0]:
# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)

In [0]:
# Fit the pipeline to articles
pipeline.fit(csr_mat)

Pipeline(memory=None,
         steps=[('truncatedsvd',
                 TruncatedSVD(algorithm='randomized', n_components=50, n_iter=5,
                              random_state=None, tol=0.0)),
                ('kmeans',
                 KMeans(algorithm='auto', copy_x=True, init='k-means++',
                        max_iter=300, n_clusters=6, n_init=10, n_jobs=None,
                        precompute_distances='auto', random_state=None,
                        tol=0.0001, verbose=0))],
         verbose=False)

In [0]:
# Calculate the cluster labels: labels
labels = pipeline.predict(csr_mat)

In [0]:
# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})

In [0]:
# Display df sorted by cluster label
print(df.sort_values('label'))

        label                                            article
122287      0   Reckless & Relentless  Reckless & Relentless ...
103072      0   Empty Glass  Empty Glass was released in 1980...
397282      0   Follow the Reaper  Follow the Reaper is the t...
23839       0   The Nile Song  "The Nile Song" is the second ...
397283      0   Hate Crew Deathroll  Hate Crew Deathroll is t...
...       ...                                                ...
368838      5   Shin Saimdang  Shin Saimdang (申師任堂, October 2...
12450       5   Clock Tower 3  Gameplay. Players assume the r...
12447       5   Jude Deveraux  Jude Deveraux (born September ...
115405      5   Michelle Ferrari  Michelle Ferrari, (born 22 ...
366369      5   Penelope Ann Miller  Penelope Ann Miller (bor...

[463818 rows x 2 columns]
