# Кластеризация текстов

** Вопросы:**
- В чём задача кластеризации текстов? 
- Что является объектами (samples), что такое признаки для этих ообъектов?

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap

from IPython.display import Image, SVG

%matplotlib inline

## Выборка

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:
train_all = fetch_20newsgroups(subset='train')
print (train_all.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [4]:
simple_dataset = fetch_20newsgroups(
    subset='train', 
    categories=['comp.sys.mac.hardware', 'soc.religion.christian', 'rec.sport.hockey'])

In [5]:
print (simple_dataset.data[0])

From: erik@cheshire.oxy.edu (Erik Adams)
Subject: HELP!!  My Macintosh "luggable" has lines on its screen!
Organization: Occidental College, Los Angeles, CA 90041 USA.
Distribution: comp
Lines: 20

Okay, I don't use it very much, but I would like for it to keep working
correctly, at least as long as Apple continues to make System software
that will run on it, if slowly :-)

Here is the problem:  When the screen is tilted too far back, vertical
lines appear on the screen.  They are every 10 pixels or so, and seem
to be affected somewhat by opening windows and pulling down menus.
It looks to a semi-technical person like there is a loose connection
between the screen and the rest of the computer.

I am open to suggestions that do not involve buying a new computer,
or taking this one to the shop.  I would also like to not have
to buy one of Larry Pina's books.  I like Larry, but I'm not sure
I feel strongly enough about the computer to buy a service manual
for it.

On a related note:  what

In [6]:
print (simple_dataset.data[-1])

From: dlecoint@garnet.acns.fsu.edu (Darius_Lecointe)
Subject: Re: Sabbath Admissions 5of5
Organization: Florida State University
Lines: 21

I find it interesting that cls never answered any of the questions posed. 
Then he goes on the make statements which make me shudder.  He has
established a two-tiered God.  One set of rules for the Jews (his people)
and another set for the saved Gentiles (his people).  Why would God
discriminate?  Does the Jew who accepts Jesus now have to live under the
Gentile rules.

God has one set of rules for all his people.  Paul was never against the
law.  In fact he says repeatedly that faith establishes rather that annuls
the law.  Paul's point is germane to both Jews and Greeks.  The Law can
never be used as an instrument of salvation.  And please do not combine
the ceremonial and moral laws in one.

In Matt 5:14-19 Christ plainly says what He came to do and you say He was
only saying that for the Jews's benefit.  Your Christ must be a
politician, speaki

In [7]:
print (simple_dataset.data[-2])

From: scialdone@nssdca.gsfc.nasa.gov (John Scialdone)
Subject: CUT Vukota and Pilon!!!
News-Software: VAX/VMS VNEWS 1.41    
Organization: NASA - Goddard Space Flight Center
Lines: 32

I have been to all 3 Isles/Caps tilts at the Crap Centre this year, all Isles
wins and there is no justification for Vukota and Pilon to play for the Isles.
Vukota is absolutely the worst puck handler in the world!! He couldn't hit a
bull in the ass with a banjo!! Al must remember a few years back when Mick 
scored 3 goals in one period against the Caps in a 5-3 Isles win. I was there
and was astonished as was the rest of the crowd. Wake-up Al!!! Years later he's
gotten worse. He's a cheap shot artist and always ends up getting
stupid/senseless penalties. I think he would make a good police officier!!!

As for Pilon, he can't carry the puck out to center ice by himself. He either
makes a bad pass resulting in a turnover, or he attempts to bring the puck 
towards the neutral zone and skates right into an 

In [8]:
print (len(simple_dataset.data))

1777


### Признаки

In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=500, min_df=10)
matrix = vectorizer.fit_transform(simple_dataset.data)
matrix.shape

(1777, 3767)

## Аггломеративная кластеризация (neighbour joining)

In [10]:
from sklearn.cluster.hierarchical import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3, affinity='cosine', linkage='complete')
preds = model.fit_predict(matrix.toarray())

In [11]:
print(list(preds) [:10])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [12]:
print(matrix[0])

  (0, 1246)	0.3906389385132699
  (0, 239)	0.09501028877632492
  (0, 1659)	0.07226670936760915
  (0, 2082)	0.09656717763921
  (0, 1846)	0.06741426312477893
  (0, 2991)	0.3895610830528872
  (0, 810)	0.0809427637021726
  (0, 2056)	0.09616720743610424
  (0, 319)	0.10054935380579341
  (0, 625)	0.05207173909126517
  (0, 3553)	0.07346689959490565
  (0, 1121)	0.06442816229222373
  (0, 838)	0.10154477899829951
  (0, 69)	0.06858267052730141
  (0, 2413)	0.10258878033617379
  (0, 3555)	0.06229087987706688
  (0, 3600)	0.05737015279917228
  (0, 2273)	0.057960021837659234
  (0, 1905)	0.08173536478989506
  (0, 3722)	0.09501028877632492
  (0, 926)	0.11742059318548873
  (0, 1991)	0.0711302768586001
  (0, 2048)	0.07488448956033757
  (0, 354)	0.06947437383545055
  (0, 900)	0.12357635170024021
  :	:
  (0, 2868)	0.08596208970791505
  (0, 855)	0.20162554088271673
  (0, 301)	0.055632566960467825
  (0, 2422)	0.09427236106959615
  (0, 3285)	0.11264581593118286
  (0, 1831)	0.1250526461821779
  (0, 620)	0.1118065

In [13]:
vectorizer.get_feature_names() [::100]

['00',
 '37',
 'ability',
 'always',
 'assistant',
 'believing',
 'btw',
 'cec1',
 'cmu',
 'continues',
 'data',
 'direct',
 'effect',
 'expect',
 'fired',
 'gballent',
 'hacker',
 'hopefully',
 'inspired',
 'justify',
 'leland',
 'man',
 'min',
 'nature',
 'offense',
 'path',
 'portable',
 'ps',
 'ref',
 'robbie',
 'seattle',
 'simply',
 'ssd',
 'supports',
 'though',
 'ubc',
 'very',
 'winnipeg']

In [14]:
simple_dataset.target

array([0, 0, 1, ..., 0, 1, 2])

In [15]:
preds

array([0, 0, 0, ..., 0, 2, 1])

In [16]:
# Assessement
mapping = {2 : 1, 1: 2, 0: 0}
mapped_preds = [mapping[pred] for pred in preds]
# print (float(sum(mapped_preds != simple_dataset.target)) / len(simple_dataset.target))
print(accuracy_score(mapped_preds, simple_dataset.target))

0.3590320765334834


In [17]:
import itertools
def validate_with_mappings(preds, target):
    permutations = itertools.permutations([0, 1, 2])
    for a, b, c in permutations:
        mapping = {2 : a, 1: b, 0: c}
        mapped_preds = [mapping[pred] for pred in preds]
#         print (float(sum(mapped_preds != target)) / len(target))
        print(accuracy_score(mapped_preds, target))
    
validate_with_mappings(preds, simple_dataset.target)

0.325267304445695
0.3275182892515476
0.34721440630275746
0.3590320765334834
0.3157006190208216
0.325267304445695


## KMeans

In [18]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3, random_state=1)
preds = model.fit_predict(matrix.toarray())
print (preds)
print (simple_dataset.target)
validate_with_mappings(preds, simple_dataset.target)

[0 0 2 ... 0 2 1]
[0 0 1 ... 0 1 2]
0.029262802476083285
0.32639279684862127
0.34946539110861
0.9527293190770962
0.018007878446820485
0.3241418120427687


In [19]:
# Compare with Linear Regression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
print (cross_val_score(clf, matrix, simple_dataset.target).mean())

0.9853603185880773


** Вопрос: ** очень высокая точность кластеризации текстов, очень близкая к точности Supervised алгоритма. Почему?

## Более сложная выборка

In [20]:
noteasy_dataset = fetch_20newsgroups(
    subset='train', 
    categories=['comp.sys.mac.hardware', 'comp.os.ms-windows.misc', 'comp.graphics'])
matrix = vectorizer.fit_transform(noteasy_dataset.data)

In [21]:
model = KMeans(n_clusters=3, random_state=1)
preds = model.fit_predict(matrix.toarray())
print (preds)
print (noteasy_dataset.target)
validate_with_mappings(preds, noteasy_dataset.target)

[0 1 2 ... 0 2 0]
[2 1 1 ... 2 0 2]
0.753565316600114
0.2966343411294923
0.39361095265259555
0.1289218482601255
0.11751283513976041
0.3097547062179121


In [22]:
clf = LogisticRegression()
print (cross_val_score(clf, matrix, noteasy_dataset.target).mean())

0.917279226713189


## SVD + KMeans

In [23]:
from sklearn.decomposition import TruncatedSVD

model = KMeans(n_clusters=3, random_state=42)
svd = TruncatedSVD(n_components=1000, random_state=123)
features = svd.fit_transform(matrix)
preds = model.fit_predict(features)
validate_with_mappings(preds, noteasy_dataset.target)

0.4067313177410154
0.0889903023388477
0.793496862521392
0.29720479178551057
0.29606389047347403
0.11751283513976041


In [24]:
model = KMeans(n_clusters=3, random_state=42)
svd = TruncatedSVD(n_components=200, random_state=321)
features = svd.fit_transform(matrix)
preds = model.fit_predict(features)
validate_with_mappings(preds, noteasy_dataset.target)

0.286936679977182
0.15459212778094694
0.11066742726754136
0.2994865944095836
0.41357672561323444
0.7347404449515117



** Вопрос: ** всё равно сумели добиться довольно высокой точности. В чем причина?

### Вывод

1. Получили интерпретируемый результат на обеих выборках
2. На простых выбрках kMeans и Agglomerative работают хорошо в рамках начального приближения.