# KeplerMapper & NLP examples

## Newsgroups20

In [1]:
# from kmapper import jupyter
import kmapper as km
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import Isomap
from sklearn.preprocessing import MinMaxScaler

### Data
We will use the Newsgroups20 dataset. This is a canonical NLP dataset containing 11314 labeled postings on 20 different newsgroups.

In [2]:
newsgroups = fetch_20newsgroups(subset='train')
X, y, target_names = np.array(newsgroups.data), np.array(newsgroups.target), np.array(newsgroups.target_names)
print("SAMPLE",X[0])
print("SHAPE",X.shape)
print("TARGET",target_names[y[0]])

('SAMPLE', u"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n")
('SHAPE', (11314,))
('TARGET', 'rec.autos')


### Projection
To project the unstructured text dataset down to 2 fixed dimensions, we will set up a function pipeline. Every consecutive function will take as input the output from the previous function.

We will try out "Latent Semantic Char-Gram Analysis followed by Isometric Mapping".

- TFIDF vectorize (1-6)-chargrams and discard the top 17% and bottom 5% chargrams. Dimensionality = 13967.
- Run TruncatedSVD with 100 components on this representation. TFIDF followed by Singular Value Decomposition is called Latent Semantic Analysis. Dimensionality = 100.
- Run Isomap embedding on the output from previous step to project down to 2 dimensions. Dimensionality = 2.
- MinMaxScale the output from previous step. Dimensionality = 2.

In [3]:
mapper = km.KeplerMapper(verbose=2)

projected_X = mapper.fit_transform(X,
    projection=[TfidfVectorizer(analyzer="char",
                                ngram_range=(1,6),
                                max_df=0.83,
                                min_df=0.05),
                TruncatedSVD(n_components=100,
                             random_state=1729),
                Isomap(n_components=2,
                       n_jobs=-1)],
    scaler=[None, None, MinMaxScaler()])

print("SHAPE",projected_X.shape)

..Composing projection pipeline length 3:
Projections: TfidfVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.83, max_features=None, min_df=0.05,
        ngram_range=(1, 6), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
       random_state=1729, tol=0.0)
Isomap(eigen_solver='auto', max_iter=None, n_components=2, n_jobs=-1,
    n_neighbors=5, neighbors_algorithm='auto', path_method='auto', tol=0)


Distance matrices: False
False
False


Scalers: None
None
MinMaxScaler(copy=True, feature_range=(0, 1))


..Projecting on data shaped (11314,)

..Projecting data using: 
	TfidfVectorizer(analyzer='char', binary=False, decode_error=u'strict',
    

### Mapping
We cover the projection with 10 33%-overlapping intervals per dimension (10\*10=100 cubes total).

We cluster on the projection (but, note, we can also create an `inverse_X` to cluster on by vectorizing the original text data).

For clustering we use Agglomerative Single Linkage Clustering with the "cosine"-distance and 3 clusters. Agglomerative Clustering is a good cluster algorithm for TDA, since it both creates pleasing informative networks, and it has strong theoretical garantuees (see [functor](https://en.wikipedia.org/wiki/Functor) and [functoriality](https://jeremykun.com/2013/07/14/functoriality/)).

In [4]:
from sklearn import cluster
graph = mapper.map(projected_X,
                   inverse_X=None,
                   clusterer=cluster.AgglomerativeClustering(n_clusters=3,
                                                             linkage="complete",
                                                             affinity="cosine"),
                   overlap_perc=0.33)

Mapping on data shaped (11314, 2) using lens shaped (11314, 2)

Minimal points in hypercube before clustering: 3
Creating 100 hypercubes.
There are 0 points in cube_0 / 100
Cube_0 is empty.

There are 0 points in cube_1 / 100
Cube_1 is empty.

There are 18 points in cube_2 / 100
Found 3 clusters in cube_2

There are 42 points in cube_3 / 100
Found 3 clusters in cube_3

There are 27 points in cube_4 / 100
Found 3 clusters in cube_4

There are 5 points in cube_5 / 100
Found 3 clusters in cube_5

There are 3 points in cube_6 / 100
Found 3 clusters in cube_6

There are 0 points in cube_7 / 100
Cube_7 is empty.

There are 0 points in cube_8 / 100
Cube_8 is empty.

There are 0 points in cube_9 / 100
Cube_9 is empty.

There are 7 points in cube_10 / 100
Found 3 clusters in cube_10

There are 351 points in cube_11 / 100
Found 3 clusters in cube_11

There are 818 points in cube_12 / 100
Found 3 clusters in cube_12

There are 28 points in cube_13 / 100
Found 3 clusters in cube_13

There are 41 p

### Interpretable inverse X
Here we show the flexibility of KeplerMapper by creating an `interpretable_inverse_X` that is easier to interpret by humans.

For text, this can be TFIDF (1-3)-wordgrams, like we do here. For structured data this can be regularitory/protected variables of interest, or using another model to select, say, the top 10% features.

In [5]:
vec = TfidfVectorizer(analyzer="word",
                      strip_accents="unicode",
                      stop_words="english",
                      ngram_range=(1,3),
                      max_df=0.97,
                      min_df=0.02)

interpretable_inverse_X = vec.fit_transform(X).toarray()
interpretable_inverse_X_names = vec.get_feature_names()

print("SHAPE", interpretable_inverse_X.shape)
print("FEATURE NAMES SAMPLE", interpretable_inverse_X_names[:400])

('SHAPE', (11314, 947))
('FEATURE NAMES SAMPLE', [u'00', u'000', u'10', u'100', u'11', u'12', u'13', u'14', u'15', u'16', u'17', u'18', u'19', u'1992', u'1993', u'1993apr15', u'20', u'200', u'21', u'22', u'23', u'24', u'25', u'26', u'27', u'28', u'29', u'30', u'31', u'32', u'33', u'34', u'35', u'36', u'37', u'38', u'39', u'40', u'408', u'41', u'42', u'43', u'44', u'45', u'49', u'50', u'500', u'60', u'70', u'80', u'90', u'92', u'93', u'able', u'ac', u'ac uk', u'accept', u'access', u'according', u'acs', u'act', u'action', u'actually', u'add', u'address', u'advance', u'advice', u'ago', u'agree', u'air', u'al', u'allow', u'allowed', u'america', u'american', u'andrew', u'answer', u'anti', u'anybody', u'apparently', u'appears', u'apple', u'application', u'apply', u'appreciate', u'appreciated', u'apr', u'apr 1993', u'apr 93', u'april', u'area', u'aren', u'argument', u'article', u'article 1993apr15', u'ask', u'asked', u'asking', u'assume', u'att', u'att com', u'au', u'available', u'average', u

### Visualization
We use `interpretable_inverse_X` as the `inverse_X` during visualization. This way we get cluster statistics that are more informative/interpretable to humans (chargrams vs. wordgrams).

We also pass the `projected_X` to get cluster statistics for the projection. For `custom_tooltips` we use a textual description of the label.

The color function is simply the multi-class ground truth represented as a non-negative integer.

In [6]:
html = mapper.visualize(graph,
                        inverse_X=interpretable_inverse_X,
                        inverse_X_names=interpretable_inverse_X_names,
                        path_html="newsgroups20.html",
                        projected_X=projected_X,
                        projected_X_names=["ISOMAP1", "ISOMAP2"],
                        title="Newsgroups20: Latent Semantic Char-gram Analysis with Isometric Embedding",
                        custom_tooltips=np.array([target_names[ys] for ys in y]),
                        color_values=y)
# jupyter.display("newsgroups20.html")



Wrote visualization to: newsgroups20.html


<img src="https://i.imgur.com/3G4sm4Y.png">