As we can not feed words directly into most of the models, we need numeric values for each text. In this notebook we discuss different techniques to represent a text document into numeric vectors. 

## Table of Contents

* [Bag of Words representation of texts](#bow)
    * [Countvectorizer](#cv)
    * [Tf-idf](#tfidf)
* [Topic Modelling](#tm)
* [Matrix Decomposition](#mf)
* [Visualizing Representations](#viz)

<a id='bow'></a>

### Bag of Words representation of texts

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

1. A vocabulary of known words.
2. A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_colwidth = -1

data = pd.read_csv('clinical_notes_cleaned_pos.csv') #for excel file use read_excel

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

<a id='cv'></a>

#### Countvectorizer

In Countvectorizer, we use counts to represent each word in each of the texts

In [57]:
vector = CountVectorizer(lowercase=True, #this will convert all the tokens into lower case
                         stop_words='english', #remove english stopwords from vocabulary. if we need the stopwords this value should be None
                         analyzer='word', #tokens should be words. we can also use char for character tokens
                         max_features=50000 #maximum vocabulary size to restrict too many features
                        )

In [58]:
sample_text = data.clean_text.iloc[:5]

In [59]:
vectorized_corpus = vector.fit_transform(sample_text)

get_feature_names attribute will show all the words in the vocabulary

In [60]:
vector.get_feature_names()[:50]

['05',
 '10',
 '100',
 '1000',
 '12',
 '13',
 '14',
 '15',
 '150',
 '17',
 '18',
 '1800',
 '19',
 '1ventral',
 '20',
 '21',
 '22',
 '23',
 '25',
 '26',
 '28',
 '29',
 '2mild',
 '32',
 '37',
 '40',
 '400',
 '45',
 '4c',
 '4th',
 '4x4s',
 '57',
 '575',
 '60',
 '69',
 '73',
 '79',
 '8gpl',
 '93',
 '96',
 '97',
 'a1',
 'abd',
 'abducto',
 'ability',
 'able',
 'abort',
 'abortion',
 'acromioclavicular',
 'activate']

Vectorized corpus is in sparse scipy matrix. We need to convert it into numpy array for display. However, if the vocab size is too big, numpy array can eat a lot of memory. Its wise to use sparse matrix for computation for big datasets

In [7]:
vectorized_corpus.toarray()

array([[0, 4, 0, ..., 0, 0, 2],
       [0, 1, 0, ..., 3, 1, 0],
       [0, 4, 1, ..., 0, 0, 1],
       [1, 1, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 3]], dtype=int64)

In [8]:
print (vectorized_corpus.shape)

(5, 841)


If we want to remove too frequent or too rare we can use max_df and min_df arguments

In [9]:
vector = CountVectorizer(lowercase=True, #this will convert all the tokens into lower case
                         stop_words='english', #remove english stopwords from vocabulary. if we need the stopwords this value should be None
                         analyzer='word', #tokens should be words. we can also use char for character tokens
                         max_features=50000, #maximum vocabulary size to restrict too many features
                         max_df=.5, #if some word is in more than 50% of the documents, remove them
                         min_df=2 #words need to be in atleast 2 documents
                        )

In [10]:
vectorized_corpus = vector.fit_transform(sample_text) #fit_transform method fit and then transform the data. We can also fit and transform separately.

In [11]:
print (vectorized_corpus.shape)

(5, 110)


Now let us add bi-gram and tri-gram to add more words into our vocabulary

In [12]:
vector = CountVectorizer(lowercase=True, #this will convert all the tokens into lower case
                         stop_words='english', #remove english stopwords from vocabulary. if we need the stopwords this value should be None
                         analyzer='word', #tokens should be words. we can also use char for character tokens
                         max_features=50000, #maximum vocabulary size to restrict too many features
                         max_df=.5, #if some word is in more than 50% of the documents, remove them
                         min_df=1, #words need to be in atleast 2 documents
                         ngram_range=(1,3) #change ngram_range for n-grams
                        )

In [13]:
vectorized_corpus = vector.fit_transform(sample_text)

In [56]:
vector.get_feature_names()[:50]

['05',
 '05 mark',
 '05 mark supraspinatus',
 '10 8gpl',
 '10 8gpl unit',
 '10 jp',
 '10 jp drain',
 '10 mg',
 '10 mg iv',
 '10 minute',
 '10 minute pass',
 '10 rf',
 '10 rf 20',
 '10 series',
 '10 series image',
 '10 wound',
 '10 wound edge',
 '10 year',
 '10 year 2mild',
 '10 year consume',
 '100',
 '100 ml',
 '100 ml postanesthesia',
 '1000',
 '1000 ml',
 '1000 ml general',
 '12',
 '12 14',
 '12 14 series',
 '12 normal',
 '12 normal coracoacromial',
 '13',
 '13 series',
 '13 series image',
 '13 year',
 '13 year state',
 '14 8gpl',
 '14 8gpl normallt22',
 '14 complete',
 '14 complete tear',
 '14 mm',
 '14 mm screw',
 '14 series',
 '14 series image',
 '15',
 '15 degree',
 '15 degree intermetatarsal',
 '150',
 '150 pound',
 '150 pound palpable']

In [15]:
print (vectorized_corpus.shape)

(5, 3850)


<a id='tfidf'></a>

#### Tf-idf 

Countvectorizer uses just count of words in each of the document, which do not represent the relative strength of the words. Tf-idf addresses that issue by normalizing the counts by total number of words in the document and how many documents that particular word in

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
tfidf_vector = TfidfVectorizer(lowercase=True, #this will convert all the tokens into lower case
                         stop_words='english', #remove english stopwords from vocabulary. if we need the stopwords this value should be None
                         analyzer='word', #tokens should be words. we can also use char for character tokens
                         max_features=50000 #maximum vocabulary size to restrict too many features
                        )

In [18]:
tfidf_vectorized_corpus = tfidf_vector.fit_transform(sample_text)

In [19]:
tfidf_vectorized_corpus.toarray()

array([[0.        , 0.07223833, 0.        , ..., 0.        , 0.        ,
        0.03611916],
       [0.        , 0.04526348, 0.        , ..., 0.24102705, 0.08034235,
        0.        ],
       [0.        , 0.06708465, 0.02976869, ..., 0.        , 0.        ,
        0.01677116],
       [0.04106359, 0.02313451, 0.        , ..., 0.        , 0.        ,
        0.02313451],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.05745765]])

Let us apply tf-idf on the whole dataset

In [20]:
data = data.dropna(subset=['clean_text'])

tfidf_vector = TfidfVectorizer(lowercase=True, #this will convert all the tokens into lower case
                         stop_words='english', #remove english stopwords from vocabulary. if we need the stopwords this value should be None
                         analyzer='word', #tokens should be words. we can also use char for character tokens
                         max_features=50000, #maximum vocabulary size to restrict too many features
                         min_df = 5,
                         max_df = .4
                        )

tfidf_vectorized_corpus = tfidf_vector.fit_transform(data.clean_text)

print (tfidf_vectorized_corpus.shape)

(818, 3831)


There are total 818 medical texts and 3831 is vocab size

### Visualizing Top words from Tf-idf representation

Under Tf-idf representation, words that carry high weightage are important for each text. Let us visualize top word based on tf-idf score for each of the specialties.

In [21]:
top_words = tfidf_vector.get_feature_names()
top_words_df = pd.DataFrame(tfidf_vectorized_corpus.toarray(),columns=top_words)
top_words_df['speciality'] = data.label

In [22]:
top_words_df.head(5)

Unnamed: 0,00,000,00am,01,02,03,04,045,05,06,...,yellow,yesterday,young,yr,yyyy,zantac,zero,zocor,zone,speciality
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Neurology
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Orthopedic
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Orthopedic
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.039968,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Radiology
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Orthopedic


In [23]:
top_words_df_per_label = top_words_df.groupby(['speciality']).mean().reset_index(drop=False)
top_words_df_per_label = top_words_df_per_label.melt(id_vars=['speciality'],value_vars=top_words)
top_words_df_per_label.columns = ['speciality','top_word','word_weight']
top_words_df_per_label = top_words_df_per_label.sort_values(by=['speciality','word_weight'],ascending=[False,False]).reset_index(drop=True)
top_words_df_per_label = top_words_df_per_label.groupby(['speciality']).head(20).reset_index(drop=True)
top_words_df_per_label.head(5)

Unnamed: 0,speciality,top_word,word_weight
0,Urology,bladder,0.075027
1,Urology,prostate,0.062671
2,Urology,inguinal,0.047526
3,Urology,hernia,0.045706
4,Urology,penis,0.042437


In [24]:
from IPython.display import HTML, Image
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import plotly.express as px
#init_notebook_mode(connected=True)

all_colors = px.colors.qualitative.Plotly
print (all_colors)

['#636EFA', '#EF553B', '#00CC96', '#AB63FA', '#FFA15A', '#19D3F3', '#FF6692', '#B6E880', '#FF97FF', '#FECB52']


In [55]:
traces = []

for i, s in enumerate(list(set(top_words_df_per_label.speciality))):
    traces.append(go.Bar(
        x = top_words_df_per_label[top_words_df_per_label.speciality == s]['top_word'],
        y = top_words_df_per_label[top_words_df_per_label.speciality == s]['word_weight'],
        name=s,
        marker_color=all_colors[i%len(all_colors)]
    ))
    
layout = dict(
        width=1050,
        height=600,
        title = 'Top 20 Words for each Speciality', 
        yaxis=dict(
        title='Word Weight',
        titlefont_size=16,
        tickfont_size=14,
        ),  
    xaxis=dict(
        title='Top Words',
        titlefont_size=16,
        tickfont_size=14,
        ),  
    )

buttons = []
visibility = [False]*len(list(set(top_words_df_per_label.speciality)))
for i, s in enumerate(list(set(top_words_df_per_label.speciality))):
    visibility_ = visibility.copy()
    visibility_[i] = True
    buttons.append(
        dict(
            args = [{'visible': visibility_}],
            label=s,
            method='update',
        ))
    
updatemenus = list([
    dict(buttons=buttons,
    direction="down",
        pad = {'r':10, "t":10},
        showactive=True,
        x=0.3,
        y=1.15,
        yanchor='top')
])

layout['updatemenus'] = updatemenus

fig = dict(data=traces,layout=layout)
iplot(fig)

<a id='tm'></a>

### Topic Modelling

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

![image.png](attachment:image.png)

In [26]:
from sklearn.decomposition import LatentDirichletAllocation

In [27]:
n_topics = 8

In [28]:
lda = LatentDirichletAllocation(n_components=n_topics)
lda_features = lda.fit_transform(tfidf_vectorized_corpus)

In [29]:
lda_features

array([[0.23328804, 0.00852919, 0.00852659, ..., 0.00852117, 0.00852117,
        0.7155546 ],
       [0.898703  , 0.0144714 , 0.01447022, ..., 0.01447022, 0.01447022,
        0.01447308],
       [0.91221622, 0.00838062, 0.00837347, ..., 0.00837285, 0.00837285,
        0.03752896],
       ...,
       [0.41376396, 0.01748452, 0.0174741 , ..., 0.01747111, 0.01747111,
        0.01809848],
       [0.29400295, 0.36319745, 0.0192552 , ..., 0.01925141, 0.01925141,
        0.24652803],
       [0.22449186, 0.01311163, 0.01310512, ..., 0.01310182, 0.01310182,
        0.69686485]])

In [30]:
lda_features.shape

(818, 8)

In [62]:
lda_features[0].sum()

0.9999999999999999

The numbers learned in lda follows probability distribution. Let us see the top topic for each text and interpret each topic.

In [31]:
text_topics = lda_features.argmax(axis=1)
data['topic'] = text_topics

In [32]:
def print_top_words(model, feature_names, n_top_words):
    topic_df = pd.DataFrame()
    topic_df['topic_id'] = np.arange(n_topics)
    temp = []
    
    for topic_idx, topic in enumerate(model.components_):
        topic /= topic.sum()
        message = "Topic #%d: " % (topic_idx)
        message += " + ".join(["{:.3f} * {}".format(topic[i],feature_names[i])
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
        
        temp.append(",".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

    topic_df['top_words'] = temp
    
    print()
    
    return topic_df

In [33]:
top_words = print_top_words(lda, tfidf_vector.get_feature_names(), n_top_words=5)

Topic #0: 0.003 * incision + 0.003 * skin + 0.003 * remove + 0.003 * knee + 0.003 * lateral
Topic #1: 0.009 * bladder + 0.007 * gallbladder + 0.006 * abdomen + 0.006 * prostate + 0.006 * pelvi
Topic #2: 0.005 * osteotomy + 0.004 * ct + 0.004 * axial + 0.003 * intracranial + 0.003 * shift
Topic #3: 0.012 * colon + 0.009 * scope + 0.009 * polyp + 0.007 * colonoscopy + 0.007 * cecum
Topic #4: 0.007 * valve + 0.005 * stress + 0.005 * perfusion + 0.005 * aortic + 0.005 * fraction
Topic #5: 0.000 * atenolol + 0.000 * real + 0.000 * handed + 0.000 * parenchyma + 0.000 * ventilate
Topic #6: 0.001 * angina + 0.001 * enema + 0.001 * 117 + 0.000 * afternoon + 0.000 * constipation
Topic #7: 0.005 * unremarkable + 0.004 * deny + 0.004 * history + 0.004 * mg + 0.004 * day



In [34]:
data_topic_proportions = data.groupby(['topic'])['label'].value_counts(normalize=True).reset_index(name='topic_proportion')
data_topic_proportions.columns = ['topic','speciality','topic_proportion']
data_topic_proportions = data_topic_proportions.sort_values(['topic'],ascending=[True]).reset_index(drop=True)
#data_topic_proportions

In [35]:
traces = []
for i in data_topic_proportions.speciality.unique():
    traces.append(go.Bar(name=i, x=data_topic_proportions[data_topic_proportions.speciality == i]['topic'], y=data_topic_proportions[data_topic_proportions.speciality == i]['topic_proportion']))

layout = dict(
        width=1200,
        height = 600,
        title = 'Top disease category for each topic', 
        yaxis=dict(
        title='Proportion',
        titlefont_size=16,
        tickfont_size=14,
        ),  
    xaxis=dict(
        title='Topic ID',
        titlefont_size=16,
        tickfont_size=14,
        ),  
    )

fig = go.Figure(data=traces,layout=layout)

fig.update_layout(barmode='stack')
fig.show()

In [36]:
import pyLDAvis
import pyLDAvis.gensim
import pyLDAvis.sklearn
from IPython.display import HTML

In [38]:
viz = pyLDAvis.sklearn.prepare(lda_model=lda,vectorizer=tfidf_vector,dtm=tfidf_vectorized_corpus)


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





In [39]:
pyLDAvis.save_html(viz,'vis.html')
HTML(filename='vis.html')

The numbers denote the relative strength of each word in each of the topics. These numbers are calculated during the process of LDA. We can extract the hidden aspect for each topic from these words. In practice, we maintain reverse dictionary to get the understanding behind each topic. However, as topics are meant for hidden semantics, often topics can demonstrate abstract concepts, which may not be comprehended by human judgment.

<a id='mf'></a>

### Matrix Factorization

Matrix decomposition/factorization is an essential statistical technique to 

* Reduce dimensionality
* Feature extraction
* Semantic analysis

Matrix factorization can also be used to extract topical information.

In this notebook we use matrix decomposition to extract representation from texts

<a id='nmf'></a>

#### Non-negative Matrix Factorization

In NMF we split the matrix into two matrices with non-negative entries. 

![image.png](attachment:image.png)

Due to the constraint, NMF can not be solved like typical matrix decomposition. NMF is solved by optimizing the above loss function using optimization techniques like - gradient descent.

In [40]:
from sklearn.decomposition import NMF

In [41]:
nmf_model = NMF(n_components=n_topics, l1_ratio=.5) #beta_loss='kullback-leibler'

In [42]:
nmf_features = nmf_model.fit_transform(tfidf_vectorized_corpus)

In [43]:
top_words_nmf = print_top_words(nmf_model, tfidf_vector.get_feature_names(), n_top_words=5)

Topic #0: 0.006 * history + 0.005 * state + 0.005 * knee + 0.004 * deny + 0.004 * mg
Topic #1: 0.011 * metatarsal + 0.010 * tendon + 0.010 * foot + 0.009 * medial + 0.007 * joint
Topic #2: 0.025 * colon + 0.017 * polyp + 0.017 * scope + 0.014 * cecum + 0.012 * colonoscopy
Topic #3: 0.017 * ct + 0.013 * contrast + 0.013 * pelvi + 0.009 * fluid + 0.009 * abdomen
Topic #4: 0.019 * c5 + 0.018 * c6 + 0.018 * cervical + 0.015 * c4 + 0.015 * disc
Topic #5: 0.009 * suture + 0.008 * hernia + 0.008 * incision + 0.007 * inguinal + 0.007 * skin
Topic #6: 0.010 * unremarkable + 0.006 * reveal + 0.006 * 92 + 0.006 * 93 + 0.005 * weakness
Topic #7: 0.022 * valve + 0.018 * stress + 0.016 * perfusion + 0.016 * aortic + 0.014 * myocardial



In [44]:
text_topics_nmf = nmf_features.argmax(axis=1)
data['topic_nmf'] = text_topics_nmf

data_topic_proportions = data.groupby(['topic_nmf'])['label'].value_counts(normalize=True).reset_index(name='topic_proportion')
data_topic_proportions.columns = ['topic_nmf','speciality','topic_proportion']
data_topic_proportions = data_topic_proportions.sort_values(['topic_nmf'],ascending=[True]).reset_index(drop=True)
#data_topic_proportions

In [45]:
traces = []
for i in data_topic_proportions.speciality.unique():
    traces.append(go.Bar(name=i, x=data_topic_proportions[data_topic_proportions.speciality == i]['topic_nmf'], y=data_topic_proportions[data_topic_proportions.speciality == i]['topic_proportion']))

layout = dict(
        width=1200,
        height=600,
        title = 'Top disease category for each topic', 
        yaxis=dict(
        title='Topic Id',
        titlefont_size=16,
        tickfont_size=14,
        ),  
    xaxis=dict(
        title='Proportion',
        titlefont_size=16,
        tickfont_size=14,
        ),  
    )

fig = go.Figure(data=traces,layout=layout)

fig.update_layout(barmode='stack')
fig.show()

In [46]:
viz_nmf = pyLDAvis.sklearn.prepare(lda_model=nmf_model,vectorizer=tfidf_vector,dtm=tfidf_vectorized_corpus)
pyLDAvis.save_html(viz_nmf,'vis_nmf.html')
HTML(filename='vis_nmf.html')


divide by zero encountered in log


divide by zero encountered in log


divide by zero encountered in log


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





<a id='viz'></a>

### Visualizing Representations

We can use quantitative methods for evaluating representations learned from different model. However, as the task is unsupervised, most of the evaluations are based on similarity based metrics. As we see below, some of the representations preserve local properties i.e - similar texts have similar representation and cluster together. On the other hand, some methods look for global structures. Usually, people evaluate representations using downstream task specific metrics.

As we can not visualize 100-dim vectors, we use t-SNE embeddings to reduce the dimensions into 2. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.

In [47]:
from sklearn.manifold import TSNE

In [48]:
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca', learning_rate=200)
tsne_lda = tsne_model.fit_transform(lda_features)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 818 samples in 0.001s...
[t-SNE] Computed neighbors for 818 samples in 0.027s...
[t-SNE] Computed conditional probabilities for sample 818 / 818
[t-SNE] Mean sigma: 0.009508
[t-SNE] KL divergence after 250 iterations with early exaggeration: 49.795509
[t-SNE] KL divergence after 1000 iterations: 0.214581


In [49]:
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca', learning_rate=200)
tsne_nmf = tsne_model.fit_transform(nmf_features)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 818 samples in 0.002s...
[t-SNE] Computed neighbors for 818 samples in 0.031s...
[t-SNE] Computed conditional probabilities for sample 818 / 818
[t-SNE] Mean sigma: 0.033378
[t-SNE] KL divergence after 250 iterations with early exaggeration: 54.262352
[t-SNE] KL divergence after 1000 iterations: 0.373938


In [50]:
vis_data = pd.DataFrame()
vis_data['x'] = tsne_lda[:,0]
vis_data['y'] = tsne_lda[:,1]
vis_data['speciality'] = data.label
vis_data['topic_id'] = data.topic
vis_data = pd.merge(vis_data,top_words,how='inner')
vis_data.topic_id = vis_data.topic_id.astype(str)

fig = px.scatter(vis_data, x="x", y="y", color=vis_data.speciality, hover_data=['topic_id'], title="Projection of Clinical Texts (based on LDA)", height=600, width=1200) #hover_data=['top_words']
fig.show()

In [51]:
fig = px.scatter(vis_data, x="x", y="y", color=vis_data.topic_id, hover_data=['top_words'], title="Projection of Clinical Texts (based on LDA)", height=600, width=1200) #hover_data=['top_words']
fig.show()

In [52]:
vis_data = pd.DataFrame()
vis_data['x'] = tsne_nmf[:,0]
vis_data['y'] = tsne_nmf[:,1]
vis_data['speciality'] = data.label
vis_data['topic_id'] = data.topic_nmf
vis_data = pd.merge(vis_data,top_words_nmf,how='inner')
vis_data.topic_id = vis_data.topic_id.astype(str)

fig = px.scatter(vis_data, x="x", y="y", color=vis_data.speciality,  hover_data=['topic_id'], title="Projection of Clinical Texts (based on NMF)", height=600, width=1200) #hover_data=['top_words']
fig.show()

In [53]:
fig = px.scatter(vis_data, x="x", y="y", color=vis_data.topic_id,  hover_data=['top_words'], title="Projection of Clinical Texts (based on NMF)", height=600, width=1200) #hover_data=['top_words']
fig.show()

### References for further reading

<strong> Visualization </strong>

https://matplotlib.org/tutorials/index.html

https://plotly.com/python

https://www.kaggle.com/thebrownviking20/intermediate-visualization-tutorial-using-plotly

<strong> Representation Learning </strong>

https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html

https://scikit-learn.org/stable/modules/feature_extraction.html

<strong> Topic Modelling </strong>

https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html

https://machinelearningmastery.com/introduction-to-matrix-decompositions-for-machine-learning/

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05

<strong> T-SNE </strong>

https://distill.pub/2016/misread-tsne/