# Feature Extraction

Texts content -> Feature Extraction -> ?

### Textual information

取决于使用的模型类型：<br>
* Graph based model to extract insights -> symbolic nodes with relationship (**WordNet**);<br>
* Statistical model -> numerical representation.<br>

如果任务是文档级别的 (document level, such as spam detection, sentiment analysis):<br>
* Document representation -> **bag-of-words** or **doc2vec**;<br>

如果任务是单词或短语级别的 (such as text generation, machine translation):<br>
* World representation -> **glove** or **word2vec**.


### Bag of words

Bag-of-words: treat each document as an un-ordered collection or bag-of-words.

To obtain a bag-of-words from raw text: 
* cleaning;
* normalizing;
* splitting;
* stemming;
* lemmatization;
* ...

Keeping lots of un-ordered of seprate sets is inefficient. (Due to different sizes, different words, words duplication).

Better representation: document -> vector of numbers (representing words occured times in a document):
* collect all the unique words in corpus;
* form a vocabulary;
* form the vector element positions or columns of a table;
* count number of occurrences of each word in each document;
* enter value in the respective column.

Now, this table can be a considered as a **Document-Term Matrix**, illustrating the relationship between documents in rows, and words or terms in columns.

To evaluate the similarity between documents, using **cosine similarity** instead of dot product.


### TF-IDF

Drawbacks of bags-of-word: treat every word as being equally important.

TF-IDF added the documents into consideration (**assigning weights** to words that signify their relevance in documents): <br>
tfidt(t,d,D)=tf(t,d) * idf(t,D)<br>
* tf: term freqenct
* idf: inverse document frequency

<br>
<br>
<br>
So far, we look at the representations that tried to characterize an **entire document** or **collection of words** as on unit. However, the inferences we can make are also typically at a document level:
* documents mixture of topics;
* documents similarity;
* documents sentiment;

For deeper analysis, a numerical representation for each word is necessary.


### One-Hot Encoding

Treat each word like a class, and assign it a vector that has 1 in a single pre-determined position for that word and 0 everywhere else.

Similar with the bag-of-word idea, but only keep a single word in each bag and build a vector for it.


### Word Embeddings

One-Hot encoding breaks down when we have a large vocabulary to deal with. So a word representation that control the size of our word representations by limiting it to a fixed-size vector is needed.

We want to find an embedding for each word in some vector space and to exhibit some desired properties. Such a representation for a variaty of purposes like finding synonyms and analogies, identifying concepts around which words are clustered, classifying words as positive, negative, neutral...

By combining word vector, we can get another way of representing documents as well.


# Modeling

* design a model;
* fit its parameters;
* training data using optimization procedure;
* make predictions on unseen data.

# Word2Vec

Core idea: a model that is able to **1.** predict a given word, given neighboring words, or vice versa: **2.** predict neighboring words for a given word is likely to capture the contextual meanings of words very well.

Word2Vec functions:
1. given neighboring words, called continuous bag of words (CBOW);
2. given the middle word, called skip-gram.

#### Skip-gram model:
1. Pick any word in a sentence; 
2. convert it into a one-hot encoded vector;
3. feed it into a neural network (probabilistic model);
4. predict a few surrounding words;
5. train & optimize

Continuous Bag of Words is using the same strategy.

Properties:
* Robust, distributed representation; (each word is distributed throughout the vector)
* Vector size is independent of vacabulary; (unlike bag-of-words, where the size grows with the number of unique words)
* Once trained, store in lookup table;
* deep learning ready. (RNNs, also possible to use RNNs to learn better word embeddings)

More: optimize using Hierarchical Softmax to present output word, using Spare Cross Entropy to compute loss...


# GloVe

Word2Vec is one type of forward embedding. GloVe tries to directly optimize the vector representation of each word just using co-occurrence statistics, unlike Word2Vec which sets up an ancillary prediction task.

1. probably that word $j$ appears in the context of word $i$ is computed; (simply that word $j$ is present in the vicinity of word $i$)
2. count all such occurrences if $i$ and $j$ in our text collection;
3. normalize account to get a probability;
4. initialize a random vector for each word; (two sets of vectors: 1 for the word when it is acting as a context; 2 for the traget)
5. for any pair of words $ij$, compute the dot product of their word vectors, to be equal to their co-occurrence probability;
6. using co-occurence prob as goal and teratively optimize these word vectors;
7. the result should be a set of vectors that capture the similarities and differences between individual words.

In another point of view, factorizing the co-occurrence probability matrix into two smaller matrices;

Why CO-occurrence Probabilities?<br>
Given a large corpus, find that the ratio of $P(solid|ice)/P(solid/steam)$ greater than 1, and $P(water|ice)/P(water/steam)$ near to 1. So co-occurrence probabilities matrix already exhibit some of the properties we want to capture.<br>
In fact, one refinement over using raw probability values is to optimize for the ratio of probabilities.<br>
co-occurrence probabilities matrix is huge and its values are typically very low (using logorithm values).


# Embeddings for Deep-learning

Distributional Hypothesis: words occured in the same contexts tend to have similiar meanings. If a large collection of sentences is used to learn in embedding, words with common context words tend to get pulled closer and closer together.

How to capture similarities and differences in the same embedding? Add another dimensional. The more dimensions we can capture in your word vector, the more expressive that representaion will be.


# t-SNE
t-distributed Stochastic Neighbor Embedding. (Dimensional Reduction).

It tries to maintain relative distances between objects. It's a great choice for visualizing word embeddings, and it effectivelty preserves the linear substractures and the relationships that have been learned by the embedding model.

t-SNE also workds on other kinds of data, such as images.