# Text and Image Data

If text/images are in addition to our tabular data, we obvs need to pull some useful stuff out of them.

Sometimes we will use the extracted features _with_ the tabular data and other times independently and then ensemble the results together at the end. This is discussed in the ensemble section.

# Text

There are two main ways to extract features from text:
1. Bag of Words
2. Embeddings (Word2Vec)

## Bag of Words

New column for each unique word from the data, then count the number of occurences of each word (can also go simpler and just OHE instead of counting).

Use: `sklearn.feature_extraction.text.CountVectorizer`.

Remember that KNN, NNs and linear models depend on feature scale (oh this is perhaps why Chollet just OHE'd the words in the basic example so that he didn't have to worry about scaling). So we want to post-process them to avoid these scaling issues.

Aims:
- Make samples more comparable
- Boost more important features and decrease scale of useless ones

Methods:
1. **Term Frequency** - Normalize sum of values per row (i.e. per piece of text). Thus, count the frequency of words per text sample
```python
tf = 1 / x.sum(axis=1)[:, None]
x = x * tf
```
2. **Inverse Document Frequency** - Normalize data column wise to boost the more important features. Normalize by the inverse fraction of documents that contain this word. Thus if a word is in every document, e.g. 'a', it will get a very low score
```python
idf = np.log(x.shape[0] / (x > 0).sum())
x = x * idf
```
By taking the log, we decrease the significance of widespread words (as it pulls them closer to 0).

Use: `sklearn.feature_extraction.text.TfidfVectorizer`.

Lots of TFiDF variants you should try. No free lunch.

### N-Grams

Add not just columns corresponding to words but also corresponding to N subsequent words (or sequences of chars).

Apparently for N=1 we will have 28 columns (not sure why though since there are 26 letters in the alphabet). Perhaps they include ! and ?

Note that for N=2 we have 28 * 28 = 784 possible combinations and thus columns.

Benefits of char N-grams:
- Sometimes better and more efficient to have every possible char N-gram as a feature instead of having a feature for each unique work from the dataset.
- Can handle unseen words

Use: `sklearn.feature_extraction.text.CountVectorizer(Ngram_range, analyzer)`

- Ngram_range - sets number of n-grams to include
- Analyzer - change from word to char n-grams.

### Other preprocessing

1. Lowercase
2. Lemmatization
3. Stemming
4. Stopwords

Without **lowercase** we would get multiple columns for the same words and this is inefficeint. `CountVectorizer` does lowercase by default.

**Lemmatization and stemming**
- I had a car --> I have car
- We have cars --> We have car

We unify very similar words.

*Stemming* - more heuristic, chop off word endings
democracy, democratic, democratization --> democr

*Lemmatization* - more precise, need knowledge of vocab and morphological analysis of words
democracy, democratic, democratization --> democracy

*The Difference*
- If we use stemming on 'saw' we get 's'
- If we use lemmatization on 'saw' we get 'see' or 'saw' depending on the context (verb or noun)

**Stopwords**
Unimportant words that occur a lot in language
- Articles/prepositions
- Very common words

Most languages have pre-defined lists of stopwords you can find online or from NLTK (Natural Language ToolKit library for Python).

Use: `sklearn.feature_extraction.text.CountVectorizer(max_df)`

- max_df - ignore terms that have a higher document frequency strictly higher than the given threshold (e.g. if a word is in 90%+ of the documents, it's probably not very helpful for our task).

This is the classical feature preprocessing pipeline for text.

## Embeddings (Word2Vec)

Vector representations of words and text but more precise and concise than before.

Converts each word to a vector in a sophisticated space (several hundred dimensions). 

king + woman - man = queen

Super amazing.

There are various embeddings that already exist that we can use:
- **Words** - Word2Vec, Glove (Global vector for word representation), FastText, etc.
- **Sentences** - Doc2Vec etc.

Note that embeddings for words and sentences are different. Could take mean/sum of word vectors or use something else like Doc2vec. Check both approaches and select the best. 

Training. word2vec takes ages. So use pretrained ones

### BOW vs. W2V

1. BOW
 1. Very large vectors
 2. Meaning of each value in vector is known
2. Word2vec
 1. Relatively small vectors
 2. Values in vector can be interpreted only in some cases
 3. The words with similar meanings often have similar embeddings
 
Usually both methods give quite different results. So can benefit from using both and seeing which works best. 

# Images

CNNs give a compressed representation of the image. 

Each CNN has many layers. There is the output from the final layer but also output from the inner layers. We will call inner layer output: **descriptors**.

The descriptors from later layers are good at solving tasks similar to the one the (original) network was trained on.

Descriptors from earlier layers are more task independent information.

E.g. if using a model trained on ImageNet, you can use the last layers for some car model classification task. But it would struggle on medical specific tasks e.g. X-ray scans. Better then to use earlier layers in the network (or even retrain your network from scratch).

Want to look for a pretrained model that was trained on data similar to what you have in the exact competition.

We will process a pre-trained model to make it suit our needs more, this is called **fine-tuning**.

Fine-tuning, especially for small datasets is usually better than training a standalone model on descriptors or training a network from scratch. 

Training is better than fine-tuning because we can train all network parameters. And thus extract more effective image representations.

But fine-tuning is better than training from scratch if we have too little data or if the task we are solving is similar to the task it was trained on (can use knowledge already encoded in the network's parameters). Better results and faster training.

One example:
Re-train VGG-16. It was originally trained on 1,000 images. But we can replace the last layer with a 4-output layer (as it was a multiclass classification problem with 4 categories), then retrain with a learning rate 1,000x smaller than the original.

Obvs better to use a model pretrained on a similar dataset.

Both TF and PyTorch have loads of info about fine-tuning.

### Increase # Training Images

Use image augmentation.

If you rotate images by 180 degrees, you can double the number of training images you have.

Not as good as getting new data but way better than nothing. Can add
- Crops
- Rotation
- Adding noise

Reduces overfitting and gives more robust models.

Can be used on both training and test data. For the latter we can average the predictions for one augmented sample to reduce variance.

Must be careful though. Can't just randomly augemt willy nilly. For example in roof classification one of the categories was a North-South orientation, so rotating 90 degrees would turn this image into an East-West orientation image.

If fine-tuning (or training from scratch) you must use labels from images in the trained set. Be careful with validation here as overfitting can easily occur. Not sure what he means by this.

### Summary

1. Use pretrained CNNs to extract features (depending on original training data, we will favour different layers e.g. earlier or later)
2. Careful choosing of pretrained network can help (is it originally trained on a similar task to the competition?)
3. Finetuning lets us refine pretrained models
4. Data augmentation is massively helpful

