- This lecture is about doing text preprocessing using the Keras API.
- we're just using a library for something we've already been doing
- just Keras makes it easier to do the steps
- in fact API is so nice that sometimes PyTorch users use Keras Preprocesing also cause PyTorch library is very bad
- easier but less flexible (ex: index 0 reserved for padding so you won't be able to assign that to a word)
- This makes the embedding matrix somewhat inefficient bcz there'll be a useless row

# Text Preprocessing Keras API Code Preparation

### Remember the steps:
- Tokenize the text (convert each doc in corpus into a list of tokens, each token may be a whole word, part of the word or just a character)
- assign integer ID to each unique token , bcz integer will represent position in a feature vector
- From now on, we won't be using tf-idf (but word embeddings) but we still need to map tokens to integers
- in most of DL, we won't be using bag of words but instead we'll work with full sequences
- NEW STEP: after mapping each token to an integer, convert the document into a sequence of integers
ex: I like cats -> [407,634,27] a list
- why helpful? Integers take up less memory than strings


- as usual you only fit on train set, transform on both
- you should pass the documents as a list of strings to this code OR alternatively a list of tokens.
- ["I am doc 1", "I am doc 2", ... ]
or
- [["I","am", "doc", "1"], ["I","am", "doc", "2"], ...]
- so although name says tokenizer, it does many other things: it maps, it converts each doc into an integer list and it even works when your input's already tokenized so a bit weird to call it tokenizer

### Tokenizer output:
- sentences= ["I like eggs and ham.", "I love chocolates and bunnies.", "I hate onions"] becomes 
- sequences = [[1,2,3,4,5], [1,6,7,4,8], [1,9,10]]
- how do you know which word goes with which integer?
- Answer: tokenizer.word_index

### Constructor arguments:

### Stemming? Lemmatization? Stopwords?
- you normally don't want these things in DL
- ex: suppose you're trying to generate text, it will be difficult if you couldn't generate stopwords
- or if you do translation, running and I ran should be translated diffrently, so not good to have lemmatization either

### Text to Matrix
- this is another method he wanna discuss as a bonus
- this allows you to convert your text to tfidf, count vectors or binary vectors as we previously done in the course
- not gonna be neededin DL, we actually wanna keep our text as a sequence but just know that they exist
- bu kadar bahsetti :D

## Looking Ahead: Padding
- unpadded sequences = [[1,2,3,4,5], [1,6,7,4,8], [1,9,10}]]
- padded sequences = [[1,2,3,4,5], [1,6,7,4,8], [0,0,1,9,10}]]
- why padding required is bcz in libraries like Tensorflow sequences all must have the same length
- there is no jagged arrays in numpy but in text we have different lengths, not all sentences or all documents have the same length
- you won't see why this is true until CNN. but trust it for now

- looking ahead bit again, we actually typically add zeroes at the front, since RNNs have trouble remembering things from too far past, so it's usually more useful to make padding to the front- but we'll explore other options too- and keep the most useful information near the end
- e o zaman sıfır indexli kelime olmaması iyi oldu, sıfırla padding yapabiliyoz

### Shapes
- as usual, it's good to think in terms of shapes
- a numpy array or a tensorflow tensor contaninin padded sequences of integers representing sentences will be shaped NxT
- N-number of documents T-document length

## Exercise - Text Preprocessing with Tensorflow
- bunun notebook'unu websitesinde bulamadım

In [1]:
import tensorflow as tf
print(tf.__version__)

2023-06-16 19:17:18.692092: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


2.12.0


In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#yukarda uzun yazmak yerine tf.keras diye yazınca error : No module named 'tf'

In [4]:
#just a simple text
sentences= ["I like eggs and ham.", 
            "I love chocolates and bunnies.", 
            "I hate onions."
           ] 

In [5]:
MAX_VOCAB_SIZE = 20000 #probably good enough, 3000 will cover 95% words already
tokenizer = Tokenizer(num_words = MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[1, 3, 4, 2, 5], [1, 6, 7, 2, 8], [1, 9, 10]]


In [6]:
#get word to index mapping
tokenizer.word_index

#as you see zero is reserved for padding

{'i': 1,
 'and': 2,
 'like': 3,
 'eggs': 4,
 'ham': 5,
 'love': 6,
 'chocolates': 7,
 'bunnies': 8,
 'hate': 9,
 'onions': 10}

In [7]:
#use the defaults ? we wanna pad sequences with all the default values, baka nacaktık ki?
data = pad_sequences(sequences)
print(data)

#default appears to set maximmum sequence length to maximum sentence length

[[ 1  3  4  2  5]
 [ 1  6  7  2  8]
 [ 0  0  1  9 10]]


In [8]:
MAX_SEQUENCE_LENGTH = 5
data = pad_sequences(sequences, maxlen = MAX_SEQUENCE_LENGTH ) #we can pass explicitly the max len
print(data)

[[ 1  3  4  2  5]
 [ 1  6  7  2  8]
 [ 0  0  1  9 10]]


In [9]:
data = pad_sequences(sequences, maxlen = MAX_SEQUENCE_LENGTH, padding ='post') ##padding i sona yaptık
print(data)

[[ 1  3  4  2  5]
 [ 1  6  7  2  8]
 [ 1  9 10  0  0]]


In [10]:
data = pad_sequences(sequences, maxlen = 6 ) #too much padding
print(data)

[[ 0  1  3  4  2  5]
 [ 0  1  6  7  2  8]
 [ 0  0  0  1  9 10]]


In [11]:
data = pad_sequences(sequences, maxlen = 4 ) #too little padding == truncation, cut off the begining of sentences
#makes sense bcz NN typically pay more attention to the final values of sequence anywya
print(data)

[[ 3  4  2  5]
 [ 6  7  2  8]
 [ 0  1  9 10]]


In [13]:
data = pad_sequences(sequences, maxlen = 4, truncating ='post' ) #truncating'i post yaparak sondan kesyik
print(data)
#

[[ 1  3  4  2]
 [ 1  6  7  2]
 [ 0  1  9 10]]
