# Representing text

If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent characters as numbers that map to letters on your screen using encodings such as ASCII or UTF-8.

![Image showing diagram mapping a character to an ASCII and binary representation](notebooks/images/ascii-character-map.png)

We understand what each letter **represents**, and how all characters come together to form the words of a sentence. However, computers don't have such an understanding, and neural networks have to learn the meaning of the sentence during training.

We can use different approaches when representing text:
* **Character-level representation**, where we represent text by treating each character as a number. Given that we have $C$ different characters in our text corpus, the word *Hello* could be represented by a tensor with shape $C \times 5$. Each letter would correspond to a tensor in one-hot encoding.
* **Word-level representation**, in which we create a **vocabulary** of all words in our text, and then represent words using one-hot encoding. This approach is better than character-level representation because each letter by itself does not have much meaning. By using higher-level semantic concepts - words - we simplify the task for the neural network. However, given a large dictionary size, we need to deal with high-dimensional sparse tensors.

Let's start by installing some required Python packages we'll use in this module.

# Text classification task

In this module, we will start with a simple text classification task based on the **[AG_NEWS](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)** dataset: we'll classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech. To load the dataset, we will use the **[TensorFlow Datasets](https://www.tensorflow.org/datasets)** API.

> In the sandbox environment, we need to pre-fetch the dataset from a known location before creating it with TensorFlow datasets. If you're running in your local environment, you can skip the next cell, and the TensorFlow datasets library will download the data automatically.

In [None]:
!cd ~ && wget -q -O - https://mslearntensorflowlp.blob.core.windows.net/data/tfds-ag-news.tgz | tar xz

In [4]:
pip uninstall -y tensorflow tensorflow-datasets array-record etils protobuf

Found existing installation: tensorflow-datasets 4.9.0
Uninstalling tensorflow-datasets-4.9.0:
  Successfully uninstalled tensorflow-datasets-4.9.0
Found existing installation: array-record 0.4.1
Uninstalling array-record-0.4.1:
  Successfully uninstalled array-record-0.4.1
Found existing installation: etils 1.13.0
Uninstalling etils-1.13.0:
  Successfully uninstalled etils-1.13.0
Found existing installation: protobuf 6.33.0
Uninstalling protobuf-6.33.0:
  Successfully uninstalled protobuf-6.33.0
Note: you may need to restart the kernel to use updated packages.


ERROR: Exception:
Traceback (most recent call last):
  File "c:\Users\waqkh\AppData\Local\Programs\Python\Python312\Lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "c:\Users\waqkh\AppData\Local\Programs\Python\Python312\Lib\site-packages\pip\_internal\commands\uninstall.py", line 110, in run
    uninstall_pathset.commit()
  File "c:\Users\waqkh\AppData\Local\Programs\Python\Python312\Lib\site-packages\pip\_internal\req\req_uninstall.py", line 432, in commit
    self._moved_paths.commit()
  File "c:\Users\waqkh\AppData\Local\Programs\Python\Python312\Lib\site-packages\pip\_internal\req\req_uninstall.py", line 278, in commit
    save_dir.cleanup()
  File "c:\Users\waqkh\AppData\Local\Programs\Python\Python312\Lib\site-packages\pip\_internal\utils\temp_dir.py", line 173, in cleanup
    rmtree(self._path)
  File "c:\Users\waqkh\AppData\Local\Programs\Python\Python312\Lib\site-packages\pi

In [None]:
pip list | findstr array-record # Check if array-record is uninstalled

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
pip install tensorflow==2.16.1

Collecting tensorflow==2.16.1
  Obtaining dependency information for tensorflow==2.16.1 from https://files.pythonhosted.org/packages/76/4f/39ddae9fb07b8c039fa5a5f2b6623c6e0564199d82da33fcef62bcf93174/tensorflow-2.16.1-cp312-cp312-win_amd64.whl.metadata
  Downloading tensorflow-2.16.1-cp312-cp312-win_amd64.whl.metadata (3.5 kB)
Collecting tensorflow-intel==2.16.1 (from tensorflow==2.16.1)
  Obtaining dependency information for tensorflow-intel==2.16.1 from https://files.pythonhosted.org/packages/14/5a/0e2c734acb91d22fa67ccb7f0cc869e24c418486aaba3d7ca8cad158d5a0/tensorflow_intel-2.16.1-cp312-cp312-win_amd64.whl.metadata
  Downloading tensorflow_intel-2.16.1-cp312-cp312-win_amd64.whl.metadata (5.0 kB)
Collecting ml-dtypes~=0.3.1 (from tensorflow-intel==2.16.1->tensorflow==2.16.1)
  Obtaining dependency information for ml-dtypes~=0.3.1 from https://files.pythonhosted.org/packages/47/f3/847da54c3d243ff2aa778078ecf09da199194d282744718ef325dd8afd41/ml_dtypes-0.3.2-cp312-cp312-win_amd64.whl.me


[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
pip install tensorflow-datasets==4.9.6

Collecting tensorflow-datasets==4.9.6
  Obtaining dependency information for tensorflow-datasets==4.9.6 from https://files.pythonhosted.org/packages/8f/50/52fa3d41d20c687d81f66338bc1b0e71a27a3390ecfa8f5bc212a10135e1/tensorflow_datasets-4.9.6-py3-none-any.whl.metadata
  Downloading tensorflow_datasets-4.9.6-py3-none-any.whl.metadata (9.5 kB)
Collecting immutabledict (from tensorflow-datasets==4.9.6)
  Obtaining dependency information for immutabledict from https://files.pythonhosted.org/packages/63/7b/04ab6afa1ff7eb9ccb09049918c0407b205f5009092c0416147d163e4e2b/immutabledict-4.2.2-py3-none-any.whl.metadata
  Downloading immutabledict-4.2.2-py3-none-any.whl.metadata (3.5 kB)
Collecting pyarrow (from tensorflow-datasets==4.9.6)
  Obtaining dependency information for pyarrow from https://files.pythonhosted.org/packages/71/30/f3795b6e192c3ab881325ffe172e526499eb3780e306a15103a2764916a2/pyarrow-21.0.0-cp312-cp312-win_amd64.whl.metadata
  Downloading pyarrow-21.0.0-cp312-cp312-win_amd64.whl.m


[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [13]:
pip install protobuf==4.25.3

Collecting protobuf==4.25.3
  Obtaining dependency information for protobuf==4.25.3 from https://files.pythonhosted.org/packages/ad/6e/1bed3b7c904cc178cb8ee8dbaf72934964452b3de95b7a63412591edb93c/protobuf-4.25.3-cp310-abi3-win_amd64.whl.metadata
  Downloading protobuf-4.25.3-cp310-abi3-win_amd64.whl.metadata (541 bytes)
Downloading protobuf-4.25.3-cp310-abi3-win_amd64.whl (413 kB)
   ---------------------------------------- 0.0/413.4 kB ? eta -:--:--
   - ------------------------------------- 20.5/413.4 kB 640.0 kB/s eta 0:00:01
   --- ----------------------------------- 41.0/413.4 kB 393.8 kB/s eta 0:00:01
   ------------- -------------------------- 143.4/413.4 kB 1.2 MB/s eta 0:00:01
   ----------------------------------- ---- 368.6/413.4 kB 2.1 MB/s eta 0:00:01
   ---------------------------------------- 413.4/413.4 kB 2.0 MB/s eta 0:00:00
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.3
    Uninstalling protob


[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
pip install etils==1.7.0

Collecting etils==1.7.0
  Obtaining dependency information for etils==1.7.0 from https://files.pythonhosted.org/packages/37/10/dd5b124f037a636783e416a2fe839edd7ec63c0dce7ce4f3c1da029aeb80/etils-1.7.0-py3-none-any.whl.metadata
  Downloading etils-1.7.0-py3-none-any.whl.metadata (6.4 kB)
Downloading etils-1.7.0-py3-none-any.whl (152 kB)
   ---------------------------------------- 0.0/152.4 kB ? eta -:--:--
   -- ------------------------------------- 10.2/152.4 kB ? eta -:--:--
   ------- ------------------------------- 30.7/152.4 kB 325.1 kB/s eta 0:00:01
   ---------------------------- --------- 112.6/152.4 kB 819.2 kB/s eta 0:00:01
   -------------------------------------- 152.4/152.4 kB 905.5 kB/s eta 0:00:00
Installing collected packages: etils
  Attempting uninstall: etils
    Found existing installation: etils 1.5.2
    Uninstalling etils-1.5.2:
      Successfully uninstalled etils-1.5.2
Successfully installed etils-1.7.0
Note: you may need to restart the kernel to use updated pack


[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds

print(tf.__version__)
print(tfds.__version__)

2.16.1
4.9.6


In [3]:
import tensorflow as tf
import sys
from tensorflow import keras
import tensorflow_datasets as tfds

# In this tutorial, we will be training a lot of models. In order to use GPU memory cautiously,
# we will set tensorflow option to grow GPU memory allocation when required.
physical_devices = tf.config.list_physical_devices('GPU') 
if len(physical_devices)>0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

dataset = tfds.load('ag_news_subset')