# File Encoding (and decoding) with Python

**Week01, Section 02**

ISM6564 Fall 2023

&copy; 2023 Dr. Tim Smith

-------


## 1.2.0 Introduction

In this section, we will learn how to read and write text files with various encodings in Python. We will also learn how to specify the encoding during file reading and writing, address encoding errors effectively, and identify the encoding of a file.

**Section Objectives:**

* Utilize Python programming language for reading and writing text files with various encodings
* Implement encoding/decoding in Python
* Know how to specify the encoding during file reading and writing
* Address encoding errors effectively
* Identify the encoding of a file


## 1.2.1 ASCII Encoding

Let's look at an example of ASCII encoding. The following is a list of 256 characters.

In [1]:
# create a list of ascii characters
ascii_list = [chr(i) for i in range(127)] # note we are simply using integers from 0 to 127 and converting them to ascii characters

print(ascii_list)

['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0b', '\x0c', '\r', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']


Write the contents of our list of characters to a file using ascii encoding.

In [2]:
# write the list to a file
with open('./data/text-ascii.txt', 'w', encoding='ascii') as f:
    f.write(''.join(ascii_list))

Now, let's try reading these characters back from this file.

In [3]:
# read and display the ascii file text-ascii.txt as a string
with open('./data/text-ascii.txt', 'r', encoding='ascii') as f:
    print(f.read())

 	

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~


Let's display the contents of the file as a list of characters.

In [4]:
# read and display the ascii file text-ascii.txt as a list of characters
with open('./data/text-ascii.txt', 'r', encoding='ascii') as f:
    print(list(f.read()))

['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0b', '\x0c', '\n', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']


We can also look at these characters as there original decimal values (which can also be represented as hexadecimal values, octal, or binary).

In [5]:
# read and display the ascii file text-ascii.txt as a list of decimal codes
with open('./data/text-ascii.txt', 'r') as f:
    print([ord(i) for i in list(f.read())])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126]


In [6]:
# read and display the ascii file text-ascii.txt as a list of binary codes
with open('./data/text-ascii.txt', 'r') as f:
    print([bin(ord(i)) for i in list(f.read())])

['0b0', '0b1', '0b10', '0b11', '0b100', '0b101', '0b110', '0b111', '0b1000', '0b1001', '0b1010', '0b1011', '0b1100', '0b1010', '0b1110', '0b1111', '0b10000', '0b10001', '0b10010', '0b10011', '0b10100', '0b10101', '0b10110', '0b10111', '0b11000', '0b11001', '0b11010', '0b11011', '0b11100', '0b11101', '0b11110', '0b11111', '0b100000', '0b100001', '0b100010', '0b100011', '0b100100', '0b100101', '0b100110', '0b100111', '0b101000', '0b101001', '0b101010', '0b101011', '0b101100', '0b101101', '0b101110', '0b101111', '0b110000', '0b110001', '0b110010', '0b110011', '0b110100', '0b110101', '0b110110', '0b110111', '0b111000', '0b111001', '0b111010', '0b111011', '0b111100', '0b111101', '0b111110', '0b111111', '0b1000000', '0b1000001', '0b1000010', '0b1000011', '0b1000100', '0b1000101', '0b1000110', '0b1000111', '0b1001000', '0b1001001', '0b1001010', '0b1001011', '0b1001100', '0b1001101', '0b1001110', '0b1001111', '0b1010000', '0b1010001', '0b1010010', '0b1010011', '0b1010100', '0b1010101', '0b1010

In [7]:
# read and display the ascii file text-ascii.txt as a list of hex codes
with open('./data/text-ascii.txt', 'r') as f:
    print([hex(ord(i)) for i in list(f.read())])

['0x0', '0x1', '0x2', '0x3', '0x4', '0x5', '0x6', '0x7', '0x8', '0x9', '0xa', '0xb', '0xc', '0xa', '0xe', '0xf', '0x10', '0x11', '0x12', '0x13', '0x14', '0x15', '0x16', '0x17', '0x18', '0x19', '0x1a', '0x1b', '0x1c', '0x1d', '0x1e', '0x1f', '0x20', '0x21', '0x22', '0x23', '0x24', '0x25', '0x26', '0x27', '0x28', '0x29', '0x2a', '0x2b', '0x2c', '0x2d', '0x2e', '0x2f', '0x30', '0x31', '0x32', '0x33', '0x34', '0x35', '0x36', '0x37', '0x38', '0x39', '0x3a', '0x3b', '0x3c', '0x3d', '0x3e', '0x3f', '0x40', '0x41', '0x42', '0x43', '0x44', '0x45', '0x46', '0x47', '0x48', '0x49', '0x4a', '0x4b', '0x4c', '0x4d', '0x4e', '0x4f', '0x50', '0x51', '0x52', '0x53', '0x54', '0x55', '0x56', '0x57', '0x58', '0x59', '0x5a', '0x5b', '0x5c', '0x5d', '0x5e', '0x5f', '0x60', '0x61', '0x62', '0x63', '0x64', '0x65', '0x66', '0x67', '0x68', '0x69', '0x6a', '0x6b', '0x6c', '0x6d', '0x6e', '0x6f', '0x70', '0x71', '0x72', '0x73', '0x74', '0x75', '0x76', '0x77', '0x78', '0x79', '0x7a', '0x7b', '0x7c', '0x7d', '0x7e']

### 1.2.1.1 What about the full range of 256 values?

Let's change our code to use the entire byte (256 values) instead of just 128 values. 

In [8]:
# create a list of ascii characters
ascii_list = [chr(i) for i in range(256)] # note we are simply using integers from 0 to 127 and converting them to ascii characters

print(ascii_list)

['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0b', '\x0c', '\r', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87', '\x88', '\x89', '\x8a', '\x8b', '\x8c', '\x8d', '\x8e', '\x8f', '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97', '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f', '\xa0', '

The following code will generate an error -- ascii encoding cannot handle values > 127.

In [9]:
# write the list to a file
#with open('./data/text-ascii.txt', 'w', encoding='ascii') as f:
#    f.write(''.join(ascii_list))

To address this issue, we will need to look at encoding techniques other than ascii.

## 1.2.2 UTF (Unicode) Encoding

Python 3 uses UTF-8 as the default encoding. This means that when you read a file, Python assumes that the file is encoded in UTF-8. If the file is not encoded in UTF-8, you must specify the encoding when you read the file. If you do not specify the encoding and the file is not UTF-8, Python will either raise an error (most likely) or read the file incorrectly (less likely).

In [10]:
# create a list of utf-8 characters
utf8_list = [chr(i) for i in range(256)] # note we are simply using integers from 0 to 127 and converting them to ascii characters

print(utf8_list)

['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0b', '\x0c', '\r', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87', '\x88', '\x89', '\x8a', '\x8b', '\x8c', '\x8d', '\x8e', '\x8f', '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97', '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f', '\xa0', '

Now, when we save this file we specify utf-8 encoding, and we can save and read back the file without any error.

In [11]:
# write the list to a file
with open('./data/text-utf8.txt', 'w', encoding='utf-8') as f:
    f.write(''.join(utf8_list))

In [12]:
# read and display the ascii file text-utf8.txt as a list of characters
with open('./data/text-utf8.txt', 'r', encoding='utf-8') as f:
    print(f.read())

 	

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ


In fact, utf-8 can encode all the characters in the world. It is the most popular encoding scheme in the world. It is also the default encoding scheme in Python 3.

The total possible characters utf-8 can encode is 1,112,064. 


In [13]:
# create a list of utf-8 characters
utf8_list = [chr(i) for i in range(10_000)] # note we are simply using integers from 0 to 127 and converting them to ascii characters

for i in range(500):
    print(utf8_list[i*20:i*20+20])

['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0b', '\x0c', '\r', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13']
['\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', ' ', '!', '"', '#', '$', '%', '&', "'"]
['(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';']
['<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O']
['P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c']
['d', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w']
['x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87', '\x88', '\x89', '\x8a', '\x8b']
['\x8c', '\x8d', '\x8e', '\x8f', '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97', '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f']
['

**Problems loading a file as utf-8**

There are times when you receive a file that this file is not encoding in utf-8. Or, you may want/need to store encode your text using a different encoding method. In these cases, you will need to specify the encoding when you open the file.

Let's create a new test.txt file using utf-16 encoding:

In [14]:
with open('./data/text-utf-16.txt', 'w', encoding='utf-16') as f:
    f.write('Hello, \nworld!')

Notice what happens when we attempt to load this file using the default encoding:

In [15]:
###### This will generate an error. Uncomment to see the error generated
#with open('./data/text-utf-16.txt', 'r') as f: # if you do not specify the encoding, Python defaults to UTF-8
#    print(f.read())

To correct this error, you need to specify the encoding when you open the file:


In [16]:
with open('./data/text-utf-16.txt', 'r', encoding='utf-16') as f:
    print(f.read())

Hello, 
world!


We can also read and write UTF-32 encoded files:

In [17]:
with open('./data/text-utf-32.txt', 'w', encoding='utf-32') as f:
    f.write('Hello, \nworld!')

In [18]:
with open('./data/text-utf-32.txt', 'r', encoding='utf-32') as f:
    print(f.read())

Hello, 
world!


> NOTE: VSCode (and other editors) may not be able to display the contents of the utf-32 file.  You can use the command line to view the contents of the file.

In [19]:
import chardet

# Open the file
with open("./data/MLK.txt", "rb") as f:
    # Read the first 100 bytes
    rawdata = f.read(100)
    
# Detect the encoding
result = chardet.detect(rawdata)

# Print the result
print(result)
    

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


## 1.2.3 Codepages

Keep in mind that how the data is encoded (i.e. utf-8, utf-16, utf-32) is about how the data is represented in binary on disk. Character mapping is what determines what 'number' is associated with what character. 

Unicode is now the preferred standard for character encoding. Unicode maps all known characters (glyphs) to a number. How this number is stored and read can either utf-8, utf-16, or utf-32.

Though utf-8 and unicode are now the primary standard; but we often find text that is encoded such that the full byte stored in the file is intepretted as a character (i.e. extended characters) (With UTF-8 some values above 127 are used to indicate that the next byte is part of the character). This is not the case with other encodings; therefore, we need to know the encoding of the file in order to properly interpret the characters.

Codepages use a single byte to represent a character. The first 128 characters are the same as ASCII. The remaining 128 characters are different for each codepage. Codepages are used to encode text in many languages. They are also used to display text on the screen. 

**Why this matters?**

Sometimes, you may find yourself attempting to read a file that is encoded using a codepage, not unicode. If you do not know the codepage, you will not be able to properly interpret the text.

In this section, we will look at some of the most common codepages. We will look at how they are used to encode text. We will look at how they are used to store text in files. We will look at how they are used to display text on the screen.

Let's begin by creating a binary file that has all the values from 0 to 255. We will use this file to see how different codepages encode text.

In [20]:
# create a binary file with values from 0 to 255.

with open('./data/binary.bin', 'wb') as f:
    f.write(bytes(range(256)))

In [21]:
### The following code will generate and error. Uncomment to see the type of error
# load the binary data as if it were utf-8

#with open('./data/binary.bin', 'r', encoding='utf-8') as f:
#    print(f.read())

In [22]:
# load the binary data as if it were ISO-8859-1
with open('./data/binary.bin', 'r', encoding='ISO-8859-1') as f:
    print(f.read())

 	

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ


In [23]:
# load the binary data as if it were ISO-8859-2
with open('./data/binary.bin', 'r', encoding='ISO-8859-2') as f:
    print(f.read())

 	

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą˘Ł¤ĽŚ§¨ŠŞŤŹ­ŽŻ°ą˛ł´ľśˇ¸šşťź˝žżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙


In [24]:
# load the binary data as if it were ISO-8859-3
with open('./data/binary.bin', 'r', encoding='ISO-8859-4') as f:
    print(f.read())

 	

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ĄĸŖ¤ĨĻ§¨ŠĒĢŦ­Ž¯°ą˛ŗ´ĩļˇ¸šēģŧŊžŋĀÁÂÃÄÅÆĮČÉĘËĖÍÎĪĐŅŌĶÔÕÖ×ØŲÚÛÜŨŪßāáâãäåæįčéęëėíîīđņōķôõö÷øųúûüũū˙


In [25]:
# load the binary data as if it were ISO-8859-5
with open('./data/binary.bin', 'r', encoding='ISO-8859-5') as f:
    print(f.read())

 	

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ЁЂЃЄЅІЇЈЉЊЋЌ­ЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя№ёђѓєѕіїјљњћќ§ўџ


In [26]:
# load the binary data as if it were ISO-8859-9
with open('./data/binary.bin', 'r', encoding='ISO-8859-9') as f:
    print(f.read())

 	

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ


In [27]:
# load the binary data as if it were ISO-8859-10
with open('./data/binary.bin', 'r', encoding='ISO-8859-10') as f:
    print(f.read())

 	

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ĄĒĢĪĨĶ§ĻĐŠŦŽ­ŪŊ°ąēģīĩķ·ļđšŧž―ūŋĀÁÂÃÄÅÆĮČÉĘËĖÍÎÏÐŅŌÓÔÕÖŨØŲÚÛÜÝÞßāáâãäåæįčéęëėíîïðņōóôõöũøųúûüýþĸ


So, as we should see from the above -- the first 128 characters are the same as ASCII. The remaining 128 characters are different for each codepage. Codepages are used to encode text in many languages. They are also used to display text on the screen.

Though unicode eliminates the need for codepages, they are still used in many places. For example, the Windows operating system uses codepages to display text on the screen. The default codepage for Windows is 1252. This codepage is also known as Windows-1252, CP1252, and ANSI. The default codepage for macOS is 1252. This codepage is also known as MacRoman. The default codepage for Linux is 1252. This codepage is also known as ISO-8859-1.

### 1.2.3.1 Using string Encoding and Decoding

Python provides a number of methods for encoding and decoding strings. The following table lists the most common methods.

In [28]:
# Encoding a string using Windows-1252
string = "Café"
encoded_string = string.encode("windows-1252")
print(encoded_string)  # b'Caf\xe9'

# Decoding a Windows-1252 encoded string
decoded_string = encoded_string.decode("windows-1252")
print(decoded_string)  # Café

b'Caf\xe9'
Café


In [29]:
# Encoding a string using UTF-8
string = "Café"
encoded_string = string.encode("utf-8")
print(encoded_string)  # b'Caf\xe9'

# Decoding a UTF-8 encoded string
decoded_string = encoded_string.decode("UTF-8")
print(decoded_string)  # Café

b'Caf\xc3\xa9'
Café


In [30]:
# Encoding a string using utf-16
string = "Café"
encoded_string = string.encode("utf-16")
print(encoded_string)  # b'Caf\xe9'

# Decoding a utf-16 encoded string
decoded_string = encoded_string.decode("utf-16")
print(decoded_string)  # Café

b'\xff\xfeC\x00a\x00f\x00\xe9\x00'
Café


In [31]:
# Will this work? If so, why or why not?

# Encoding a string using ascii
string = "Café"
#encoded_string = string.encode("ascii")
#print(encoded_string)  # b'Caf\xe9'

In [32]:
# Will this work? If so, why or why not?

# Encoding a string using utf-8
string = "Café"
#encoded_string = string.encode("utf-8")
#print(encoded_string)  # b'Caf\xe9'

# Decoding a utf-126 encoded string
#decoded_string = encoded_string.decode("utf-16")
#print(decoded_string)  # Café

## 1.2.4 Emoji's

NOTE: Install VS Code extension emojisense (https://marketplace.visualstudio.com/items?itemName=bierner.emojisense) to easily create emoji's in VS Code (you can also cut and paster from https://emojipedia.org/)

Emoji's are only defined in unicode. The unicode standard is updated regularly. The current version of unicode is 15.0. In each version, new emoji's are added. You can see the entire set of supported emoji's for unicode v15.0 here(https://unicode.org/emoji/charts/full-emoji-list.html) (NOTE: This is a very large list, and will take time to load).

For demonstration, the following is a list of example emoji's:
🕥
🌵
💩
👍
🇮🇳
🇺🇸

We can cut and paste these into any python string:

In [33]:
str = "🕥🌵💩👍🇮🇳🇺🇸"
print(str)

🕥🌵💩👍🇮🇳🇺🇸


You can also use the unicode hex representation of an emoji:

In [34]:
# grinning face
print("\U0001F600") # question: How many bytes does this character take?

# beaming face with smiling eyes
print("\U0001F601")

# grinning face with sweat
print("\U0001F605")

# rolling on the floor laughing
print("\U0001F923")

# face with tears of joy
print("\U0001F602")

# slightly smiling face
print("\U0001F642")

# smiling face with halo
print("\U0001F607")

# smiling face with heart-eyes
print("\U0001F60D")

# zipper-mouth face
print("\U0001F910")

# unamused face
print("\U0001F612")

😀
😁
😅
🤣
😂
🙂
😇
😍
🤐
😒


I can also save and read these emoji's to a file:

In [35]:
with open('./data/emojis-utf-8.txt', 'w', encoding='utf-8') as f:
    f.write("\U0001F612\U0001F910\U0001F602")

In [36]:
with open('./data/emojis-utf-8.txt', 'r', encoding='utf-8') as f:
    print(f.read())

😒🤐😂


A very common way of representing the value (decimal) of an emoji is to display it as a hexadecimal value:

In [37]:
# read and display the ascii file text-ascii.txt as a list of hex codes
with open('./data/emojis-utf-8.txt', 'r') as f:
    print([hex(ord(i)) for i in list(f.read())])

['0x1f612', '0x1f910', '0x1f602']


> NOTE: When reading HTML files (something we will see later) we use the decimal value of an emoji.:
>
> <p>&#128512;</p>  
>
> <p>&#128169</p>
>
> Click on this cell to display the code that generated these emojis.

## 1.2.5 Conclusion

In this notebook we looked at how to encode and decode text in Python. We looked at how to read and write files using different encodings. We looked at how to display text on the screen using different encodings. We looked at how to encode and decode text using different encodings. We looked at how to encode and decode emoji's.

Though this introduction is thorough, this is a very complex topic. There are many different encodings. There are many different ways to encode and decode text. There are many different ways to encode and decode emoji's. There are many different ways to display text on the screen. There are many different ways to read and write files. There are many different ways to encode and decode text. There are many different ways to encode and decode emoji's.

Despite this, we have covered the fundamentals. Knowing these fundamentals will allow you to develop a deeper understanding of how text data is being represented and intrepreted by your program(s). This understanding will help you avoid errors and troubleshoot more quickly. Without knowing these fundamentals, you may be 'lost' when faced with encoding errors; and troubleshooting will take much longer (or you may not be successful at all). 