# Character Encodings

Sometimes you might have an old dataset that is encoded with an older lesser used encoding. Python makes it easy to convert.



To list the character encoding of all files in a folder and its subfolders in Python, you can use the os.walk() function in combination with the chardet library to detect the encoding of each file. Here's an example:

In [3]:
import chardet
import os

In [3]:
folder_path = '../Data'

for root, dirs, files in os.walk(folder_path):
    for file_name in files:
        file_path = os.path.join(root, file_name)
        with open(file_path, 'rb') as f:
            result = chardet.detect(f.read())
            print(f'{file_path} is encoded in {result["encoding"]}')

../Data/breakfast.csv is encoded in ascii
../Data/L3P1.csv is encoded in UTF-8-SIG
../Data/pgaTourData.xlsx is encoded in None
../Data/cellPhone.csv is encoded in ascii
../Data/flowerShipment.csv is encoded in ascii
../Data/graduate_admissions.csv is encoded in ascii
../Data/All_GPUs.csv is encoded in utf-8
../Data/London_hotel_reviews.csv is encoded in Windows-1252
../Data/volcanoes.csv is encoded in utf-8
../Data/anime.csv is encoded in utf-8
../Data/.DS_Store is encoded in MacCyrillic
../Data/flowerShop.csv is encoded in utf-8
../Data/tea.csv is encoded in ascii
../Data/boats.csv is encoded in UTF-8-SIG
../Data/glassdoor.csv is encoded in utf-8
../Data/PlaneCandidates.xlsx is encoded in None
../Data/BorderCrossing.csv is encoded in ascii
../Data/YouTubeChannels.csv is encoded in utf-8
../Data/olympicEvents.xlsx is encoded in None
../Data/datasets-733390-1272569-individuals.csv is encoded in ascii
../Data/YouTubeChannels_Python.csv is encoded in utf-8
../Data/test.csv is encoded in a

In [13]:
# import a library to detect encodings
import chardet
import glob

# for every text file, print the file name & a gues of its file encoding
print("File".ljust(45), "Encoding")
for filename in glob.glob('../Data/*.*'):
    with open(filename, 'rb') as rawdata:
        result = chardet.detect(rawdata.read())
    print(filename.ljust(45), result['encoding'])

File                                          Encoding
../Data/breakfast.csv                         ascii
../Data/L3P1.csv                              UTF-8-SIG
../Data/pgaTourData.xlsx                      None
../Data/cellPhone.csv                         ascii
../Data/flowerShipment.csv                    ascii
../Data/graduate_admissions.csv               ascii
../Data/All_GPUs.csv                          utf-8
../Data/London_hotel_reviews.csv              Windows-1252
../Data/volcanoes.csv                         utf-8
../Data/anime.csv                             utf-8
../Data/flowerShop.csv                        utf-8
../Data/tea.csv                               ascii
../Data/boats.csv                             UTF-8-SIG
../Data/glassdoor.csv                         utf-8
../Data/PlaneCandidates.xlsx                  None
../Data/BorderCrossing.csv                    ascii
../Data/YouTubeChannels.csv                   utf-8
../Data/olympicEvents.xlsx                    No

You can use the pandas library in Python to read a file, read in its current encoding, and then save the file with UTF-8 encoding. Here's an example of how to do this for a CSV file:


In [4]:
import pandas as pd

# Read the CSV file and detect its encoding
file_path1 = '../Data/titanic.csv'
df = pd.read_csv(file_path1)

# Save the dataframe to a new CSV file with UTF-8 encoding
df.to_csv('../Data/titanic-utf-8.csv', index=False, encoding='utf-8')


In [5]:
# Open the file in binary mode
with open(file_path1, 'rb') as f:
    # Read the file content
    data = f.read()
    # Detect the file's encoding
    result = chardet.detect(data)
    print(f'Encoding of {file_path1} is {result["encoding"]}')

Encoding of ../Data/titanic.csv is ascii


In [8]:
file_path2 = '../Data/titanic-utf-8.csv'
# Open the file in binary mode
with open(file_path2, 'rb') as f:
    # Read the file content
    data2 = f.read()
    # Detect the file's encoding
    result2 = chardet.detect(data2)
    print(f'Encoding of {file_path2} is {result2["encoding"]}')

Encoding of ../Data/titanic-utf-8.csv is ascii


In [7]:
import codecs

# Open the file in ASCII mode
with open("../Data/titanic.csv", "r", encoding="ascii") as f:
    # Read the contents of the file
    contents = f.read()

# Open the file in UTF-8 mode
with codecs.open("../Data/titanic-utf.csv", "w", encoding="utf-8") as f:
    # Write the contents of the file in UTF-8 encoding
    f.write(contents)

In [10]:
file_path3 = '../Data/titanic-utf.csv'
# Open the file in binary mode
with open(file_path3, 'rb') as f:
    # Read the file content
    data3 = f.read()
    # Detect the file's encoding
    result3 = chardet.detect(data3)
    print(f'Encoding of {file_path3} is {result3["encoding"]}')

Encoding of ../Data/titanic-utf.csv is ascii


## Why does chardet report a utf-8 file as ascii
chardet is a python library that is used to detect the encoding of a file. It does this by analyzing the byte patterns in the file and comparing them to known patterns for various encodings.

However, UTF-8 is a variable-length encoding, which means that it can use one to four bytes to represent a single character. ASCII, on the other hand, uses one byte per character. If a file is saved in UTF-8 format, but only contains characters that can be represented using the ASCII character set, then chardet may incorrectly report the encoding as ASCII because the byte patterns in the file match the patterns for ASCII more closely than those for UTF-8.

In other words, the file may be in UTF-8 but since it's only using the ASCII subset of the UTF-8 encoding, it can't be distinguished from ASCII by chardet, as it doesn't have enough information.

## Is US-ASCII same as UTF-8?
For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes.


In [12]:
file_path4 = '../Data/utf.txt'
# Open the file in binary mode
with open(file_path4, 'rb') as f:
    # Read the file content
    data4 = f.read()
    # Detect the file's encoding
    result4 = chardet.detect(data4)
    print(f'Encoding of {file_path4} is {result4["encoding"]}')

Encoding of ../Data/utf.txt is ascii


In [13]:
file_path5 = '../Data/utf2.txt'
# Open the file in binary mode
with open(file_path5, 'rb') as f:
    # Read the file content
    data5 = f.read()
    # Detect the file's encoding
    result5 = chardet.detect(data5)
    print(f'Encoding of {file_path5} is {result5["encoding"]}')

Encoding of ../Data/utf2.txt is UTF-32


In [15]:
#import codecs

# Open the file in ASCII mode
with open("../Data/utf2.txt", "r", encoding="UTF-32") as f:
    # Read the contents of the file
    contents = f.read()

# Open the file in UTF-8 mode
with codecs.open("../Data/utf3.txt", "w", encoding="utf-8") as f:
    # Write the contents of the file in UTF-8 encoding
    f.write(contents)

In [16]:
file_path6 = '../Data/utf3.txt'
# Open the file in binary mode
with open(file_path6, 'rb') as f:
    # Read the file content
    data6 = f.read()
    # Detect the file's encoding
    result6 = chardet.detect(data6)
    print(f'Encoding of {file_path6} is {result6["encoding"]}')

Encoding of ../Data/utf3.txt is ascii


on MacOs you can utilize 

```bash
File -I utf.txt

iconv -f US-ASCII -t UTF-32 utf.txt > utf2.txt

File -I utf2.txt
```
    

In [18]:
!file -I ../Data/utf.txt

../Data/utf.txt: text/plain; charset=us-ascii


In [19]:
!file -I ../Data/utf2.txt

../Data/utf2.txt: text/plain; charset=utf-32be


In [20]:
!file -I ../Data/utf3.txt

../Data/utf3.txt: text/plain; charset=us-ascii


In [22]:
!iconv -f US-ASCII -t UTF-32 ../Data/utf3.txt > ../Data/utf4.txt

In [23]:
!file -I ../Data/utf4.txt

../Data/utf4.txt: text/plain; charset=utf-32be
