# Character Encodings

Sometimes you might have an old dataset that is encoded with an older lesser used encoding. Python makes it easy to convert.



To list the character encoding of all files in a folder and its subfolders in Python, you can use the os.walk() function in combination with the chardet library to detect the encoding of each file. Here's an example:

In [3]:
import chardet
import os

folder_path = '../Data'

for root, dirs, files in os.walk(folder_path):
    for file_name in files:
        file_path = os.path.join(root, file_name)
        with open(file_path, 'rb') as f:
            result = chardet.detect(f.read())
            print(f'{file_path} is encoded in {result["encoding"]}')

../Data/breakfast.csv is encoded in ascii
../Data/L3P1.csv is encoded in UTF-8-SIG
../Data/pgaTourData.xlsx is encoded in None
../Data/cellPhone.csv is encoded in ascii
../Data/flowerShipment.csv is encoded in ascii
../Data/graduate_admissions.csv is encoded in ascii
../Data/All_GPUs.csv is encoded in utf-8
../Data/London_hotel_reviews.csv is encoded in Windows-1252
../Data/volcanoes.csv is encoded in utf-8
../Data/anime.csv is encoded in utf-8
../Data/.DS_Store is encoded in MacCyrillic
../Data/flowerShop.csv is encoded in utf-8
../Data/tea.csv is encoded in ascii
../Data/boats.csv is encoded in UTF-8-SIG
../Data/glassdoor.csv is encoded in utf-8
../Data/PlaneCandidates.xlsx is encoded in None
../Data/BorderCrossing.csv is encoded in ascii
../Data/YouTubeChannels.csv is encoded in utf-8
../Data/olympicEvents.xlsx is encoded in None
../Data/datasets-733390-1272569-individuals.csv is encoded in ascii
../Data/YouTubeChannels_Python.csv is encoded in utf-8
../Data/test.csv is encoded in a

In [13]:
# import a library to detect encodings
import chardet
import glob

# for every text file, print the file name & a gues of its file encoding
print("File".ljust(45), "Encoding")
for filename in glob.glob('../Data/*.*'):
    with open(filename, 'rb') as rawdata:
        result = chardet.detect(rawdata.read())
    print(filename.ljust(45), result['encoding'])

File                                          Encoding
../Data/breakfast.csv                         ascii
../Data/L3P1.csv                              UTF-8-SIG
../Data/pgaTourData.xlsx                      None
../Data/cellPhone.csv                         ascii
../Data/flowerShipment.csv                    ascii
../Data/graduate_admissions.csv               ascii
../Data/All_GPUs.csv                          utf-8
../Data/London_hotel_reviews.csv              Windows-1252
../Data/volcanoes.csv                         utf-8
../Data/anime.csv                             utf-8
../Data/flowerShop.csv                        utf-8
../Data/tea.csv                               ascii
../Data/boats.csv                             UTF-8-SIG
../Data/glassdoor.csv                         utf-8
../Data/PlaneCandidates.xlsx                  None
../Data/BorderCrossing.csv                    ascii
../Data/YouTubeChannels.csv                   utf-8
../Data/olympicEvents.xlsx                    No

You can use the pandas library in Python to read a file, read in its current encoding, and then save the file with UTF-8 encoding. Here's an example of how to do this for a CSV file:


In [23]:
import pandas as pd

# Read the CSV file and detect its encoding
file_path = '../Data/titanic.csv'
df = pd.read_csv(file_path)

# Save the dataframe to a new CSV file with UTF-8 encoding
df.to_csv('../Data/titanic-utf-8.csv', index=False, encoding='utf-8')


In [24]:
# Open the file in binary mode
with open(file_path, 'rb') as f:
    # Read the file content
    data = f.read()
    # Detect the file's encoding
    result = chardet.detect(data)
    print(f'Encoding of {file_path} is {result["encoding"]}')

Encoding of ../Data/titanic.csv is ascii


In [25]:
file_path2 = '../Data/titanic-utf-8.csv'
# Open the file in binary mode
with open(file_path2, 'rb') as f:
    # Read the file content
    data = f.read()
    # Detect the file's encoding
    result = chardet.detect(data)
    print(f'Encoding of {file_path2} is {result["encoding"]}')

Encoding of ../Data/titanic-utf-8.csv is ascii
