# Character Encodings

In [1]:
# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
# Required for this environment: $ mamba install -n base -c conda-forge chardet
import chardet

# set seed for reproducibility
np.random.seed(0)

import os
from pathlib import Path

### What are encodings?
**Character encodings** are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding than the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake:

æ–‡å—åŒ–ã??

You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

UTF-8 is **the** standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

It was pretty hard to deal with encodings in Python 2, but thankfully in Python 3 it's a lot simpler. (Kaggle Notebooks only use Python 3.) There are two main data types you'll encounter when working with text in Python 3. One is is the string, which is what text is by default.

In [2]:
# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)

str

The other data is the bytes data type, which is a sequence of integers. You can convert a string into bytes by specifying which encoding it's in:

In [3]:
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors="replace")

# check the type
type(after)

bytes

If you look at a bytes object, you'll see that it has a b in front of it, and then maybe some text after. That's because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn't really work for writing any language other than English.) Here you can see that our euro symbol has been replaced with some mojibake that looks like "\xe2\x82\xac" when it's printed as if it were an ASCII string.

In [4]:
# take a look at what the bytes look like
after

b'This is the euro symbol: \xe2\x82\xac'

When we convert our bytes back to a string with the correct encoding, we can see that our text is all there correctly, which is great! :)

In [5]:
# convert it back to utf-8
print(after.decode("utf-8"))

This is the euro symbol: €


However, when we try to use a different encoding to map our bytes into a string, we get an error. This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it. You need to tell Python the encoding that the byte string is actually supposed to be in.

>You can think of different encodings as different ways of recording music. You can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, you need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a CD player. If you try to play a cassette in a CD player, it just won't work.

In [6]:
# try to decode our bytes with the ascii encoding
try:
    print(after.decode("ascii"))
except Exception as e:
    print(f'{e.__class__}')

<class 'UnicodeDecodeError'>


We can also run into trouble if we try to use the wrong encoding to map from a string to bytes. Like I said earlier, strings are UTF-8 by default in Python 3, so if we try to treat them like they were in another encoding we'll create problems.

For example, if we try to convert a string to bytes for ASCII using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text isn't in ASCII, though, there will be some characters it can't handle. We can automatically replace the characters that ASCII can't handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there's not way to tell which character it should have been. That means we may have just made our data unusable!

#### Reading in files with encoding problems
Most files you'll encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time you won't run into problems. However, sometimes you'll get an error like this:

In [7]:
# start with a string
before = "This is the euro symbol: €"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

# We've lost the original underlying byte string! It's been 
# replaced with the underlying byte string for the unknown character :(

This is the euro symbol: ?


This is bad and we want to avoid doing it! It's far better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert non UTF-8 input into UTF-8 is when you read in files, which we'll talk about next.

### Reading in files with encoding problems
Most files you'll encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time you won't run into problems. However, sometimes you'll get an error like this:

In [8]:
# try to read in a file not in UTF-8
try:
    kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv")
except Exception as e:
    print(f'{e.__class__}')

<class 'UnicodeDecodeError'>


Notice that we get the same UnicodeDecodeError we got when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file isn't actually UTF-8. We don't know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

I'm going to just look at the first ten thousand bytes of this file. This is usually enough for a good guess about what the encoding is and is much faster than trying to look at the whole file. (Especially with a large file this can be very slow.) Another reason to just look at the first part of the file is that we can see by looking at the error message that the first problem is the 11th character. So we probably only need to look at the first little bit of the file to figure out what's going on.

In [9]:
# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201612.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


In [10]:
# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


So chardet is 73% confidence that the right encoding is "Windows-1252". Let's see if that's correct:

In [11]:
# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,


Yep, looks like chardet was right! The file reads in with no problem (although we do get a warning about datatypes) and when we look at the first few rows it seems to be fine.

What if the encoding chardet guesses isn't right? Since chardet is basically just a fancy guesser, sometimes it will guess the wrong encoding. One thing you can try is looking at more or less of the file and seeing if you get a different result and then try that.

#### Saving your files with UTF-8 encoding
Finally, once you've gone through all the trouble of getting your file into UTF-8, you'll probably want to keep it that way. The easiest way to do that is to save your files with UTF-8 encoding. The good news is, since UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:

In [12]:
# save our file (will be saved as UTF-8 by default!)

# kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

# Exercise: Character Encodings

### Get our environment set up

The first thing we'll need to do is load in the libraries we'll be using.

In [13]:
# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)

### 1) What are encodings?

You're working with a dataset composed of bytes.  Run the code cell below to print a sample entry.

In [14]:
sample_entry = b'\xa7A\xa6n'
print(sample_entry)
print('data type:', type(sample_entry))

b'\xa7A\xa6n'
data type: <class 'bytes'>


You notice that it doesn't use the standard UTF-8 encoding. 

Use the next code cell to create a variable `new_entry` that changes the encoding from `"big5-tw"` to `"utf-8"`.  `new_entry` should have the bytes datatype.

In [15]:
new_entry = sample_entry.decode('big5-tw').encode('utf-8')
new_entry

b'\xe4\xbd\xa0\xe5\xa5\xbd'

### 2) Reading in files with encoding problems

Use the code cell below to read in this file at path `"../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv"`.  

Figure out what the correct encoding should be and read in the file to a DataFrame `police_killings`.

In [16]:
# Estimate the encoding with chardet:
# When the file was read without encoding below, it failed at the 28000 ch. The 1st 10000 where all ascii.

with open('../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv', 'rb') as rawdata:
    result1 = chardet.detect(rawdata.read(30000))
print(result1)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


In [17]:
# TODO: Load in the DataFrame correctly.
# Use Windows-1252 as determined above,
police_killings = pd.read_csv("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", encoding='Windows-1252')


### 3) Saving your files with UTF-8 encoding¶
Save a version of the police killings dataset to CSV with UTF-8 encoding. Your answer will be marked correct after saving this file.

Note: When using the to_csv() method, supply only the name of the file (e.g., "my_file.csv"). This saves the file at the filepath "/kaggle/working/my_file.csv".

In [18]:
police_killings.to_csv('my_file.csv')

### (Optional) More practice

Check out [this dataset of files in different character encodings](https://www.kaggle.com/rtatman/character-encoding-examples). Can you read in all the files with their original encodings and them save them out as UTF-8 files?

If you have a file that's in UTF-8 but has just a couple of weird-looking characters in it, you can try out the [ftfy module](https://ftfy.readthedocs.io/en/latest/#) and see if it helps. 

In [19]:
def qCoding(filename, quantity=-1):
    with open(filename, 'rb') as f:
        raw = f.read()
        v = chardet.detect(raw)['encoding']
        return raw, v

In [20]:
directory = Path.home() / 'notebooks' / 'input' / 'character_encoding_examples'
catalog = pd.read_csv(directory / 'file_guide.csv')
catalog

Unnamed: 0,File,Text,Author,Encoding,Language,Words
0,die_ISO-8859-1.txt,Die Fürstin,Kasimir Edschmid,ISO-8859-1,German,13314
1,harpers_ASCII.txt,"Harper's Round Table, October 8, 1895",Various,ASCII,English,29094
2,olaf_Windows-1251.txt,Olaf van Geldern,Pencho Slaveykov,Windows 1251,Bulgarian,2790
3,portugal_ISO-8859-1.txt,"Portugal enfermo por vicios, e abusos de ambos...",José Daniel Rodrigues da Costa,ISO-8859-1,Portuguese,14215
4,shisei_UTF-8.txt,Shisei,Junichiro Tanizaki,UTF-8,Japanese,4809
5,yan_BIG-5.txt,Yan shi jia xun,Yan Zhitui,BIG-5,Chinese,2538


In [46]:
# adaptation of https://www.kaggle.com/keithmurray/data-cleaning-character-encoding-practice
# where all the files are worked on iteratively
from pathlib import Path
import pandas as pd
from chardet import detect

# Make a catalog of the files to examine
directory = Path.home() / 'notebooks' / 'input' / 'character_encoding_examples'
catalog = pd.read_csv(directory / 'file_guide.csv')

result = {}

for i, fn in enumerate(catalog.File):
    
    # Construct next file to process
    filename = directory / fn

    # Estimate its encoding. Retain raw data for decoding.
    with open(filename, 'rb') as f:
        raw = f.read()
        estimated_encoding = detect(raw)['encoding']

    # Encode using the estimated encoding and check the result
    new_encoding = detect(raw.decode(estimated_encoding).encode())['encoding']
    
    
    result[fn] = {'Estimated Encoding': estimated_encoding, 'New Encoding': new_encoding}
        
print('Note: Because ascii is subset of UTF8, a file of pure ascii in UTF8 is reported as ASCII')
# pd.concat([catalog, pd.DataFrame(result).transpose()], axis=0)
catalog.merge(pd.DataFrame(result).transpose())


Note: Because ascii is subset of UTF8, a file of pure ascii in UTF8 is reported as ASCII


MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

In [42]:
pd.DataFrame(result).transpose()

Unnamed: 0,Estimated Encoding,New Encoding
die_ISO-8859-1.txt,ISO-8859-1,utf-8
harpers_ASCII.txt,ascii,ascii
olaf_Windows-1251.txt,windows-1251,utf-8
portugal_ISO-8859-1.txt,ISO-8859-1,utf-8
shisei_UTF-8.txt,UTF-8-SIG,utf-8
yan_BIG-5.txt,Big5,utf-8


In [54]:
# Adaptation of https://www.kaggle.com/keithmurray/data-cleaning-character-encoding-practice
# where all the files are worked on iteratively
from pathlib import Path
import pandas as pd
from chardet import detect

# Make a catalog of the files to examine
directory = Path.home() / 'notebooks' / 'input' / 'character_encoding_examples'
catalog = pd.read_csv(directory / 'file_guide.csv')


def encoding(x, sample=10000, directory=directory):
    # Find the encoding of the original file and the encoding if it is converted to UTF8
    
    filename = directory / x

    # Find original encoding from a sample. Save contents for decode-encode
    with open(filename, 'rb') as f:
        raw = f.read()
        estimated_encoding = detect(raw[:sample])['encoding']

    # Find the encoding after decoding as estimated encoding and then after re-encoding as UTF8
    new_encoding = detect(raw.decode(estimated_encoding).encode()[:sample])['encoding']
    
    # Return as a Series so it can be added as new columns in the catalog
    return pd.Series([estimated_encoding, new_encoding])

print('Explanation: Because ascii is subset of UTF8, a utf8 file of pure ascii is reported as ASCII')
catalog[['Estimated Encoding', 'New Encoding']] = catalog['File'].apply(encoding)
catalog


Explanation: Because ascii is subset of UTF8, a utf8 file of pure ascii is reported as ASCII


Unnamed: 0,File,Text,Author,Encoding,Language,Words,Estimated Encoding,New Encoding
0,die_ISO-8859-1.txt,Die Fürstin,Kasimir Edschmid,ISO-8859-1,German,13314,ISO-8859-1,utf-8
1,harpers_ASCII.txt,"Harper's Round Table, October 8, 1895",Various,ASCII,English,29094,ascii,ascii
2,olaf_Windows-1251.txt,Olaf van Geldern,Pencho Slaveykov,Windows 1251,Bulgarian,2790,windows-1251,utf-8
3,portugal_ISO-8859-1.txt,"Portugal enfermo por vicios, e abusos de ambos...",José Daniel Rodrigues da Costa,ISO-8859-1,Portuguese,14215,ISO-8859-1,utf-8
4,shisei_UTF-8.txt,Shisei,Junichiro Tanizaki,UTF-8,Japanese,4809,UTF-8-SIG,utf-8
5,yan_BIG-5.txt,Yan shi jia xun,Yan Zhitui,BIG-5,Chinese,2538,Big5,utf-8


In [100]:
# Adaptation of https://www.kaggle.com/keithmurray/data-cleaning-character-encoding-practice
# to use pandas apply

import pandas as pd
import numpy as np
from pathlib import Path
from chardet import detect

# Make a catalog of the files to examine
directory = Path.home() / 'notebooks' / 'input' / 'character_encoding_examples'
catalog = pd.read_csv(directory / 'file_guide.csv')


def find_encoding(x, sample=10001, directory=directory):
    # Find the encoding of the original file and the encoding if it is converted to UTF8
    
    filename = directory / x

    # Find original encoding from a sample. Save contents for decode-encode
    with open(filename, 'rb') as f:
        raw = f.read(sample)
        try:
            estimate = detect(raw)
        except Exception as e:
            print(f'Estimate: {e.__class__} in {filename}')
            return pd.Series([np.nan, np.nan, np.nan])

    # Find the encoding after decoding as estimated encoding and then after re-encoding as UTF8
    try:
        new = detect(raw.decode(estimate['encoding']).encode())
    except Exception as e:
        print(f'New: {e.__class__} in {filename}')
        return pd.Series([estimate['encoding'], estimate['confidence'], np.nan])
       
    # Return as a Series so it can be added as new columns in the catalog
    return pd.Series([estimate['encoding'], estimate['confidence'], new['encoding']])


print('Explanation: Because ascii is subset of UTF8, a utf8 file of pure ascii is reported as ASCII')

catalog[['Estimated Encoding', 'Confidence', 'New Encoding']] = catalog['File'].apply(find_encoding)
catalog


Explanation: Because ascii is subset of UTF8, a utf8 file of pure ascii is reported as ASCII


Unnamed: 0,File,Text,Author,Encoding,Language,Words,Estimated Encoding,Confidence,New Encoding
0,die_ISO-8859-1.txt,Die Fürstin,Kasimir Edschmid,ISO-8859-1,German,13314,ISO-8859-1,0.671095,utf-8
1,harpers_ASCII.txt,"Harper's Round Table, October 8, 1895",Various,ASCII,English,29094,ascii,1.0,ascii
2,olaf_Windows-1251.txt,Olaf van Geldern,Pencho Slaveykov,Windows 1251,Bulgarian,2790,windows-1251,0.99,utf-8
3,portugal_ISO-8859-1.txt,"Portugal enfermo por vicios, e abusos de ambos...",José Daniel Rodrigues da Costa,ISO-8859-1,Portuguese,14215,ISO-8859-1,0.73,utf-8
4,shisei_UTF-8.txt,Shisei,Junichiro Tanizaki,UTF-8,Japanese,4809,UTF-8-SIG,1.0,utf-8
5,yan_BIG-5.txt,Yan shi jia xun,Yan Zhitui,BIG-5,Chinese,2538,Big5,0.99,utf-8


In [93]:
# Adaptation of https://www.kaggle.com/keithmurray/data-cleaning-character-encoding-practice
# to use pandas apply

from pathlib import Path
import pandas as pd
from chardet import detect

# Make a catalog of the files to examine
directory = Path.home() / 'notebooks' / 'input' / 'character_encoding_examples'
catalog = pd.read_csv(directory / 'file_guide.csv')


def encoding(x, sample=10001, directory=directory):
    # Find the encoding of the original file and the encoding if it is converted to UTF8
    
    filename = directory / x

    # Find original encoding from a sample. Save contents for decode-encode
    with open(filename, 'rb') as f:
        raw = f.read(sample)
        estimated_encoding = detect(raw)['encoding']

    # Find the encoding after decoding as estimated encoding and then after re-encoding as UTF8
    new_encoding = detect(raw.decode(estimated_encoding).encode())
    
    # Return as a Series so it can be added as new columns in the catalog
    return pd.Series([estimated_encoding, new_encoding])


print('Explanation: Because ascii is subset of UTF8, a utf8 file of pure ascii is reported as ASCII')

catalog[['Estimated Encoding', 'New Encoding']] = catalog['File'].apply(encoding)
catalog


Explanation: Because ascii is subset of UTF8, a utf8 file of pure ascii is reported as ASCII


NameError: name 'estimate' is not defined

# Extra SCRATCH

In [23]:
import os
# for dirname, _, filenames in os.walk('../input/character_encoding_examples'):
for dirname, _, filenames in os.walk('../input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

../input/character_encoding_examples/die_ISO-8859-1.txt
../input/character_encoding_examples/file_guide.csv
../input/character_encoding_examples/harpers_ASCII.txt
../input/character_encoding_examples/olaf_Windows-1251.txt
../input/character_encoding_examples/output.die_ISO-8859-1.txt
../input/character_encoding_examples/output.harpers_ASCII.txt
../input/character_encoding_examples/output.olaf_Windows-1251.txt
../input/character_encoding_examples/output.portugal_ISO-8859-1.txt
../input/character_encoding_examples/output.shisei_UTF-8.txt
../input/character_encoding_examples/output.yan_BIG-5.txt
../input/character_encoding_examples/portugal_ISO-8859-1.txt
../input/character_encoding_examples/shisei_UTF-8.txt
../input/character_encoding_examples/yan_BIG-5.txt
../input/earthquake-database/database.csv
../input/earthquake-database/database.csv.zip
../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv
../input/kickstarter-projects/ks-projects-201612.csv
../input/kickstarter-projects/

In [24]:
dict = {'a':1, 'b':2}

In [25]:
for key,v in dict.items():
    print(key, v)

a 1
b 2


In [26]:
for key in dict.items():
    print(key)

('a', 1)
('b', 2)


In [27]:
for key in dict.keys():
    print(key)  

a
b


In [28]:
for v in dict.values():
    print(v)  

1
2


In [29]:
lst = [1, 2, 3]

In [30]:
for i,x in enumerate(lst):
    print(i, '-> ', x)

0 ->  1
1 ->  2
2 ->  3


In [31]:
for i,x in enumerate(lst):
    if i==1: break
    print(i, '-> ', x)

0 ->  1


In [32]:
for i,x in enumerate(lst):
    if i==1: continue
    print(i, '-> ', x)

0 ->  1
2 ->  3


In [33]:
def fn(x):
    return 'x+2=', x+2

In [34]:
fn(4)

('x+2=', 6)

In [66]:
pd.Series([dict['a'],dict['b']])

0    1
1    2
dtype: int64

In [68]:
dict['a'],dict['b']

(1, 2)

In [80]:
pd.Series({k:dict[k] for k in ['a', 'b'] if k in dict})

a    1
b    2
dtype: int64

In [82]:
pd.Series([1,2]+[3])

0    1
1    2
2    3
dtype: int64