# Use the right encoding (how to know)
- Determine the encoding of the file (Correctly detecting the encoding all times is impossible. See this [stackoverflow post](https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text))
- Python libraries exist: [chardet](https://pypi.org/project/chardet/), [python-magic](https://pypi.org/project/python-magic/), [UnicodeDammit](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit)
- There can be gotchas for e.g. libmagic (which is OS dependency) needs to be installed for python-magic to work, as it depends on that (For macOSX, `brew install libmagic`)
- [Pandas documentation](https://docs.python.org/3/library/codecs.html#standard-encodings) refers to encoding - https://docs.python.org/3/library/codecs.html#standard-encodings

In [1]:
import pandas as pd
df_clicks = pd.read_csv("./data/Clicks.csv", 
                        sep="|", error_bad_lines=True)
df_clicks

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4: invalid start byte

# Use python-magic - To determine encoding
- `brew install libmagic` on Mac OSX upfront

In [2]:
!pip install python-magic==0.4.18



In [3]:
import magic

blob = open('./data/Clicks.csv', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)
encoding

'iso-8859-1'

In [4]:
import pandas as pd
df_clicks = pd.read_csv("./data/Clicks.csv", 
                        sep="|", error_bad_lines=True, encoding = "ISO-8859-1")
df_clicks

Unnamed: 0,ClientID,SendID,SubscriberKey,EmailAddress,SubscriberID,ListID,EventDate,EventType,SendURLID,URLID,URL,Alias,BatchID,TriggeredSendExternalKey,Browser,EmailClient,OperatingSystem,Device
0,12345,98765,blah@blah.com (),blah@blah.com,372058613,371051,5/14/2020 6:42:32 AM,Click,54793413,4578751,http://www.techsparks.guru,Save $0.25,1,,Unspecified,Unspecified,Windows,PC
1,12345,98765,blah@blah.com (),blah@blah.com,372058613,371051,5/14/2020 6:42:43 AM,Click,54793413,4578751,http://www.techsparks.guru,Save $0.25,1,,Chrome,Unspecified,Windows 7,PC
2,12345,98765,blah@blah.com (),blah@blah.com,372058613,371051,5/14/2020 6:43:12 AM,Click,54793414,4578751,http://www.techsparks.guru,Save $0.25,1,,Unspecified,Unspecified,Windows,PC
3,12345,98765,blah@blah.com (),blah@blah.com,372058613,371051,5/14/2020 6:43:29 AM,Click,54793414,4578751,http://www.techsparks.guru,Save $0.25,1,,Unspecified,Unspecified,Windows,PC
4,12345,98765,blah@blah.com (),blah@blah.com,372058613,371051,5/14/2020 6:45:10 AM,Click,54793414,4578751,http://www.techsparks.guru,Save $0.25,1,,Unspecified,Unspecified,Windows,PC


# Use chardet
- https://pypi.org/project/chardet/
- https://chardet.readthedocs.io/en/latest/usage.html#basic-usage

In [5]:
! pip install chardet==3.0.4



In [6]:
import chardet    
rawdata = open("./data/Clicks.csv", 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
charenc

'ISO-8859-1'

# Use UnicodeDammit
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit

In [7]:
! pip install beautifulsoup4==4.9.1

Collecting beautifulsoup4==4.9.1
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 1.2MB/s eta 0:00:01
Installing collected packages: beautifulsoup4
  Found existing installation: beautifulsoup4 4.8.2
    Uninstalling beautifulsoup4-4.8.2:
      Successfully uninstalled beautifulsoup4-4.8.2
Successfully installed beautifulsoup4-4.9.1


In [8]:
from bs4 import UnicodeDammit
rawdata = open("./data/Clicks.csv", 'rb').read()
dammit = UnicodeDammit(rawdata)
print(dammit.unicode_markup)
dammit.original_encoding

ClientID|SendID|SubscriberKey|EmailAddress|SubscriberID|ListID|EventDate|EventType|SendURLID|URLID|URL|Alias|BatchID|TriggeredSendExternalKey|Browser|EmailClient|OperatingSystem|Device
12345|98765|blah@blah.com ()|blah@blah.com|372058613|371051|5/14/2020 6:42:32 AM|Click|54793413|4578751|http://www.techsparks.guru|Save $0.25|1||Unspecified|Unspecified|Windows|PC
12345|98765|blah@blah.com ()|blah@blah.com|372058613|371051|5/14/2020 6:42:43 AM|Click|54793413|4578751|http://www.techsparks.guru|Save $0.25|1||Chrome|Unspecified|Windows 7|PC
12345|98765|blah@blah.com ()|blah@blah.com|372058613|371051|5/14/2020 6:43:12 AM|Click|54793414|4578751|http://www.techsparks.guru|Save $0.25|1||Unspecified|Unspecified|Windows|PC
12345|98765|blah@blah.com ()|blah@blah.com|372058613|371051|5/14/2020 6:43:29 AM|Click|54793414|4578751|http://www.techsparks.guru|Save $0.25|1||Unspecified|Unspecified|Windows|PC
12345|98765|blah@blah.com ()|blah@blah.com|372058613|371051|5/14/2020 6:45:10 AM|Click|54793414|45

'iso-8859-1'