# Tutorial 

In this notebook we will see the treat of documents with different encodings. This can be a complex problem specially when users work with several documents and text analysis. Python works with UTF-8 encoding due to this one of the most universal encoding nowadays. The basis of this notebook is based on the Course [Character Encodings](https://www.kaggle.com/alexisbcook/character-encodings) of Kaggle.

In [None]:
#Load main libraries
import pandas as pd #Work with dataframes
import numpy as np #Work with arrays
import chardet #Useful to auto-detect encodings

In [None]:
#Load the file_guide document, where the information of documents available in the dataset is presented. 
file_guide = pd.read_csv("../input/character-encoding-examples/file_guide.csv")
file_guide

#Here, we can see encodings of the different .txt documents. We can "ASCII", "Windows 1251", "UTF-8" and more. 

Documents are in .txt format, thereby we can read them using "open" function. A short form to apply "open" is using "with" statement. If you don't know how this statement works, you can see [this link](https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python).

In [None]:
#Load the first document. We can see the first 200 characters. The method used in this case is "rb", which means only-read a binary format. 
with open("../input/character-encoding-examples/die_ISO-8859-1.txt", 'rb') as simple_doc:
    result = simple_doc.read(200)
    
#Print our result
print(result)    

#Here, we can observe that there are many symbols that are not clear, thereby we can deduce that the encoding of this document is not UTF-8.

To know the correct encoding of this document, we can try different encodings to read it. However, in Python exists the function "chardet.detect" to guess the encoding of a file

In [None]:
with open("../input/character-encoding-examples/die_ISO-8859-1.txt", 'rb') as simple_doc:
    result = chardet.detect(simple_doc.read(200))

#Print our result
print(result)

#Using the first 1000 characters, it is suggested with a confidence of a 0.73 that the encoding utilised in this file is "ISO-8859-1".

In [None]:
#Now, we can open the file using the more appropiate encoding.
with open("../input/character-encoding-examples/die_ISO-8859-1.txt", encoding='ISO-8859-1') as simple_doc:
    result = simple_doc.read(500)

#Print our result
print(result)

Now, we have identified the right encoding, we can read the entire document with its encoding and save as UTF-8.

In [None]:
#Now, we can open the file using the more appropiate encoding.Read the entire document
with open("../input/character-encoding-examples/die_ISO-8859-1.txt", encoding='ISO-8859-1') as simple_doc:
    result = simple_doc.read()

In [None]:
#To save the file we can use the method "w" of write.
with open("./first_document_utf8.txt","w+") as output_doc:
    output_doc.write(result)

In [None]:
#Now, we can test if the file is correctly read without add any encoding. As we mentioned before, Python uses UTF-8 as default encoding.
with open("./first_document_utf8.txt") as simple_doc_utf8:
    result = simple_doc_utf8.read(500)
#Print the result    
print(result)

#As you can see, the document is read in the correct way.

We can do this same process to each document located in this dataset.