# File,Character Encoding and Table Data File

* 4.6 Files, Page 82

* 2.3.2 A Digression About Character Encoding,Page 38

## 1 Files


Every computer system uses files to save things from one computation to the next.

Python provides many facilities for creating and accessing files.

The built-in function <b style="color:blue">open</b> 

```python
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
``` 

Open file and return a stream(a file object)

In [None]:
help(open)

The <b style="color:blue">open()</b> is most commonly used with <b style="color:blue">two arguments</b>:

<b style="color:blue">open(filename, mode)</b>.

```python
f = open('workfile', 'w')
```    

* The first argument is a `string` containing the filename. 


* The second argument is another `string` containing a few characters describing the way in which the file will be used. 
 
   * **Writing** : <b style="color:blue">'w'</b> for only writing (an existing file with the same name will be erased)

   * **Reading** :<b style="color:blue">'r'</b> when the file will only be read(default) 

   * **Appending** :<b style="color:blue">'a'</b>  open for writing, appending to the end of the file if it exists


Normally, files are opened in <b style="color:blue">text</b>  mode, that means, you read and write `strings` from and to the file.

Here we illustrate some of the basic ones in the <b style="color:blue">text</b> mode:

### 1.1  Writing 


<b style="color:blue">open</b>  create a file with the name **kids.txt** ,

using the argument <b style="color:blue">w</b> to indicate that the file is to be opened for **writing**.

 * an existing file with the same name will be erased
 
using its <b style="color:blue">write</b> methods appropriately to write to the file.

In the end, we finally <b style="color:blue">close</b> the file.

In [None]:
names=['David','Andrea']
nameHandle = open('kids.txt', 'w')

for name in names:
    nameHandle.write(name + '\n') # the string '\n' indicates a new line character.

nameHandle.close()

**name + '\n'**: the string **'\n'** indicates a **new line** character.

In [None]:
!dir kids.txt

In [None]:
%load kids.txt

### 1.2 Reading

<b style="color:blue">open</b> the file for **reading**,using the argument <b style="color:blue">'r'</b>

`'r'` will be assumed if it’s omitted.

For reading lines from a file, you can `loop` over `the file object`. This is memory efficient, fast, and leads to simple code:

In [None]:
nameHandle = open('kids.txt', 'r')
#nameHandle = open('kids.txt') #'r' will be assumed if it’s omitted.

for line in nameHandle:
    print(line)
    
nameHandle.close()

You see **new line** between each name. 

```python
David

Andrea
```
Because

* `\n` at the end of each line -> the new  line

* `print(line)` procude the new line

```python
print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
```
the default value `end='\n'`, so that `print(line)` procude the new line

We could have avoided printing that(**new line**) by writing:
```python
  print(line[:-1])
```
**slicing** line to delete **'\n'** in each line for file. 

In [None]:
nameHandle = open('kids.txt', 'r')
for line in nameHandle:
    print(line[:-1])  # print(line[:len(line)-1] \n
nameHandle.close()

### 1.3 Appending

<b style="color:blue">open</b> the file for **appending** (instead of writing) by using the argument  <b style="color:blue">a</b>

In [None]:
names=['Zhang\n','Li\n']

# append
nameHandle = open('kids.txt', 'a') # argument 'a' -  appending
for name in names:
    nameHandle.write(name)
nameHandle.close()

# read
nameHandle = open('kids.txt', 'r')
for line in nameHandle:
    print(line[:-1])
nameHandle.close()

### 1.4 The common operations on files

Some of the common operations on files are summarized

![Figure412](./img/figure412.jpg)

#### Using <b style="color:blue">readline</b> methods 

Using `print(line, end='')` ,so that print(line) do not get the new line

In [None]:
nameHandle = open('kids.txt', 'r')

while True:
    line = nameHandle.readline()
    # Zero length indicates EOF
    if len(line) == 0:
        break
   
    # The `line` already has a newline at the end of each line
    print(line, end='')

## 2. Character Encoding and Plain text file in a specific `encoding`

`Text encoding` is a sufficiently complex topic that there’s no one size fits all answer - the right answer for a given application will depend on factors like:

* how much control you have over the text encodings used

* whether avoiding program failure is more important than avoiding data corruption or vice-versa

* how common encoding errors are expected to be, and whether they need to be handled gracefully or can simply be rejected as invalid input


### 2.1 A Digression About Character Encoding)(2.3.2)

**Character Encoding**

In computer memory, character are "encoded" (or "represented") using a chosen "character encoding schemes" (aka "character set", "charset", "character map", or "code page").
   
For example, in **ASCII** (American Standard Code for Information Interchange),as well as Latin1, Unicode, 

* code numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z', respectively.

* code numbers 97D (61H) to 122D (7AH) represents 'a' to 'z', respectively.

* code numbers 48D (30H) to 57D (39H) represents '0' to '9', respectively.

It is important to note that the **representation scheme must be known** before a binary pattern can be interpreted. E.g., the 8-bit pattern "0100 0010B" could represent anything.

**Unicode:** ISO/IEC 10646 Universal Character Set
 
Unicode encoding scheme could represent characters in **all languages**.

**Reference** [Tutorial on Data Representation Integers, Floating-point Numbers, and Characters]( http://www3.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html)

---
For many years most programming languages used a standard called `ASCII` for the internal representation of characters. This standard included 128 characters, plenty for representing the usual set of characters appearing in `English-language` text—but `not enough to cover the characters and accents appearing in all the world’s languages.`

In recent years, there has been a shift to **Unicode**. The Unicode standard is a character coding system designed to support the digital processing and display of the written `texts of all languages`. The standard contains more than 120,000 different characters—covering 129 modern and historic scripts and multiple symbol sets. 

Since **Python 3.0**, the language features a **str** type that contain **Unicode** characters, meaning any string created using "unicode rocks!", 'unicode rocks!', or the triple-quoted string syntax is stored as Unicode.
 
https://docs.python.org/howto/unicode.html

The default encoding for Python source code  and **Jupyter Notebook** is UTF-8, so you can simply include a Unicode character in a string literal:

In [None]:
# the Python implementations is the default to UTF-8.
print('Mluvíš anglicky?')
print('􂡘या आप अंग्रेज़ी बोलते ह􂜊 ?')

The Unicode standard can be implemented using different internal character encodings. You can tell Python which encoding to use by inserting a comment of the form
```python
# -*- coding: encoding name -*-
```
as the first or second line of your program. For example,
```python
# -*- coding: utf-8 -*-
```
instructs Python to use **UTF-8**, the most frequently used character encoding for World Wide Web pages.If you don’t have such a comment in your program, most Python implementations will `default to UTF-8.`

If we  use the comment `# -*- coding: acsii -*-` to set the text of Python code as ASCII, 

* We get encoding problem: `SyntaxError: encoding problem: acsii`

In [None]:
%%file ./code/python/CharacterEncodingASCII.py
# -*- coding: acsii -*-
print('Mluvíš anglicky?')
print('􂡘या आप अंग्रेज़ी बोलते ह􂜊 ?')

In [None]:
!python ./code/python/CharacterEncodingASCII.py

In the **Visual Studio Code** ,We can `Reopen with Encoding` ,or `save the Encoding ` to transfer the text to a suitable character coding system.

![vscode-encoding](./img/vscode-encoding.jpg)

We will give more examples on Character Encoding in the following section 

### 2.2 Plain text file in a specific `encoding`

We have been reading and writing strings from and to the file, which are encoded in `the default encoding, Unicode UTF-8` for Python3

Both English and non-English characters can be represented in UTF-8

In [None]:
fname="./code/python/default.txt"
f = open(fname,'w')
f.write('中文default')
f.close()

In [None]:
fname="./code/python/default.txt"
f = open(fname,'r')
line=f.readline()
print(line)
f.close()

#### The specific encoding

We can read and write in a specific encoding  by using a keyword argument <b style="color:blue">encoding</b> in the `open` function.

Example: Writing  in GBK

* **GBK**: 汉字内码扩展规范(Chinese Internal Code Specification)

In [None]:
fname="./code/python/gbk.txt"
f = open(fname,'w',encoding="gbk")
f.write('中文-gbk')
f.close()

Open GBK with GBK encoding

In [None]:
f = open(fname,'r',encoding="gbk")
line=f.readline()
print(line)
f.close()

Open GBK with UTF-8 encoding

In [None]:
f = open(fname,'r',encoding="utf-8")
line=f.readline()
print(line)
f.close()

### Further Reading: Text Fils 

* [Python HOWTOs: Unicode HOWTO](https://docs.python.org/howto/unicode.html)
 

* [Python Tutorial 7.2. Reading and Writing Files](https://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files)
 
 
* [Processing Text Files in Python 3](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html)
  



## 3 Binary  Files

`'b'` appended to the mode opens the file in `binary mode`: now the data is read and written in the form of `bytes` objects. 

This mode should be used for all files that don’t contain text.

In [None]:
f = open('binaryfile', 'wb')
f.write(b'0123456789abcdef')
f.close()

To change the file object’s position, use `f.seek(offset, from_what)`. The position is computed from adding offset to a reference point; the reference point is selected by the `from_what` argument.

A `from_what` value of 

* `0` measures from the beginning of the file, 

* `1` uses the current file position, and 

* `2` uses the end of the file as the reference point. 

`from_what` can be omitted and defaults to `0`, using the beginning of the file as the reference point.

In [None]:
f = open('binaryfile', 'rb')
f.seek(5) # Go to the 6th byte in the file

In [None]:
f.read(1)

In [None]:
f.seek(-3, 2) # Go to the 3rd byte before the end

In [None]:
f.read(1)

### Further Reading:  Binary Files

**J. M. Hughes. Real World Instrumentation with Python: CHAPTER 12 Reading and Writing Data Files**

  *  [UnitA-2: Reading-and-Writing-Data-Files-Binary-Data-Files](./UnitA-2-Reading-and-Writing-Data-Files-Binary-Data-Files.ipynb) 


## 4. Table Data(File), Dictionary and List

|name     |  age   |   city|
|---------:|------:|-------:|
|zhangsan  |  28    |  nanjing|
|Lishi  |  18    |  Beijing|

```python
data table  -> dict: {key:value}
    column -> key(string)
     row  -> value(list)
```
**In the concept of Relation Database**

```
data table  -> Relation Database's Table(表）
    column -> field（域）
     row  -> record（记录）
```

### 4.1 Creating data table dict from `sequences`

data table is `dict`，values of each field is `list`

```python
dict {field1:[],field2:[]....}
```


In [None]:
fields=['name','age','city']
rows=[['Zhangsan',28,'Nanjing'],
      ['Lishi',18,'Beijing']
     ]

datatable={}

# 1 create the dict of  data table
for key in fields:
    datatable[key] = []
print(datatable)  

# 2 set the value list of key
for r in rows:
    for i in range(len(fields)):
        datatable[fields[i]].append(r[i])
print(datatable)

print("\n",fields)
for r in range(len(rows)):
    currow=[]
    for i in range(len(fields)):
        currow.append(datatable[fields[i]][r])
    print(currow)


### 4.2  Creating dict from the `file` of  data table

**view data table in `Columns`** 

* **data table is a `dict`**, 

* **values of each field is `list`**

```python
dict {field1:[],field2:[]....}
```

In [None]:
%%file ./data/personrecords.txt
name        age
zhangsan    28
lishi       18 

In [None]:
fields=[]
datatable={}

personrecords=open('./data/personrecords.txt','r')

# 1 get string of field(column)
fields=personrecords.readline().split()
print(fields)

# 2 create the dict of  data table
for key in fields:
    datatable[key] = []
print("dict for datatable:{field1:[],field2:[]....}")
print(datatable)

# 2 read each record into the value list of key 
for line in personrecords:
    currowrecord=line.split()
    for i in range(len(fields)):
         datatable[fields[i]].append(currowrecord[i])

personrecords.close()

print(datatable)

recordCount=len(datatable[fields[0]])
print("\n",fields)
for r in range(recordCount):
    currow=[]
    for i in range(len(fields)):
        currow.append(datatable[fields[i]][r])
    print(currow)    

### 4.3 Creating list  from the `file` of  data table

**view data table in `Rows`** 

* **data table is a `list` of rows

* each `row` is a `dict`**

```python
list[dict]: [{field1:value,field2:value,*:*},{field1:value,field2:value,*.*},...]
```

In [None]:
%%file ./data/personrecords.txt
name        age      city
zhangsan    28      nanjing
lishi       18      shanghai

In [None]:
records=[]
fields=[]

# data table is a list
datatable=[] 

personrecordsfile=open('./data/personrecords.txt','r')

# 1 get string of field(column)
fields=personrecordsfile.readline().split()
print(fields)

# 2 read each record into dict：key is field string
for line in personrecordsfile:
    currowrecord=line.split()
    # 2.1 init dict
    rowrecord={}
    for i in range(len(fields)):
        # 2.2 add key:value to dict
        rowrecord[fields[i]]=currowrecord[i]
    # 2.3 add dict to list:records
    datatable.append(rowrecord)

personrecordsfile.close()

for item in datatable:
    print(item)
    
for item in datatable:
    print(item['name'])    

### 4.4 CSV and csv.DictReader

#### 4.4.1 CSV: Comma-separated values

https://en.wikipedia.org/wiki/Comma-separated_values
    
In computing, a comma-separated values (**CSV**) file stores **tabular** data (numbers and text) in **plain text**.

* Each **line** of the file is a data **record**
. 
* Each **record** consists of one or more **fields**, separated by **commas**.    

CSV is **a common data exchange format** that is widely supported by consumer, business, and scientific applications. 

For example, a user may need to transfer information from a **database** program that stores data in a proprietary format, to a **spreadsheet** that uses a completely different format. 

The database program most likely can export its data as "CSV"; the exported CSV file can then be imported by the spreadsheet program

The `CSV` format is the most common import and export format for `spreadsheets and databases`.


#### 4.2 CSV module

The `csv` module implements classes to read and write tabular data in CSV format.

https://docs.python.org/3.7/library/csv.html

In [None]:
%%file ./data/personrecords.csv
name,age
zhangsan,28
lishi,18 

In [None]:
import  csv
filename="./data/personrecords.csv"
csvfile = open(filename, 'r')
csvdata = csv.DictReader(csvfile)
for line in csvdata:
    print(line)
    name = line['name']
    age=line['age']
    print(name,age)    

####  4.3 our DictReader for CSV

In [None]:
def ourDictReader(file):
    # the first row: fields
    fields=file.readline()[:-1].split(',')
    print(fields)
    
    # the list of all rows
    records=[]
    for line in file:
        currowrecord=line[:-1].split(',')
        print(currowrecord)
        # the dict of each row 
        rowrecord={}
        for i,item in  enumerate(fields):
            if item=="age":
                rowrecord[fields[i]]=int(currowrecord[i])
            else:
                rowrecord[fields[i]]=currowrecord[i]
        records.append(rowrecord)
    return records

filerecords=open('./data/personrecords.csv','r')
csvdata=ourDictReader(filerecords)
for line in csvdata:
    print(line)
    print(line['name'],line['age'])

### Further Reading 

#### Pandas http://pandas.pydata.org/

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use **data structures and data analysis tools** for the Python programming language.

#### Reference

Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython(2nd Edition) O'Reilly Media, 2017 
* https://github.com/wesm/pydata-book