# File,Character Encoding and Table Data File


## 1 Files


Every computer system uses files to save things from one computation to the next.

Python provides many facilities for creating and accessing files.

The built-in function <b style="color:blue">open</b> 

```python
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
``` 

Open file and return a stream(a file object)

In [None]:
help(open)

The <b style="color:blue">open()</b> is most commonly used with <b style="color:blue">two arguments</b>:

<b style="color:blue">open(filename, mode)</b>.

```python
f = open('workfile', 'w')
```    

* The first argument is a `string` containing the filename. 


* The second argument is another `string` containing a few characters describing the way in which the file will be used. 
 
   * **Writing** : <b style="color:blue">'w'</b> for only writing (an existing file with the same name will be erased)

   * **Reading** :<b style="color:blue">'r'</b> when the file will only be read(default) 

   * **Appending** :<b style="color:blue">'a'</b>  open for writing, appending to the end of the file if it exists


Normally, files are opened in <b style="color:blue">text</b>  mode, that means, you read and write `strings` from and to the file.

Here we illustrate some of the basic ones in the <b style="color:blue">text</b> mode:

### 1.1  Writing 


<b style="color:blue">open</b>  create a file with the name **kids.txt** ,

using the argument <b style="color:blue">w</b> to indicate that the file is to be opened for **writing**.

 * an existing file with the same name will be erased
 
using its <b style="color:blue">write</b> methods appropriately to write to the file.

In the end, we finally <b style="color:blue">close</b> the file.

In [None]:
names=['David','Andrea']


In [None]:
nameHandle = open('kids.txt', 'w')

for name in names:
    nameHandle.write(name + '\n') # the string '\n' indicates a new line character.

nameHandle.close()

**name + '\n'**: the string **'\n'** indicates a **new line** character.

In [None]:
!dir kids.txt

In [None]:
# %load kids.txt
David
Andrea


### 1.2 Reading

<b style="color:blue">open</b> the file for **reading**,using the argument <b style="color:blue">'r'</b>

`'r'` will be assumed if it’s omitted.

For reading lines from a file, you can `loop` over `the file object`. This is memory efficient, fast, and leads to simple code:

In [None]:
nameHandle = open('kids.txt', 'r')

#nameHandle = open('kids.txt') #'r' will be assumed if it’s omitted.

for line in nameHandle:
    print(line,end="")
    
nameHandle.close()

You see **new line** between each name. 

```python
David

Andrea
```
Because

* `\n` at the end of each line -> the new  line

* `print(line)` procude the new line

```python
print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
```
the default value `end='\n'`, so that `print(line)` procude the new line

We could have avoided printing that(**new line**) by writing:
```python
  print(line[:-1])
```
**slicing** line to delete **'\n'** in each line for file. 

In [None]:
nameHandle = open('kids.txt', 'r')
for line in nameHandle:
    print(line[:-1])  # print(line[:len(line)-1] \n
nameHandle.close()

### 1.3 Appending

<b style="color:blue">open</b> the file for **appending** (instead of writing) by using the argument  <b style="color:blue">a</b>

In [None]:
names=['Zhang\n','Li\n']

# append
nameHandle = open('kids.txt', 'a') # argument 'a' -  appending
for name in names:
    nameHandle.write(name)
nameHandle.close()

# read
nameHandle = open('kids.txt', 'r')
for line in nameHandle:
    print(line[:-1])
nameHandle.close()

### 1.4 The common operations on files

Some of the common operations on files are summarized

![Figure412](./img/figure412.jpg)

#### Using <b style="color:blue">readline</b> methods 

Using `print(line, end='')` ,so that print(line) do not get the new line

In [None]:
nameHandle = open('kids.txt', 'r')

while True:
    line = nameHandle.readline()
    # Zero length indicates EOF
    if len(line) == 0:
        break
   
    # The `line` already has a newline at the end of each line
    print(line, end='')

## 2. Character Encoding and Plain text file in a specific `encoding`

`Text encoding` is a sufficiently complex topic that there’s no one size fits all answer - the right answer for a given application will depend on factors like:

* how much control you have over the text encodings used

* whether avoiding program failure is more important than avoiding data corruption or vice-versa

* how common encoding errors are expected to be, and whether they need to be handled gracefully or can simply be rejected as invalid input


### 2.1  Character Encoding


In computer memory, **character** are "encoded" (or "represented") using a chosen **"character encoding schemes"** (aka "character set", "charset", "character map", or "code page").

**ASCII** 

For many years most programming languages used a standard called **ASCII** for the internal representation of characters. 

This standard included 128 characters, plenty for representing the usual set of characters appearing in `English-language` text—but `not enough to cover the characters and accents appearing in all the world’s languages.`

For example, in **ASCII** (American Standard Code for Information Interchange),as well as Latin1, Unicode, 

* code numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z', respectively.

* code numbers 97D (61H) to 122D (7AH) represents 'a' to 'z', respectively.

* code numbers 48D (30H) to 57D (39H) represents '0' to '9', respectively.

It is important to note that the **representation scheme must be known** before a binary pattern can be interpreted. E.g., the 8-bit pattern "0100 0010B" could represent anything.

**Unicode**.

* **Unicode:** ISO/IEC 10646 Universal Character Set

In recent years, there has been a shift to **Unicode**. 

The Unicode standard is a character coding system designed to support the digital processing and display of the written `texts of all languages`. The standard contains more than 120,000 different characters—covering 129 modern and historic scripts and multiple symbol sets.


 
Unicode encoding scheme could represent characters in **all languages**.

> **Reference** [Tutorial on Data Representation Integers, Floating-point Numbers, and Characters]( http://www3.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html)

---

**The default encoding Python3：UTF-8**

The **default encoding** for `Python source code`  and `Jupyter Notebook` is **UTF-8**,
 
Since **Python 3.0**, the language features a **str** type that contain **Unicode** characters

> https://docs.python.org/howto/unicode.html

so you can simply include a Unicode character in a string literal:

**Visual Studio Code default encoding: utf8**

![](./img/vsc-encoding.jpg)



We can `Reopen with Encoding` ,or `save the Encoding ` to transfer the text to a suitable character coding system.

![vscode-encoding](./img/vscode-encoding.jpg)



We will give more examples on Character Encoding in the following section 

**中文Windows默认字符集编码是中文GBK**

* **GBK**: 汉字内码扩展规范(Chinese Internal Code Specification)



### 2.2 Plain text file in a specific `encoding`

We have been reading and writing strings from and to the file, which are encoded in `the default encoding, Unicode UTF-8` for Python3

Both English and non-English characters can be represented in UTF-8

In [None]:
fname="./code/python/default.txt"
f = open(fname,'w')
f.write('中文default')
f.close()

In [None]:
fname="./code/python/default.txt"
f = open(fname,'r')
line=f.readline()
print(line)
f.close()

#### The specific encoding

We can read and write in a specific encoding  by using a keyword argument <b style="color:blue">encoding</b> in the `open` function.

Example: Writing  in GBK



In [None]:
fname="./code/python/gbk.txt"
f = open(fname,'w',encoding="gbk")
f.write('中文-gbk')
f.close()

Open GBK with GBK encoding

In [None]:
f = open(fname,'r',encoding="gbk")
line=f.readline()
print(line)
f.close()

Open GBK with UTF-8 encoding

In [None]:
f = open(fname,'r',encoding="utf-8")
line=f.readline()
print(line)
f.close()

## 3. Data Table (File), Dictionary and List

The Data Table (File)


|name     |  age   |
|---------:|------:|
|zhangsan  |  28    |
|Lishi  |  18    |  



**Creating dict from the `file` of  data table**


In [None]:
%%file ./data/personrecords.txt
name        age
zhangsan    28
lishi       18 


**In the concept of Relation Database**

```
data table  -> Relation Database's Table(表）关系数据库的数据表
    column -> field（域）: name	age	
     row  -> record（记录）: zhangsan   28    
```

### 3.1 data table: `dict`

**view data table in `Columns`** 

* **data table is a `dict`**, 

* **values of each field is `list`**
``

```python
data table  -> dict: {key:value}
    column -> key(string)
    all row of each column -> value(list)
```

data table is **dict**，values of each field is `list`

```python
dict {field1:[],field2:[]....}
```

```python
{'name': ['zhangsan', 'lishi'],
 'age': [28.0, 18.0]

```

In [1]:

datefile=open('./data/personrecords.txt')

# 1 get string of field(column)
fields=datefile.readline().split()
print(fields)

['name', 'age']


```python
{"name":[],"age"：[]}
```

In [2]:
# 2 create the dict of  data table
datatable = {}
# {"name":[],"age"：[]}
for key in fields:
    datatable[key] = []
    
# Dictionary Comprehension
# datatable={ key:[] for key in fields: }

print("dict for datatable:\n{field1:[],field2:[]....}")
print(datatable)

dict for datatable:
{field1:[],field2:[]....}
{'name': [], 'age': []}


In [3]:
# 3 read each row into the value list of key 
for row in datefile:
    currow=row.split()
    for i in range(len(fields)):
        if fields[i]=="age":
            datatable[fields[i]].append(float(currow[i]))
        else:
            datatable[fields[i]].append(currow[i])

datefile.close()

print(datatable)


{'name': ['zhangsan', 'lishi'], 'age': [28.0, 18.0]}


**average age**

In [4]:
from statistics import mean
avgage=mean(datatable["age"])
print(avgage)

23.0


### 3.2 data table `list` 

**view data table in `Rows`** 

* data table is a `list` of rows

* each `row` is a `dict`

```python
list[dict]: [{field1:value,field2:value,field*:value*},
         {field1:value,field2:value,field*:value*},
          ...]
```


|name     |  age   |
|---------:|------:|
|zhangsan  |  28    |
|Lishi  |  18    |  



```python
[{"name":zhangsan,"age":28},
 {"name":lishi,"age":18}
]
```

In [None]:
datatable=[] # data table is a list

datafile=open('./data/personrecords.txt')

# 1 get string of field(column)
fields=datafile.readline().split()
print(fields)


In [None]:
# 2 read each row into dict：key is field string
for row in datafile:
    currow=row.split()
    rowdict={} # 2.1 init dict
    for i in range(len(fields)):
        # 2.2 add key:value to dict
        if fields[i]=="age":
            rowdict[fields[i]]=float(currow[i])
        else:
            rowdict[fields[i]]=currow[i]
    # 2.3 add dict to list:records
    datatable.append(rowdict)

datafile.close()
print(datatable)      

In [None]:
for item in datatable:
    print(item)
    
for item in datatable:
    print(item['name']) 

**average age**

In [None]:
from statistics import mean
# list Comprehension
ages=[item["age"] for item in datatable]
avgage=mean(ages)
print(avgage)

### What is programs?

<b style="color:blue;font-size:110%">Algorithms + Data Structures = Programs</b> is a 1976 book written by `Niklaus Wirth` covering some of the fundamental topics of computer programming, particularly that algorithms and data structures are inherently related. (https://en.wikipedia.org/wiki/Algorithms_%2B_Data_Structures_%3D_Programs)**

The Turbo Pascal compiler written by **Anders Hejlsberg** was largely inspired by the "Tiny Pascal" compiler in **Niklaus Wirth**'s book.



### 3.3 CSV  

#### 3.3.1 CSV: Comma-separated values

https://en.wikipedia.org/wiki/Comma-separated_values
    
In computing, a comma-separated values (**CSV**) file stores **tabular** data (numbers and text) in **plain text**.

* Each **line** of the file is a data **record**
. 
* Each **record** consists of one or more **fields**, separated by <b style="color:red">commas</b>.    



In [None]:
%%file ./data/personrecords.csv
name,age
zhangsan,28
lishi,18 

CSV is **a common data exchange format** that is widely supported by consumer, business, and scientific applications. 

For example, a user may need to transfer information from a **database** program that stores data in a proprietary format, to a **spreadsheet** that uses a completely different format. 

The database program most likely can export its data as "CSV"; the exported CSV file can then be imported by the spreadsheet program

The `CSV` format is the most common import and export format for `spreadsheets and databases`.



#### 3.3.2 CSV module

The `csv` module implements classes to read and write tabular data in CSV format.

https://docs.python.org/3.8/library/csv.html

##### 3.3.2.1 csv.DictReader

In [None]:
import  csv
filename="./data/personrecords.csv"
csvfile = open(filename, 'r')
csvdata = csv.DictReader(csvfile)

# fields
print("fields =",csvdata.fieldnames)

# values
for line in csvdata:
    print(line)
    name = line['name']
    age=float(line['age'])
    print(name,age) 

csvfile.close()    

##### 3.3.2.2 csv.writerow

In [None]:
import csv

fields=['name','age','city']
rows=[['Zhangsan',28,'Nanjing'],
      ['Lishi',18,'Beijing']
     ]


csvfile=open("./data/personrecords.csv","w", newline='')

csvwriter = csv.writer(csvfile,dialect=("excel"))
csvwriter.writerow(fields)

for record in rows:
    csvwriter.writerow(record)
    
csvfile.close()            

In [None]:
# %load "./data/personrecords.csv"
name,age,city
Zhangsan,28,Nanjing
Lishi,18,Beijing


## Further Reading

###  Binary  Files

<b style="color:blue">b</b> appended to the mode opens the file in <b style="color:blue">binary mode</b>: the data is read and written in the form of `bytes` objects. 

This mode should be used for all files that don’t contain text.

In [None]:
f = open('binaryfile', 'wb')
f.write(b'0123456789abcdef')
f.close()

To change the file object’s position, use `f.seek(offset, from_what)`. The position is computed from adding offset to a reference point; the reference point is selected by the `from_what` argument.

A `from_what` value of 

* `0` measures from the beginning of the file, 

* `1` uses the current file position, and 

* `2` uses the end of the file as the reference point. 

`from_what` can be omitted and defaults to `0`, using the beginning of the file as the reference point.

In [None]:
f = open('binaryfile', 'rb')
f.seek(5) # Go to the 6th byte in the file

In [None]:
f.read(1)

In [None]:
f.seek(-3, 2) # Go to the 3rd byte before the end

In [None]:
f.read(1)


## Reference


* **J. M. Hughes. Real World Instrumentation with Python: CHAPTER 12 Reading and Writing Data Files**



* [Python HOWTOs: Unicode HOWTO](https://docs.python.org/howto/unicode.html)
 

* [Python Tutorial 7.2. Reading and Writing Files](https://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files)
 
 
* [Processing Text Files in Python 3](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html)
  

* [Pandas](http://pandas.pydata.org/)  is an open source, BSD-licensed library providing high-performance, easy-to-use **data structures and data analysis tools** for the Python programming language.

