### 各種檔案格式的匯入方法

#### The common spreadsheet file formats: data is stored in cells

- Comma Separated Values (CSV)


- Microsoft Excel Spreadsheet (xls) 


- Microsoft Excel Open XML Spreadsheet (xlsx)

#### 1. Comma-separated values (CSV)

- 每一筆紀錄裡不同欄位的項目使用逗號(，)分離

[Note] Tab Separated Values (TSV): 每一筆紀錄裡不同欄位的項目使用 tab 分離

方法一:

In [None]:
import pandas as pd

df = pd.read_csv("檔案路徑/檔名.csv") # encoding="編碼參數"、index_col=0 (去除索引欄位)

# 儲存檔案: df.to_csv("檔案路徑/檔名.csv", encoding="編碼參數")

方法二:

In [None]:
import csv

# csv 檔的寫入-1 #
with open('檔名.csv', 'w', newline='') as 別名: # newline: 使換行字元被解析
    # 建立 csv 檔寫入物件
    writer = csv.writer(別名)
    
    # 寫入欄位名稱
    writer.writerow(['欄位1', '欄位2', ...])
    
    # 寫入資料
    writer.writerow([資料1, 資料2, ...])

# csv 檔的寫入-2 #
with open('檔名.csv', 'w', newline='') as 別名: # newline: 使換行字元被解析
    # 定義欄位
    fieldnames = ['欄位1', '欄位2', ...]
    
    # 將字典寫入 csv 檔
    writer = csv.DictWriter(別名, fieldnames = fieldnames)
    
    # 寫入欄位名稱
    writer.writeheader()
    
# csv 檔的讀取-1 #
with open('檔名.csv', newline='') as 別名: # newline: 使換行字元被解析
    # 讀取 csv 檔案內容
    rows = csv.reader(別名)

# csv 檔的讀取-2 #
with open('檔名.csv', newline='') as 別名: # newline: 使換行字元被解析
    # 讀取 csv 檔案內容
    rows = csv.DictReader(別名)

#### 2. XLSX

方法一:

In [None]:
import pandas as pd

df = pd.read_excel("檔案路徑/檔名.xlsx", sheetname = "表單名稱")

# 儲存檔案: df.to_excel("檔案路徑/檔名.xlsx", encoding="編碼參數")

方法二:

In [None]:
import openpyxl

workbook = openpyxl.load_workbook("檔案路徑/檔名.xlsx")

# 取得工作簿的第一個工作表
sheet1 = workbook.worksheets[0]

# 讀取總行數: sheet.max_column
# 讀取總列數: sheet.max_row
# 讀取儲存格內容: sheet.cell(row=第 i 列, column=第 j 行).value
# 寫入單一儲存格: sheet['儲存格']=內容
# 寫入串列: sheet.append(串列名稱)
# 儲存檔案: workbook.save("檔案路徑/檔名.xlsx")

#### The common archive file formats: they are used to collect multiple data files together into a single file.

- Zip


- RAR


- Tar

#### 3. ZIP

-  lossless compression format

In [None]:
import zipfile

archive = zipfile.ZipFile('zip檔名.zip', 'r')

df = archive.read('csv檔名.csv')

#### 4. Plain Text (txt)

-  unstructured form

In [None]:
text_file = open("txt檔名.txt", "r")

lines = text_file.read()

#### 5. JavaScript Object Notation (JSON)

- a text-based open standard designed for exchanging the structured data over web

In [None]:
import pandas as pd

df = pd.read_json("檔案路徑/檔名.json")

# 儲存檔案: df.to_json("檔案路徑/檔名.json", encoding="編碼參數")

#### 6. XML

- an Extensible Markup Language


- a human-readable and machine-readable file format


- a self-descriptive language designed for sending information over the internet


- XML doesn't use predefined tags

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse("檔案路徑/檔名.xml")

root = tree.getroot()
print root.tag

#### 7. Hyper Text Markup Language (HTML)

- It is the standard markup language which is used for creating Web pages and describing structure of web pages


- HTML tags are predefined

See [Beginner’s guide to Web Scraping in Python (using BeautifulSoup)](https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/)

In [None]:
import pandas as pd

df = pd.read_html("檔案路徑/檔名.html")

# 儲存檔案: df.to_html("檔案路徑/檔名.html", encoding="編碼參數")

#### 8. Images

See [Basics of Image Processing in Python](https://www.analyticsvidhya.com/blog/2014/12/image-processing-python-basics/)

#### 9. Hierarchical Data Format (HDF)

Advantages:

- It can be used in every size and type of system
- It has flexible, efficient storage and fast I/O
- Many formats support HDF

HDF5 file format: the latest HDF version which is designed to address some of the limitations of the older HDF file formats

In [None]:
import pandas as pd

df = pd.read_hdf("檔案路徑/檔名.h5")

#### 10. Portable Document Format (PDF)

- Download PDFMiner and install it through the [website](https://euske.github.io/pdfminer/)


- Extract PDF file by the following code:

      pdf2txt.py <pdf_file>.pdf

#### 11. DOCX

- 安裝 python docx2txt library: pip install docx2txt

In [None]:
import docx2txt

text = docx2txt.process("檔案路徑/檔名.docx")

#### 12. MP3

- [mp3 File Format Structure](https://upload.wikimedia.org/wikipedia/commons/0/01/Mp3filestructure.svg)


- Reading or manipulating the multimedia files: [PyMedia](http://pymedia.org/tut/index.html)

#### 13. MP4

- 安裝 [MoviePy](http://zulko.github.io/moviepy/)

In [None]:
from moviepy.editor import VideoFileClip

clip = VideoFileClip(‘<video_file>.mp4’)
ipython_display(clip)

#### [Note]: 內建函式 open() 的使用

#### 1. 開啟與關閉檔案

方法一:

    f = open(檔案名稱 [, 模式] [, 編碼])
    
    ......
    
    f.close()

方法二:

    with open(檔案名稱 [, 模式] [, 編碼]) as f:
    
    ......
        
[模式]

- r: 讀取模式


- w: 寫入模式，覆蓋已存在的檔案內容


- a: 附加模式，新增的內容附加在已存在的檔案內容尾端


[編碼]

- 預設: UTF-8


- 讀取 BOM 的文件檔時，去除 BOM: UTF-8-sig


#### 2. 檔案處理

- read(size): 讀取指定長度的字元，未指定則讀取所有字元


- readable(): 測試是否可讀取


- readline(size): 讀取目前文字指標所在行 size 長度的文字內容，若省略參數則讀取一整列(包含'\n')，以 print() 顯示會多出一列空白列


- readlines(): 讀取所有行並回傳一個串列(包含跳列字元:\n、隱含字元)


- next(): 移動到下一行


- seek(0): 將指標移到文件最前端


- tell(): 回傳文件目前位置


- write(str): 將指定的串列寫入文件中(沒有返回值)


- writable(): 測試是否可寫入

#### Reference:

- [How to read most commonly used file formats in Data Science (using Python)?](https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/?utm_content=buffer47ec4&utm_medium=social&utm_source=plus.google.com&utm_campaign=buffer)