## Part A: Retrieveing Desired Lotto Numbers from Zip Files
### Step A-1: Unzip
- Issues
    - Garbled Chinese file names
- Refs
    - [zipfile — Work with ZIP archives](https://docs.python.org/3/library/zipfile.html)
    - [pathlib — Object-oriented filesystem paths](https://docs.python.org/3/library/pathlib.html)
    - [【Python】使用 zipfile 解压含有中文文件名的 zip 文件](https://blog.csdn.net/u010099080/article/details/79829247)
    - [Python: How to unzip a file | Extract Single, multiple or all files from a ZIP archive](https://thispointer.com/python-how-to-unzip-a-file-extract-single-multiple-or-all-files-from-a-zip-archive/)

In [None]:
from pathlib import Path
from zipfile import ZipFile
for year in range(103, 110):
    print(f"start unzip win_nums_{year}.zip")
    with ZipFile(f'../output/win_nums_{year}.zip', 'r') as zf:
        for fn in zf.namelist():
            extracted_path = Path(zf.extract(fn, path='../output/win_nums_all/'))
            extracted_path.rename(f"../output/win_nums_all/{fn.encode('cp437').decode('big5')}")
else:
    print("All unzip complete!")

In [None]:
# # 此法解壓縮後，中文檔名呈現亂碼
# from zipfile import ZipFile
# for year in range(103, 110):
#     print(f"Unzipping year win_nums_{year}.zip...")
#     with ZipFile(f'../output/win_nums_{year}.zip', 'r') as zf:
#         zf.extractall("../output/")
# else:
#     print("All unzip complete!")

---

### Step A-2: Retrieve desired sheets from each unzipped folder
- Issues
    - Find the desired file in each folder
- Refs
    - [Python 的 Big5 與 UTF-8 檔案編碼轉換程式教學](https://officeguide.cc/python-big5-utf8-file-encoding-convertion-tutorial/)

==**ToDO: 已經可以找到檔案並轉成pd df，但要想辦法整個串起來**==

In [13]:
import os
import numpy as np
import pandas as pd

In [61]:
def win_nums_csv_to_df(path:str):
    df = pd.read_csv(path, 
                     header=0 , 
                     parse_dates=True, 
                     infer_datetime_format=True,
                     skipinitialspace=True,
                     )
    df = df.reset_index()
    col_names = list(df.columns)
    df = df.iloc[:, 0:13]
    df.columns = col_names[4:17]
    return df

In [71]:
# main()
base_dir = '../output/win_nums_all/'
year_nums_dict = dict()
win_nums_all_df = pd.DataFrame(data=None)
for year in range(2014, 2021):
    dir_path = f'../output/win_nums_all/{year}/'
    full_path = ""
    files = os.listdir(dir_path)
    for file in files:
        if file.startswith("大樂透_"):
            full_path = f'{dir_path}{file}'
            df = win_nums_csv_to_df(full_path)
            win_nums_all_df = pd.concat([win_nums_all_df, df], axis = 0)
else:
    print(win_nums_all_df)
    win_nums_all_df.to_csv('../output/win_nums_all/win_nums_all.csv', encoding= "big5" )

   遊戲名稱         期別       開獎日期       銷售總額     銷售注數        總獎金  獎號1  獎號2  獎號3  \
0   大樂透  103000001 2014-01-03  146792850  2935857   80736067   11   18   20   
1   大樂透  103000002 2014-01-07  132190750  2643815  106318850    1    7   21   
2   大樂透  103000003 2014-01-10  140617450  2812349  149849635    7   17   19   
3   大樂透  103000004 2014-01-14  128619400  2572388   70740670    2   11   21   
4   大樂透  103000005 2014-01-17  141157800  2823156  108354326   13   15   20   
..  ...        ...        ...        ...      ...        ...  ...  ...  ...   
55  大樂透  109000056 2020-06-16  142703600  2854072  160593624    6   13   20   
56  大樂透  109000057 2020-06-19  133453900  2669078   74734184    1   15   16   
57  大樂透  109000058 2020-06-23  139006850  2780137  109169790    1    2   16   
58  大樂透  109000059 2020-06-26  140647400  2812948  141275751   10   12   16   
59  大樂透  109000060 2020-06-30  134411400  2688228  169667082   26   29   31   

    獎號4  獎號5  獎號6  特別號  
0    21   35   37    8  
1

---
## Tests & Ref Codes

In [None]:
with open("big5_input.txt", "r", encoding = "Big5") as inFile, open("utf8_output.txt", "w", encoding = "UTF-8") as outFile:
    outFile.write(inFile.read())

In [4]:
files = os.listdir("../output/win_nums_all/2020")
for file in files:
    if file.startswith("大樂透_"):
        print(file)

大樂透_202001_202006.csv


In [16]:
win_nums_2020_df = pd.read_csv("../output/win_nums_all/2020/大樂透_202001_202006.csv", 
                               header=0 , 
                               parse_dates=True, 
                               infer_datetime_format=True,
                               skipinitialspace=True,
                               )
win_nums_2020_df = win_nums_2020_df.reset_index()
col_names = list(win_nums_2020_df.columns)
win_nums_2020_df = win_nums_2020_df.iloc[:, 0:13]
win_nums_2020_df.columns = col_names[4:17]
win_nums_2020_df.head()

Unnamed: 0,遊戲名稱,期別,開獎日期,銷售總額,銷售注數,總獎金,獎號1,獎號2,獎號3,獎號4,獎號5,獎號6,特別號
0,大樂透,109000001,2020-01-03,112729500,2254590,217079550,16,18,19,29,30,47,10
1,大樂透,109000002,2020-01-07,117529800,2350596,248835825,1,5,8,15,16,38,7
2,大樂透,109000003,2020-01-10,121991550,2439831,266131509,4,7,12,17,33,38,11
3,大樂透,109000004,2020-01-14,125272850,2505457,290316408,7,20,34,37,38,47,18
4,大樂透,109000005,2020-01-17,101544800,2030896,56865088,7,9,20,42,44,45,8
