# Prerpocessing the Surface Weather Map

The daily weather briefing (天氣概況) is composed based on the surface weather map issued at 00Z (0800 LST). Here is an example:

<img src='images/20200119-0000-FI04.png' style="height: 200px" />

We collected the surface weather maps at 6-hour intervals from 2011 to 2020, and we want to process the dataset in this notebook. We aim to:

- Pick only the 00z weather map of each day
- Find a proper way to encode the images

We hope to use the **encoded images** and related **weather briefing** to fine-tune an LLM.


## Look into the Data

In [1]:
import os
import numpy as np
import pandas as pd

DATAROOT = '../../data/weather_map/'

# Walk through the sub-directories
file_list = []
for root, dirs, files in os.walk(DATAROOT):
    for name in files:
        if name.endswith('-1.jpg'):
            date = name.replace('-1.jpg','')
            url = os.path.join(root, name)
            file_list.append({'date':date, 'furi':url})

file_list = pd.DataFrame(file_list)
print(file_list.shape)
print(file_list.head())

(3273, 2)
       date                                               furi
0  20110101  ../../data/weather_map/2011\20110101\20110101-...
1  20110102  ../../data/weather_map/2011\20110102\20110102-...
2  20110103  ../../data/weather_map/2011\20110103\20110103-...
3  20110104  ../../data/weather_map/2011\20110104\20110104-...
4  20110105  ../../data/weather_map/2011\20110105\20110105-...


In [2]:
!pip install Pillow



The raw data for 2020 we received is in the format of PNG. Also, their naming convention is different from 2011 - 2019.  Here we will convert them all into jpg format (the same as others) and rename the files. 

We used the [`Pillow`](https://pillow.readthedocs.io/en/stable/handbook/tutorial.html) package for image processing. We do the following:

- Resize images to 1280x960. (We found each image is slightly different from this size, and 2020 is ~2560x1920.)
- Convert color scheme to 8-bit grayscale. (We found early images are black-white only, and latter images are in RGB.)

### Correct the 2020 data

In [13]:
DATAROOT2020 = '../../data/weather_map/2020/'

# Walk through the sub-directories
file_list2020 = []
for root, dirs, files in os.walk(DATAROOT2020):
    for name in files:
        if name.endswith('-0000-FI04.png'):
            date = name.replace('-0000-FI04.png','')
            url = os.path.join(root, name)
            file_list2020.append({'date':date, 'furi':url})

file_list2020 = pd.DataFrame(file_list2020)
print(file_list2020.shape)
print(file_list2020.head())

(366, 2)
       date                                               furi
0  20200101  ../../data/weather_map/2020/20200101-0000-FI04...
1  20200102  ../../data/weather_map/2020/20200102-0000-FI04...
2  20200103  ../../data/weather_map/2020/20200103-0000-FI04...
3  20200104  ../../data/weather_map/2020/20200104-0000-FI04...
4  20200105  ../../data/weather_map/2020/20200105-0000-FI04...


In [20]:
from PIL import Image

OUTPUT_PATH = '../../data/weather_map/preproc/'

ff20 = file_list2020.copy()

for index, row in ff20.iterrows():
    # Convert images from PNG to JPG
    tmp = Image.open(row['furi'])
    tmp = tmp.resize((1280, 960))
    tmp = tmp.convert('L')
    tmp.save(OUTPUT_PATH+row['date']+'-1.jpg')
    # Replace the furi
    row['furi'] = OUTPUT_PATH+row['date']+'-1.jpg'

print(ff20.shape)
print(ff20.head())

(366, 2)
       date                                           furi
0  20200101  ../../data/weather_map/preproc/20200101-1.jpg
1  20200102  ../../data/weather_map/preproc/20200102-1.jpg
2  20200103  ../../data/weather_map/preproc/20200103-1.jpg
3  20200104  ../../data/weather_map/preproc/20200104-1.jpg
4  20200105  ../../data/weather_map/preproc/20200105-1.jpg


### Unify the format of 2011-2019

We noticed earlier images are black/white, and more recent ones are with colors. Also, the file size is set to roughly 1280x960 but each image is slightly different from it. We need to unify the format for further analysis.

In [21]:
OUTPUT_PATH = '../../data/weather_map/preproc/'

for index, row in file_list.iterrows():
    # Convert images from PNG to JPG
    tmp = Image.open(row['furi'])
    tmp = tmp.resize((1280, 960))
    tmp = tmp.convert('L')
    tmp.save(OUTPUT_PATH+row['date']+'-1.jpg')
    # Replace the furi
    row['furi'] = OUTPUT_PATH+row['date']+'-1.jpg'

print(file_list.shape)
print(file_list.head())

(3273, 2)
       date                                           furi
0  20110101  ../../data/weather_map/preproc/20110101-1.jpg
1  20110102  ../../data/weather_map/preproc/20110102-1.jpg
2  20110103  ../../data/weather_map/preproc/20110103-1.jpg
3  20110104  ../../data/weather_map/preproc/20110104-1.jpg
4  20110105  ../../data/weather_map/preproc/20110105-1.jpg


In [22]:
df = pd.concat([file_list, ff20])
print(df.shape)
print(df.head())

(3639, 2)
       date                                           furi
0  20110101  ../../data/weather_map/preproc/20110101-1.jpg
1  20110102  ../../data/weather_map/preproc/20110102-1.jpg
2  20110103  ../../data/weather_map/preproc/20110103-1.jpg
3  20110104  ../../data/weather_map/preproc/20110104-1.jpg
4  20110105  ../../data/weather_map/preproc/20110105-1.jpg


In [24]:
df.to_csv('../../data/flist_weather_map.csv',index=False)