## 2.1 Profiling applications with both IO and computing workloads

書籍のサンプルコードをnotebook化してみた。

In [1]:
import collections
import csv
import datetime
import sys

import requests

%load_ext line_profiler

In [2]:
stations = ["01044099999", "02293099999"]
start_year = 2021
end_year = 2021
# stations = sys.argv[1].split(",")
# years = [int(year) for year in sys.argv[2].split("-")]
# start_year = years[0]
# end_year = years[1]

TEMPLATE_URL = "https://www.ncei.noaa.gov/data/global-hourly/access/{year}/{station}.csv"
TEMPLATE_FILE = "station_{station}_{year}.csv"

In [3]:
def download_data(station, year):
    my_url = TEMPLATE_URL.format(station=station, year=year)
    req = requests.get(my_url)
    if req.status_code != 200:
        return  # not found
    w = open(TEMPLATE_FILE.format(station=station, year=year), "wt")
    w.write(req.text)
    w.close()

In [4]:
def download_all_data(stations, start_year, end_year):
    for station in stations:
        for year in range(start_year, end_year + 1):
            download_data(station, year)


In [5]:
def get_file_temperatures(file_name):
    with open(file_name, "rt") as f:
        reader = csv.reader(f)
        header = next(reader)
        for row in reader:
            station = row[header.index("STATION")]
            # date = datetime.datetime.fromisoformat(row[header.index('DATE')])
            tmp = row[header.index("TMP")]
            temperature, status = tmp.split(",")
            if status != "1":
                continue
            temperature = int(temperature) / 10
            yield temperature

In [6]:
def get_all_temperatures(stations, start_year, end_year):
    temperatures = collections.defaultdict(list)
    for station in stations:
        for year in range(start_year, end_year + 1):
            for temperature in get_file_temperatures(TEMPLATE_FILE.format(station=station, year=year)):
                temperatures[station].append(temperature)
    return temperatures


In [7]:
def get_min_temperatures(all_temperatures):
    return {station: min(temperatures) for station, temperatures in all_temperatures.items()}


実行時間の計測は `%%time` を付ければ測定できる。

In [8]:
%%time
download_all_data(stations, start_year, end_year)
all_temperatures = get_all_temperatures(stations, start_year, end_year)
min_temperatures = get_min_temperatures(all_temperatures)
print(min_temperatures)


{'01044099999': -10.0, '02293099999': -27.6}
CPU times: user 327 ms, sys: 154 ms, total: 481 ms
Wall time: 12.9 s


内部的な処理のプロファイリングは `%%prun` で出来る。

In [9]:
%%prun
download_all_data(stations, start_year, end_year)
all_temperatures = get_all_temperatures(stations, start_year, end_year)
min_temperatures = get_min_temperatures(all_temperatures)
print(min_temperatures)


{'01044099999': -10.0, '02293099999': -27.6}
 

```
         244928 function calls (244922 primitive calls) in 12.521 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999   10.842    0.004   10.842    0.004 {method 'read' of '_ssl._SSLSocket' objects}
        2    0.881    0.440    0.881    0.440 {method 'do_handshake' of '_ssl._SSLSocket' objects}
        2    0.398    0.199    0.398    0.199 {method 'connect' of '_socket.socket' objects}
    33650    0.103    0.000    0.116    0.000 2613255434.py:1(get_file_temperatures)
        2    0.089    0.044    0.089    0.044 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
     1169    0.020    0.000   10.513    0.009 response.py:535(read)
     1169    0.014    0.000   10.438    0.009 {method 'read' of '_io.BufferedReader' objects}
     2999    0.011    0.000   10.883    0.004 socket.py:691(readinto)
     2999    0.011    0.000   10.864    0.004 ssl.py:1263(recv_into)
     :
     :
```

行単位のプロファイリングは `line_profiler` を使う。

`%lprun -f {{関数名}} {{関数呼び出し}}` の形式で実行。

In [10]:
%lprun -f download_all_data download_all_data(stations, start_year, end_year)

```
Timer unit: 1e-09 s

Total time: 17.4423 s
File: /var/folders/tt/qt9zhxym8v5bhx0059bjbzx80000gp/T/ipykernel_3470/555202207.py
Function: download_all_data at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def download_all_data(stations, start_year, end_year):
     2         2       2000.0   1000.0      0.0      for station in stations:
     3         2      10000.0   5000.0      0.0          for year in range(start_year, end_year + 1):
     4         2 17442278000.0 8721139000.0    100.0              download_data(station, year)
```

In [11]:
%lprun -f get_all_temperatures all_temperatures = get_all_temperatures(stations, start_year, end_year)


```
Timer unit: 1e-09 s

Total time: 0.122668 s
File: /var/folders/tt/qt9zhxym8v5bhx0059bjbzx80000gp/T/ipykernel_3470/644307346.py
Function: get_all_temperatures at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def get_all_temperatures(stations, start_year, end_year):
     2         1       3000.0   3000.0      0.0      temperatures = collections.defaultdict(list)
     3         2       2000.0   1000.0      0.0      for station in stations:
     4         2       4000.0   2000.0      0.0          for year in range(start_year, end_year + 1):
     5     33648  116934000.0   3475.2     95.3              for temperature in get_file_temperatures(TEMPLATE_FILE.format(station=station, year=year)):
     6     33648    5725000.0    170.1      4.7                  temperatures[station].append(temperature)
     7         1          0.0      0.0      0.0      return temperatures
```

In [12]:
%lprun -f get_min_temperatures min_temperatures = get_min_temperatures(all_temperatures)

```
Timer unit: 1e-09 s

Total time: 0.000721 s
File: /var/folders/tt/qt9zhxym8v5bhx0059bjbzx80000gp/T/ipykernel_3470/2586325356.py
Function: get_min_temperatures at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def get_min_temperatures(all_temperatures):
     2         1     721000.0 721000.0    100.0      return {station: min(temperatures) for station, temperatures in all_temperatures.items()}
```

In [13]:
min_temperatures

{'01044099999': -10.0, '02293099999': -27.6}