## Final Goal
- Obtain information about the start of repairs for tunnels with risk levels 3 and 4 by scraping the official site.

## Actions
- Scrape tunnel data from a website, focusing on risk levels 3 and 4.
- Use the tunnel name as the key for data organization.
- Employed Selenium because the site loads new data and erases previous data with each scroll.

In [9]:
!pip install selenium



In [10]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import StaleElementReferenceException
import pandas as pd

In [11]:
driver = webdriver.Chrome() 
url = "https://road-structures-map.mlit.go.jp/Index.aspx" 
driver.get(url)
time.sleep(2) 

In [12]:
checkbox = driver.find_element(By.ID, "CbOk")

if not checkbox.is_selected():
    checkbox.click()

wait = WebDriverWait(driver, 10)

start_btn = driver.find_element(By.ID, "BTN_LOGIN")
start_btn.click()

In [13]:
ids = [
    "CbOtherSoundness2",
    "CbOtherSoundness1",
    "CbOtherSoundness8",
    "CbOtherSoundness9"
]

for cid in ids:
    checkbox = driver.find_element(By.ID, cid)
    if checkbox.is_selected():
        checkbox.click()

select = Select(driver.find_element(By.ID, "DdlFacilityKind"))
select.select_by_value("TU0")


start_btn = driver.find_element(By.ID, "Btn_List")
start_btn.click()

In [None]:
select_element = driver.find_element(By.ID, "DdlPasing")
select = Select(select_element)
time.sleep(1)

result_df = pd.DataFrame()

for i in range(0,19):
    page_num = i+1
    print(f"処理中: {page_num}ページ")
    select.select_by_value(str(i))
    time.sleep(5)  
    page_rows = set() 

    content_area = driver.find_element(By.ID, "contentArea")
    listpanel_content = driver.find_element(By.ID, "listpanel_content")

    all_rows = set()
    last_num_rows = -1

    scroll_height = driver.execute_script("return arguments[0].scrollHeight", listpanel_content)
    view_height = driver.execute_script("return arguments[0].clientHeight", listpanel_content)

    scroll_pos = 0
    step = int(view_height*1.7)
    unchanged_limit = 8
    unchanged = 0

    while True:
        try:
            trs = content_area.find_elements(By.TAG_NAME, "tr")
        except StaleElementReferenceException:
            continue  

        for tr in trs:
            try:
                tds = tr.find_elements(By.TAG_NAME, "td")
                if tds:
                    row_data = tuple(td.text.strip() for td in tds)
                    if row_data not in all_rows:
                        all_rows.add(row_data)
                        page_rows.add(row_data)
            except StaleElementReferenceException:
                continue

        prev_scroll_pos = scroll_pos
        scroll_pos = min(scroll_pos + step, scroll_height)
        driver.execute_script("arguments[0].scrollTop = arguments[1]", listpanel_content, scroll_pos)
        time.sleep(0.05)

        if len(page_rows) == last_num_rows:
            unchanged += 1
            if unchanged >= unchanged_limit:
                break
        else:
            unchanged = 0
            last_num_rows = len(page_rows)
        if scroll_pos == prev_scroll_pos:
            break

    if page_rows:
        temp_df = pd.DataFrame(list(page_rows))
        result_df = pd.concat([result_df, temp_df], ignore_index=True)
    temp_df.to_csv(f"result_page{page_num}.csv", index=False, encoding="utf-8")

result_df.to_csv("result.csv", index=False, encoding="utf-8")
print(result_df.head())
driver.quit()

In [None]:
concat_df = pd.DataFrame()
for l in range(0,19):
    p_num = l + 1
    df = pd.read_csv(f'result_page{p_num}.csv')
    print(p_num,len(df))
    concat_df = pd.concat([concat_df, df], ignore_index=True)

In [None]:
columns_order = [
    "位置確認",
    "種類",
    "橋名",
    "ﾌﾘｶﾞﾅ",
    "路線名",
    "区分",
    "管理者名",
    "管理事務所名（地公体は任意）",
    "都道府県名",
    "市区町村名",
    "緯度",
    "経度",
    "架設年度（西暦 4桁）",
    "供用年度（西暦 4桁）",
    "橋長（ｍ）",
    "幅員（ｍ）",
    "点検実施年度",
    "判定区分",
    "措置状況"
]


concat_df.columns= columns_order
concat_df.to_csv("result.csv", index=False, encoding="utf-8")


In [10]:

import pandas as pd

input_file = 'result_all.csv'
try:
    df = pd.read_csv(input_file, encoding='utf-8')
except:
    df = pd.read_csv(input_file, encoding='shift-jis')

print(f"ファイル '{input_file}' を読み込みました")
print("読み込んだデータの形状:", df.shape)
print("列名一覧:", df.columns.tolist())

print("\nデータのプレビュー:")
display(df.head())

ファイル 'result_all2.csv' を読み込みました
読み込んだデータの形状: (3312, 19)
列名一覧: ['0', '種類', 'トンネル名', '(ﾌﾘｶﾞﾅ)', '路線名', '区分', '管理者名', '管理事務所名\n（地公体は任意）', '都道府県名', '市区町村名', '緯度', '経度', '建設年度\n（西暦 4桁）', '供用年度\n（西暦 4桁）', '延長（ｍ）', '幅員（ｍ）', '年度', '判定区分', '措置状況']

データのプレビュー:


Unnamed: 0,0,種類,トンネル名,(ﾌﾘｶﾞﾅ),路線名,区分,管理者名,管理事務所名\n（地公体は任意）,都道府県名,市区町村名,緯度,経度,建設年度\n（西暦 4桁）,供用年度\n（西暦 4桁）,延長（ｍ）,幅員（ｍ）,年度,判定区分,措置状況
0,,トンネル,京上トンネル,(ｷｮｳｼﾞｮｳﾄﾝﾈﾙ),国道439号,都道府県,徳島県,西部総合県民局(三好),徳島県,三好市,33.86192度,133.90099度,2001.0,2003.0,1023.0,8.5,2023年度,Ⅲ,措置未着手
1,,トンネル,石塚トンネル,(ｲｼﾂﾞｶﾄﾝﾈﾙ),国道45号,国,東北地方整備局,南三陸沿岸国道事務所,岩手県,釜石市,39.22133度,141.87493度,1969.0,,1351.0,8.6,2023年度,Ⅲ,措置着手済み
2,,トンネル,ずみの窪トンネル,(ｽﾞﾐﾉｸﾎﾞﾄﾝﾈﾙ),国道158号,都道府県,長野県,松本建設事務所,長野県,松本市,36.13583度,137.72639度,1966.0,,427.6,6.0,2019年度,Ⅲ,措置未着手
3,,トンネル,滝川トンネル（１号）,(ﾀｷｶﾞﾜﾄﾝﾈﾙ),県道小野･富岡線,都道府県,福島県,富岡土木事務所,福島県,富岡町,37.35398度,140.9133度,2001.0,,702.0,0.0,2022年度,Ⅲ,措置完了済み
4,,トンネル,大黒岩隧道,(ｵｵｸﾞﾛｲﾜｽﾞｲﾄﾞｳ),村道羽沢桧沢線,市区町村,南牧村,振興整備課,群馬県,南牧村,36.14167度,138.67083度,1984.0,1984.0,50.0,6.6,2022年度,Ⅲ,措置未着手
