### Current TODOS:

- In stations, the Shinkansen lines' data are not recorded. Need to manually add them. (`ekidata_id: 1002~1011`)
- In stations, all stations don't have romaji (English names). Some have stations names, but they are the minorities (e.g. "Iwafune" in`"code": "JR-East.Ryomo.Iwafune"`)
- Name inconsistencies in stations. (e.g. "梅田" vs "大阪")
- In stations, `"ekidata_id"` reveals the station order of each station on a particular line, e.g. `"2500106"` means it is station `06` of line `25001`. There are some inconsistencies, as not all `ekidata_id` is processed in this way.


### Problem:

- Station ordering
- Aiport station bridge to actual railway station
- Granularity: prefecture vs city vs specific station
- How many stations will the train actually stop at? Express vs. Regular

In [1]:
import os
import sys
import json
import pprint

In [2]:
lineF = open("open-data-jp-railway-lines-master/lines.json")
lineDataJson = json.load(lineF)

In [3]:
print("Example schema: ")
pprint.pprint(lineDataJson[0], indent=4)
print("Total number of lines: ")
print(len(lineDataJson))

Example schema: 
{   'alternative_names': ['トウカイドウシンカンセン'],
    'code': '',
    'ekidata_id': '1002',
    'logo': '',
    'name_kana': 'とうかいどうしんかんせん',
    'name_kanji': '東海道新幹線',
    'name_romaji': 'Tokaido Shinkansen',
    'prefectures': []}
Total number of lines: 
613


In [4]:
prefF = open("open-data-jp-prefectures-master/prefectures.json")
prefDataJson = json.load(prefF)

In [5]:
print("Example schema: ")
pprint.pprint(prefDataJson[0], indent=4)
print("Total number of prefectures: ")
print(len(prefDataJson))

Example schema: 
{   'iso': '01',
    'prefecture_kana': 'ほっかいどう',
    'prefecture_kanji': '北海道',
    'prefecture_romaji': 'Hokkaido'}
Total number of prefectures: 
47


In [6]:
stationsF = open("open-data-jp-railway-stations-master/stations.json")
stationsDataJson = json.load(stationsF)

Add alternative names of a station to alternative_names.

In [7]:
def check_alternative_names(data):
    for station_data in data:
        # Collect all station Kanji names
        kanji_names = [station.get("name_kanji") for station in station_data.get("stations", [])]
        
        # Determine the normalized name as the most common Kanji name
        if kanji_names:
            primary_name_kanji = max(set(kanji_names), key=kanji_names.count)
        else:
            primary_name_kanji = station_data["name_kanji"]
        
        # Set the primary name as the main name in the top-level "name_kanji"
        station_data["name_kanji"] = primary_name_kanji

        # Add other names as alternative names, excluding the primary name
        unique_names = set(kanji_names) - {primary_name_kanji}
        station_data["alternative_names"].extend(list(unique_names))


check_alternative_names(stationsDataJson)

An example where the expected name is not the actual name (大阪 vs 梅田)

In [8]:
target_name_kanji = "梅田"
    
for station_data in stationsDataJson:
    # Check the main station's kanji name
    if station_data.get("name_kanji") == target_name_kanji:
        print(json.dumps(station_data, ensure_ascii=False, indent=4))

print("Total number of stations: ", len(stationsDataJson))

{
    "name_kanji": "梅田",
    "name_kana": "",
    "name_romaji": "",
    "alternative_names": [
        "大阪",
        "北新地",
        "東梅田",
        "西梅田"
    ],
    "group_code": "1160214",
    "ekidata_line_ids": [
        "11602",
        "11603",
        "11623",
        "11625",
        "11629",
        "34001",
        "34002",
        "34003",
        "35001",
        "99618",
        "99619",
        "99620"
    ],
    "line_codes": [],
    "stations": [
        {
            "code": "",
            "ekidata_id": "1160214",
            "ekidata_group_id": "1160214",
            "name_kanji": "大阪",
            "alternative_names": [],
            "ekidata_line_id": "11602",
            "line_code": "",
            "short_code": "",
            "prefecture": "27",
            "lat": 34.702398,
            "lon": 135.495188
        },
        {
            "code": "",
            "ekidata_id": "1160301",
            "ekidata_group_id": "1160214",
            "name_kanji": "大阪",
  

Check inconsistent `ekidata_line` and `ekidata_id`

In [9]:
def correct_ekidata_ids(stations_data_json):
    for station_data in stations_data_json:
        for station in station_data.get("stations", []):
            ekidata_id = station.get("ekidata_id")
            ekidata_line_id = station.get("ekidata_line_id")

            # Check if `ekidata_id` starts with the `ekidata_line_id`
            if ekidata_line_id:
                # Extract the station order (last two digits of `ekidata_id`)
                station_order = ekidata_id[-2:]

                # Verify that the order is numeric
                if not station_order.isdigit():
                    print(f"Inconsistent station order in ekidata_id: {ekidata_id} for station {station['name_kanji']}")
                    continue
                
                # Expected `ekidata_id` based on the line ID and station order
                expected_ekidata_id = f"{ekidata_line_id}{station_order}"
                
                # Check if the existing ekidata_id matches the expected format
                if ekidata_id != expected_ekidata_id:
                    # Report and correct the inconsistency
                    print(f"Inconsistent `ekidata_id` for station {station['name_kanji']} (expected: {expected_ekidata_id}, found: {ekidata_id})")
                    station["ekidata_id"] = expected_ekidata_id

            else:
                # Handle cases where `ekidata_id` does not start with the `ekidata_line_id`
                print(f"For station {station['name_kanji']}, ekidata_line_id is not defined")

In [10]:
correct_ekidata_ids(stationsDataJson)

Inconsistent `ekidata_id` for station 函館 (expected: 9910822, found: 1111922)
Inconsistent `ekidata_id` for station 五稜郭 (expected: 9910821, found: 1111921)
Inconsistent `ekidata_id` for station 木古内 (expected: 9910810, found: 1111910)
Inconsistent `ekidata_id` for station 札苅 (expected: 9910811, found: 1111911)
Inconsistent `ekidata_id` for station 泉沢 (expected: 9910812, found: 1111912)
Inconsistent `ekidata_id` for station 釜谷 (expected: 9910813, found: 1111913)
Inconsistent `ekidata_id` for station 渡島当別 (expected: 9910814, found: 1111914)
Inconsistent `ekidata_id` for station 茂辺地 (expected: 9910815, found: 1111915)
Inconsistent `ekidata_id` for station 上磯 (expected: 9910816, found: 1111916)
Inconsistent `ekidata_id` for station 清川口 (expected: 9910817, found: 1111917)
Inconsistent `ekidata_id` for station 久根別 (expected: 9910818, found: 1111918)
Inconsistent `ekidata_id` for station 東久根別 (expected: 9910819, found: 1111919)
Inconsistent `ekidata_id` for station 七重浜 (expected: 9910820, found

Check number of lines that are not recorded in stations' `ekidata_line_ids`.

In [11]:
line_ids = [[line.get('ekidata_id'), line.get('name_kanji')] for line in lineDataJson]
missing_lines = []
for line, name in line_ids:
    found = False
    for station in stationsDataJson:
        if line in station['ekidata_line_ids']:
            found = True
            break
    if not found:
        missing_lines.append([line, name])
print(missing_lines)

[['1002', '東海道新幹線'], ['1003', '山陽新幹線'], ['1004', '東北新幹線'], ['1005', '上越新幹線'], ['1006', '上越新幹線(ガーラ湯沢支線)'], ['1007', '山形新幹線'], ['1008', '秋田新幹線'], ['1009', '北陸新幹線'], ['1010', '九州新幹線'], ['1011', '北海道新幹線']]


In [12]:
# Missing lines: lineDataJson[0:10]
# Manually add stations for each missing line
stations_to_add = []
s1 = ["東京", "品川", "新横浜", "小田原", "熱海", "三島", "新富士", "静岡", "掛川", "浜松", "豊橋", "三河安城", "名古屋", "岐阜羽島", "米原", "京都", "新大阪"]
stations_to_add.append(s1)
s2 = ["新大阪", "新神戸", "西明石", "姫路", "相生", "岡山", "新倉敷", "福山", "新尾道", "三原", "東広島", "広島", "新岩国", "徳山", "新山口", "厚狭", "新下関", "小倉", "博多"]
stations_to_add.append(s2)
s3 = ["東京", "上野", "大宮", "小山", "宇都宮", "那須塩原", "新白河", "郡山", "福島", "白石蔵王", "仙台", "古川", "くりこま高原", "一ノ関", "水沢江刺", "北上", "新花巻", "盛岡", "いわて沼宮内", "二戸", "八戸", "七戸十和田", "新青森"]
stations_to_add.append(s3)
s4 = ["東京", "上野", "大宮", "熊谷", "本庄早稲田", "高崎", "上毛高原", "越後湯沢", "浦佐", "長岡", "燕三条", "新潟"]
stations_to_add.append(s4)
s5 = []
# s5 = ["越後湯沢", "ガーラ湯沢"]
stations_to_add.append(s5)
s6 = ["福島", "米沢", "高畠", "赤湯", "かみのやま温泉", "山形", "天童", "さくらんぼ東根", "村山", "大石田", "新庄"]
stations_to_add.append(s6)
s7 = ["東京", "上野", "大宮", "小山", "宇都宮", "那須塩原", "新白河", "郡山", "福島", "白石蔵王", "仙台", "古川", "くりこま高原", "一ノ関", "水沢江刺", "北上", "新花巻", "盛岡", "雫石", "田沢湖", "角館", "大曲", "秋田"]
stations_to_add.append(s7)
s8 = ["東京", "上野", "大宮", "熊谷", "本庄早稲田", "高崎", "安中榛名", "軽井沢", "佐久平", "上田", "長野", "飯山", "上越妙高", "糸魚川", "黒部宇奈月温泉", "富山", "新高岡", "金沢", "小松", "加賀温泉", "芦原温泉", "福井", "越前たけふ", "敦賀", "東小浜", "京都", "松井山手", "新大阪"]
stations_to_add.append(s8)
s9 = ["博多", "新鳥栖", "久留米", "筑後船小屋", "新大牟田", "新玉名", "熊本", "新八代", "新水俣", "出水", "川内", "鹿児島中央"]
stations_to_add.append(s9)
s10 = ["新青森", "奥津軽いまべつ", "木古内", "新函館北斗"]
stations_to_add.append(s10)

In [13]:
def add_shinkansen_data(stations: list[str], line: dict, stations_data_json: list[dict]):
    line_name = line["name_kanji"]
    line_id = line["ekidata_id"]

    if stations is None:
        return

    for i, st in enumerate(stations):
        found = False
        for station_data in stations_data_json:
            # Check if station name matches any entry in station_data
            if station_data.get("name_kanji") == st or st in station_data.get("alternative_names", []):
                if found:
                    print(f"Station {st} was processed twice")
                    break
                found = True
                
                # Get details for the existing station entry
                cur_prefecture = station_data["prefecture"]
                cur_lat, cur_lon = station_data["stations"][0]["lat"], station_data["stations"][0]["lon"]

                # Add the line ID to the existing ekidata_line_ids if it's not already there
                if line_id not in station_data["ekidata_line_ids"]:
                    station_data["ekidata_line_ids"].append(line_id)
                
                # Prepare the new station data for insertion
                new_station = {
                    "code": "",
                    "ekidata_id": f"{line_id}{i + 1:02}",
                    "name_kanji": st,
                    "alternative_names": [],
                    "ekidata_line_id": line_id,
                    "line_code": line["name_romaji"],
                    "short_code": "",
                    "prefecture": cur_prefecture,
                    "lat": cur_lat,
                    "lon": cur_lon
                }
                
                # Append the new station entry to the station's stations list
                station_data["stations"].append(new_station)
                
                # Print in JSON format for viewing with original insertion order
                # print(json.dumps(new_station, ensure_ascii=False, indent=4))
                
                # Break after finding and processing the correct station
                break

        if not found:
            print(f"Station {st} is not found for line {line_name}")

In [14]:
for i in range(10):
    add_shinkansen_data(stations_to_add[i], lineDataJson[i], stationsDataJson)

Station 岐阜羽島 is not found for line 東海道新幹線
Station 新尾道 is not found for line 山陽新幹線
Station 東広島 is not found for line 山陽新幹線
Station 新岩国 is not found for line 山陽新幹線
Station 白石蔵王 is not found for line 東北新幹線
Station くりこま高原 is not found for line 東北新幹線
Station 水沢江刺 is not found for line 東北新幹線
Station 七戸十和田 is not found for line 東北新幹線
Station 本庄早稲田 is not found for line 上越新幹線
Station 上毛高原 is not found for line 上越新幹線
Station 白石蔵王 is not found for line 秋田新幹線
Station くりこま高原 is not found for line 秋田新幹線
Station 水沢江刺 is not found for line 秋田新幹線
Station 本庄早稲田 is not found for line 北陸新幹線
Station 安中榛名 is not found for line 北陸新幹線
Station 黒部宇奈月温泉 is not found for line 北陸新幹線
Station 越前たけふ is not found for line 北陸新幹線
Station 新大牟田 is not found for line 九州新幹線
Station 新玉名 is not found for line 九州新幹線
Station 奥津軽いまべつ is not found for line 北海道新幹線


In [15]:
target_name_kanji = "新大阪"
    
for station_data in stationsDataJson:
    # Check the main station's kanji name
    if station_data.get("name_kanji") == target_name_kanji or target_name_kanji in station_data.get("alternative_names"):
        print(json.dumps(station_data, ensure_ascii=False, indent=4))

print("Total number of stations: ", len(stationsDataJson))

{
    "name_kanji": "新大阪",
    "name_kana": "",
    "name_romaji": "",
    "alternative_names": [],
    "group_code": "1160213",
    "ekidata_line_ids": [
        "11602",
        "11629",
        "99618",
        "1002",
        "1003",
        "1009"
    ],
    "line_codes": [],
    "stations": [
        {
            "code": "",
            "ekidata_id": "1160213",
            "ekidata_group_id": "1160213",
            "name_kanji": "新大阪",
            "alternative_names": [],
            "ekidata_line_id": "11602",
            "line_code": "",
            "short_code": "",
            "prefecture": "27",
            "lat": 34.734136,
            "lon": 135.501852
        },
        {
            "code": "",
            "ekidata_id": "1162901",
            "ekidata_group_id": "1160213",
            "name_kanji": "新大阪",
            "alternative_names": [],
            "ekidata_line_id": "11629",
            "line_code": "",
            "short_code": "",
            "prefecture": "27",

In [16]:
target_name_kanji = "函館"
    
for station_data in stationsDataJson:
    # Check the main station's kanji name
    if station_data.get("name_kanji") == target_name_kanji or target_name_kanji in station_data.get("alternative_names"):
        print(json.dumps(station_data, ensure_ascii=False, indent=4))

print("Total number of stations: ", len(stationsDataJson))

{
    "name_kanji": "函館",
    "name_kana": "",
    "name_romaji": "",
    "alternative_names": [
        "函館駅前"
    ],
    "group_code": "1110101",
    "ekidata_line_ids": [
        "11101",
        "99108",
        "99105",
        "99106"
    ],
    "line_codes": [],
    "stations": [
        {
            "code": "",
            "ekidata_id": "1110101",
            "ekidata_group_id": "1110101",
            "name_kanji": "函館",
            "alternative_names": [],
            "ekidata_line_id": "11101",
            "line_code": "",
            "short_code": "",
            "prefecture": "01",
            "lat": 41.773709,
            "lon": 140.726413
        },
        {
            "code": "",
            "ekidata_id": "9910822",
            "ekidata_group_id": "1110101",
            "name_kanji": "函館",
            "alternative_names": [],
            "ekidata_line_id": "99108",
            "line_code": "",
            "short_code": "",
            "prefecture": "01",
            "

To shrink the dataset, first identify the lines to keep (Shinkansen (1002~1011), and lines that crosses prefectures and have id less than 90000)

In [38]:
def filter_lines(line_data_json):
    filtered_lines = {}
    
    for line in line_data_json:
        # Extract the line's ID and name
        line_id = int(line["ekidata_id"])
        line_name = line["name_kanji"]

        # Check if the line is a Shinkansen (IDs 1002 to 1011) or crosses prefectures (ID < 90000)
        if 1002 <= line_id <= 1011 or "47" in line["prefectures"] or (line_id < 20000 and (len(line["prefectures"]) > 1 or "01" in line["prefectures"])):
            filtered_lines[line_id] = line_name
    
    return filtered_lines


In [39]:
filtered_lines = filter_lines(lineDataJson)
print(filtered_lines)
print(len(filtered_lines))

{1002: '東海道新幹線', 1003: '山陽新幹線', 1004: '東北新幹線', 1005: '上越新幹線', 1006: '上越新幹線(ガーラ湯沢支線)', 1007: '山形新幹線', 1008: '秋田新幹線', 1009: '北陸新幹線', 1010: '九州新幹線', 1011: '北海道新幹線', 11101: 'JR函館本線(函館～長万部)', 11102: 'JR函館本線(長万部～小樽)', 11103: 'JR函館本線(小樽～旭川)', 11104: 'JR室蘭本線(長万部・室蘭～苫小牧)', 11105: 'JR室蘭本線(苫小牧～岩見沢)', 11106: 'JR根室本線(滝川～新得)', 11107: 'JR根室本線(新得～釧路)', 11108: '根室本線', 11109: '千歳線', 11110: '石勝線', 11111: '日高本線', 11112: '札沼線', 11113: '留萌本線', 11114: '富良野線', 11115: '宗谷本線', 11116: '石北本線', 11117: '釧網本線', 11118: '海峡線', 11202: 'JR奥羽本線(新庄～青森)', 11231: 'JR東北本線(黒磯～利府・盛岡)', 11204: '五能線', 11206: '八戸線', 11208: '大船渡線', 11210: '北上線', 11211: '田沢湖線', 11212: '花輪線', 11214: '羽越本線', 11216: '山形線', 11217: '仙山線', 11219: '米坂線', 11221: '陸羽東線', 11226: '磐越西線', 11227: '只見線', 11229: 'JR常磐線(取手～いわき)', 11230: 'JR常磐線(いわき～仙台)', 11301: 'JR東海道本線(東京～熱海)', 11303: '南武線', 11305: '武蔵野線', 11306: '横浜線', 11308: '横須賀線', 11311: 'JR中央本線(東京～塩尻)', 11313: 'JR中央・総武線', 11314: '総武本線', 11317: 'JR八高線(八王子～高麗川)', 11318: 'JR八高線(高麗川～高崎)', 11319: '宇都宮線', 11320: 'J

In [40]:
def extract_lines(filtered, line_data_json):
    extracted_lines = []
    visited = set()
    for line in line_data_json:
        line_id = int(line["ekidata_id"])

        # Check if the line_id is in the filtered dictionary
        if line_id in filtered and line_id not in visited:
            extracted_lines.append(line)
            visited.add(line_id)

    return extracted_lines

In [41]:
extracted_lines = extract_lines(filtered_lines, lineDataJson)
print(len(extracted_lines))

130


In [42]:
with open("processed_lines.json", "w", encoding='utf-8') as outfile: 
    json.dump(extracted_lines, outfile, ensure_ascii=False, indent=4)

Lines are prepared, the next step is to extract only the relevant stations

In [47]:
def shrink_stations(station_data_json, filtered_lines):
    shrink_station_data = []
    for station in station_data_json:
        # Filter ekidata_line_ids based on extracted_lines
        filtered_line_ids = [line_id for line_id in station["ekidata_line_ids"] if int(line_id) in filtered_lines]

        if not filtered_line_ids:
            continue
        # Update ekidata_line_ids to only include matching lines
        station["ekidata_line_ids"] = filtered_line_ids

        # Filter 'stations' inner dictionary to include only matching line ids
        station["stations"] = [
            st for st in station["stations"] if st["ekidata_line_id"] in filtered_line_ids
        ]

        shrink_station_data.append(station)
    return shrink_station_data

In [48]:
shrinkDataJson = shrink_stations(stationsDataJson, filtered_lines)

In [49]:
target_name_kanji = "新大阪"
    
for station_data in shrinkDataJson:
    # Check the main station's kanji name
    if station_data.get("name_kanji") == target_name_kanji or target_name_kanji in station_data.get("alternative_names"):
        print(json.dumps(station_data, ensure_ascii=False, indent=4))

print("Total number of stations: ", len(shrinkDataJson))

{
    "name_kanji": "新大阪",
    "name_kana": "",
    "name_romaji": "",
    "alternative_names": [],
    "group_code": "1160213",
    "ekidata_line_ids": [
        "11602",
        "11629",
        "1002",
        "1003",
        "1009"
    ],
    "line_codes": [],
    "stations": [
        {
            "code": "",
            "ekidata_id": "1160213",
            "ekidata_group_id": "1160213",
            "name_kanji": "新大阪",
            "alternative_names": [],
            "ekidata_line_id": "11602",
            "line_code": "",
            "short_code": "",
            "prefecture": "27",
            "lat": 34.734136,
            "lon": 135.501852
        },
        {
            "code": "",
            "ekidata_id": "1162901",
            "ekidata_group_id": "1160213",
            "name_kanji": "新大阪",
            "alternative_names": [],
            "ekidata_line_id": "11629",
            "line_code": "",
            "short_code": "",
            "prefecture": "27",
            "lat

In [50]:
with open("processed_stations_shrinked.json", "w", encoding='utf-8') as outfile: 
    json.dump(shrinkDataJson, outfile, ensure_ascii=False, indent=4)

In [95]:
with open("processed_shinkansen_stations.json", "w", encoding='utf-8') as outfile: 
    json.dump(stationsDataJson, outfile, ensure_ascii=False, indent=4)

### Schema setup

#### Nodes:

- Station nodes
    - Attributes: station_id, name_kanji, name_romaji, alternative_names, latitude, longitude, prefecture_id
- Line nodes
    - Attributes: line_id, name_kanji, name_romaji
- Prefecture nodes
    - Attributes: prefecture_id, name

#### Relationships:

- BELONGS_TO: Between each Station and Prefecture
    - Attributes: None (or metadata if necessary)
- CONNECTED_BY: Between each pair of Station nodes if they are sequentially connected on a line
    - Attributes: line_id, distance (if available or calculable)
- SERVICED_BY: Between Station and Line nodes indicating which lines serve each station
    - Attributes: None (or optional metadata if required)