The languages.csv file in the database includes the latitude and longitude for each city. We can use the distance module from the geopy library to calculate the distance between each pair of cities. This module computes the great-circle distance, also known as spherical distance, between two geographic coordinates.

In [1]:
pip install geopy



In [2]:
import pandas as pd
from google.colab import drive
from geopy.distance import great_circle

In [3]:
# Mount Google Drive
drive.mount('/content/gdrive')
data = pd.read_csv("/content/gdrive/My Drive/Data Science/languages.csv")
# Check the data
data

Mounted at /content/gdrive


Unnamed: 0,ID,Name,Glottocode,Glottolog_Name,Macroarea,Latitude,Longitude,Family,ChineseName,SubGroup,Source_ID,DialectGroup,Pinyin,AltName
0,Beijing,Beijing,beij1234,Beijing Mandarin,Eurasia,39.938547,116.117277,Sino-Tibetan,北京,Sinitic,1,Mandarin,Běijīng,Peking
1,Changsha,Changsha,chan1326,Changsha,Eurasia,28.85032,112.943344,Sino-Tibetan,长沙,Sinitic,12,Xiang,Chángshā,
2,Chengdu,Chengdu,chen1267,Chengdu Mandarin,Eurasia,30.605768,103.970947,Sino-Tibetan,成都,Sinitic,7,Mandarin,Chéngdū,
3,Fuzhou,Fuzhou,fuzh1239,Houguan,Eurasia,26.08904,119.294243,Sino-Tibetan,福州,Sinitic,18,Min,Fúzhōu,Foochow
4,Guangzhou,Guangzhou,guan1279,Guangzhou,Eurasia,23.12535,112.947655,Sino-Tibetan,广州,Sinitic,17,Yue,Guǎngzhōu,
5,Guilin,Guilin,guil1241,Guilin Pinghua,Eurasia,25.266667,110.283333,Sino-Tibetan,桂林,Sinitic,16,Pinghua,Guīlín,
6,Haerbin,Ha_erbin,haer1234,Ha'erbin Mandarin,Eurasia,45.846595,126.551056,Sino-Tibetan,哈尔滨,Sinitic,2,Mandarin,Hāěrbīn,"Harbin, Ha'erbin, Ha’erbin"
7,Jinan,Jinan,jina1245,Jinan Mandarin,Eurasia,36.690777,116.997299,Sino-Tibetan,济南,Sinitic,3,Mandarin,Jǐnán,
8,Jixi,Jixi,jixi1238,Jixi,Eurasia,30.071111,118.592222,Sino-Tibetan,绩溪,Sinitic,9,Hui,Jìxī,
9,Loudi,Loudi,loud1234,Loudi,Eurasia,27.733333,111.928188,Sino-Tibetan,娄底,Sinitic,13,Xiang,Lóudî,


In [8]:
# Extract the city, latitude, and longitude columns from the DataFrame
city_la_lo= data[['Name', 'Latitude', 'Longitude']]
city_la_lo

Unnamed: 0,Name,Latitude,Longitude
0,Beijing,39.938547,116.117277
1,Changsha,28.85032,112.943344
2,Chengdu,30.605768,103.970947
3,Fuzhou,26.08904,119.294243
4,Guangzhou,23.12535,112.947655
5,Guilin,25.266667,110.283333
6,Ha_erbin,45.846595,126.551056
7,Jinan,36.690777,116.997299
8,Jixi,30.071111,118.592222
9,Loudi,27.733333,111.928188


In [9]:
# Initialize a new DataFrame to store the result
distance = pd.DataFrame(index=city_la_lo['Name'], columns=city_la_lo['Name'])
# Calculate the distance between each pair of cities
for i, city1 in city_la_lo.iterrows():
    coords_1 = (city1['Latitude'], city1['Longitude'])
    for j, city2 in city_la_lo.iterrows():
        coords_2 = (city2['Latitude'], city2['Longitude'])
        dis = great_circle(coords_1, coords_2).kilometers
        distance.at[city1['Name'], city2['Name']] = dis
distance

Name,Beijing,Changsha,Chengdu,Fuzhou,Guangzhou,Guilin,Ha_erbin,Jinan,Jixi,Loudi,Meixian,Nanchang,Nanjing,Rongcheng,Suzhou,Taiyuan,Wenzhou,Xi_an,Xiamen
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Beijing,0.0,1266.628201,1511.474249,1567.916447,1893.158891,1719.454965,1072.768366,369.201189,1119.986882,1410.755967,1733.244378,1249.162364,896.672829,631.736049,1031.156059,395.655855,1392.040046,897.285126,1727.191646
Changsha,1266.628201,0.0,887.818631,697.595988,636.588631,477.62868,2233.11245,950.377743,563.429342,159.076906,579.59291,284.455768,639.95292,1275.800796,765.053766,1003.441433,764.264969,706.977491,704.128618
Chengdu,1511.474249,887.818631,0.0,1579.77384,1217.542982,858.137122,2584.002326,1380.802271,1403.443375,835.73398,1364.764947,1167.24267,1384.544438,1848.529893,1563.643398,1117.928778,1646.168509,617.57906,1547.147358
Fuzhou,1567.916447,697.595988,1579.77384,0.0,721.148202,907.449977,2289.518411,1198.732722,448.107106,752.765729,393.093243,446.977568,673.191595,1263.032509,591.80972,1460.57379,254.602512,1342.433667,214.766896
Guangzhou,1893.158891,636.588631,1217.542982,721.148202,0.0,360.14346,2808.142047,1557.632292,954.329133,522.503493,329.121127,685.367712,1135.44467,1802.443498,1169.551508,1639.600624,947.925059,1294.047789,544.704783
Guilin,1719.454965,477.62868,858.137122,907.449977,360.14346,0.0,2708.864797,1421.410542,976.715781,319.397531,574.941767,671.685437,1100.607008,1750.605364,1194.208954,1414.867477,1078.88761,1004.589844,791.967187
Ha_erbin,1072.768366,2233.11245,2584.002326,2289.518411,2808.142047,2708.864797,0.0,1291.863006,1885.448193,2389.739848,2573.811483,2123.273618,1679.004461,1027.665222,1700.236041,1466.455982,2049.396129,1969.856071,2492.238383
Jinan,369.201189,950.377743,1380.802271,1198.732722,1557.632292,1421.410542,1291.863006,0.0,750.786002,1103.829936,1376.037381,894.16094,527.852409,485.792821,672.099689,426.882838,1026.270598,777.379717,1360.188166
Jixi,1119.986882,563.429342,1403.443375,448.107106,954.329133,976.715781,1885.448193,750.786002,0.0,698.697659,689.675819,305.308908,225.851159,861.645236,219.618525,1036.954055,307.866921,1017.132256,622.110633
Loudi,1410.755967,159.076906,835.73398,752.765729,522.503493,319.397531,2389.739848,1103.829936,698.697659,0.0,545.838861,399.881799,794.463225,1434.747453,908.400729,1127.259691,862.713859,775.510001,712.657043


The next step is to combine the distance matrix with the differences in value/form/segments between dialects with the same meaning. This will help verify the hypothesis that the closer the cities are to each other, the more similar their dialects tend to be. (TBC)