Class Name: Data Science for Linguistics. 

Project Name: Analysis of correlations between Character and geographical areas 

Authors:     Jiarong Tang      Weiting Wang  

Descripition: Use the Chinese word2Vec Embedding, and Geospatial to show the distribution of Chinese dialects. 

In [44]:
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors
from collections import Counter


In [2]:
data_df = pd.read_csv('forms.csv')
data_df

Unnamed: 0,ID,Local_ID,Language_ID,Parameter_ID,Value,Form,Segments,Comment,Source,Cognacy,Loan,Graphemes,Profile,Prosody,Morpheme_Glosses,Partial_Cognacy,Chinese_Characters
0,Beijing-91_vomit-1,,Beijing,91_vomit,tʰu⁵¹,tʰu⁵¹,tʰ u ⁵¹,,Liu2007,,,,,i n t,spit/吐,1,吐
1,Haerbin-91_vomit-1,,Haerbin,91_vomit,tʰu⁵³,tʰu⁵³,tʰ u ⁵³,,Liu2007,,,,,i n t,spit/吐,1,吐
2,Jinan-91_vomit-1,,Jinan,91_vomit,tʰu³¹,tʰu³¹,tʰ u ³¹,,Liu2007,,,,,i n t,spit/吐,1,吐
3,Rongcheng-91_vomit-1,,Rongcheng,91_vomit,ou²¹³⁻³⁵ tʰu²¹⁴,ou²¹³⁻³⁵ tʰu²¹⁴,ou ²¹³ + tʰ u ²¹⁴,copulative synonyme,Liu2007,,,,,n t + i n t,nausea/嘔 spit/吐,2 1,嘔 吐
4,Taiyuan-91_vomit-1,,Taiyuan,91_vomit,tʰu⁵³ lə⁰,tʰu⁵³ lə⁰,tʰ u ⁵³ + l ə ⁰,,Liu2007,,,,,i n t + i n t,nausea/嘔 _:PERFECTIVE/了,2 5,嘔 嘞
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4297,Guangzhou-90_woman-1,,Guangzhou,90_woman,nøy²³ iɐn²¹⁻²³,nøy²³ iɐn²¹⁻²³,n øy ²³ + j ɐ n ²¹,,Liu2007,,,,,i n t + i n c t,female/女 _person/人,39 38,女 人
4298,Fuzhou-90_woman-1,,Fuzhou,90_woman,i⁵⁵⁻⁵³ tsia³²,i⁵⁵⁻⁵³ tsia³²,i ⁵⁵ + ts j a ³²,,Liu2007,,,,,n t + i m n t,woman/伊 sister/姐,824 715,伊 姐
4299,Fuzhou-90_woman-2,,Fuzhou,90_woman,ny³²⁻⁵⁵ ɛ²¹²,ny³²⁻⁵⁵ ɛ²¹²,n y ³² + ɛ ²¹²,,Liu2007,,,,,i n t + n t,female/女 _world/界,39 825,女 界
4300,Fuzhou-90_woman-3,,Fuzhou,90_woman,tsy⁵⁵ nøyŋ⁵³⁻⁵⁵ nøyŋ⁵³,tsy⁵⁵ nøyŋ⁵³⁻⁵⁵ nøyŋ⁵³,ts y ⁵⁵ + n øy ŋ ⁵³ + n øy ŋ ⁵³,,Liu2007,,,,,i n t + i n c t + i n c t,woman/諸 female/娘 _person/人,749 31 38,諸 娘 儂


In [3]:
# get the info of the dataFrame columns
data_df.columns

Index(['ID', 'Local_ID', 'Language_ID', 'Parameter_ID', 'Value', 'Form',
       'Segments', 'Comment', 'Source', 'Cognacy', 'Loan', 'Graphemes',
       'Profile', 'Prosody', 'Morpheme_Glosses', 'Partial_Cognacy',
       'Chinese_Characters'],
      dtype='object')

step 1 extract all the parameter_ID/ Language_ID to decide how many words we are inverstigating


In [4]:
# 19 cities  but there is variety among the same city 
cities = data_df['Language_ID'].unique()
cities

array(['Beijing', 'Haerbin', 'Jinan', 'Rongcheng', 'Taiyuan', 'XiAn',
       'Chengdu', 'Nanjing', 'Jixi', 'Suzhou', 'Wenzhou', 'Changsha',
       'Loudi', 'Nanchang', 'Meixian', 'Guilin', 'Guangzhou', 'Fuzhou',
       'Xiamen'], dtype=object)

In [5]:
#  比如沿河城市对一个单词的相似性或者两个
#  沿着山脉的对一个单词的相似性或者两个
# 7 cities
#  'Suzhou',   # 东部'Guangzhou', # 南部'Chengdu',   # 西部'Beijing',   # 北部'Haerbin',   # 东北部（补充选择）
# 'Changsha',  # 中部（补充选择）'XiAn'  

In [6]:
# 203 words 
words = data_df['Parameter_ID'].unique()
words

array(['91_vomit', '92_fear', '93_skin', '94_float', '95_smooth',
       '96_wife', '97_all', '98_hot', '99_person', '100_throw', '1_white',
       '2_back', '3_nose', '4_ice', '5_neck', '6_thin', '7_not', '8_rub',
       '9_grass', '10_long', '101_flesh', '102_if', '103_breasts',
       '104_three', '105_woods', '106_kill', '107_sand', '108_mountain',
       '109_burn', '110_few', '111_tongue', '112_snake', '113_rope',
       '114_louse', '115_wet', '116_what', '117_stone', '118_hand',
       '119_tree', '120_bark', '121a_countnoun', '121b_countverb',
       '122_who', '123_water', '124_fruit', '125_sleep', '126_suck',
       '127_say', '128_split', '129_die', '130_four', '131_he',
       '132_they', '133_sun', '134_lie', '135_day', '136_sky', '137_hear',
       '138_head', '139_hair', '140_earth', '141_spit', '142_push',
       '143_leg', '144_dig', '145_play', '146_night', '147_tail',
       '148_smell', '149_i', '150_we', '151_five', '152_fog', '153_knee',
       '154_wash', '155_t

In [7]:
len(words)

203

In [8]:
# check which words have more than one expression in one area
multis = []
ones = []
for s in data_df['ID']:
    # ID format: city-para-index
    if int(s.split('-')[2]) != 1:
        multis.append(s)
    else:
        ones.append(s)
multis
# the later similarity comparison of different areas 
# first is the similarity of different areas
# second is the similarity of same parameter in same area

['Chengdu-91_vomit-2',
 'Chengdu-91_vomit-3',
 'Nanjing-91_vomit-2',
 'Loudi-91_vomit-2',
 'Nanchang-91_vomit-2',
 'Chengdu-92_fear-2',
 'Meixian-92_fear-2',
 'Meixian-92_fear-3',
 'Chengdu-93_skin-2',
 'Fuzhou-93_skin-2',
 'XiAn-94_float-2',
 'Nanjing-94_float-2',
 'Nanjing-94_float-3',
 'Jixi-94_float-2',
 'Wenzhou-94_float-2',
 'Nanchang-94_float-2',
 'Nanchang-94_float-3',
 'Fuzhou-94_float-2',
 'Haerbin-96_wife-2',
 'Haerbin-96_wife-3',
 'XiAn-96_wife-2',
 'XiAn-96_wife-3',
 'XiAn-96_wife-4',
 'Chengdu-96_wife-2',
 'Chengdu-96_wife-3',
 'Jixi-96_wife-2',
 'Changsha-96_wife-2',
 'Changsha-96_wife-3',
 'Changsha-96_wife-4',
 'Nanchang-96_wife-2',
 'Fuzhou-96_wife-2',
 'Fuzhou-96_wife-3',
 'Fuzhou-96_wife-4',
 'Chengdu-100_throw-2',
 'Changsha-100_throw-2',
 'Loudi-100_throw-2',
 'Nanchang-100_throw-2',
 'Fuzhou-100_throw-2',
 'Fuzhou-100_throw-3',
 'Taiyuan-1_white-2',
 'XiAn-2_back-2',
 'Taiyuan-3_nose-2',
 'Changsha-3_nose-2',
 'Xiamen-3_nose-2',
 'Changsha-4_ice-2',
 'Xiamen-4_ic

In [9]:
ones

['Beijing-91_vomit-1',
 'Haerbin-91_vomit-1',
 'Jinan-91_vomit-1',
 'Rongcheng-91_vomit-1',
 'Taiyuan-91_vomit-1',
 'XiAn-91_vomit-1',
 'Chengdu-91_vomit-1',
 'Nanjing-91_vomit-1',
 'Jixi-91_vomit-1',
 'Suzhou-91_vomit-1',
 'Wenzhou-91_vomit-1',
 'Changsha-91_vomit-1',
 'Loudi-91_vomit-1',
 'Nanchang-91_vomit-1',
 'Meixian-91_vomit-1',
 'Guilin-91_vomit-1',
 'Guangzhou-91_vomit-1',
 'Fuzhou-91_vomit-1',
 'Xiamen-91_vomit-1',
 'Beijing-92_fear-1',
 'Haerbin-92_fear-1',
 'Jinan-92_fear-1',
 'Rongcheng-92_fear-1',
 'Taiyuan-92_fear-1',
 'XiAn-92_fear-1',
 'Chengdu-92_fear-1',
 'Nanjing-92_fear-1',
 'Jixi-92_fear-1',
 'Suzhou-92_fear-1',
 'Wenzhou-92_fear-1',
 'Changsha-92_fear-1',
 'Loudi-92_fear-1',
 'Nanchang-92_fear-1',
 'Meixian-92_fear-1',
 'Guilin-92_fear-1',
 'Guangzhou-92_fear-1',
 'Fuzhou-92_fear-1',
 'Xiamen-92_fear-1',
 'Beijing-93_skin-1',
 'Haerbin-93_skin-1',
 'Jinan-93_skin-1',
 'Rongcheng-93_skin-1',
 'Taiyuan-93_skin-1',
 'XiAn-93_skin-1',
 'Chengdu-93_skin-1',
 'Nanjing-

In [10]:
len(multis)

446

In [11]:
len(ones)

3856

length of ones should be 3857, apparently there is a missing data, so next step is to find it out and does not count it in the similarity counting.


In [12]:

id_format_issues = [s for s in ones if len(s.split('-')) != 3]
# check id_format
if id_format_issues:
    print(f"Not well-formed ID: {id_format_issues}")

# count cities 
cities_in_ones = [s.split('-')[0] for s in ones]

city_counts = Counter(cities_in_ones)

print("Count of each city:")
for city in cities:
    print(f"{city}: {city_counts.get(city, 0)}")
# xiamen is not 203 times 

Count of each city:
Beijing: 203
Haerbin: 203
Jinan: 203
Rongcheng: 203
Taiyuan: 203
XiAn: 203
Chengdu: 203
Nanjing: 203
Jixi: 203
Suzhou: 203
Wenzhou: 203
Changsha: 203
Loudi: 203
Nanchang: 203
Meixian: 203
Guilin: 203
Guangzhou: 203
Fuzhou: 203
Xiamen: 202


In [13]:
cities_in_ones = [s.split('-')[0] for s in ones]

# extract xiamen words list 
xiamen_ids = [s for s in ones if s.split('-')[0] == 'Xiamen']
xiamen_words_list = [s.split('-')[1] for s in xiamen_ids]


# find missing word 
missing_word = [word for word in words if word not in xiamen_words_list]
print(missing_word)

['190_grease']


In [14]:
# method to get infos of the same parameter_ID
def get_characters(para,info1,info2):
    return (data_df[data_df['Parameter_ID']== para][info1],data_df[data_df['Parameter_ID']== para][info2])


In [15]:
# check again, for 190_grease,  Xiamen data missing
get_characters('190_grease','ID','Chinese_Characters')


(2336      Beijing-190_grease-1
 2337      Haerbin-190_grease-1
 2338        Jinan-190_grease-1
 2339    Rongcheng-190_grease-1
 2340      Taiyuan-190_grease-1
 2341      Taiyuan-190_grease-2
 2342         XiAn-190_grease-1
 2343         XiAn-190_grease-2
 2344      Chengdu-190_grease-1
 2345      Nanjing-190_grease-1
 2346      Nanjing-190_grease-2
 2347         Jixi-190_grease-1
 2348         Jixi-190_grease-2
 2349       Suzhou-190_grease-1
 2350      Wenzhou-190_grease-1
 2351     Changsha-190_grease-1
 2352        Loudi-190_grease-1
 2353     Nanchang-190_grease-1
 2354     Nanchang-190_grease-2
 2355      Meixian-190_grease-1
 2356       Guilin-190_grease-1
 2357    Guangzhou-190_grease-1
 2358       Fuzhou-190_grease-1
 Name: ID, dtype: object,
 2336      脂 肪
 2337      脂 肪
 2338      脂 肪
 2339      脂 肪
 2340      脂 肪
 2341        油
 2342      脂 肪
 2343        油
 2344        油
 2345        油
 2346      脂 肪
 2347        油
 2348      脂 肪
 2349      壯 肉
 2350      脂 肪
 2351      脂 

Since 190_grease is not helpful with comparing the similarity among different areas, we remove it, but it can be also used in the same area different varieties similarities comparison.


Q1. Do cities that are geographically close have similar characteristics in the same Parameter_ID?

In [17]:
pwd()


'/Users/weiting/Desktop/2024summer semester/data linguistics'

In [20]:
cd tencent-ailab-embedding-zh-d100-v0.2.0


/Users/weiting/Desktop/2024summer semester/data linguistics/tencent-ailab-embedding-zh-d100-v0.2.0


In [21]:
# import word2vec Model
wv_from_text = KeyedVectors.load_word2vec_format('tencent-ailab-embedding-zh-d100-v0.2.0.txt', binary=False)
    

In [37]:
# define a funtion to do the similarities 
def get_similarity(word1, word2, model):
    word1 = word1.replace(' ', '')
    word2 = word2.replace(' ', '')
    if word1 == word2:
        return 1.0
    elif word1 in model.key_to_index and word2 in model.key_to_index:
        return round(model.similarity(word1, word2), 3)
    else:
        return 0.0


In [23]:
# compare the word 
word_items =[]
for i in ones:
    if i.split('-')[1] != '190_grease':
        chars = data_df[data_df['ID']==i]['Chinese_Characters']
        if not chars.empty:  # not put series type in a list
            word_items.append(chars.iloc[0])

In [24]:
# 202 words except 190_grease
len(word_items)


3838

In [26]:
word_groups = [word_items[i:i + 19] for i in range(0, len(word_items), 19)]
word_groups[0]


['吐',
 '吐',
 '吐',
 '嘔 吐',
 '嘔 嘞',
 '嘔 吐',
 '發 吐 了',
 '吐',
 '㽹 惡',
 '嘔',
 '吐',
 '吐',
 '囗',
 '嘔',
 '翻',
 '吐',
 '嘔 吐',
 '吐',
 '吐']

In [46]:
# define the row labels for the dataframe
row_labels = [ f"{cities[ i %19]}_{variant}"for word_l in word_groups for i, variant in enumerate (word_l) ]
# define a method to create dataframe to show similarities 
area_similarity = pd.DataFrame(index=row_labels, columns=cities)
# turn the np array to list 
cities_list = cities.tolist()
# fill in the dataframe 
for word_list in word_groups:
    for i, c in enumerate(cities):
        row_label =  f"{cities[i % 19]}_{word_list[i]}"
        for city in cities:
            word_index = cities_list.index(city) % len(word_list)
            similar_word = word_list[word_index]
            area_similarity .at[row_label, city] = get_similarity(word_list[i], similar_word, wv_from_text)
area_similarity
# some none values due to the unrecognised words 

Unnamed: 0,Beijing,Haerbin,Jinan,Rongcheng,Taiyuan,XiAn,Chengdu,Nanjing,Jixi,Suzhou,Wenzhou,Changsha,Loudi,Nanchang,Meixian,Guilin,Guangzhou,Fuzhou,Xiamen
Beijing_吐,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.462,0.462
Haerbin_吐,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.462,0.462
Jinan_吐,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.462,0.462
Rongcheng_嘔 吐,0.581,0.581,0.581,1.0,0.0,1.0,0.0,0.581,0.0,0.793,0.581,0.581,0.376,0.793,0.305,0.581,1.0,0.581,0.581
Taiyuan_嘔 嘞,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Meixian_妹 ? 人,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Guilin_女 人,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.586,1.0,0.772,0.565,0.0,1.0,1.0,0.232,0.453
Guangzhou_女 人,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.586,1.0,0.772,0.565,0.0,1.0,1.0,0.232,0.453
Fuzhou_伊 姐,0.232,0.0,0.232,0.232,0.232,0.232,0.232,0.232,0.0,0.232,0.247,0.232,0.278,0.32,0.0,0.232,0.232,1.0,0.228


In [47]:
# turn 0 to NaN value
area_similarity = area_similarity.astype(float)
area_similarity.replace(0,np.nan, inplace=True)
area_similarity


Unnamed: 0,Beijing,Haerbin,Jinan,Rongcheng,Taiyuan,XiAn,Chengdu,Nanjing,Jixi,Suzhou,Wenzhou,Changsha,Loudi,Nanchang,Meixian,Guilin,Guangzhou,Fuzhou,Xiamen
Beijing_吐,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.0,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.000,0.462,0.462
Haerbin_吐,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.0,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.000,0.462,0.462
Jinan_吐,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.0,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.000,0.462,0.462
Rongcheng_嘔 吐,0.581,0.581,0.581,1.000,,1.000,,0.581,,0.793,0.581,0.581,0.376,0.793,0.305,0.581,1.000,0.581,0.581
Taiyuan_嘔 嘞,,,,,1.000,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Meixian_妹 ? 人,,,,,,,,,,,,,,,1.000,,,,
Guilin_女 人,1.000,,1.000,1.000,1.000,1.000,1.000,1.000,,1.000,0.586,1.000,0.772,0.565,,1.000,1.000,0.232,0.453
Guangzhou_女 人,1.000,,1.000,1.000,1.000,1.000,1.000,1.000,,1.000,0.586,1.000,0.772,0.565,,1.000,1.000,0.232,0.453
Fuzhou_伊 姐,0.232,,0.232,0.232,0.232,0.232,0.232,0.232,,0.232,0.247,0.232,0.278,0.320,,0.232,0.232,1.000,0.228


In [53]:
# store the area_similarity dataframe as a csv file 
# for further visualization 
area_similarity.to_csv('area_similarity.csv',float_format='%.3f', na_rep='NaN', index=True)

In [None]:
# use Geospatial to show the similarity 
# https://piktochart.com/blog/big-data-visualization/

Q2 under the same word, test the value/form/segements difference among different areas

Q3 does the similarity of embedding has the positive coefficient with values simialrity

Q4 use the network to show the tree 

Q5 Model to show the dialect distribution 

Q6 N-Gram Analysis 

Q7 the same city such as Chengdu ,Nnajing and so on has varieties , what is the reason for this variety 

Q8 the words without any change, what is the reason ?

step3 Raise some questions related with the dialetcs' geography and explore it 