# Modal organization in Chinese Folks songs
-----------
## Digital Musicology - Milestone 3
-------

In [1]:
# Imports 

from music21 import *
import glob
from tqdm import tqdm
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd
from collections import Counter, OrderedDict
from geopy.geocoders import Nominatim
import re
import folium
import networkx as nx

# Users should modify the MuseScore_PATH with the path to MuseScore on their own device

MuseScore_PATH = 'C:\Program Files\MuseScore3\\bin\\MuseScore3.exe'

environment.set('musescoreDirectPNGPath', MuseScore_PATH)

### Research Questions

#### Tonic

##### Definition

**TBD: Write cleaner definition**

For this project, we define the tonic as a central pitch, i.e. the most recurring pitch that has a high duration (multiplied with beat strength) or the pitch that ends the piece.


#### Formula

**TBD: Repharse the algorithm explanantion and add a mathematical formula**

It computes a centrality for each pitch based on the pitch duration and the beat strength. Additionally, the last pitch of the piece gets added weight proportional to the song's length. The note with the highest score is then determined as the tonic. 

In [2]:
def get_tonic_and_pitch_classes(score):
    """
    This function accepts a score and returns its tonic, pitch classes, and 5 most central pitches,
    using the method described above.
    """
    try:
        note_scores = dict()
        length = 0
        for part in score.parts:
            for measure in part:
                if type(measure) is not stream.Measure:
                    continue
                for note_ in measure:
                    if type(note_) is not note.Note:
                        continue
                    length += 1
                    if note_.name in note_scores.keys():
                        note_scores[note_.name] += note_.duration.quarterLength * note_.beatStrength
                    else:
                        note_scores[note_.name] = note_.duration.quarterLength * note_.beatStrength
                    last_note = note_.name
        note_scores[last_note] += 0.1 * length
        tonic = max(note_scores, key=note_scores.get)
    except:
        tonic = None

    pitchclasses = list(note_scores.keys())
    top5 = sorted(note_scores, key=note_scores.get, reverse=True)[:5]
    return tonic, pitchclasses, top5

### Mode

#### Definition

**TBD: Repharse the definition**

In Chinese music, we distinguish the 5 following modes with their five notes when transposed to C[[4]](https://digitalcommons.lsu.edu/cgi/viewcontent.cgi?article=2761&context=gradschool_dissertations):
* 宮 (gong, C) mode \[C D E G A\]
* 商 (shang, D) mode \[C D F G B<sub>b</sub>\]
* 角 (jue, E) mode \[C E<sub>b</sub> F A<sub>b</sub> B<sub>b</sub>\]
* 徵 (zhi, G) mode \[C D F G A\]
* 羽 (yu, A) mode \[C E<sub>b</sub> F G B<sub>b</sub>\]

Using this information, we then develop a function below that decides whether the 5 most central pitches of a piece indeed belong to one of these modes.

#### Formula

**TBD: Write the algorithm explanantion and add a mathematical formula**


In [3]:
def get_mode(top5_pitches):
    if set(top5_pitches).issubset(['C', 'D', 'E', 'G', 'A']):
        return "gong"
    if set(top5_pitches).issubset(['C', 'D', 'F', 'G', 'B-']):
        return "shang"
    if set(top5_pitches).issubset(['C', 'E-', 'F', 'A-', 'B-']):
        return "jue"
    if set(top5_pitches).issubset(['C', 'D', 'F', 'G', 'A']):
        return "zhi"
    if set(top5_pitches).issubset(['C', 'E-', 'F', 'G', 'B-']):
        return "yu"
    return None

### Score Loading

In [4]:
regions = ["han", "natmin", "shanxi", "xinhua"]

We then define the function below that returns all *\*\*kern* scores in the given `path`, parsed with `music21`'s function.

In [5]:
def load_scores(path):
    """
    The function accepts a path and loads the .krn files in the path into a list.
    The pieces that are successfully parsed by the music21 parsers are stored in the scores list.
    In case the parser fails to load any file, they are stored in failed_scores.
    The count of total scores and failed scores are stores in total and failed counters.
    """

    pieces = {"scores": [], "failed_scores": [], "total": 0, "failed": 0}
    for file in tqdm(glob.glob(path+"/*.krn")):
        pieces["total"] += 1
        try:
            pieces["scores"].append(converter.parse(file))
        except:
            pieces["failed"] += 1 
            pieces["failed_scores"].append(converter.parse(file))

    return pieces

We can now iterate over the four regions, also corresponding to the four folders, and use the `load_score` function defined above to retrieve all scores of our dataset.

In [6]:
# A dictionary to store the music scores belonging to the 4 regions of the CFS as per the dataset.
# The key is the name of the region and the value is dictionay with keys corresponding to scores,
# failed_scores - Representing scores that have not loaded properly,
# total - representing the total number of scores of that region and failed - representing the total number of scores that have failed to be parsed.

music_data = {}

for region in regions:
    music_data[region] = load_scores("./data/" + region)

100%|██████████| 1223/1223 [00:10<00:00, 114.74it/s]
100%|██████████| 206/206 [00:02<00:00, 69.38it/s] 
100%|██████████| 802/802 [00:05<00:00, 135.81it/s]
100%|██████████| 10/10 [00:00<00:00, 127.59it/s]


### Methods

we define our regional categories using the following map and call them geographical divisions:

![map](img/map.JPG)

In [7]:
CFS_full = pd.read_csv("./data/dataframes/cfs_full.csv", converters={'pitches': eval,'pitchclasses': eval,'pitches_transposed': eval,'pitchclasses_transposed': eval,'top5_after_transpose': eval})
CFS_full.head()

Unnamed: 0,region,title,location,tonic,pitches,num_notes,pitchclasses,num_pitchclasses,tonic_transposed,pitches_transposed,pitchclasses_transposed,top5_after_transpose,mode,geo_division
0,han,Renmin gongshe shizai hao,"Asia, China, Shanxi, Zizhou",D,"[D5, A4, C5, D5, D5, A4, C5, D5, G5, C5, A4, G...",64,"[D, A, C, G, E, F#]",6,C,"[C5, G4, B-4, C5, C5, G4, B-4, C5, F5, B-4, G4...","[C, G, B-, F, D, E]","[C, B-, F, G, D]",shang,Northwest
1,han,Zanmen de ling xiu Mao Zedong,"Asia, China, Shanxi Nordwesten",C,"[C5, C5, F5, C5, B-4, G4, C5, E-4, F4, G4, C5,...",57,"[C, F, B-, G, E-, D, A]",7,C,"[C5, C5, F5, C5, B-4, G4, C5, E-4, F4, G4, C5,...","[C, F, B-, G, E-, D, A]","[C, G, F, E-, D]",,Northwest
2,han,Tian xin shun,"Asia, China, Shanxi Yanchang",D,"[D5, A4, D5, D5, A4, D5, A4, D5, G4, E4, D4, D...",24,"[D, A, G, E, B]",5,C,"[C5, G4, C5, C5, G4, C5, G4, C5, F4, D4, C4, C...","[C, G, F, D, A]","[C, G, F, D, A]",zhi,Northwest
3,han,Liu zhi dan,"Asia, China, Shanxi Shanbei",B-,"[E-5, C5, B-4, A-4, B-4, E-4, F4, E-5, C5, B-4...",41,"[E-, C, B-, A-, F]",5,C,"[F4, D4, C4, B-3, C4, F3, G3, F4, D4, C4, B-3,...","[F, D, C, B-, G]","[C, F, G, B-, D]",shang,Northwest
4,han,Zanmen de hongjun shi li zhong,"Asia, China, Shanxi Shanbei",E,"[E5, E4, A4, G4, A4, B4, E5, D5, E5, B4, E5, E...",24,"[E, A, G, B, D]",5,C,"[C5, C4, F4, E-4, F4, G4, C5, B-4, C5, G4, C5,...","[C, F, E-, G, B-]","[C, G, F, B-, E-]",yu,Northwest


## 2.2 Exploring regional differences

In this subsection we will explore the data in order to get insight on our research questions. We hypothesized that we could find notable differences in the modal organization of the songs of different regions(divisions). Therefore, in the following sections, we will group the songs by division, apply methods for each group, and compare the results.

### 2.2.1 Pitch statistics

In this subsection, we plot the combined pitch and scale degrees distributions of all the pieces combined for each division.

To do so, we first create empty dictionaries to store the pitches and scale degrees before and after transposition with division name as key and values as a list of pitches/scale degrees of all pieces belonging to that division.

In [8]:
# The manually annotated geographic divisions
geo_divisions = ["Northwest", "Central", "Southwest", "Northeast", "Jiangzhe", "Southeast", "Neimeng", "Jiang"]
pitch_classes_list = ['A', 'A#', 'A-', 'B', 'B#', 'B-', 'C', 'C#', 'C-', 'D', 'D#', 'D-', 'E', 'E#', 'E-', 'F', 'F#', 'G', 'G#', 'G-']

In [None]:
# Empty dictionaries to store pitches and 
region_pitch_stat = {}
region_scaledegree_stat = {}
region_pitch_transposed_stat = {}
region_scaledegree_transposed_stat = {}
region_tonic_stat = {}

scaledegree_defcount = {pc:0 for pc in pitch_classes_list}
pitchclass_transitions_defcount = {pc1:{pc2:0 for pc2 in pitch_classes_list} for pc1 in pitch_classes_list}

for gloc in geo_divisions: 
    region_df = CFS_full[CFS_full["geo_division"]==gloc]
    region_pitch_stat[gloc] = sum(region_df["pitches"].tolist(), [])
    region_scaledegree_stat[gloc] = sum(region_df["pitchclasses"].tolist(), [])
    region_pitch_transposed_stat[gloc] = sum(region_df["pitches_transposed"].tolist(), [])
    region_scaledegree_transposed_stat[gloc] = sum(region_df["pitchclasses_transposed"].tolist(), [])
    region_tonic_stat[gloc] = region_df["tonic"].tolist()

In [None]:
plt.rcParams["figure.figsize"] = (20,5)

region_pitch_counts = {}
for gloc in region_pitch_stat:
    
    counter = Counter(region_pitch_stat[gloc])
#     pitch_fractions = {i:counter[i] / len(region_pitch_stat[gloc]) for i in counter}
    pitch_fractions = {i:counter[i] for i in counter}
    region_pitch_counts[gloc] = sorted(pitch_fractions.items(), key=lambda pair: pair[0])
    
    x, y = zip(*region_pitch_counts[gloc])
    
    plt.bar(x, y, color='black')
    plt.title("Pitch stats of {} region pieces".format(gloc))
    plt.show()

By comparing each of these plots, we can get a first comparison of the pieces of each region. The first striking observation is that these distributions look similar across the regions. Indeed, we notice that A4, B4, C5, D5, and G4 are dominant in most regions. However, we can also observe differences. For instance, B4 is barely used in the Neimeng region compared to all the other ones.

These remarks give some first insight to our hypothesis: on average, there is no striking difference in the use of pitches across different regions of China. However, further analysis is required to determine the implications of the fluctuations.

### 2.2.2 Pitch class statistics
In the previous subsection, we compared the use of pitches across regions. As an additional comparison, we now plot the distribution of pitch class across regions.

For this visualization, we decided to sort the pitch classes by occurrences rather than alphabetically.

In [None]:
plt.rcParams["figure.figsize"] = (20,20)
region_scaledegree_counts = {}
fig, axarr = plt.subplots(4,2)

for gloc,ax in zip(region_scaledegree_stat, axarr.flat):
    
    counter = Counter(region_scaledegree_stat[gloc])
#     scaledegree_fractions = {i:counter[i] / len(region_scaledegree_stat[gloc]) for i in counter}
    scaledegree_fractions = {i:counter[i] for i in counter}
    scaledegree_fractions = {**scaledegree_defcount, **scaledegree_fractions}
    region_scaledegree_counts[gloc] = sorted(scaledegree_fractions.items(), key=lambda pair: -pair[1])
    x, y = zip(*region_scaledegree_counts[gloc])
    ax.bar(x, y, color='black')
    ax.set_title("Pitch classes stats of {} region pieces".format(gloc))

A visual comparison of these plots again reveals a great similarity between all regions as `A`, `D` and `E` most often closely compete for the three most used pitch classes. However, we notice once again a significant difference in the Neimeng region, where there is a drop after the 5 most used pitch classes.

These results once again go against our hypothesis. However, the singular example of the Neimeng region must be remembered for the rest of the analysis.

A visual comparison of these plots again reveals a great similarity between all regions as `A`, `D` and `E` most often closely compete for the three most used pitch classes. However, we notice once again a significant difference in the Neimeng region, where there is a drop after the 5 most used pitch classes.

These results once again go against our hypothesis. However, the singular example of the Neimeng region must be remembered for the rest of the analysis.

### 2.2.3 Scale degrees statistics
In our research questions, we were particularly interested in the organization of the piece around the central pitch. For that reason, we transposed all pieces to C relatively to their tonic. Analyzing the pitches of these transposed scores reveals information about the use of scale degrees.

As a first analysis, we plot the distribution of use of scale degrees by simply counting them. Note that for simplicity, instead of labeling the plots with scale degrees, they are simply labeled with their pitch class relative to C. Therefore G is V, D is II, etc...

In [None]:
plt.rcParams["figure.figsize"] = (20, 20)
fig, axarr = plt.subplots(4,2)

region_scaledegree_transposed_counts = {}

for gloc, ax in zip(region_scaledegree_transposed_stat, axarr.flat):
    
    counter = Counter(region_scaledegree_transposed_stat[gloc])
#     scaledegree_transposed_fractions = {i:counter[i] / len(region_scaledegree_transposed_stat[gloc]) for i in counter}
    scaledegree_transposed_fractions = {i:counter[i] for i in counter}
    scaledegree_transposed_fractions = {**scaledegree_defcount, **scaledegree_transposed_fractions}
    region_scaledegree_transposed_counts[gloc] = sorted(scaledegree_transposed_fractions.items(), key=lambda pair: -pair[1])
    x, y = zip(*region_scaledegree_transposed_counts[gloc])
    ax.bar(x, y, color='black')
    ax.set_title("Pitchclass Stats of {} region pieces after transposing to C".format(gloc))

These plots reveal that the distributions of use of scale degrees across regions are vastly similar, thus once again refuting our initial hypothesis. However, it is interesting to note that like in Western music, the fifth and fourth are the most dominant scale degrees.

### 2.2.4 Pitch Classes Transitions

Another means to compare the organization of songs of different divisions is to look at the transitions between the pitches. Comparing these transitions will reveal whether the folks preferred certain pitch transitions to others and may show significant differences across regions.

For this purpose, we create a dictionary to store the count of transitions of the pitch classes in all pieces belonging to one division. The name of the division is the key of the mentioned dictionary and the value is a nested dictionary with keys as the starting pitch class and the values as dictionaries with key as the transitioned pitch class and value as the count of such occurrence.

In [None]:
pitchclass_transition = {geo_region: {} for geo_region in geo_divisions}

for gloc in tqdm(music_data):
    
    for score in music_data[gloc]["scores"]:
        region = CFS_full[CFS_full["title"] == score.metadata.title].geo_division.values[0]
        if region != "Null":
            for part in score.parts:
                for measure in part:
                    if type(measure) is not stream.Measure:
                        continue
                    prev = None
                    for note_ in measure:
                        if type(note_) is not note.Note:
                            continue
                        if prev != None:
                            if prev not in pitchclass_transition[region]:
                                pitchclass_transition[region][prev] = {}
                            if note_.name not in  pitchclass_transition[region][prev]:
                                pitchclass_transition[region][prev][note_.name] = 1
                            else:
                                pitchclass_transition[region][prev][note_.name] += 1
                        prev = note_.name
                
for geo_region in pitchclass_transition:
    pitchclass_transition[geo_region] = {**pitchclass_transitions_defcount, **pitchclass_transition[geo_region]}

Now using the pitch class transitions in each division we create a network with pitch class as the nodes and creating an edge in case there is a transition between the nodes. The graph will be directed meaning there exists an edge only in the direction where the pitch transition occurs.

We create the `plot_network` function to plot this network. The size of the nodes is dependent on the node degree (Degree of a node in a network is the number of connections it has to other nodes) and the thickness of edges is proportional to the count of transitions between those nodes. We also plot the transitions as heatmaps to help in the interpretation. To create them, we first convert the transition counts from a dictionary to a `DataFrame` and use it to plot the heatmap.

In [None]:
plt.rcParams["figure.figsize"] = (20,8)
def plot_network(graph, ax):
    weighted_degree = dict(graph.degree(weight='weight'))
    edges = graph.edges()
    weights = [0.002 * graph[u][v]['weight'] for u,v in edges]
    nx.draw_circular(graph, with_labels=True, alpha = 0.6, width=weights,
                  node_size=[v * 0.1 for v in weighted_degree.values()], ax=ax)
    ax.axis("off")

In [None]:
def sort_dict(item):
    # function to sort a nested dictionary based on key
    # credits: https://gist.github.com/gyli/f60f0374defc383aa098d44cfbd318eb
    return {k: sort_dict(v) if isinstance(v, dict) else v for k, v in sorted(item.items())}

colormap = sns.color_palette("light:b", as_cmap=True)
for gloc in geo_divisions:
    fig, axes = plt.subplots(1,2)
    gloc_graph = nx.DiGraph((k, v, {'weight': weight}) for k, vs in pitchclass_transition[gloc].items() for v, weight in vs.items())
    axes[0].set_title("Pitch Class transition of {} region".format(gloc))
    plot_network(gloc_graph, axes[0])
    
    sorted_probs = sort_dict(pitchclass_transition[gloc])
    gloc_matrix = pd.DataFrame.from_dict(sorted_probs).T.fillna(0)
    gloc_matrix = gloc_matrix.reindex(sorted(gloc_matrix.columns), axis=1)
    
    ax1 = sns.heatmap(gloc_matrix, cmap= colormap, square=True, linecolor='black', linewidths=0.1, vmin=0)
    axes[1] = ax1
    ax1.set_title("Pitch Class transition probability of {} region".format(gloc))
    plt.show()
    
    print("--"*80)

The graphs and heatmaps above reveal both similarities and differences in the pitches transitions across the different regions. For instance, we can observe that in all regions, Bs and Gs are most often followed by an A. However, it also appears that while some regions make frequent use of consecutive As (e.g. Central and Southwest), others barely use that transition (e.g. Northwest and Neimeng). From the map provided in section 2.1, we can see that these regions sharing similarities also share borders. It also provides us with evidence the interval of transition is visibly different for each region. 

This analysis of pitch transitions thus gives a first difference in the organization of CFS and provides support to our hypothesis.

## 3. Next Steps

As we have found minimum evidence showing there are differences between the organization of the music in a different region, next we plan to quantitatively define these differences. Additionally, the heatmaps indicate to us that the interval between the pitches could also be an indicator of the music belonging to a region. Thus, in the next step, we shall look at intervals of transitions between the pitches in pieces belonging to each division to extract preferred intervals of songs for each division.

## 4. Conclusion
At the end of this explorative data analysis, we observe that only a few of the comparing methods we used show significant differences across the eight regions of China we defined. Only the pitch transitions provide some insight on this matter. However, there is still more analysis that we haven't performed now and that may reveal regional characteristics. For example, we have only plotted the pitch transitions but have not done it for the scale degrees. Moreover, while we have currently only computed the distribution of scale degrees by counting them, we will also take into account its duration and beat strength in the next milestone.

Therefore, while our first results tend to refute our initial hypothesis, we hope that applying other methods and analyzing the results more precisely will bring more evidence of regional differences in the organization of CFS.

## 5. References


1. Essen Associative Code (EsAC) and folksong Database. http://www.esac-data.org/
2. Huron, D. (1994). UNIX tools for musical research: The humdrum toolkit reference manual. Stanford, CA: Center for Computer Assisted Research in Humanities.
3. Kern Scores, Folksongs from China. https://kern.humdrum.org/cgi-bin/browse?l=essen/asia/china 
4. Shi, Jiazi (2016), "East Meets West: A Musical Analysis of Chinese Sights and Sounds, by Yuankai Bao". LSU Doctoral Dissertations. 1762. https://digitalcommons.lsu.edu/gradschool_dissertations/1762
5. Cuthbert, M. S., & Ariza, C. (2010). music21: A toolkit for computer-aided musicology and symbolic music data. https://dspace.mit.edu/handle/1721.1/84963