# Analyzing Taiwanese rap lyrics: tone and "flow"
This Jupyter notebook was developed for a student in ["Digital Humanities Across Borders" (DLCL 204, Stanford University, winter 2019](https://github.com/quinnanya/dlcl204) who was looking at how consecutive words that use the same tone contribute to the "flow" of the lyrics.

The notebook takes lyrics saved in Unicode (UTF-8) encoded plain text files (one file per song), converts them to pinyin, then for each line, it counts the number of consecutive identical tones. It creates an output file for each song, and marks lines with 3 or more consecutive identical tones with an asterisk. At the moment, that's it! It'd be nice to also mark which tone is repeated, and how many times, but I haven't quite worked that out.

## Required modules
Before you can run this notebook successfully, you'll need to install a few modules. On a Mac, open the Terminal and run `pip install xpinyin` and `pip install regex`. If you get an error message about running _pip_, you can first run in Terminal `sudo easy_install pip`. Type the login password for your laptop when prompted (don't worry about nothing showing up when you type), and _pip_ should get installed, which will allow you to install the required modules.

## File structure
To work with this notebook, you should put all your lyrics in a single folder somewhere on your computer. You'll need to specify the location of that folder in the second code block.

**Note**: If your computer is localized for Chinese (i.e. if the Mac system interface is in Chinese), when you look for the path (location) information for your folder, it will be displayed in Chinese. When you set the path in this notebook, it turns out that you need to use the default English names (e.g. use *Users* and *Documents* or *Downloads* in your path, even if those folders display with different names on your laptop.)

## Step 1: Setup
The code block below imports the modules the notebook needs to run.

Change the value of *file_directory* (the text between the ' ' marks) to the path to the folder where you've stored the text files with the lyrics.

In [99]:
#os is for doing things with the operating system, like changing to the right directory
import os
#regex lets you do complex searching within the text (e.g. looking for tone numbers)
import regex
#itertools lets you do various kinds of iterative processing, including grouping reults
from itertools import groupby
from itertools import chain
#glob is used to find all the pathnames matching a specified pattern (here, all text files)
from glob import glob
#xpinyin converts text to pinyin
from xpinyin import Pinyin

p = Pinyin()

#Change the path below to the folder where you store the text files with the lyrics
file_directory = '/Users/qad/Documents/twlyrics'

#Change the working directory to the folder with your lyrics
os.chdir(file_directory)

## Step 2: Convert to pinyin

This step looks through the folder your specified, and for each .txt file, it rewrites that text file replacing the original Chinese with pinyin. 

Note that the original file is *replaced*, so if you want to get back to the original Chinese, make sure that you have a copy of the files saved in another folder as well.

In [100]:
#Looks through the specified folder, and for each .txt file...
for filename in os.listdir(filedirectory):
    if filename.endswith(".txt"):
        #Opens and reads the .txt file
        f = open(filename, 'r')
        text = f.read()
        #Converts the lines to pinyin using numbers for tone marks
        lines = p.get_pinyin(text, tone_marks='numbers')
        #Overwrites the original file with the lines converted to pinyin
        with open(filename, 'w') as out:
            out.writelines(lines)

## Step 3: Count consecutive identical tones
This step extracts the numbers indicating the tone marks, counts how many consecutive identical ones there are in each line, then prints lines that meet the criterion of 3+ consecutive identical tones prefixed with an asterisk. Lines that don't meet that criterion are printed normally.

In [104]:
#Looks through the specified folder, and for each .txt file...
for filename in os.listdir(file_directory):
    if filename.endswith(".txt"):
        #Opens the .txt file
        with open(filename) as lyricsfile:
            #Creates a new .txt file with the same name, but prepended with "flagged-"
            newfilename = ('flagged-') + filename
            #Opens that new .txt file to write to it
            f = open(newfilename,'w')
            
            #For each line in the source file
            for line in lyricsfile:
                #Find all the numbers 1-5 and put them in a list
                tones = regex.findall(r"[1-5]", line)
                #Counts consecutive duplicate numbers
                count_dups = [sum(1 for _ in group) for _, group in groupby(tones)]
                #If there are 3, 4, or 5 consecutive duplicates, print an * before the line
                #These lines are fairly short, so 5 is probably the maximum conceivable
                #But you could copy & paste the elif/print lines to add more numbers
                if 3 in count_dups:
                    print ('* ', line, file=f)   
                elif 4 in count_dups:
                    print ('* ', line, file=f) 
                elif 5 in count_dups:
                    print ('* ', line, file=f) 
                #If there aren't 3-5 consecutive duplicates, just print the line
                else:
                    print(line, file=f)
            #Close the new file
            f.close()