# Textual Encoding of Hindi-Urdu Poetry for Data-Rich Literary Analysis

A. Sean Pue, Michigan State University

pue@msu.edu

@seanpue

http://seanpue.com

Github: seanpue

Talk Repository: http://github.com/seanpue/dtsa2016


In [1]:
from IPython.display import IFrame

# Hindi/Urdu

## हिन्दी

* left-to-right devanagari script preferred
* more tatsam (from Sanskrit) words

## اردو

* right-to-left nastaliq script preferred
* more Perso-Arabic words

## "different *literary styles* based on the *same* linguistic subdialect" (Masica 1991)

# Research Question #1

## How to best analyze and encode texts in both scripts?

# Challenges



## Disambiguating between words

###  आम aam as عام or آم

### کیا as किया kiyaa or क्या kyaa


# Challenges

## Certain types of analysis require additional information:

* morphology
* grammatical markers, such as the iẓāfat (kitāb-e dil)
* compound-word boundaries

# Background Project

## Desertful of Roses by Frances W. Pritchett

### http://www.columbia.edu/itc/mealac/pritchett/00ghalib/

# Hindi/Urdu Text and IPA from Transliteration

* roman *tokens* parsed into devanagari/nastaliq versions

* requires looking before and after for particular combinations

* involves both tokens and classes of tokens, eg. consonant, vowel, etc.

* quite but not entirely accurate

* now using a lexer/parser

# Workflow

* Have texts transcribed into Unicode
* Convert those files into spreadsheet *tables*
  * easy to manipulate by an editor or programmatically
  * very clean
* Attach transliteration, lemaa information to the words
* Analyze as a DataFrame
* Reconstitute as TEI if necessary

In [2]:
import sys
sys.path.append('./graphparser/')
import graphparser as gp
import pandas as pd
import networkx as nx
import logging,sys,codecs,re,csv


![XKCD Python Cartoon](pres_files/python.png)

# Data File Structure

In [3]:
pd.set_option("display.max_rows",25)

In [4]:
pd.DataFrame.from_csv('data/miraji_nazmen.csv', encoding='utf-16')


Unnamed: 0,type,transliteration,urdu,notes
0,TITLE,,چل چلاؤ,
1,LINE,,بس دیکھا اور پھر بھول گئے،,
2,TOKEN,bas,بس,
3,TOKEN,dekhaa,دیکھا,
4,TOKEN,aur,اور,
5,TOKEN,phir,پھر,
6,TOKEN,bhuul,بھول,
7,TOKEN,ga))e,گئے,
8,TOKEN,",",،,
9,LINE,,جب حُسن نگاہوں میں آیا,


#Why digital analysis?

## Motivated by the strong and recurrent discourse about ‘sound’ in modern Hindi/Urdu poetry

## Hindi/Urdu as a language involves:
* Perso-Arabic vocabulary and forms (ghazal, masnavi, etc.)
* Indic (“Hindi”) vocabulary and forms
* Relation of meter and forms to literary community

## Possibilities of providing experiential or graphical “proof” to prose assertions



# Urdu Meters

* The meters are quantitative (not qualitative), based on length rather than stress
* Metrical units involve “short” and “long” vowels
* Metrical units are not necessarily syllables
  * E.g. 	Raaj   	= - (raa j)	 [where = is long, - is short]

* Flexibilities
  * Long vowels can be shortened at the end of words
  * Metrical units can span words
  * There are particular word-based anomalies/flexibilities






# Urdu Prosody

Descriptions in Urdu from Persian (Farsi) and earlier Arabic prosody, as following a particular pattern (dates back to al-Khalil of Basra 718 CE)
* Describe metrical feet using text where certain vowels are “moving” or “silent,” e. g.
  * fāʿilātun = - = =	فاعلاتن
  * fāʿilun = - = 	فاعلن
  * faʿūlan - = =	فعولن
* Meters named using primary metrical “wheels” and different sorts of modifications to them
* Meter is referred to as a baḥr (“ocean”)



![Naqsh Faryaadii](pres_files/naqsh_faryaadii.png)
* Meter: = - = = / = - = = / = - = = / = - =
    
نقش فریادی ہے کس کی سوخی تحریر کا

کاغذی ہے پیرہن ہر پیکر تصویر کا


naqsh faryaadii hai kis kii sho;xii-e ta;hriir kaa

kaa;gazii hai pairahan har paikar-e ta.sviir kaa


नक़्श फ़रयादी है किस की शोख़ी-ए तहरीर का 

काग़ज़ी है पैरहन हर पैकर-ए तस्वीर का


# Computational Problem

How to computationally scan Hindi/Urdu poetry in a scalable and effective way?

What is topic modeling?

In [5]:
import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

topic1 = pydot.Node(name='topic1', texlbl=r'topic1', label='Topic #1', shape='square')
dot_object.add_node(topic1)
topic2 = pydot.Node(name='topic2', texlbl=r'topic2', label='Topic #2', shape='square')
dot_object.add_node(topic2)
#topic3 = pydot.Node(name='topic3', texlbl=r'topic3', label='عاشق', shape='square', fontname="Jameel Noori Nastaleeq")
#dot_object.add_node(topic3)

plate_document = pydot.Cluster(graph_name='plate_document', label='Document', fontsize=24)

word1= pydot.Node(name='word', texlbl=r'\word', label='Word')
plate_document.add_node(word1)
word2= pydot.Node(name='word2', texlbl=r'\word', label='Word')
plate_document.add_node(word2)
word3= pydot.Node(name='word3', texlbl=r'\word', label='Word')
plate_document.add_node(word3)


# add plate k to graph
dot_object.add_subgraph(plate_document)


dot_object.add_edge(pydot.Edge(topic1, word1))
dot_object.add_edge(pydot.Edge(topic1, word2))
dot_object.add_edge(pydot.Edge(topic2, word3))
#dot_object.add_edge(pydot.Edge(node_theta, node_z))
#dot_object.add_edge(pydot.Edge(node_z, node_w))
#dot_object.add_edge(pydot.Edge(node_w, node_beta, dir='back'))
#dot_object.add_edge(pydot.Edge(node_beta, node_eta, dir='back'))
dot_object.write('graph.dotfile', format='raw', prog='dot')

True

In [6]:
dot_object.write_png('topic_model.png', prog='dot')
from IPython.display import Image
#Image('topic_model.png')

![Topic Model Diagram](topic_model.png)

In [7]:
from gensim import corpora, models, similarities
import collections,operator,sys,numpy,pandas
from jinja2 import Template


sys.path.append('graphparser/')
from graphparser import GraphParser
urdup = GraphParser('graphparser/settings/urdu.yaml')

with open('ghalib-concordance/output/lemma_documents.txt','r') as f:
    text = f.read()

verses = text.split('\n')
verses_orig=[urdup.parse(v).output for v in verses]
assert(len(verses)==1461)
tokens=[]

for v in verses:
    tokens+= v.split(' ')

    stoplist=['honaa','','karnaa',
'kaa','se','me;n','nah','vuh','kih','ko','jaanaa','kii','nahii;n','mai;n','kyaa','meraa','jo','ham',
'bhii','to','kahnaa','yih','aanaa','ne','teraa','dekhnaa','aur','par','denaa',';gaalib','ko))ii','kyuu;n',
'hii','pah','bah','gar','rahnaa','tuu','phir','apnaa','har','ay','ik','kis','tum','kuchh',
'agar','ek','asad','ab','chaahiye','puuchhnaa','yuu;n','hamaaraa',
'mauj','yaa;n','nikalnaa','yaa','milnaa','liye','yak',"jaan'naa",'achchhaa','haa))e','vaa;n','tak','paanaa',
'magar','taa','pa;rnaa','khe;nchnaa','kabhii','lekin','u;thnaa','varnah','chalnaa',
'phir','lenaa','denaa','kahaa;n','sar','jab',"go","ban'naa","ya((nii","vuhii","aap","saknaa","kisii","yihii"
'jitnaa','saa','pahle','lagnaa','vale','mat','sahii','kam',
'bahut','aisaa','qadar','aage','abhii','az','ba;gair','kyuu;nkar','buraa',
'hanuuz','baar']

verbs=[w for w in set(tokens) if w.endswith('naa') and w!='tamanna']

stoplist+=verbs

In [8]:
texts = [[word for word in verse.lower().split() if word not in stoplist] for verse in verses]

all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

texts = [[word for word in text if word not in tokens_once] for text in texts]
texts = [[urdup.parse(word).output for word in text] for text in texts]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

In [9]:
def gen_model(num_topics=15, passes=10,iterations=250,chunksize=10,workers=5):
    model = models.LdaMulticore(corpus, id2word=dictionary, num_topics=num_topics, eval_every=10, passes=passes,iterations=iterations,workers=workers)
    return model
model=gen_model()


What is a topic?

usually a probability distribution

Example: 15 topics from Ghalib's Divan

In [11]:


def get_verses():
    global model
    global corpus
    text_topics = [ model [x] for x in corpus ]
    da = numpy.zeros((len(text_topics),model.num_topics))
    for i, v in enumerate(text_topics):
        for topic, value in v:
            da[i,topic] = value
    df = pandas.DataFrame(da) # probably a way to compress the above
    verses_out = {}

    for i in range (model.num_topics):
        verses = []
        for x in df.sort(columns=[i],ascending=False)[i].index:
            v = df[i][x]
            if (v > 0):
                verses.append(verses_orig[x])

        verses_out['topic_'+str(i)]=verses
    return verses_out


num_words = 20
data = {'topic_words': [model.show_topic(i,topn=num_words) for i in range(model.num_topics)],
        'topic_verses': get_verses()}




In [12]:
for x in range(model.num_topics):
    print('Topic #',x+1)
    for w in data['topic_words'][x]:print(w)



Topic # 1
('دل', 0.022736092999611903)
('پانو', 0.021338669050970229)
('عشق', 0.012930060118660762)
('طرح', 0.010336583438885157)
('چشم', 0.0092438153010541979)
('سایہ', 0.0088866184981056893)
('حسرت', 0.008492627787419044)
('ناز', 0.0082843407414475139)
('جلوہ', 0.00798499554293787)
('لذّت', 0.0079605382647480027)
('برق', 0.0079605381751552632)
('وفا', 0.0073417809554842308)
('نالہ', 0.0066580176002507428)
('یار', 0.0065524500020972109)
('زنجیر', 0.0065242641033103135)
('ستم', 0.006475336314957207)
('ذوق', 0.0062236936802671054)
('یاد', 0.0061957259363934003)
('گریہ', 0.0057341318381815476)
('گرم', 0.0053361851137337249)
Topic # 2
('وفا', 0.027794057874284701)
('دل', 0.015887778292937818)
('گل', 0.015444968917307504)
('عشق', 0.014884167145019986)
('خیال', 0.012919727517173437)
('گویا', 0.010789902417500673)
('آنکھ', 0.010726919487730352)
('سلامت', 0.010460352269209879)
('نالہ', 0.0092749655295324904)
('عمر', 0.0087477229181823733)
('نفس', 0.0074443804013880899)
('دن', 0.00741654231409

Alternative Visualization as Interactive Word Clouds
using d3.js


In [13]:

clouds_template='''
<!DOCTYPE html>
<meta charset="utf-8">
<head>
<script type="text/javascript" src="d3/d3.js"></script>
<script type="text/javascript" src="d3-cloud/d3.layout.cloud.js"></script>
<script type="application/json" id="data">

{{topic_words_json}}

</script>


</head>

<body>
<div id="models" style="width:50%;float:left">
</div>
<div id="texts" style="width:50%;float:left">
</div>

<script>

var fill = d3.scale.category20();

var word_data;

function make_cloud(cloud,id){
    
    
    words = cloud.map(function(d){
        return {text:d[0],size:d[1]*2000}
      }).sort(function(a,b){
        return a.size < b.size;
      });
    
    word_data = words;
      
    d3.layout.cloud().size([800, 800])
      .words(words)
      .padding(1)
      .rotate(function() { return 0})//~~(Math.random() * 2) * 90; })
      .font("Impact")
      .fontSize(function(d) { return d.size; })
      .on("end", draw)
      .start();
    
    function show_text(id){
    
        d3.select("div#texts").selectAll('p').remove();
        for (i=0; i<10;i++){//topic_verses[id].length; i++){
            d3.select("div#texts").append("p").style("font-family", "Jameel Noori Nastaleeq").style("font-size","16").text(topic_verses[id][i]).append("br");
        }
        
 
    }
    
    
    function draw(words) {
      d3.select("div#models").append("svg")
          .attr("width", 400)
          .attr("height", 400)
        .attr("id",id)
        .on("click",function(d) {show_text(this.id) } )
        .append("g")
          .attr("transform", "translate(400,400)")
        .selectAll("text")
          .data(words)
        .enter().append("text")
          .style("font-size", function(d) { return d.size + "px"; })
          .style("font-family", "Jameel Noori Nastaleeq")
          .style("fill", function(d, i) { return 0;})//fill(i); })
          .attr("text-anchor", "middle")
          .attr("transform", function(d) {
            return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
          })
          .text(function(d) { return d.text; });
    }
    
    
}


var num_topics = {{num_topics}};

var json_data = JSON.parse(document.getElementById('data').innerHTML);
topic_words = json_data['topic_words'];
topic_verses = json_data['topic_verses'];
for (i=0;i<num_topics;i++) {
  id = "topic_"+i;
  make_cloud(topic_words[i], id);
  
}
</script>
</body>
</html>
'''

from IPython.display import IFrame
import os
import json
num_words = 100
    
count=0
last_fun = None
def serve_html(s,w,h):
    import os
    global count
    count+=1
    fn= '__tmp'+str(os.getpid())+'_'+str(count)+'.html'
    global last_fn
    last_fn = fn
    with open(fn,'w') as f:
        f.write(s)
    return IFrame('files/'+fn,w,h)

def gen_clouds():
    global model
    num_words = 100
    data = {'topic_words': [model.show_topic(i,topn=num_words) for i in range(model.num_topics)],
            'topic_verses': get_verses()}
    topic_words_json = json.dumps(data)
    s=Template(clouds_template).render(num_topics=model.num_topics,topic_words_json = topic_words_json)

    with open('word-cloud.html',"w") as f:
        f.write(s)
#    IFrame('word-cloud.html',width=1200,height=800)
    #return(serve_html(s,1200,800))

gen_clouds()
IFrame('word-cloud.html',width=1200,height=800)
#IFrame



# Why topic model?

## information extraction

## show changes over time or text


## compare one text/corpus with another


## get access to texts in different ways

# How does this model of the topic or theme align with Urdu-based rhetorical understandings?

## more specifically, how does it compare to the idea of the *maẓmūn* (theme)

The road of fresh themes is not closed

The gate of poetry is open until Doomsday

    -Valī Dakkanī (1667-1707)



*maẓmūn āfrīnī* Creation of themes

the beloved is a hunter

the beloved lies in wait for the the prey

the hunter slaughters the prey

the hunter makes into a kabob the prey

the beloved is the prey




Perhaps as an Resource Data Framework (RDF) triple?

subject -> predicate -> object

In [14]:
import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

node1 = pydot.Node(name='node1', texlbl=r'topic1', label='Subject', shape='square')
dot_object.add_node(node1)
node2 = pydot.Node(name='node2', texlbl=r'topic2', label='Object', shape='square')
dot_object.add_node(node2)
dot_object.add_edge(pydot.Edge(node1, node2,label="Predicate"))
#dot_object.write('graph.dotfile', format='raw', prog='dot')
dot_object.write_png('basic_triple.png', prog='dot')
from IPython.display import Image
#Image('basic_triple.png')

![Basic Triple](basic_triple.png)


In [15]:
#import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

node1 = pydot.Node(name='node1', texlbl=r'topic1', label='Beloved', shape='square')
dot_object.add_node(node1)
node2 = pydot.Node(name='node2', texlbl=r'topic2', label='Lover', shape='square')
dot_object.add_node(node2)
node3 = pydot.Node(name='node3', texlbl=r'topic3', label='Cruelty', shape='square')
dot_object.add_node(node3)
dot_object.add_edge(pydot.Edge(node1, node2,label="hunts"))
dot_object.add_edge(pydot.Edge(node1, node3,label="exhibits"))
dot_object.add_edge(pydot.Edge(node2, node1,label="loves"))
dot_object.add_edge(pydot.Edge(node2, node3,label="suffers"))
#dot_object.write('graph.dotfile', format='raw', prog='dot')
dot_object.write_png('example_triple1.png', prog='dot')
from IPython.display import Image
#Image('example_triple1.png')

![Example Triple](example_triple1.png)

Thanks!

Sean Pue

pue@msu.edu

@seanpue