# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109B Final Project Milestone 3 - LexRank Model

**Harvard University**<br/>
**Spring 2020**<br/>
**Group 32** 

<hr style="height:2pt">

In [1]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

In [2]:
# same as EDA
# libraries
import json
import lzma
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re
from tqdm import tqdm
from IPython.core.display import display, HTML
import re
from nltk.tokenize import RegexpTokenizer
import datetime as dt
import math
from collections import Counter
from rouge import Rouge 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## <a id='0'>Content</a>

- <a href='#1'>1. Introduction</a> 
- <a href='#2'>2. Citation Filter</a>
- <a href='#3'>3. Importing the legal documents</a>
- <a href='#4'>4. LexRank Implementation</a>
- <a href='#5'>5. Examples</a>
- <a href='#6'>6. Challenges to thematic segmentation</a>
- <a href='#7'>7. Conclusion</a>

## <a id='1'>1. Introduction</a>

In order to summarize the legal texts, I choose the LexRank model, which is one of the ways for extractive summarization.

There are two ways to summarize texts using NLP; extractive and abstractive. Since the extractive approach chooses sentences from the original text, it is less likely to create an incomprehensible summary and less likely to output a gramatically incorrect summary. However, in the extractive approach, you cannot use words that are not in the original text, so it is not possible to paraphrase or to use conjunctions to make summary more easy to read. In contrast, the abstractive allows you to summarize more flexibly because you are free to use words that are not in the original sentences. In addition, you can choose the length of summary. However, the disadvantage in this approach is that it is difficult to produce "natural" sentences. For this project, I decided to take the extractive way to ensure that important information is included in the summary of the legal documents.

The LexRank model works almost the same as Google Search -- it uses sentences as a node and similarity as edge/weight. For details, please refer to [Erkan and Radev (2004)](https://www.aaai.org/Papers/JAIR/Vol22/JAIR-2214.pdf).

## <a id='2'>2. Citation Filter</a>

According to [Farzindar and Lapalme (2004)](https://www.aclweb.org/anthology/W04-1006.pdf), the citations account for a large part of the text of the judgment, but they are not considered relevant for the summary. Therefore, I removed sentences that include specific abbreviation or words.

In [3]:
import re

# Reference: https://library.csustan.edu/apalegal

def citation_filter(text_list):
    new_list = []
    
    for i in range(len(text_list)):
        sentence = text_list[i]
        
        if re.search('\(\d\d\d\d\)', sentence) != None:
            pass
        elif re.search('v\.', sentence) != None:
            pass
        elif re.search('vs\.', sentence) != None:
            pass
        elif re.search('§', sentence) != None:
            pass
        elif re.search('R\.', sentence) != None:
            pass
        elif re.search('Rule', sentence) != None:
            pass
        # Ark. = Arkansas
        elif re.search('Ark\.', sentence) != None:
            pass
        else:
            new_list.append(sentence)
            
    return new_list

## <a id='3'>3. Importing the legal documents</a>

In [4]:
# same as EDA
# defining a fucnction to remove \n and HTML tags
def text_cleaner(text):
    text_divided = text.splitlines()
    text_divided_clean = " ".join(text_divided)
    return text_divided_clean

In [5]:
# same as EDA
# The file size for some states are too large to open into memory
# This function loads individual cases into memory, parses headnotes and 
# opinions, cleans the text, tokenizes the text, and returns counts of tokens
# for each case.

tokenizer = RegexpTokenizer('\s+', gaps=True)

def get_counts(state):
    cases = []
    with lzma.open("../" + state + '-text/data/data.jsonl.xz', 'r') as jsonl_file:
        for case in jsonl_file:
            c = json.loads(str(case, 'utf-8'))

            date = c['decision_date']
            
            headnotes = text_cleaner(c['casebody']['data']['head_matter'])
            headnotes_tokenized = tokenizer.tokenize(headnotes)
            num_headnotes = len(headnotes_tokenized)

            opinions = c['casebody']['data']['opinions']
            if opinions == []:
                num_opinions = 0
            else:
                opinions = text_cleaner(opinions[0]['text'])
                opinions_tokenized = tokenizer.tokenize(opinions)
                num_opinions = len(opinions_tokenized)
            cases.append({'date':date, 'num_headnotes':num_headnotes, 'headnotes': headnotes, 'num_opinions':num_opinions, 'opinions':opinions})
        return pd.DataFrame(cases)

In [6]:
%%time

# use Arkansas data as an example
states = ['Arkansas']
counts_ar = get_counts(states[0])

CPU times: user 43.3 s, sys: 522 ms, total: 43.9 s
Wall time: 43.9 s


In [7]:
counts_ar.head(5)

Unnamed: 0,date,num_headnotes,headnotes,num_opinions,opinions
0,1829-11,29,"Case No. 4,822a. FISHER v. REIDER. [Hempst. 82...",230,OPINION OF THE COIÍRT. This is an action of de...
1,1828-05,28,"Case No. 4,785a. FIKES v. BENTLEY. [Hempst. 61...",62,OPINION OP THE COURT. This is an appeal from t...
2,1836-02,27,"Case No. 4,863a. FLETCHER v. ELLIS. [Hempst. 3...",616,"CROSS, Judge. The record in this case shows th..."
3,1999-07-15,46,Michael NORRIS v. STATE of Arkansas CR 98-1429...,3936,"W. H.“Dub” Arnold, Chief Justice. This is a ca..."
4,1999-10-07,39,Roger Allen HAMMON v. STATE of Arkansas CR 98-...,1788,"Ray Thornton, Justice. Appellant brings this a..."


## <a id='4'>4. LaxRank Implementation</a>

In [8]:
# compute term frequency
# Reference: http://www.tfidf.com/
# sentence: dictionary
def termfreq(sentence):

    dict_counts = Counter(sentence)
    max_tf = max(dict_counts.values())

    tf_dict = {}
    
    # create a dictionary with word + term frequncy
    for word, tf in dict_counts.items():
        tf_dict[word] = tf / max_tf
    
    # Output Example: {'word': tf, 'word': tf, ...}
    return tf_dict

# Create a dictionary with words and their IDF: Inverse Document Frequncy
# Reference: http://www.tfidf.com/
# list_sentences: list
def compute_idf(list_sentences):
    
    idf_dict = {}
    sentences_len = len(list_sentences)

    for sentence in list_sentences:
        for word in sentence:
            if word not in idf_dict:
                
                # if not in idf_dict, calculate idf and append it to the dictionary
                number_appearance = 0
                for sen in list_sentences:
                    if word in sen:
                        number_appearance += 1
                idf_dict[word] = math.log(sentences_len / number_appearance)
                
    return idf_dict

In [9]:
# Reference: http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html
# Chapter 3 : Centrality-based Sentence Salience
def idf_modified_cosine(list_sentences, list_sentence_x, list_sentence_y):
    
    dict_x = termfreq(list_sentence_x)
    dict_y = termfreq(list_sentence_y)
    idf_dict = compute_idf(list_sentences)

    set_unique_words1 = set(list_sentence_x)
    set_unique_words2 = set(list_sentence_y)

    words_xy = set_unique_words1 & set_unique_words2
    
    numerator = 0
    for word in words_xy:
        numerator = numerator + dict_x[word] * dict_y[word] * ((idf_dict[word])**2)
        
    denominator_left_quad = 0
    for word in set_unique_words1:
        denominator_left_quad = denominator_left_quad + ((dict_x[word] * idf_dict[word]) ** 2)
        
    denominator_right_quad = 0
    for word in set_unique_words2:
        denominator_right_quad = denominator_right_quad + ((dict_y[word] * idf_dict[word]) ** 2)
    
    denominator_left = math.sqrt(denominator_left_quad)
    denominator_right = math.sqrt(denominator_right_quad)
    
    if denominator_left == 0 or denominator_right == 0:
        print("Error! 0 in denominator!")
    else:
        return numerator / (math.sqrt(denominator_left) * math.sqrt(denominator_right))

In [10]:
# Reference: http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html
# 3.2 Eigenvector Centrality and LexRank
# computing the stationary distribution of a Marcov chain
def power_method(M, N, eps):

    # initialization
    p = np.full((N,), 1/N)

    delta = 999

    while delta >= eps:
        p_t = np.dot(np.transpose(M), p)
        delta = np.linalg.norm(p_t - p)
        p = p_t

    return p

In [11]:
# Reference: http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html
# 3.2 Eigenvector Centrality and LexRank, Algorithm #3

# Lexrank: summarizing sentences
# computing Lexrank Scores
def lex_rank(list_sentences, n, t):

    cosinematrix = np.zeros((n, n))
    degree = np.zeros((n,))

    for i in range(n):
        for j in range(n):
            cosinematrix[i][j] = idf_modified_cosine(list_sentences, list_sentences[i], list_sentences[j])
            if cosinematrix[i][j] > t:
                cosinematrix[i][j] = 1
                degree[i] += 1
            else:
                cosinematrix[i][j] = 0

    for i in range(n):
        for j in range(n):
            cosinematrix[i][j] = cosinematrix[i][j] / degree[i]

    L = power_method(cosinematrix, n, t)

    return zip(list_sentences, L)

## <a id='5'>5. Examples</a>

### Example 1

In [12]:
test_opinions = counts_ar.iloc[2,4]

In [13]:
test_opinions

"CROSS, Judge. The record in this case shows that the plaintiff in error [Frederick Fletcher] brought an action of trespass on the case against the. defendant [William Ellis], in the Conway circuit court, and in his declaration alleged “that the said plaintiff and one Alexander Rogers, were indebted to Daniel Gilmore in a large sum of money, namely,' in the amount of fifty-five dollars, upon which said Gilmore had brought suit and obtained judgment, and sued out execution against the plaintiff and the said Rogers, and the plaintiff avers that he and Rogers had, in the county of Conway, sufficient goods and chattels to have satisfied the execution, and the plaintiff avers that the defendant being an evil disposed person, fond of encouraging litigation and fomenting strife, and wishing to harass, impoverish, and distress the plaintiff, did, on the first day of October, 1834, at the county of Conway, and within the jurisdiction of this court, maliciously persuade and procure the said Dani

In [14]:
# Reference: https://nlpforhackers.io/splitting-text-into-sentences/

from pprint import pprint
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
 
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(test_opinions)
 
tokenizer = PunktSentenceTokenizer(trainer.get_params())

In [15]:
test_opinions_list = tokenizer.tokenize(test_opinions)

In [16]:
def text_summarizer(text, n, t=1):
    """
    n: number of sentences
    t: error tolerance
    """
    words_list = []
    for i in range(len(text)):
        words = text[i].split()
        words_list.append(words)
    zipped = lex_rank(words_list, len(text), t)
    unzipped = list(zip(*zipped))
    scores = np.array(unzipped[1])
    highest_index = scores.argsort()[-n:][::-1]
    summarized = []
    high_scores = []
    for i in range(len(highest_index)):
        sentence = text[i]
        score = scores[i]
        high_scores.append(score)
        summarized.append(sentence)     
    print("\nOriginally", len(text), "sentences\n")
    print("Summarized in", n, "sentences\n")
    print("Summarized:  ", summarized,"\n")
    return summarized

In [17]:
%%time

# 616 words
result1 = text_summarizer(citation_filter(test_opinions_list), 4, 0.1)


Originally 14 sentences

Summarized in 4 sentences

Summarized:   ['CROSS, Judge.', 'The record in this case shows that the plaintiff in error [Frederick Fletcher] brought an action of trespass on the case against the.', "defendant [William Ellis], in the Conway circuit court, and in his declaration alleged “that the said plaintiff and one Alexander Rogers, were indebted to Daniel Gilmore in a large sum of money, namely,' in the amount of fifty-five dollars, upon which said Gilmore had brought suit and obtained judgment, and sued out execution against the plaintiff and the said Rogers, and the plaintiff avers that he and Rogers had, in the county of Conway, sufficient goods and chattels to have satisfied the execution, and the plaintiff avers that the defendant being an evil disposed person, fond of encouraging litigation and fomenting strife, and wishing to harass, impoverish, and distress the plaintiff, did, on the first day of October, 1834, at the county of Conway, and within the 

#### Rouge Scores

In [18]:
string1 = ''.join(result1)

rouge = Rouge()
scores = rouge.get_scores(string1, counts_ar.iloc[2,2])
print("Original Headnote: ", counts_ar.iloc[2,2], "\n\n")
print("Rouge F1 Score:", scores[0]['rouge-1']['f'])

Original Headnote:  Case No. 4,863a. FLETCHER v. ELLIS. [Hempst. 300.] Superior Court, Territory of Arkansas. Feb. 1836. Before CROSS and TELL, Judges. 1 [Reported by Samuel H. Hempstead, Esq.] 


Rouge F1 Score: 0.02205882162440544


#### Example 2

In [19]:
test2_opinions = counts_ar.iloc[182,4]
test2_opinions_list = tokenizer.tokenize(test2_opinions)

In [20]:
%%time

result2 = text_summarizer(citation_filter(test2_opinions_list), 15, 0.1)


Originally 44 sentences

Summarized in 15 sentences

Summarized:   ['Battle, J.', 'The Bank of Malvern sued J. W. Burton and William Kilpatrick, in the Hot Springs circuit court, upon a note executed by them to it for the sum of $349.50, bearing date the 12th day of May, 1896, and due ninety days after date.', 'The defendants answered, and pleaded usury.', 'The plaintiff filed a reply, which, on motion of defendants, was stricken from the files of the court.', 'As it was not restored to record by bill of exceptions, it is no longer in the ea.se.', 'The cause, both parties consenting, was submitted to the court sitting as a jury.', 'J. W. Burton testified as follows on his own behalf: “I am one of the defendants in the above-entitled cause, and I executed the note sued on herein.', 'The note was due ninety days from its date.', 'I paid $16.00 interest in advance for the extension of the note, which note was for $349.50, dated May 12, 1896, which interest was in excess of ten per cent, 

#### Rouge Scores

In [21]:
string2 = ''.join(result2)

rouge = Rouge()
scores = rouge.get_scores(string2, counts_ar.iloc[182,2])
print("Original Headnote: ", counts_ar.iloc[182,2], "\n\n")
print("Rouge F1 Score:", scores[0]['rouge-1']['f'])

Original Headnote:  Bank of Malvern v. Burton. Opinion delivered February 10, 1900. 1. Usury—Renewal Note.—Where the note sued on was the last of a series of usurious notes given in renewal of a note untainted with usury, plaintiff was entitled to amend the complaint so as to recover on the original note. (Page 429.) 2. Pleading—Amendment to Conform to Proof.—Where no objection was taken to the admission of evidence that the note sued on was given in renewal of a valid note, the complaint was properly treated by the trial court as amended to conform to the proof. (Page 429.) Appeal from Hot Springs Circuit Court. Alexander M. Duffie, Judge. JS. M. Vance, Jr., for appellant. It was error to strike appellant’s reply from the files. 44 S. W. 393; 99 N. Car. 107. The usury, if any, in the renewal notes did not affect the consideration, which was free from usury. Hence the pleadings should have been considered amended by the proof, and judgment given for the original debt. 29 Ark. 323; 42 A

#### Example 3

In [22]:
test3_opinions = counts_ar.iloc[175,4]
test3_opinions_list = tokenizer.tokenize(test3_opinions)

In [23]:
%%time

result3 = text_summarizer(citation_filter(test3_opinions_list), 20, 0.1)


Originally 32 sentences

Summarized in 20 sentences

Summarized:   ['Hughes, J., (after stating the facts.)', 'The question arises on the construction of several statutes relating to county convicts.', '• The act of March 10, 1877, provides as follows: “Sec.', '4.', 'When any person shall be convicted of any misdemeanor under the laws of this state by any court of competent jurisdiction, the court shall render judgment against the person so convicted, which judgment shall direct that the person convicted be put to labor in any manual labor workhouse, or on any bridge or other public improvement, or that the person be hired out to some person as hereinafter provided, until the fine and costs are paid, and which shall not exceed one day for each seventy-five cents of the fine and costs.” Acts 1877, p.', '74; Sand.', '& H.', 'The following is taken from the act of March 22, 1881: “Sec.', '5.', 'That whenever any prisoner shall be convicted of a misdemeanor by any court or justice of the 

#### Rouge Scores

In [24]:
string3 = ''.join(result3)

rouge = Rouge()
scores = rouge.get_scores(string3, counts_ar.iloc[175,2])
print("Original Headnote: ", counts_ar.iloc[175,2], "\n\n")
print("Rouge F1 Score:", scores[0]['rouge-1']['f'])

Original Headnote:  State v. McNally. Opinion delivered March 10, 1900. County Convict—Per Diem Allowance.—The act of April 12,1899 (? 3), relating to county convicts, which amended the act of March 13,1883, upon the same subject by increasing the per diem allowance of a county convict delivered to the county contractor from 50 cents to 75 cents per day was not intended to be retroactive, nor to apply to the case of one convicted prior to its passage. (Page 583.)" Appeal from Jefferson Circuit Court. Antonio B. Grace, Judge. STATEMENT BY THE COURT. Petition by appellee for habeas corpus, alleging as follows: On the 6th of April, 1899, appellee was convicted of an assault in the court below, and fined $50, making, with the costs, $109. On the 11th of the month she was committed to the custody of R. R. Adams, contractor for county prisoners of the county, where she has continuously served at hard labor; and she is entitled to credit in the month of April 19 days, May 31 days, June 30 day

## <a id='6'>6. Challenges to thematic segmentation</a>

In [Farzindar and Lapalme (2004)](https://www.aclweb.org/anthology/W04-1006.pdf), thematic segmentation by linguistic markars is introduced as a way to more easily summarize legal documents. This is because we can follow the structure of legal documets, and we are less likely to miss important information. I tried to implement the thematic segmentation, but I observed, for example, linguistic markars for conclusion in the beginning of a legal document. Therefore, I decided not to use thematic segmentation for summarization.

In [25]:
# Reference: https://www.aclweb.org/anthology/W04-1006.pdf

markars_introduction = ['application for judicial review', 'application to review a decision', 'motion filed by', 'Statement of Claim']

markars_context = ["advise","indicate","concern","request"]

markars_juridical_analysis = ['this court','In reviewing',
                            'Pursuant to section','As I have stated','In the present case']

markars_conclusion = ['note','accept','summarise','scrutinize','think','say','satisfy','discuss','conclude','find','believe','reach','persuade',
                      'agree','indicate','review']

In [26]:
markars = [markars_introduction, markars_context, markars_juridical_analysis, markars_conclusion]

def markar_detector(text_list):
    for i in range(len(text_list)):
        sentence = text_list[i]
        
        for markar in markars:
            for j in range(len(markar)):
                markar_word = markar[j]
                TF = markar_word in sentence
                if TF == True:
                    if markar == markars_context:
                        type_markar = "context"
                        print("Linguistic markar '"+markar_word+"' detected! Sentence #", i, "of "+str(len(text_list))+". This is",type_markar,"markar.")
                    elif markar == markars_juridical_analysis:
                        type_markar = "juridical analysis"
                        print("Linguistic markar '"+markar_word+"' detected! Sentence #", i, "of "+str(len(text_list))+". This is",type_markar,"markar.") 
                    else:
                        type_markar = "conclusion"
                        print("Linguistic markar '"+markar_word+"' detected! Sentence #", i, "of "+str(len(text_list))+". This is",type_markar,"markar.")   

In [27]:
test_text4 = counts_ar.iloc[5,4]
test_text_list4 = tokenizer.tokenize(test_text4)

markar_detector(test_text_list4)

Linguistic markar 'find' detected! Sentence # 9 of 607. This is conclusion markar.
Linguistic markar 'indicate' detected! Sentence # 28 of 607. This is context markar.
Linguistic markar 'indicate' detected! Sentence # 28 of 607. This is conclusion markar.
Linguistic markar 'indicate' detected! Sentence # 64 of 607. This is context markar.
Linguistic markar 'indicate' detected! Sentence # 64 of 607. This is conclusion markar.
Linguistic markar 'conclude' detected! Sentence # 92 of 607. This is conclusion markar.
Linguistic markar 'request' detected! Sentence # 97 of 607. This is context markar.
Linguistic markar 'request' detected! Sentence # 99 of 607. This is context markar.
Linguistic markar 'indicate' detected! Sentence # 111 of 607. This is context markar.
Linguistic markar 'indicate' detected! Sentence # 111 of 607. This is conclusion markar.
Linguistic markar 'agree' detected! Sentence # 141 of 607. This is conclusion markar.
Linguistic markar 'conclude' detected! Sentence # 145 

## <a id='7'>7. Conclusion</a>

Although the LexRank Model summarizes the legal documents with a reasonably long headnote well (Rouge F1 Score around 0.35), there is a couple of ways to improve the LexRank model.

- For example, we missed conclusion (juridical decision) in Example 2. This is because I do not implement thematic segmentation and the model cannot recognize the conclusion sentences as important.
- Regarding thematic segmentation, we should have better lists of linguistic markars. In addition, we could set a stop condition for thematic segmentation (e.g. if we observe conclusion markars N times in a row, we recognize all the texts after that as conclusion.)
- We should also think about the citation filter as headnotes include citations especially in conclusion. We should not the citation filter for summarizing conclusion. (But in order to achieve it, we need to implement thematic segmentation.)