# Supreme Court Oral Argument Analysis
#### The aim of this notebook is to use NLP and Machine learning strategies to see if one can predict how a Supreme Court Justice will vote based on oral arguments, which take place months before a decision.  Justices speak during two distinct sections in a given case– one, while the Petitioner (P) is presenting, and two, while the respondent (R) is presenting.  The theory behind the analysis is that Justices speak differently while each side is presenting, because they actually already may know how they will vote, and the language they use in questioning may show such an indication. 

Some notes:
1. This notebook is just a small display of research at the Columbia Law School Programming Lab
2. This uses traditional Machine Learning algorithms (Logistic Regression, Naive Bayes, SVMs) and doesn't use deep learning.  It would be interesting to see how LSTMs would perform.
3. This notebook does not use word encodings such as Word2Vec.  Encodings would almost certainly help the predictive performance.
3. The dataset contains 215760 utterances from 1128 different cases, before data cleaning (which deletes some bad data).

Notebook author: Zack Nagler

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import grid_search
from sklearn.grid_search import GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
import sklearn
import statsmodels.formula.api as smf
from textblob import TextBlob
from __future__ import division

#viz
import matplotlib
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

sns.set(color_codes=True)

#example
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

from sklearn.grid_search import GridSearchCV
from sklearn import metrics

from sklearn.linear_model import SGDClassifier
import pymysql

In [3]:
rows = []
with open('full_conversations.txt') as fp:
    for line in fp:
        row = line.split(" +++$+++ ")
        rows.append(row)
print len(rows)
cols = ["docket",
        "id",
        "after_prev", 
        "speaker", 
        "is_justice", 
        "justice_vote", #want this (5)
        "presentation_side", #want this (6)
        "utterance", #want this (7)
        ]
pd.set_option('display.max_colwidth', 40)
df = pd.DataFrame(rows)
df.columns = cols
df.head()


215760


Unnamed: 0,docket,id,after_prev,speaker,is_justice,justice_vote,presentation_side,utterance
0,03-855,1,False,JUSTICE STEVENS,JUSTICE,RESPONDENT,PETITIONER,We will now hear argument in the cas...
1,03-855,2,True,MR. SACKS,NOT JUSTICE,,PETITIONER,"Justice Stevens, and may it please t..."
2,03-855,3,True,JUSTICE KENNEDY,JUSTICE,PETITIONER,PETITIONER,"Well, is it your position that whene..."
3,03-855,4,True,MR. SACKS,NOT JUSTICE,,PETITIONER,No. It is not our position that that...
4,03-855,5,True,JUSTICE BREYER,JUSTICE,PETITIONER,PETITIONER,Why does not having a possessory rig...


In [4]:
counts = df[df.is_justice=="JUSTICE"].speaker.value_counts()
counts

JUSTICE SCALIA       17485
JUSTICE ROBERTS      13835
JUSTICE BREYER       13790
JUSTICE GINSBURG      9792
JUSTICE KENNEDY       8547
JUSTICE SOTOMAYOR     7573
JUSTICE STEVENS       5090
JUSTICE SOUTER        5052
JUSTICE ALITO         5048
JUSTICE KAGAN         3508
JUSTICE O'CONNOR       968
JUSTICE REHNQUIST      598
JUSTICE THOMAS          11
JUDGE SCALIA             5
JUDGE BREYER             3
JUDGE GINSBURG           3
JUDGE ALITO              3
JUDGE SOTOMAYOR          2
JUSTICE ROBERT           2
JUDGE SOUTER             1
JUTICE SCALIA            1
JUSTINE GINSBURG         1
JUTICE BREYER            1
JUSTICE KENNED           1
JUDGE STEVENS            1
JUST SCALIA              1
JUDGE ROBERTS            1
JUSTCIE BREYER           1
dtype: int64

In [5]:
speakers = counts[counts>100].index.values
print(speakers)
df = df[df.speaker.isin(speakers)]
len(df)

['JUSTICE SCALIA' 'JUSTICE ROBERTS' 'JUSTICE BREYER' 'JUSTICE GINSBURG'
 'JUSTICE KENNEDY' 'JUSTICE SOTOMAYOR' 'JUSTICE STEVENS' 'JUSTICE SOUTER'
 'JUSTICE ALITO' 'JUSTICE KAGAN' "JUSTICE O'CONNOR" 'JUSTICE REHNQUIST']


91286

In [6]:
pd.set_option('display.max_colwidth', -1)
df[df.speaker=="JUSTICE ROBERTS"].utterance


204       Well hear argument first this morning in Case 09-497, Rent-A-Center West v. Jackson. Mr. Friedman. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
206       But not to the question of which parties have agreed to arbitrate?\n                                                                                                                                                                                                                                                                                                         

In [7]:
pd.set_option('display.max_colwidth', 40)
df.shape


(91286, 8)

In [8]:
df.presentation_side.value_counts()

PETITIONER    46705
RESPONDENT    30639
NA            13942
dtype: int64

In [9]:
sides = ["PETITIONER","RESPONDENT"]
df = df[(df.presentation_side.isin(sides)) & (df.justice_vote.isin(sides)) & (df.is_justice=="JUSTICE") ]

In [10]:
df.shape

(76778, 8)

In [11]:
df.justice_vote.value_counts()

PETITIONER    48206
RESPONDENT    28572
dtype: int64

In [12]:
df.head()

Unnamed: 0,docket,id,after_prev,speaker,is_justice,justice_vote,presentation_side,utterance
0,03-855,1,False,JUSTICE STEVENS,JUSTICE,RESPONDENT,PETITIONER,We will now hear argument in the cas...
2,03-855,3,True,JUSTICE KENNEDY,JUSTICE,PETITIONER,PETITIONER,"Well, is it your position that whene..."
4,03-855,5,True,JUSTICE BREYER,JUSTICE,PETITIONER,PETITIONER,Why does not having a possessory rig...
6,03-855,7,True,JUSTICE BREYER,JUSTICE,PETITIONER,PETITIONER,"No, I'm just thinking, that suppose ..."
8,03-855,9,True,JUSTICE KENNEDY,JUSTICE,PETITIONER,PETITIONER,All right. So if you were to say the...


In [13]:
df.presentation_side = df.presentation_side.map({"PETITIONER": 1, "RESPONDENT": 0})
df.justice_vote = df.justice_vote.map({"PETITIONER": 1, "RESPONDENT": 0})

In [14]:
df.describe()

Unnamed: 0,justice_vote,presentation_side
count,76778.0,76778.0
mean,0.627862,0.60335
std,0.483378,0.489205
min,0.0,0.0
25%,0.0,0.0
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


In [15]:
df.head()
# def polarize(data):
#     return TextBlob(data).polarity

# df["polarity"] = df.utterance.apply(polarize)


Unnamed: 0,docket,id,after_prev,speaker,is_justice,justice_vote,presentation_side,utterance
0,03-855,1,False,JUSTICE STEVENS,JUSTICE,0,1,We will now hear argument in the cas...
2,03-855,3,True,JUSTICE KENNEDY,JUSTICE,1,1,"Well, is it your position that whene..."
4,03-855,5,True,JUSTICE BREYER,JUSTICE,1,1,Why does not having a possessory rig...
6,03-855,7,True,JUSTICE BREYER,JUSTICE,1,1,"No, I'm just thinking, that suppose ..."
8,03-855,9,True,JUSTICE KENNEDY,JUSTICE,1,1,All right. So if you were to say the...


In [16]:
rows = []
for docket in df.docket.unique():
    cond_a = (df.docket == docket)
    for speaker in speakers:
        cond_b = (df.speaker == speaker)
        if len(df[(cond_a)&(cond_b)].presentation_side.unique())!=2: continue
        justice_vote = df[(cond_a)&(cond_b)].justice_vote.head(1).values[0]
        row = [docket,speaker,justice_vote]        
        for presentation_side in [0,1]:
            cond_c = (df.presentation_side == presentation_side)
            temp_df = df[(cond_a) & (cond_b) & (cond_c)]
            utterances = temp_df.utterance
            # print(utterances.head(1).values)
            text = " ".join(utterances.tolist()).replace('\n', ' ').replace('--', '')
            row.append(text)
        rows.append(row)



In [17]:
cols = ["docket",
        "speaker",
        "justice_vote",
        "pres0_text",
        "pres1_text", 
        ]

print len(rows)
df2 = pd.DataFrame(rows)
df2.columns = cols
df2.head()




3353


Unnamed: 0,docket,speaker,justice_vote,pres0_text,pres1_text
0,09-497,JUSTICE SCALIA,1,Is that is that right? Is the arbit...,I guess you could argue that on its ...
1,09-497,JUSTICE ROBERTS,1,"Thank you, counsel. Mr. Silverberg. ...",Well hear argument first this mornin...
2,09-497,JUSTICE BREYER,0,What is what I'm not sure about wha...,"Yes, that's thats true. The thing I..."
3,09-497,JUSTICE GINSBURG,0,Why is that why is - Subject to Fin...,"But if if fraud in the inducement, ..."
4,09-497,JUSTICE KENNEDY,1,After this After this suit was fil...,"Why is it post-formation? Arguably, ..."


In [18]:
###### Naive Bayes ########

X0 = df2.pres0_text
y0 = df2.justice_vote

X1 = df2.pres1_text
y1 = df2.justice_vote

nb_pipeline = Pipeline([('vect', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

nb_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                 'vect__stop_words': ["english",None],
                 'clf__alpha': (1e-2, 1e-3),
}


nb_gs = GridSearchCV(nb_pipeline, nb_parameters, n_jobs=-1)
nb0_gs = nb_gs.fit(X0,y0)


nb0_best_parameters, nb0_score, _ = max(nb0_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(nb0_best_parameters.keys()):
    print("%s: %r" % (param_name, nb0_best_parameters[param_name]))
print("nb0 score: " + str(nb0_score))
    
nb1_gs = nb_gs.fit(X1,y1)
nb1_best_parameters, nb1_score, _ = max(nb1_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(nb1_best_parameters.keys()):
    print("%s: %r" % (param_name, nb1_best_parameters[param_name]))
print("nb1 score: " + str(nb1_score))

print "Dummy score: " + str(y0[y0==y0.mode().values[0]].size/y0.size)

clf__alpha: 0.01
vect__ngram_range: (1, 3)
vect__stop_words: None
nb0 score: 0.601550849985
clf__alpha: 0.01
vect__ngram_range: (1, 4)
vect__stop_words: None
nb1 score: 0.583954667462
Dummy score: 0.603638532657


In [19]:
###### Support Vector Machine ########

X0 = df2.pres0_text
y0 = df2.justice_vote

X1 = df2.pres1_text
y1 = df2.justice_vote

sv_pipeline = Pipeline([('vect', TfidfVectorizer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, n_iter=5, random_state=42)),
])

sv_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
              'vect__stop_words': ["english",None],
}



sv_gs = GridSearchCV(sv_pipeline, sv_parameters, n_jobs=-1)

sv0_gs = sv_gs.fit(X0,y0)


sv0_best_parameters, sv0_score, _ = max(sv0_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(sv0_best_parameters.keys()):
    print("%s: %r" % (param_name, sv0_best_parameters[param_name]))
print("sv0 score: " + str(sv0_score))
    
sv1_gs = sv_gs.fit(X1,y1)
sv1_best_parameters, sv1_score, _ = max(sv1_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(sv1_best_parameters.keys()):
    print("%s: %r" % (param_name, sv1_best_parameters[param_name]))
print("sv1 score: " + str(sv1_score))

print "Dummy score: " + str(y0[y0==y0.mode().values[0]].size/y0.size)

vect__ngram_range: (1, 2)
vect__stop_words: 'english'
sv0 score: 0.60572621533
vect__ngram_range: (1, 1)
vect__stop_words: None
sv1 score: 0.605427974948
Dummy score: 0.603638532657


In [20]:
###### LOGISTIC REGRESSION ########

X0 = df2.pres0_text
y0 = df2.justice_vote

X1 = df2.pres1_text
y1 = df2.justice_vote

lr_pipeline = Pipeline([('vect', TfidfVectorizer()),
                     ('clf', LogisticRegression()),
])

lr_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
              'vect__stop_words': ["english",None],
}

lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)

lr0_gs = lr_gs.fit(X0,y0)
lr0_best_parameters, lr0_score, _ = max(lr0_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(lr0_best_parameters.keys()):
    print("%s: %r" % (param_name, lr0_best_parameters[param_name]))
print("lr0 score: " + str(lr0_score))
    
lr1_gs = lr_gs.fit(X1,y1)
lr1_best_parameters, lr1_score, _ = max(lr1_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(lr1_best_parameters.keys()):
    print("%s: %r" % (param_name, lr1_best_parameters[param_name]))
print("lr1 score: " + str(lr1_score))

print "Dummy score: " + str(y0[y0==y0.mode().values[0]].size/y0.size)

vect__ngram_range: (1, 3)
vect__stop_words: None
lr0 score: 0.607217417238
vect__ngram_range: (1, 4)
vect__stop_words: None
lr1 score: 0.614076946018
Dummy score: 0.603638532657


In [21]:
for speaker in speakers:
    subframe = df2[df2.speaker==speaker]
    
    if len(subframe) < 10: continue
    print speaker+ ": " + str(len(subframe))
    X = subframe.pres0_text
    y = subframe.justice_vote

    
    ## Naive Bayes
    nb_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', MultinomialNB()),
    ])

    nb_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
                  'clf__alpha': (1e-2, 1e-3,1e-4),
    }

    # Tried ngrams up to (1,7) and they didn't beat (1,4)
    nb_gs = GridSearchCV(nb_pipeline, nb_parameters, n_jobs=-1)
    nb_gs = nb_gs.fit(X,y)
    nb_best_parameters, nb_score, _ = max(nb_gs.grid_scores_, key=lambda x: x[1])
    
    
    #### Support Vector
    sv_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, n_iter=5, random_state=42)),
    ])
    
    sv_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
    }
    

    sv_gs = GridSearchCV(sv_pipeline, sv_parameters, n_jobs=-1)
    sv_gs = sv_gs.fit(X,y)
    sv_best_parameters, sv_score, _ = max(sv_gs.grid_scores_, key=lambda x: x[1])

    
    #### Logistic Regression
    lr_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', LogisticRegression()),
    ])
    
    lr_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
    }
    

    lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)
    lr_gs = lr_gs.fit(X,y)
    lr_best_parameters, lr_score, _ = max(lr_gs.grid_scores_, key=lambda x: x[1])
    

    
#     for param_name in sorted(parameters.keys()):
#         print("%s: %r" % (param_name, nb_best_parameters[param_name]))


    print "Naive Bayes score :" + str(nb_score)
    print "Support Vector score :" + str(sv_score)
    print "Logistic Regression score :" + str(sv_score)    
    print "Dummy score: " + str(y[y==y.mode().values[0]].size/y.size)


JUSTICE SCALIA: 462
Naive Bayes score :0.625541125541
Support Vector score :0.623376623377
Logistic Regression score :0.623376623377
Dummy score: 0.614718614719
JUSTICE ROBERTS: 555
Naive Bayes score :0.614414414414
Support Vector score :0.625225225225
Logistic Regression score :0.625225225225
Dummy score: 0.636036036036
JUSTICE BREYER: 404
Naive Bayes score :0.574257425743
Support Vector score :0.589108910891
Logistic Regression score :0.589108910891
Dummy score: 0.569306930693
JUSTICE GINSBURG: 460
Naive Bayes score :0.591304347826
Support Vector score :0.59347826087
Logistic Regression score :0.59347826087
Dummy score: 0.567391304348
JUSTICE KENNEDY: 393
Naive Bayes score :0.671755725191
Support Vector score :0.643765903308
Logistic Regression score :0.643765903308
Dummy score: 0.653944020356
JUSTICE SOTOMAYOR: 246
Naive Bayes score :0.605691056911
Support Vector score :0.569105691057
Logistic Regression score :0.569105691057
Dummy score: 0.59756097561
JUSTICE STEVENS: 207
Naive Bay

In [22]:
for speaker in speakers:
    subframe = df2[df2.speaker==speaker]
    
    if len(subframe) < 10: continue
    print speaker+ ": " + str(len(subframe))
    X = subframe.pres1_text
    y = subframe.justice_vote

    
    ## Naive Bayes
    nb_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', MultinomialNB()),
    ])

    nb_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
                  'clf__alpha': (1e-2, 1e-3,1e-4),
    }

    # Tried ngrams up to (1,7) and they didn't beat (1,4)
    nb_gs = GridSearchCV(nb_pipeline, nb_parameters, n_jobs=-1)
    nb_gs = nb_gs.fit(X,y)
    nb_best_parameters, nb_score, _ = max(nb_gs.grid_scores_, key=lambda x: x[1])
    
    
    #### Support Vector
    sv_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, n_iter=5, random_state=42)),
    ])
    
    sv_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
    }
    

    sv_gs = GridSearchCV(sv_pipeline, sv_parameters, n_jobs=-1)
    sv_gs = sv_gs.fit(X,y)
    sv_best_parameters, sv_score, _ = max(sv_gs.grid_scores_, key=lambda x: x[1])

    
    #### Logistic Regression
    lr_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', LogisticRegression()),
    ])
    
    lr_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
    }
    

    lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)
    lr_gs = lr_gs.fit(X,y)
    lr_best_parameters, lr_score, _ = max(lr_gs.grid_scores_, key=lambda x: x[1])
    

    
#     for param_name in sorted(parameters.keys()):
#         print("%s: %r" % (param_name, nb_best_parameters[param_name]))


    print "Naive Bayes score :" + str(nb_score)
    print "Support Vector score :" + str(sv_score)
    print "Logistic Regression score :" + str(sv_score)    
    print "Dummy score: " + str(y[y==y.mode().values[0]].size/y.size)


JUSTICE SCALIA: 462
Naive Bayes score :0.577922077922
Support Vector score :0.612554112554
Logistic Regression score :0.612554112554
Dummy score: 0.614718614719
JUSTICE ROBERTS: 555
Naive Bayes score :0.637837837838
Support Vector score :0.643243243243
Logistic Regression score :0.643243243243
Dummy score: 0.636036036036
JUSTICE BREYER: 404
Naive Bayes score :0.561881188119
Support Vector score :0.576732673267
Logistic Regression score :0.576732673267
Dummy score: 0.569306930693
JUSTICE GINSBURG: 460
Naive Bayes score :0.571739130435
Support Vector score :0.604347826087
Logistic Regression score :0.604347826087
Dummy score: 0.567391304348
JUSTICE KENNEDY: 393
Naive Bayes score :0.64631043257
Support Vector score :0.656488549618
Logistic Regression score :0.656488549618
Dummy score: 0.653944020356
JUSTICE SOTOMAYOR: 246
Naive Bayes score :0.593495934959
Support Vector score :0.565040650407
Logistic Regression score :0.565040650407
Dummy score: 0.59756097561
JUSTICE STEVENS: 207
Naive Ba

In [23]:
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)

# count_vect = CountVectorizer()
# X_train_counts = count_vect.fit_transform(twenty_train.data)
# X_train_counts.shape

# tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
# X_train_tf = tf_transformer.transform(X_train_counts)

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)    


0.83488681757656458



'This is a string of Text'