# Introduction

The rulings and opinions of the Supreme Court of the United States (SCOTUS) have been of obvious importance to developments in US law, especially in the past half-century.  While the outcome of the majority vote of the nine justices is the sole determinant of the case at hand, the written majority opinion – typically a substantial, multi-page document – determines the scope of and the justification for the precedent that the immediate ruling establishes.  So while the case-deciding power of the court rests with the rulings, the broader precedent-setting power of the court lies in the written opinions.  These precedents can shape the direction of litigation, legislation, and in particular future rulings at all levels of courts.

Given this context, exploratory analysis of this dataset should be both interesting and useful.  In this notebook, I'll examine about 7,500 opinions written by eighteen different justices between 1970 and 2016.  I'll perform the following types of analysis:
- high-level statistical analysis of justices and opinion counts/lengths/types over the years
- topic modeling of opinions
- clustering opinions
- quantitative and qualitative comparisons between different justices' work
- correlations with known ideological tendencies

I'll also make an attempt at classifying opinions by author.  However, note that there are numerous difficulties that will make this a very difficult task:
- topics of opinions vary hugely based on the subject matter of the case
- the official authoring justice is not the sole author of any opinion: opinions are typically first drafted by clerks, and revised through discussion and collaboration with other justices
- the problem features a moderately high number of classes (18) with limited examples (minimum ~100)

That said, most of the interesting insights here will come from the unsupervised exploration of the data.

Finally, two quick notes on adapting this notebook into a Kaggle kernel.  First, this notebook originally included a section on moral and political ideology, but because it used outside data sources (Martin-Quinn scores, Segal-Cover scores, Moral Foundations Theory keywords dictionary), so there's no straightforward way to run it in a Kaggle kernel.  Second, at time of posting, Kaggle's seaborn module is a version behind and doesn't include the scatterplot module used for the clustering plots here; it has to be manually updated per [Kaggle's instructions.](https://www.kaggle.com/docs/kernels#modifying-the-default-environment)

## Imports

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
from time import time
import random
random.seed(11)

# suppress warnings: only necessary on Kaggle kernel (because of some spotty package updates)
# shouldn't be necessary on a properly updated local machine
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

## Load and Clean


In [None]:
import os
print(os.listdir("../input"))

In [None]:
opinions_df = pd.read_csv('../input/opinions_since_1970.csv')

# eliminate a few hundred mini-opinions that are basically short comments on the majority opinion
opinions_df = opinions_df[opinions_df.text.map(lambda x: len(x) > 3000)]
print(opinions_df.shape)

# eliminate lingering per curiam opinions; eliminate the second_dissenting / dissenting distinction
opinions_df = opinions_df[opinions_df.category != 'per_curiam']
opinions_df.category = opinions_df.category.map(lambda x: x if x != 'second_dissenting' else 'dissenting')

# drop least common authors (Roberts, Kagan, Black, Harlan)
authors_with_min_100 = opinions_df.author_name.value_counts()[opinions_df.author_name.value_counts() > 100].index
opinions_df = opinions_df[opinions_df.author_name.isin(authors_with_min_100)]

# drop justice douglas (213 opinions) for reasons detailed below
opinions_df = opinions_df[opinions_df.author_name!='justice douglas']

opinions_df.author_name.value_counts()

### A note on dropping Justice Douglas

Despite the fact that me meets the minimum opinion count, we're going to drop Justice Douglas from the dataset.  There are three core reasons:
1. He doesn't really belong to the era this report is examining.  We are examining 1970-2016; Douglas was incapacitated in 1974 and retired in 1975.
2. The meta-statistics of his opinions are quite anomalous. He wrote far shorter opinions, far more opinions per year, and nearly exclusively in dissent.
3. The opinions he wrote were themselves highly unusual. 

The last point requires some substantiation, and is in fact a major understatement.  Per Wikipedia: 

<blockquote>"In general, legal scholars have noted that Douglas's judicial style was unusual in that he did not attempt to elaborate justifications for his judicial positions on the basis of text, history, or precedent. Douglas was known for writing short, pithy opinions which relied on philosophical insights, observations about current politics, and literature, as much as more conventional "judicial" sources. Douglas wrote many of his opinions in twenty minutes, often publishing the first draft."</blockquote>

All of the plots below marked Douglas as a major outlier when included, which is good intuitive confirmation of their usefulness.  Even better, the LSA dataset formulated later in this notebook uniquely identified him as unusual.  His outliership is so extreme that it washes out differences and trends among other justices.  For these reasons, we'll drop Douglas from the dataset moving forward.

# Basic data exploration

Our opinions data now contains 7,578 opinions from 1970 through 2018.  Of these:
- 4388 are majority opinions
- 2427 are dissenting
- 955 are concurring

Note that these exclude per curiam (unanimous) agreements, in which cases the opinion is written by the court with no individual author cited.

The average date of opinion written by each justice is as follows:

In [None]:
opinions_df.groupby('author_name').agg('mean').year_filed.astype(int).sort_values()

In [None]:
import matplotlib.pylab as pylab
params = {'legend.fontsize': 12,
         'axes.labelsize': 12,
         'axes.titlesize':14}
pylab.rcParams.update(params)

In [None]:
yearly_counts = opinions_df.groupby('year_filed').agg({'federal_cite_one': pd.Series.nunique})
plt.figure(figsize=(10,6))
plt.plot(yearly_counts.iloc[:46]) # index omits 2017(36) and 2018(1) which have not-yet-catalogued cases
plt.title('Number of cases per year with at least one attributed opinion (i.e., not decided per curiam)', fontsize=14)
plt.ylim((0,180))
plt.xlabel('Year')
plt.ylabel('Number of cases')
plt.show()

These results show a drastic decrease in the size of the docket (i.e., cases heard per term) since the 1980s.  Caution is warranted here: the present data excludes cases decided per curiam (a small but significant number), a small number of cases featuring only opinions by non-prolific judges, and potential some with irregular formatting.  But external sources (e.g., <a href=https://www.nytimes.com/2009/09/29/us/29bar.html>The New York Times</a>) confirm that the decrease in docket size is a real and widely noted phenomenon.  This decrease explains some of the trends visible in the individual justices' opinion statistics below.

We can also take a look at how long opinions tend to be by category (majority/dissenting/concurring):

In [None]:
for category in ['majority', 'concurring', 'dissenting']:
    plt.figure(figsize=(10,6))
    sns.distplot(opinions_df[opinions_df.category==category].text.map(lambda x: len(x)))
    plt.title('distribution of ' + category + ' opinion lengths')
    plt.xlim((0,150000))
    plt.xlabel('length (characters) of opinion')
    plt.show()

This shows about what we would expect: majority opinions are usually the longest, since they lay all the groundwork; concurring opinions are shortest, since they build most off the majority opinions; dissenting opinions are in the middle, building somewhat off the majority opinion but having many different or opposing points to make.

Now let's take a look at patterns by individual author.

In [None]:
temporal_sequence = opinions_df.groupby('author_name').agg('mean').year_filed.sort_values().index

# bar graph of number of opinions by justice, arranged by year: absolute -- by category
plt.figure(figsize=(16,8))
sns.countplot('author_name', hue='category', order=temporal_sequence, data=opinions_df)
plt.xticks(rotation=75)
plt.xlabel('Justice (arranged chronologically by average year of opinion)')
plt.title('Number of opinions written by each recent justice, by category')
plt.show()

In [None]:
# bar graph of number of opinions by justice, arranged by year: per-year
yearly_counts = opinions_df.groupby('author_name').agg(
    {'year_filed': pd.Series.nunique,
     'category': 'count' })
yearly_counts['average'] = yearly_counts.category / yearly_counts.year_filed
plt.figure(figsize=(12,7))
sns.barplot('author_name','average', order=temporal_sequence, data=yearly_counts.reset_index())
plt.xticks(rotation=75)
plt.xlabel('Justice (arranged chronologically by average year of opinion)')
plt.title('Average opinions per year by each recent justice')
plt.show()

Collectively, the number of opinions authored - both total and per-year - shows the downward trend that we expected from the shrinking docket.  There are some notable individual discrepancies.  It is in particular worth noting some positive outliers:
- **Justices Marshall and Rhenquist** both served (successively) as chief justice for the court.  The chief justice customarily writes the majority opinion if he or she votes on the prevailing side.
- **Justices Brennan and Stevens** both (successively) held the position of senior associate justice, and both often voted liberal while the Chief Justice at the time often voted conservative. This positioned them almost as associate chief justices / minority leaders (not formal titles).
- **Justice Powell**: the reasons for his outliership are unclear and may be simple a tendency of personality.
- **Justice Scalia** was a vociferous and eloquent writer known for dissenting.

The composition of the opinions for each justice, (majority opinions vs. dissents) largely indicates whether or not the majority of the court tended to align with a particular justice's views during their tenure.

In [None]:
opinions_df['word_count'] = opinions_df.text.map(lambda x: len(x.split()))

plt.figure(figsize=(12,7))
sns.barplot('author_name', 'word_count', order=temporal_sequence, ci=95, data=opinions_df)
plt.xticks(rotation=75)
plt.title('Average length of opinion(words) by each recent justice')
plt.show()

There is a notable trend here toward longer opinions in recent years.  Justice Thomas, whose opinions would have been perfectly average before 1990, now stands out as unusually terse among justices in the second half of this chart.  This trend is probably due at least in part to the fact that SCOTUS justices write fewer opinions per year than they used to (again, the shrinking docket).

## Setting aside test set

I'll now set aside a quarter of our data for later use (primarily checking the stability of our unsupervised results).  This may be a little more rigor than is really necessary here, but it's one of the few ways to check the reliability of unsupervised methods.

In [None]:
# get selection index
test_index = np.random.choice(opinions_df.index, size=opinions_df.shape[0]//4, replace=False)
train_index = opinions_df.index[~opinions_df.index.isin(test_index)]
toy_index = np.random.choice(opinions_df.index, size=opinions_df.shape[0]//200, replace=False)
# create a version by position, for a re-indexed version
re_train_index = [i for i in range(opinions_df.shape[0]) if opinions_df.index[i] in train_index]
re_test_index = [i for i in range(opinions_df.shape[0]) if opinions_df.index[i] in test_index]

# split data
op_df = opinions_df.loc[train_index]
test_df = opinions_df.loc[test_index]
toy_df = opinions_df.loc[toy_index]

# Modeling and Analysis

To do anything with this data, we'll first need to vectorize it.  There are basically two canonical ways to do this:
1. Lemmatize/stem words and produce a bag-of-words or tf-idf vector for each document
2. Get word embeddings and average or otherwise combine the embeddings of all words in each document

For this case, we'll mostly use 1, which is the most useful for conventional topic modeling techniques.  There are also some technical obstacles for option 2: the corpus is big enough that producing our own word embeddings would take considerable time on this machine, but also small enough and lexically diverse enough that these embeddings may still be rather flawed.  And downloading pre-trained embeddings may be worth a try, but domain specificity is a major concern, given the distinctive nature of legal writing.  Still, if time allows, it would be worthwhile to give some version of option 2 a shot as well.

### Parsing and TF_IDF

First we'll need to clean and lemmatize the opinions. This parsing will give us a version including named entities (people, organizations, etc.) and a version without them. 

Moving forward I'll work almost entirely with the version of the opinions with the named entities removed.  The main reason is that some of the topic analysis methods, especially LSA, end up catching a lot of names if they are included.  This leads to mostly uninterpretable topics clusters of people and organizations.  Names are also just not what we're interested in here: they tend to be restricted to the opinions belonging to just one case (sometimes two or three), and while they're frequent within that case, they're absent and useless for the rest of the dataset.

In [None]:
import spacy
parser = spacy.load('en')

from spacy.lang.en.stop_words import STOP_WORDS
stopword_additions = [
    "'s",
    '’s',
    '\r',
    'l.ed.2d',
    'l.ed'
]
for addition in stopword_additions:
    STOP_WORDS.add(addition)

In [None]:
import re

def clean(text):
    # recipe from https://stackoverflow.com/questions/6116978/how-to-replace-multiple-substrings-of-a-string
    rep = {
        '\s+':' ', # reduces any whitespace to a single space
        '\'s':'', # removes possessives, and there are virtually no contractions in the texts
        '’s':'',
        '\r':'',
        'u. s. c.': 'u.s.c.', # federal statute citation
        'u. s.': ''
    }
    rep = dict((re.escape(k), v) for k, v in rep.items())
    pattern = re.compile("|".join(rep.keys()))
    text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
    text = re.sub('\s+', ' ', text) # doesn't work in the rep dict for some reason...
    return text

def parse_and_lemmatize(opinions, get_full_ops=True, get_nameless_ops=False, get_named_ents=False):
    '''
    Returns all three lists; if any are marked False, the returned list will be empty:
        lemmatized_full_ops: a list of lists (one per opinion) of the lemmas in each opinion
        lemmatized_nameless_ops: a lists of lists of the lemmas (excluding named entitites) in each opinion
        named_entities: a list of lists of named entities identified in each opinion
    '''
    start = time()
    lemmatized_full_ops = []
    lemmatized_nameless_ops = []
    named_entities = []
    counter = 0
    parser = spacy.load('en')
    for opinion in opinions:
        counter += 1
        print('Parsing opinion {} of {}'.format(counter, len(opinions)), end='\r')
        parsed_opinion = parser(clean(opinion))
        if get_full_ops:
            lemmatized_opinion = \
                [token.lemma_ for token in parsed_opinion \
                 if not token.is_stop and not token.is_punct and not token.like_num]
            lemmatized_full_ops.append(lemmatized_opinion)
        if get_nameless_ops:
            lemmatized_opinion = \
                [token.lemma_ for token in parsed_opinion \
                 if not token.is_stop and not token.is_punct and not token.like_num and token.ent_iob == 2]
        lemmatized_nameless_ops.append(lemmatized_opinion)
        if get_named_ents:
            named_entities.append([(ent.text, ent.label_) for ent in parsed_opinion.ents])

    print('  Total parsing time:', round((time()-start)/60, 1), 'minutes    ')
    
    return lemmatized_full_ops, lemmatized_nameless_ops, named_entities


all_lemmatized_opinions, all_lemmatized_opinions_nonames, named_entity_tuples  = parse_and_lemmatize(
    opinions_df.text,
    get_nameless_ops=True,
    get_named_ents=True
)
lemmatized_opinions_with_names = [all_lemmatized_opinions[i] for i in re_train_index]
lemmatized_opinions_with_names_test = [all_lemmatized_opinions[i] for i in re_test_index]
lemmatized_opinions_nonames = [all_lemmatized_opinions_nonames[i] for i in re_train_index]
lemmatized_opinions_nonames_test = [all_lemmatized_opinions_nonames[i] for i in re_test_index]
named_entities = [tup[0] for opinion in named_entity_tuples for tup in opinion]

In [None]:
from collections import Counter

all_words_with_names = [lemma for opinion in lemmatized_opinions_with_names for lemma in opinion]
n_words = len(set(all_words_with_names))
counter = Counter(all_words_with_names)
two_count = len([word for word in counter.keys() if counter[word] >= 2])
three_count = len([word for word in counter.keys() if counter[word] >= 3])

print("*** Stats with named entities: ***")
print("Number of unique words (likely to include typos, etc.):", n_words)
print("Number of words occurring at least twice:", two_count)
print("Number of words occurring at least thrice:", three_count)

all_words_nonames = [lemma for opinion in lemmatized_opinions_nonames for lemma in opinion]
n_words_2 = len(set(all_words_nonames))
counter_2 = Counter(all_words_nonames)
two_count_2 = len([word for word in counter_2.keys() if counter_2[word] >= 2])
three_count_2 = len([word for word in counter_2.keys() if counter_2[word] >= 3])

print("\n*** Stats without named entities: ***")
print("Number of unique words (likely to include typos, etc.):", n_words_2)
print("Number of words occurring at least twice:", two_count_2)
print("Number of words occurring at least thrice:", three_count_2)

In [None]:
# add back in two common general names, and all Xth Amendment references
for i, opinion in enumerate(named_entity_tuples):
    keepnames = [tup[0] for tup in opinion if tup[0] in ['Indian', 'Miranda'] or tup[0].endswith('Amendment')]
    all_lemmatized_opinions_nonames[i].extend(keepnames)

lemmatized_opinions_nonames = [all_lemmatized_opinions_nonames[i] for i in re_train_index]
lemmatized_opinions_nonames_test = [all_lemmatized_opinions_nonames[i] for i in re_test_index]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

start = time()
all_words = [lemma for opinion in lemmatized_opinions_nonames for lemma in opinion]
common_words = [tup[0] for tup in Counter(all_words).most_common(12000)[100:]] # omit most common 100 words
joined_opinions = [
    ' '.join([lemma for lemma in opinion]) # could remove with `if lemma in common_words`, but vectorizer vocab does so
    for opinion in lemmatized_opinions_nonames]
print('Total common-word filtering time:', round((time()-start)/60, 1), 'minutes')

vectorizer_0 = TfidfVectorizer(
    lowercase=True,
    use_idf=True,
    vocabulary=common_words
)

tfidf_mat = vectorizer_0.fit_transform(joined_opinions)
print('Total vectorizing time:', round((time()-start)/60, 1), 'minutes')

## LSA

These basic tf-idf vectors will serve for some supervised learning processes and (when truncated) clustering.  But for more computationally expensive tasks, we'll need a reduced dataset.  Latent Semantic Analysis uses SVD to reduce the dataset down to a set of combined features that can be thought of as representing semantic clusters: our first piece of topic modeling in this notebook.

In [None]:
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# remove_list = ["'s", '   ', '         ', '    ', '\r ', '      ','  ', '’s', ' ', '            ']

# make pipeline
vectorizer = TfidfVectorizer(
    lowercase=True,
    use_idf=True,
    vocabulary=common_words # could omit an additional most common 200 words here if desired
)
svd = TruncatedSVD(300)
lsa = make_pipeline(vectorizer, svd, Normalizer(copy=False))

# run pipeline
start=time()
lsa_mat_300 = lsa.fit_transform(joined_opinions)
lsa_mat_200 = lsa_mat_300[:,:200]
lsa_mat_100 = lsa_mat_300[:,:100]
print('Total LSA time:', round((time()-start)/60, 1), 'minutes')
print('lsa_mat shape:', lsa_mat_200.shape)

In [None]:
lsa_300_df = pd.concat(
    [
        op_df[['author_name','year_filed','case_name', 'category', 'word_count']].reset_index(drop=True), 
        pd.DataFrame(lsa_mat_300)
    ],
    axis=1)

lsa_100_df = pd.concat(
    [
        op_df[['author_name','year_filed','case_name', 'category', 'word_count']].reset_index(drop=True), 
        pd.DataFrame(lsa_mat_300)
    ],
    axis=1)

lsa_100_df.loc[:1]

### Examining the components (semantic clusters)

Since we've got these SVD-reduced LSA features, let's take a look at the weightings of the original features, which are lemmas, for each of the top ten combined features.  Below, I've plotted wordclouds with the word sizes determined by those feature weights.

In [None]:
from wordcloud import WordCloud

def visualize_components(components_df):
    n_components = components_df.shape[0]
    for j in range(n_components // 2):
        plt.figure(figsize=(14,5))
        i = 2*j
        plt.subplot(121)
        weights_dict = components_df.loc[i].sort_values(ascending=False)[:25].to_dict()
        word_cloud = WordCloud(background_color='white', colormap='RdBu').generate_from_frequencies(weights_dict)
        plt.imshow(word_cloud, interpolation='bilinear')
        plt.title("FEATURE NUMBER " + str(i+1))
        plt.axis('off')

        i += 1
        plt.subplot(122)
        weights_dict = components_df.loc[i].sort_values(ascending=False)[:25].to_dict()
        word_cloud = WordCloud(background_color='white', colormap='RdBu').generate_from_frequencies(weights_dict)
        plt.imshow(word_cloud, interpolation='bilinear')
        plt.title("FEATURE NUMBER " + str(i+1))
        plt.axis('off')    

        plt.show()

In [None]:
components_df = pd.DataFrame(svd.components_[:20,:], columns=vectorizer.vocabulary)
visualize_components(components_df)

It's common with LSA for the first feature to serve as a filter for the most common and indiscriminate words across the board*, and we see some of that in Feature 0 here: a constitutional amendment, a counsel (attorney), a sentence, and an appeal are common to very many of these cases. Certainly these words don't clearly identify a distinct and coherent concept.

Some of the other features, though, have a more clearly articulable focus:
- Feature 1: [mixture/filter]
- Feature 2: criminal sentencing
- Feature 3: police conduct and the legality of searches
- Feature 4: religion in schools
- Feature 5: employer-employee arbitration
- Feature 6: Native American rights
- Feature 7: [mixed - ?]
- Feature 8: Native American cases
- Feature 9: elections and discrimination
- Feature 10: [mixed - ?]

Overall, about three of the top ten features here are entertwined and not robustly distinct.  The blends don't fit any single human-language concept or category.  This doesn't preclude the usefulness of our LSA-reduced dataset for ongoing modeling.  For insightful topic modeling, however, it would also make sense to give Latent Dirichlet Analysis a try.  One of its chief advantages is the ability to create more distinct or coherent topic clusters than LSA.

\*  *"Following the recom-mendation of Hu, Cai, Wiemer-Hastings, Graesser, & McNamara (2007), we discard the first dimension prior to computing similarity because this dimension is always positive and correlates with the under-lying terms’ frequency in the corpus."  -- (PDF) Measuring Moral Rhetoric in Text. Available from: https://www.researchgate.net/publication/258698999_Measuring_Moral_Rhetoric_in_Text [accessed Oct 17 2018].*

### Latent Dirichlect Allocation

Before digging into the LDA results below, it's worth noting that tuning LDA is as much art as science.  For one thing, choosing the right number of components is highly empirical.  It's generally possible to use some heuristics based on the rate of change in the perplexity of the results (the RPC*); however, the perplexity metric built into sklearn's LDA is broken, and constructing my own isn't a priority for this capstone.  From empirical investigation of the resulting topical clusters over repeated runs, the topics seemed to be distinct and useful at low numbers (5-8), poor at small/medium numbers (10-20), and fairly well again and medium/large numbers (30-50).  I also tried tweaking the alpha and beta parameters, but the default values seemed to work best.

LDA is also not especially stable between runs.  I've set the random state for the sake of consistency, but different runs did produce somewhat different groupings and results, although this variance was curtailed by increasing the max_iter parameter.  

The wordclouds here, as before, scale the words by the weight allotted to each of the 30 top-weighted terms in the LDA vector.  Below these, I'll point out some interesting commonalities and differences between these topical clusters.

\* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/

#### LDA with 5 components

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from wordcloud import WordCloud

# tf-idf
vectorizer_2 = TfidfVectorizer(
    lowercase=True,
    use_idf=True,
    vocabulary=common_words
)
tfidf_mat = vectorizer_2.fit_transform(joined_opinions)

# LDA
start=time()
lda_5 = LatentDirichletAllocation(
    n_components=5,
    max_iter=20,
    random_state=11
)
lda_mat = lda_5.fit_transform(tfidf_mat)
print('Total LDA time:', round((time()-start)/60, 1), 'minutes')

# visual representation
components_df = pd.DataFrame(lda_5.components_, columns=vectorizer_2.vocabulary)
for i in range(5):
    weights_dict = components_df.loc[i].sort_values(ascending=False)[:25].to_dict()
    word_cloud = WordCloud(background_color='white', colormap='RdBu').generate_from_frequencies(weights_dict)
    plt.figure(figsize=(8,5))
    plt.imshow(word_cloud, interpolation='bilinear')
    plt.title("FEATURE NUMBER " + str(i+1))
    plt.axis('off')
    plt.show()

All five of these features are specific, coherent, and interpretable:
- Feature 1: separation of church and state
- Feature 2: nudity/obscenity laws
- Feature 3: employer / employee suits
- Feature 4: police searches and fourth amendment rights
- Feature 5: criminal sentencing

Some of these features overlap with the LSA features, but features 1 and 2 are almost entirely new

#### LDA with 10 components

In [None]:
start=time()
lda_10 = LatentDirichletAllocation(
    n_components=10,
    max_iter=20,
    random_state=11
)
lda_mat = lda_10.fit_transform(tfidf_mat)
print('Total LDA time:', round((time()-start)/60, 1), 'minutes')

# visual representation
components_df = pd.DataFrame(lda_10.components_, columns=vectorizer_2.vocabulary)
visualize_components(components_df)

Of these ten features, six are fairly clear:
- Feature 1: agriculture law
- Feature 2: ??
- Feature 3: elections and voting
- Feature 4: optometry (?!)
- Feature 5: pollution
- Feature 6: shipping and customs
- Feature 7: employment benefits
- Feature 8: criminal law and sentencing
- Feature 9: ??
- Feature 10: ??

As we increase n_components, the features are getting more specific (cf. optometry), but many are getting less comprehensible.

### LDA with 30 components

In [None]:
start=time()
lda_30 = LatentDirichletAllocation(
    n_components=30,
    max_iter=50,
    random_state=11
)
lda_mat = lda_30.fit_transform(tfidf_mat)
print('Total LDA time:', round((time()-start)/60, 1), 'minutes')

# visual representation
components_df = pd.DataFrame(lda_30.components_, columns=vectorizer_2.vocabulary)
visualize_components(components_df)

I won't go through all 40 topics here, but note that some of them are totally indecipherable, while others are both coherent and highly specific.  Of particular note:
- Feature 6: television programming
- Feature 7: waterlands
- Feature 18: elections and voting
- Feature 19: wildlife resources (hunting & fishing)
- Feature 23: immigration
- Feature 24: employee abuse
- Feature 25: obscenity

At small numbers of topics (e.g., 5), LDA seems to work better here.  As the number of topics increases, the specificity increases but the consistency of coherence falls. Nevertheless, the 40-50% of the clusters that don't display any clear human-conceptual coherence may still represent significant semantic groupings not captured by our English vocabulary and concept-structures, or just not readily apparent.  

It is also possible as an optimization to filter out the top words selected by the first LSA feature, and then run LDA on the remaining datase, as in the commented-out code snippet below.  In practice, I didn't find that this made any noticeable improvement in topic distinctness or coherence.

It's worth noting that different runs of LDA with different settings produced results that differed in interesting ways.  It would be far too space-consuming to print out the results of every variant, but I'll mention a few of the differences here.  

First, with 10-15 clusters, some runs grouped abortion with medical issues as in the groupings above, but others grouped it with fourth amendment rights and search legality.  That isn't the first connection most humans would think of, but it represents the fact that abortion has been treated in the courts as a right to privacy issue.  Such a grouping shows that the algorithm is drawing semantic clusters from how these topics are actually treated in the text, not just general language-use patterns.

Second, while almost every run of LDA included a topic cluster for criminal cases or sentencing and some version of finance and bankruptcy cases, the set of topic clusters varied widely.  Native American cases, religion, obscenity, segregation, immigration law, and maritime law all appeared in some runs and not others.  This demonstrates that the hyperparameters of LDA matter a great deal, and that it can be pretty unstable across different runs if max_iter is set fairly low.  But it also represents legitimate conceptual ambiguities.  Ten humans would almost certainly come up with ten different ways for enumerating and grouping the topics encompassed by these six thousand opinions. Is abortion a medical issue or a right to privacy issue, legally speaking?  Should trade be grouped with maritime law, or with immigration?  In a way, it's actually more reaffirming than otherwise that runs of LDA can differ on some of the same issues that humans may reasonably differ on.

In [None]:
# # LSA-filtered LDA:
# components_df = pd.DataFrame(svd.components_[:2,:], columns=vectorizer.vocabulary)
# filtered_words = components_df.loc[0].sort_values(ascending=False)[:25].index
# print(filtered_words)

# vectorizer_3 = TfidfVectorizer(
#     lowercase=True,
#     use_idf=True,
#     vocabulary=[word for word in common_words if word not in filtered_words]
# )
# tfidf_mat_3 = vectorizer_3.fit_transform(joined_opinions)

# lda_20 = LatentDirichletAllocation(
#     n_components=20,
#     max_iter=20,
#     random_state=11
# )
# lda_mat = lda_20.fit_transform(tfidf_mat_3)

### Plotting topics over time

In [None]:
lda_mat.shape

In [None]:
lda_df = pd.concat(
    [
        op_df[['author_name','year_filed','case_name', 'category', 'word_count']].reset_index(drop=True), 
        pd.DataFrame(lda_mat)
    ],
    axis=1)

lda_by_year = lda_df[['year_filed'] + list(range(30))].groupby('year_filed').agg('mean')
lda_by_year.head()

In [None]:
x = lda_by_year.index[1:-1]
plt.figure(figsize=(14,10))

for feature in [14,22,23]:
    y_vals = list(lda_by_year[feature-1])
    y = [(y_vals[j] + y_vals[j-1] + y_vals[j+1])/3 for j in range(1, len(y_vals)-1)] # rolling mean for smoothing
    plt.plot(x,y)
plt.legend(labels=[
    'religion in schools',
    'unions and employers',
    'immigration'
])  
plt.title("Topic prominence in SCOTUS opinions over time")
plt.show()

There's a lot of noise here because there are many factors affecting which cases reach the Supreme Court each year.  But some of the trends may be meaningful: union-related cases have declined with the decrease in unionization until the post-2008-crisis uptick in unionization.  Immigration seems to have peaked around 2000, just after the peak in Mexican immigrant numbers.  And religion in schools seems to have settled somewhat since the 80s and 00s.  (Other topics displayed less interpretable trends.)

## Author Similarity
Since we've vectorized every opinion in our dataset, we can now average each author's opinion vectors and measure some differences.  The averages across the reduced featureset are all very similar overall, but there are also some interesting differences.

First, let's try sequencing the rows and columns of the correlation heatmap by chronology (specifically, the mean year for each author).

In [None]:
temporal_sequence = opinions_df.groupby('author_name').agg('mean').year_filed.sort_values().index
corr_df = lsa_300_df.drop('category', axis=1).groupby('author_name').agg('mean').T.corr()
typicality = corr_df.sum(axis=1)

mask = np.zeros_like(corr_df, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.set(style="white")

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(
    lsa_300_df.drop('category', axis=1).groupby('author_name').agg('mean').T.corr().loc[
        temporal_sequence, temporal_sequence
    ],
    cmap='Greens',
    square=True,
    mask=mask
)
plt.title("Similarity heatmap of averaged LSA opinion vector for each justice (by year)")
plt.show()

Here, with chronology, we see some patterning. As we would expect, the oldest justices are fairly similar to each other (top left corner is shaded dark) and the newer justices are fairly similar to each other (bottom right corner is shaded dark), while the differences are greater across a larger time difference (bottom left corner is lighter).  

One signficant deviation to this trend is Justice Thomas.  His opinions are actually more similar to justices thirty years prior than to most of his contemporaries - with the exception of Justice Scalia, who displays a similar (though less pronounced) tendency.  

Now let's try sorting by distinctness (the sum of each author's similarity scores).

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(
    lsa_300_df.drop('category', axis=1).groupby('author_name').agg('mean').T.corr().loc[
        typicality.sort_values().index, typicality.sort_values().index
    ],
    cmap='Greens',
    square=True,
    mask=mask
)
plt.title("Similarity heatmap of averaged LSA opinion vector for each justice (by distinctness)")
plt.show()

This gives us essentially what we expect: the more distinct authors (at the top) are either very similar to or very unlike each other (very dark or very light), whereas the more typical authors (at the bottom) have a moderate difference from all other authors.  By this metric, it appears that Justice Scalia is the prototypical opinion author.  This makes sense given his tendency to write roughly as much the fashion of his predecessors as in the fashion of his contemporaries.  It is surprising, however, given his reputation for stylistic flair.  (It may be that this reputation is owed to a small but vociferous minority of his opinions.)

Finally, we can also try clustering the authors:

In [None]:
from sklearn.cluster import KMeans

justices_df = lsa_300_df.drop('category', axis=1).groupby('author_name').agg('mean')

kmeans = KMeans(n_clusters=2)
labels = kmeans.fit_predict(justices_df)

justices_df['cluster_labels'] = labels
justices_df.cluster_labels.sort_values()

It looks like the clustering tends to pick up mostly on era, with some weight toward ideology as well: older and/or more conservative justices are grouped in cluster 0, while more recent and more progressive justices are grouped in label 1.

# Clustering

Now we'll move on to performing clustering on the opinions.  This has several goals, including:
- revealing the innate structure and natural groups of the data
- yielding alternate semantic clusters
- finding cluster labels that may be likely to help with classification

First we'll need to get a two-feature version of our data for visualization, using the density-based t-SNE method. This requires us to start with a smaller, more manageable set of features.  We'll take a quick look at the explained variance plot of our LSA features to decide how many features we should feed in, then proceed to performing the t-SNE and clustering.

In [None]:
# plot explained variance
sns.set()
plt.figure(figsize=(8,6))
plt.plot(svd.explained_variance_)
plt.title("Explained variance by each successive feature")
plt.show()

It looks like somewhere between 50 and 75 – toward the latter end of the elbow – would get us the best bang-for-buck.  For now we'll call it 65.

### t-SNE and plotting

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(random_state=12)
start = time()
tsne_mat = tsne.fit_transform(lsa_mat_100[:,:65])
tsne_df = pd.concat(
    [lsa_100_df[['author_name','category','case_name','year_filed']].reset_index(drop=True), 
     pd.DataFrame(tsne_mat)], 
    axis=1
) 
print("Elapsed time:", round((time()-start)/60, 1), "minutes")

plt.figure(figsize=(16,14))
sns.scatterplot(tsne_df[0], tsne_df[1], s=25, hue=tsne_df.author_name)
plt.show()

This preliminary plot nicely demonstrates the difficulty of the author-classification task: while the data does have some natural clustering, this clustering is definitively NOT around author (as we'll see below, it does seem to be driven by topic of the case).  This also means that including cluster labels in the data, which may allow an algorithm to account for topic, will be especially important for the classification task.

### K-means

To choose the number of clusters for our K-means, I'll plot the entropy, silhouette score, and adjusted rand index (ARI, a measure of consistency with different starting centroids) across a range of n-values.  We're looking for inflection points in the entropy and silhouette score, plus a reasonably high adjusted rand index.

In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering
from sklearn.metrics import adjusted_rand_score, silhouette_score, silhouette_samples

inertia_scores = []
ari_scores = []
silhouettes = []
kvals = range(6,37,3)
for k in kvals:
    print('Processing k =', k, end='\r')
    kmeans = KMeans(n_clusters=k, random_state=11)
    cluster_labels = kmeans.fit_predict(lsa_mat_300)
    kmeans2 = KMeans(n_clusters=k, random_state=13)
    cluster_labels2 = kmeans2.fit_predict(lsa_mat_300)
    ari = adjusted_rand_score(cluster_labels, cluster_labels2)
    ari_scores.append(ari)
    inertia_scores.append(kmeans.inertia_)
    silhouette = round(silhouette_score(lsa_mat_300, cluster_labels), 3)
    silhouettes.append(silhouette)

plt.figure(figsize=(10,7))
plt.plot(kvals, inertia_scores / np.mean(inertia_scores), label='inertia')
plt.plot(kvals, ari_scores, label='ARI')
plt.plot(kvals, silhouettes / np.mean(silhouettes), label='silhouette')
plt.title('Normalized metric scores vs k')
plt.xlabel("k")
plt.ylabel('Mean-normalized scores')
plt.legend()
plt.show()

This isn't very conclusive, unfortunately: there are no clear elbows in inertia or silhouette scores, nor dramtic peaks in ARI. Given this, we'll mostly go off the shape of the data and the number of clear topics that emerged from LDA.  Based on these, something in the range of 18 to 24 would probably work well.

In [None]:
tfidf_df = pd.DataFrame(tfidf_mat.todense(), columns=vectorizer_2.vocabulary) # tfidf_mat must be the 400: version

def labels_to_keywords(labels, tfidf_df, n=2):
    label_dict = {}
    for label in np.unique(labels):
        label_dict[label] = list(tfidf_df[labels==label].sum().sort_values(ascending=False).index[:n+3])
        label_dict[label] = ', '.join([word for word in label_dict[label] if word not in ['ct', 'court', 'appeal']][:n])
    return [label_dict[label] for label in labels]

In [None]:
kmeans = KMeans(n_clusters=18, random_state=11)
labels = kmeans.fit_predict(lsa_mat_300)
kmeans_15_labels = labels[:]
plt.figure(figsize=(14,12))
sns.scatterplot(tsne_df[0], tsne_df[1], s=25, hue=labels_to_keywords(labels, tfidf_df))
plt.title('K-means clustering with k=18')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=24, random_state=11)
labels = kmeans.fit_predict(lsa_mat_300)
kmeans_24_labels = labels[:]
plt.figure(figsize=(16,14))
sns.scatterplot(tsne_df[0], tsne_df[1], s=25, hue=labels_to_keywords(labels, tfidf_df))
plt.title('K-means clustering with k=24')
plt.show()

If we want a more detailed interpretation of the clusters, we can also look at the full wordcloud for each cluster (the labels above just use the top two words).

In [None]:
# components_df = pd.DataFrame(lda_12.components_, columns=vectorizer_2.vocabulary)
label_dict = {}
for label in np.unique(labels):
    label_dict[label] = list(tfidf_df[labels==label].sum().sort_values(ascending=False).index[:3])
    label_dict[label] = ', '.join([word for word in label_dict[label] if word !='ct'][:2])
    
def display_wordcloud(label):
    print("LABEL {} ({})".format(label, label_dict[label]))
    print("examples:")
    print(' ','\n  '.join(list(set(lsa_300_df.case_name[labels==label][:7].values))))
    weights_dict = tfidf_df[labels==label].sum().sort_values(ascending=False)[:25].to_dict()
    word_cloud = WordCloud(background_color='white', colormap='RdBu').generate_from_frequencies(weights_dict)
    plt.figure(figsize=(8,5))
    plt.imshow(word_cloud, interpolation='bilinear')
    plt.title("LABEL NUMBER " + str(i))
    plt.axis('off')
    plt.show()

for i in range(24):
    display_wordcloud(i)

This is a very promising result: these semantic clusters seem to be even more consistently distinct and interpretable than those produced by LDA.  Nearly every one matches an easily named natural topic.  

### Agglomerative clustering

Agglomerative clustering works reasonably well here too, though it tends more toward one big central cluster with spotted outliers.

In [None]:
agglomerative = AgglomerativeClustering(n_clusters=24, linkage='ward')
labels = agglomerative.fit_predict(lsa_mat_300)
ward_labels = labels[:]
plt.figure(figsize=(14,12))
sns.scatterplot(tsne_df[0], tsne_df[1], s=25, hue=labels_to_keywords(labels, tfidf_df))
plt.title('Ward agglomerative clustering with n=24')
plt.show()

### Spectral clustering

Spectral clustering produces clusters that are very nearly as clean as the k-means clusters, and seem to be picking out some different dimensionality through the middle.  It will definitely be worth taking a look at the interpretability of these clusters as topical clusters, as well as including them as features for supervised learning.

In [None]:
spectral = SpectralClustering(n_clusters=22)
labels = spectral.fit_predict(lsa_mat_300)
spectral_labels = labels[:]
plt.figure(figsize=(14,12))
sns.scatterplot(tsne_df[0], tsne_df[1], s=25, hue=labels_to_keywords(labels, tfidf_df))
plt.title('Spectral Clustering clustering with k=22')
plt.show()

Overall, while there are some real differences, these clustering methods result in the same general outline of clusters: a central, hard-to-separate crowd of business-related cases (agency, income, commerce, &c), and many outlying clusters of more specific civil and criminal topics.  While these label sets aren't perfect, they do provide a pretty sound taxonomy of coherent topics.

The topic clusters are also appropriately proximate to each other.  For instance, in K-means with k=24 (probably the best of these clusterings), "sentence, sentencing" is right next to "offense, conviction," fairly close to "juror, peremptory", and very far away from the natural-resource-focused "water, land."

These labels can hypothetically contribute to a supervised learning task (which we'll try below), but they are more useful for document classification and for tracking topic prominence against time or other factors (as in the LDA section).

# Supervised learning: author classification

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

def test_models(X_train, X_test, y_train, y_test, mnb=True):
    start = time()
    # logistic regression
    lr = LogisticRegression(solver='lbfgs', multi_class='multinomial', class_weight='balanced')
    train = lr.fit(X_train, y_train)
    print("**Logistic Regression**")
    print('Training set score:', round(lr.score(X_train, y_train), 4))
    print('Test set score:', round(lr.score(X_test, y_test), 4))
    
    if mnb:  # multinomialNB must be non-negative, so can't take LSA components
        mnb = MultinomialNB()
        mnb.fit(X_train, y_train)
        print("\n**Multinomial Naive Bayes**")
        print('Training set score:', round(mnb.score(X_train, y_train), 4))
        print('Test set score:', round(mnb.score(X_test, y_test), 4))

    # linear SVM
    svc = LinearSVC(class_weight='balanced')
    svc.fit(X_train, y_train)
    print("\n**Linear SVM**")
    print('Training set score:', round(svc.score(X_train, y_train), 4))
    print('Test set score:', round(svc.score(X_test, y_test), 4))

    # random forest
    rfc = RandomForestClassifier(n_estimators=20)
    rfc.fit(X_train, y_train)
    print("\n**Random Forest**")
    print('Training set score:', round(rfc.score(X_train, y_train), 4))
    print('Test set score:', round(rfc.score(X_test, y_test), 4))
    print("\n Total elapsed time:", round((time()-start)/60, 1), "minutes")

In [None]:
# make train/test splits
test_index_2 = np.random.choice(np.arange(tfidf_mat.shape[0]), size=opinions_df.shape[0]//4, replace=False)
train_index_2 = np.setdiff1d(np.arange(tfidf_mat.shape[0]), test_index_2)

# basic training sets
X_train_tfidf = pd.DataFrame(tfidf_mat[train_index_2].todense()) # could remain csr_mat if I convert all the below to csrs
X_test_tfidf = pd.DataFrame(tfidf_mat[test_index_2].todense())
X_train_lsa = lsa_mat_200[train_index_2]
X_test_lsa = lsa_mat_200[test_index_2]
y_train = lsa_300_df['author_name'].loc[train_index_2]
y_test = lsa_300_df['author_name'].loc[test_index_2]

# advanced and cluster labels
X_train_kmeans_15 = pd.get_dummies(pd.Series(kmeans_15_labels).astype(str)).loc[train_index_2]
X_test_kmeans_15 = pd.get_dummies(pd.Series(kmeans_15_labels).astype(str)).loc[test_index_2]
X_train_kmeans_24 = pd.get_dummies(pd.Series(kmeans_24_labels).astype(str)).loc[train_index_2]
X_test_kmeans_24 = pd.get_dummies(pd.Series(kmeans_24_labels).astype(str)).loc[test_index_2]
X_train_spectral = pd.get_dummies(pd.Series(spectral_labels).astype(str)).loc[train_index_2]
X_test_spectral = pd.get_dummies(pd.Series(spectral_labels).astype(str)).loc[test_index_2]
X_train_ward = pd.get_dummies(pd.Series(ward_labels).astype(str)).loc[train_index_2]
X_test_ward = pd.get_dummies(pd.Series(ward_labels).astype(str)).loc[test_index_2]
X_train_all_clusters = pd.concat([X_train_kmeans_24, X_train_spectral, X_train_ward], axis=1)
X_test_all_clusters = pd.concat([X_test_kmeans_24, X_test_spectral, X_test_ward], axis=1)
X_train_lda = pd.DataFrame(lda_mat[train_index_2])
X_test_lda = pd.DataFrame(lda_mat[test_index_2])

### Preliminary tests

In [None]:
test_models(X_train_lsa, X_test_lsa, y_train, y_test, mnb=False)

In [None]:
start = time()
def minitest(X_train, X_test, y_train, y_test):
    lgbmc = LGBMClassifier(class_weight='balanced', min_data=1, min_data_in_bin=1)
    lgbmc.fit(X_train, y_train)
    print("\n**LightGBM**")
    print('Training set score:', lgbmc.score(X_train, y_train))
    print('Test set score:', lgbmc.score(X_test, y_test))

minitest(X_train_lsa, X_test_lsa, y_train, y_test)
print("Elapsed time:", round((time()-start)/60, 1), "minutes")

In [None]:
test_models(X_train_tfidf, X_test_tfidf, y_train, y_test, mnb=True)

In [None]:
# decision tree
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(class_weight='balanced')
dtc.fit(X_train_tfidf, y_train)
print("\n**Decision Tree Classifier**")
print('Training set score:', round(dtc.score(X_train_tfidf, y_train), 4))
print('Test set score:', round(dtc.score(X_test_tfidf, y_test), 4))

### Tests with cluster labels

In [None]:
test_models(
    pd.concat([X_train_tfidf, X_train_lda], axis=1),
    pd.concat([X_test_tfidf, X_test_lda], axis=1),
    y_train,
    y_test,
    mnb=True)

In [None]:
test_models(
    pd.concat([X_train_tfidf.reset_index(drop=True), X_train_kmeans_15.reset_index(drop=True)], axis=1),
    pd.concat([X_test_tfidf.reset_index(drop=True), X_test_kmeans_15.reset_index(drop=True)], axis=1),
    y_train,
    y_test,
    mnb=False)

In [None]:
test_models(
    pd.concat([X_train_tfidf.reset_index(drop=True), X_train_spectral.reset_index(drop=True)], axis=1),
    pd.concat([X_test_tfidf.reset_index(drop=True), X_test_spectral.reset_index(drop=True)], axis=1),
    y_train,
    y_test,
    mnb=True)

### Tests with all available data

In [None]:
test_models(
    pd.concat([X_train_tfidf.reset_index(drop=True), 
               X_train_all_clusters.reset_index(drop=True),
              X_train_lda.reset_index(drop=True)], axis=1),
    pd.concat([X_test_tfidf.reset_index(drop=True), 
               X_test_all_clusters.reset_index(drop=True),
              X_test_lda.reset_index(drop=True)], axis=1),
    y_train,
    y_test,
    mnb=True)

### Optimal model

In [None]:
# random forest
start = time()
rfc = RandomForestClassifier(n_estimators=500)
rfc.fit(X_train_tfidf, y_train)
print("\n**Random Forest**")
print('Training set score:', round(rfc.score(X_train_tfidf, y_train), 4))
print('Test set score:', round(rfc.score(X_test_tfidf, y_test), 4))
print("\n Total elapsed time:", round((time()-start)/60, 1), "minutes")

Optimal model with all cluster values:

In [None]:
# random forest
X_train_temp = pd.concat(
    [X_train_tfidf.reset_index(drop=True), 
    X_train_all_clusters.reset_index(drop=True),
    X_train_lda.reset_index(drop=True)], 
    axis=1
)

X_test_temp = pd.concat(
    [X_test_tfidf.reset_index(drop=True),
     X_test_all_clusters.reset_index(drop=True),
    X_test_lda.reset_index(drop=True)], 
    axis=1
)

start = time()
rfc = RandomForestClassifier(n_estimators=500)
rfc.fit(X_train_temp, y_train)
print("\n**Random Forest**")
print('Training set score:', round(rfc.score(X_train_temp, y_train), 4))
print('Test set score:', round(rfc.score(X_test_temp, y_test), 4))
print("\n Total elapsed time:", round((time()-start)/60, 1), "minutes")

### Supervised learning conclusions

By and large, adding cluster labels doesn't actually help at all.  The Spectral clustering labels did help a little bit with the SVM, but they either didn't change or slightly lowered the other model scores.  It's a little bit surprising that none of the more advanced topical analysis seems to have helped at all.

The most accurate model here, seems to be Random Forest, topping out at around 37%.  For text classification in general, it's surprising that RF would be better than the canonical linear SVM (our second most accurate model here, at 31%).  However, it's easy to see why it might be more effective given the particulars of this data.  The data has a latent hierarchical structure, with authorship falling well below subject/topic of the case.  The hierarchical structure of Random Forest allows it to account for this hierarchical structure in the data.

That said, all of these models are massively overfitting.  There may be parameter tweaking that could improve this.  But it also means that our core problem here is probably the limited data.  If we had ten times the data, we could probably predict authorship pretty accurately – but there's a very finite number of opinions authored by each SCOTUS justice.

Overall, due to the various intrinsic difficulties of the dataset, our prediction rate will probably remain fairly low no matter what we try.

# Checking clustering on test set

Here we'll return to the test data set aside earlier and see if the methods produce comparable results.

### LSA

In [None]:
joined_opinions_test = [
    ' '.join([lemma for lemma in opinion]) # could remove with `if lemma in common_words`, but vectorizer vocab does so
    for opinion in lemmatized_opinions_nonames_test]
tfidf_mat_test = vectorizer_2.transform(joined_opinions_test)

start=time()
lsa_mat_test_300 = svd.transform(tfidf_mat_test)
lsa_mat_test_200 = lsa_mat_test_300[:,:200]
lsa_mat_test_100 = lsa_mat_test_300[:,:100]
print('Total LSA time:', round((time()-start)/60, 1), 'minutes')
print('lsa_mat shape:', lsa_mat_test_200.shape)

In [None]:
# TSNE
tsne = TSNE(random_state=12)
start = time()
tsne_mat_test = tsne.fit_transform(lsa_mat_test_300[:,:65])
tsne_test_df = pd.concat(
    [test_df[['author_name','category','case_name','year_filed']].reset_index(drop=True), 
     pd.DataFrame(tsne_mat_test)], 
    axis=1
) 
print("Elapsed time:", round((time()-start)/60, 1), "minutes")

plt.figure(figsize=(14,12))
sns.scatterplot(tsne_test_df[0], tsne_test_df[1], s=25, hue=tsne_test_df.author_name)
plt.show()

In [None]:
tfidf_df_test = pd.DataFrame(tfidf_mat_test.todense(), columns=vectorizer_2.vocabulary) # tfidf_mat must be the 400: version

def labels_to_keywords(labels, tfidf_df, n=2):
    label_dict = {}
    for label in np.unique(labels):
        label_dict[label] = list(tfidf_df[labels==label].sum().sort_values(ascending=False).index[:n+1])
        label_dict[label] = ', '.join([word for word in label_dict[label] if word not in ['ct', 'court']][:n])
    return [label_dict[label] for label in labels]

In [None]:
kmeans = KMeans(n_clusters=24, random_state=11)
labels = kmeans.fit_predict(lsa_mat_test_300)
plt.figure(figsize=(14,12))
sns.scatterplot(tsne_test_df[0], tsne_test_df[1], s=25, hue=labels_to_keywords(labels, tfidf_df_test))
plt.title('K-means clustering with k=24')
plt.show()

In [None]:
spectral = SpectralClustering(n_clusters=22)
labels = spectral.fit_predict(lsa_mat_test_300)
spectral_labels = labels[:]
plt.figure(figsize=(14,12))
sns.scatterplot(tsne_test_df[0], tsne_test_df[1], s=25, hue=labels_to_keywords(labels, tfidf_df_test))
plt.title('Spectral Clustering clustering with k=22')
plt.show()

### Clustering check conclusions:

All of these results are affirmative. The shape of the data through t-SNE is pretty similar to that of the test set, with some differences since it's a density-based algorithm.  More importantly, the topic clusters are nearly identical.  In short, the rerun shows that the clustering above was stable and consistent across the withheld test data as well.

# Conclusions
In this examination of SCOTUS opinions, we've generated insight into author tendencies and differences between justices, the structure of opinion data, and latent semantic clusters in the opinions.  The topical clusters produced by clustering seem to be the most coherent (even more than the topics from LDA), probably because the topic/subject is the most prominent or determinative latent aspect of any given opinion.  The supervised task of predicting authorship remains extremely difficult, however, due to the diverse topics covered, the formulaic nature of the prose, and the collaborative methods by which the texts were generated.

In future work, topical clusters like these could easily be used for opinion classification.  Tracking political leanings within topical clusters could also prove more insightful than trying to do so across all topics.

### Further work
There remains much to be done with this data that could not fit into this capstone project.  In particular:
- A **more rigorous Moral Foundations Theory analysis** and certain types of **bias evaluation** using pre-trained word embeddings could lend useful insight.
- **Sentiment analysis** using pre-trained word embeddings may yield interesting results, especially over the course of single opinions from paragraph to paragraph
- **Textual entailment** could be used to attempt a mapping of argumentative structures in the opinions
- **Mapping centers of tension**: we could use custom-generated word embeddings (with word2vec, or an inverted tf-idf + LSA) to measure which words are used most differently by different authors – especially justices of different political leanings.  This could identify ideological and linguistic centers of tension, which we could map over time.

Overall, this project revealed a lot of information about the opinions themselves, but pointed even more toward interesting future work that could be built on this foundation.