<h1 style="text-align: center;">Topic Modeling of Supreme Court Cases</h1>
<h4 style="text-align: center;"> 
Max Liebeskind | Madhu Vijay | Yunhan Xu <br>
CS 109/Stat 121  <br>
December 9, 2015
</h4>

<img src="https://upload.wikimedia.org/wikipedia/commons/4/43/Supreme_Court_US_2010.jpg" align ="center" height="300" width="450">

Photo source: https://upload.wikimedia.org/wikipedia/commons/4/43/Supreme_Court_US_2010.jpg

<h3 style="text-align: center;">Overview and motivation</h3>

Every year, thousands of legal opinions are written in the United States. The authors of these opinions range from local judges to the Chief Justice of the Supreme Court, yet one thing that many of these opinions share in common is that they count as *precedent*. Because the American legal system is a common law system, judges rely on precedents—previous judicial decisions—to make decisions about new cases [1]. Accordingly, lawyers research precedents to form legal arguments and write legal briefs. Because so many cases exist in the American legal canon, however, it is quite difficult for lawyers to effectively find the case that best fits their research needs. (To get an idea of how many cases are heard every year in the United States, consider that the Supreme Court receives about 10,000 appeals per year, yet these 10,000 cases only represent a subset of the cases heard in Federal Appeals courts and state supreme courts [2].)

Our project attempts to reduce the burden on lawyers by making it easier to classify court cases. The millions of cases in the American legal canon are, for the most part, not classified, which means that a tax lawyer who wants a comprehensive list of tax law cases cannot easily obtain such a list. Moreover, classifying cases manually is extremely costly in time and monetary terms. If we could use computerized text analysis to classify cases by topic area (or "issue area," to use a more legal term), this would make classification much easier. Our project takes a first step towards doing this by classifying Supreme Court cases by issue area using text analysis methods. Because Supreme Court cases are widely studied, political scientists have already manually classified them into issue areas; this makes Supreme Court cases an ideal set of cases on which to train text analysis models. In our project, we train a number of different models—both supervised and unsupervised—to classify cases by issue area. We also briefly examine how we can use text analysis to predict the partisanship of cases (i.e., whether the decision leans conservative or liberal).  

The remainder of this notebook proceeds as follows. First, we describe the data we use. Second, we give a brief overview of the models we have trained, including how we prepared the data for these models. Finally, we provide a table of contents for the rest of our project. The table of contents gives the order in which our notebooks should be read.

[1] https://www.law.berkeley.edu/library/robbins/CommonLawCivilLawTraditions.html

[2] http://www.supremecourt.gov/faq.aspx#faqgi9

<h3 style="text-align: center;"> Data </h3>

We obtain our data from two sources: the Justia database [3] and the Supreme Court Database [4]. From the Justia database, we scraped the *syllabus* of every Supreme Court case since 1946 (about 11,000 cases). A syllabus is a summary of the Court's decision in a case. Syllabi range in length from a paragraph to a few pages, and they provide an excellent source of text on which to classify cases by issue area. Because syllabi summarize the basic facts of the case, they generally include key words indicating the issues (and issue areas) at stake in the case. They also frequently reference precedents (i.e., past cases), which are likely to be good indicators of issue area, since cases in the same issue areas will consistently reference the same precedents. Another benefit of syllabi is that they are relatively short in length (a few paragraphs to a few pages). This means that much less memory is needed to analyze syllabi than to analyze actual Supreme Court opinions, which can be over eighty pages in length. 

From the Supreme Court Database (SCDB), we download a csv file that contains information on every case the Supreme Court has heard since 1946. (Technically, the level of observation is the "citation" or "dispute," which means that if the Supreme Court has heard the same case multiple times, all of these hearings are consolidated into one row. This makes sense, since each hearing would be the same issue area.) SCDB labels each case by issue area, splitting cases into 14 issue areas total. (We describe these issue areas in more detail in `data_merging.ipynb`.) In addition, SCDB labels the outcome of each case as either conservative or liberal (with a simple dummy variable). By merging the SCDB data with the Justia syllabi, we are able to train models in which the output/dependent variable is issue area or partisanship, and the input variable is the text of the case.

As we have worked on the project, our approach to using the data has changed slightly. We initially planned to analyze the text of Supreme Court *opinions*, rather than syllabi. However, the text of all Supreme Court opinions since 1946 takes an enormous amount of memory, and we don't think that opinions would give much more information than syllabi. We therefore chose to use syllabi, rather than opinions, as the basis of our text analysis. 

[3] https://supreme.justia.com/cases/

[4] http://scdb.wustl.edu/data.php

<h3 style="text-align: center;"> Models </h3>

We ran eight different models to classify Supreme Court cases by issue area. Before running models, we clean the syllabi text and convert them into bag-of-words form. The process of data cleaning is described in great detail in `data_merging.ipynb` and `data_cleaning.ipynb`, but essentially it consists of the following steps: (1) cleaning the syllabi to remove unnecessary characters and words, such as html tags that remained from the scraping; (2) removing useless cases from the dataframe; (3) tokenizing the syllabi and converting them into bag-of-words form. Step (3) is most important: using a library for natural language processing, we split each syllabus into a list of words, and then divide the words by word type (noun, verb, etc.). We also separately extract the precedents that each syllabus cites, since precedents are unique to court cases (and are potentially very useful in classifying cases, as noted above.) We then create bags of words: this means that, for each syllabus, we simply count the number of times each word is used, so that we end up with a vector of word frequencies. All of our models use bags of words as inputs. Our models therefore assume that we can classify syllabi simply based on the *words* present in a case and their frequencies, without considering the order of these words or how they interact. This is an important assumption, but we feel it is justified for an initial analysis, since legal terms and precedents tend to be issue area-specific and should therefore be fairly good predictors of issue area.

Using bags of words, we ran eight models. Note that some models we test are supervised (i.e., the syllabi (input variable) are labeled by issue area (output variable), and the model tries to best match syllabi with issue areas), while other models are unsupervised (the model classifies syllabi (input variable) based on "latent variables"). The models are the following:
- Naive Bayes (supervised) 
- SVM (supervised)
- NMF (unsupervised)
- k-means (unsupervised)
- lda (unsupervised)
- lsi (unsupervised)
- mean-shift (unsupervised)

<h3 style="text-align: center;"> Table of contents </h3>

Please follow the following order when reading the documents in our repository. The hyperlinks below all link to the corresponding pages in our github repo: 

1. [intro_overview.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/intro_overview.ipynb) (this document)
2. [data_scraping.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/data_scraping.ipynb)
3. [data_merging.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/data_merging.ipynb)
4. [data_cleaning.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/data_cleaning.ipynb)
5. [naive_bayes_classification.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/naive_bayes_classification.ipynb)
6. [SVM.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/SVM.ipynb)
7. [NMF.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/NMF.ipynb)
8. [k-means.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/k-means.ipynb)
9. [lda.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/lda.ipynb) (note that lda.ipynb includes both the lda and the lsi models)
10. [mean-shift.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/mean_shift.ipynb)
11. [comparisons_and_conclusion.ipynb](https://github.com/yunhanxu/cs109-project/blob/project_data_collection/comparisons_and_conclusion.ipynb)