# Opinion Text Clean

**Remove references, special characters, Thomson Reuters marks in the opinion text**

### Load data

In [1]:
import json
full_data = json.load(open('decision_text.json', 'r'))

In [2]:
full_data.keys()

dict_keys(['author', 'decision', 'opinion', 'cleaned_text'])

In [5]:
op_text = full_data['opinion']

In [6]:
op_text[0][0:100]

'Apple appeals from the final decision of the International Trade Commission (ITC) that the asserted '

### Identify references

In [7]:
import re
from reporters_db import EDITIONS
for key in list(EDITIONS.keys()):
    EDITIONS[key.replace(" ", "")]=EDITIONS[key].replace(" ", "")

In [8]:
# regular expression to describe citation pattern

CITATION_PTN = r"""
(?:[\s,:\(]|^)
(
(\d+)\s+
({reporters})(\s|[a-z])+
(\d+)
)
""".format(reporters='|'.join([re.escape(i) for i in EDITIONS]))
CITATION_PTN_RE = re.compile(CITATION_PTN, re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)

In [9]:
# regular expression to describe regulation pattern
REGULATION_PTN = r"""
(
(\d+)\s*
(U\.?S\.?C\.?|C\.?F\.?R\.?)\s*
(Sec(?:tion|\.)?|§)?\s*
(\d+[\da-zA-Z\-]*)
)"""
REGULATION_PTN_RE = re.compile(REGULATION_PTN, 
                               re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)

PUBLIC_LAW_PTN = r"""
(
Pub(?:lic|\.)\s+L(?:aw|\.)(?:\s+No.?)?\s+\d+\-\d+
|
\d+\s+Stat\.\s+[\d-]+
)
"""
PUBLIC_LAW_PTN_RE = re.compile(PUBLIC_LAW_PTN, 
                               re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)
PUBLIC_LAW_SUB_RE = re.compile(r'.+?(\d+\-\d+)',
                               re.MULTILINE | re.DOTALL)




In [10]:
FED_CIR_RE =  re.compile(r'\(Fed.Cir.\d*\)', re.MULTILINE | re.DOTALL)

In [11]:
MARK_RE = re.compile(r'©\s\d+\sThomson Reuters.', re.MULTILINE | re.DOTALL)

### Write a function to clean text
Remove references, special characters, Thomson Reuters marks in the opinion text

In [50]:
replacements = [
    (CITATION_PTN_RE, ''), # remove other case reference
    (REGULATION_PTN_RE, ''), # remove regulation reference
    (PUBLIC_LAW_PTN_RE, ''), # remove public law reference
    (FED_CIR_RE, ''), # remove Fed.Cir
    (MARK_RE, ''), # remove Thomson Reuters mark
    (r'[^\w]', ' '), # remove special characters
    (r'\d+', ' '), # remove numbers
    (r'\s+[A-Za-z]{1,2}\s+', ' '), # remove words with only one or two letters
    (r'\s+', ' '), # remove consecutive space
    (r'\s+[A-Za-z]{1,2}\s+', ' '), # remove words with only one or two letters
    (r'\s+col\s+', ' '), # remove 'col' references
    (r'\s+[A-Za-z]{1,2}\s+', ' ') # remove words with only one or two letters
]

def clean_citation(input_str):
    for old, new in replacements:
        input_str = re.sub(old, new, input_str)
    return input_str

#### Test the `clean_citation` function

In [61]:
# Origional Text
op_text[5][500:2100]

"the asserted patents. After claim construction, the partiesstipulated to noninfringement of the ′140 and ′771 patents on the grounds that AgiLight's products do not include an“IDC connector” as construed by the court. The district court entered partial summary judgment consistent with theparties' stipulation. GE Lighting Solutions, LLC. v. AgiLight, Inc., C.A. No. 12–cv–00354–JG (N.D.Ohio Jan. 8, 2013),ECF No. 38. The district court also granted AgiLight's motion for summary judgment of noninfringement of the ′896and ′ 055 patents. GE Lighting Solutions, LLC. v. AgiLight, Inc., C.A. No. 12–cv00354–JG (N.D.Ohio Mar. 18, 2013),ECF No. 43 (Summary Judgment Order ). GE appeals. We have jurisdiction under 28 U.S.C. § 1295(a)(1).DISCUSSION[1] We review claim construction de novo. Lighting Ballast Control LLC v. Philips Elecs. N. Am. Corp., 744 F.3d 1272,1276–77 (Fed.Cir.2014) (en banc). We review the grant of summary judgment under the law of the relevant regionalcircuit. The Sixth Circuit 

In [60]:
# Cleaned Text
clean_citation(op_text[5][500:2100])

'the asserted patents After claim construction the partiesstipulated noninfringement the and patents the grounds that AgiLight products not include IDC connector construed the court The district court entered partial summary judgment consistent with theparties stipulation Lighting Solutions LLC AgiLight Inc Ohio Jan ECF The district court also granted AgiLight motion for summary judgment noninfringement the and patents Lighting Solutions LLC AgiLight Inc Ohio Mar ECF Summary Judgment Order appeals have jurisdiction under DISCUSSION review claim construction novo Lighting Ballast Control LLC Philips Elecs Corp banc review the grant summary judgment under the law the relevant regionalcircuit The Sixth Circuit reviews grants summary judgment novo Moore Holbrook Cir Summary judgment appropriate when there genuine issue material fact and the moving party entitled tojudgment matter law The and patents are directed light emitting diode LED string lights that include LED insulated electrical c

## Use `clean_citation` to clean all opinion text

In [62]:
cleaned_text = [None]*len(op_text)

for i in range(len(op_text)):
    if op_text[i] != None:
        cleaned_text[i]= clean_citation(op_text[i])

# Write to json file

In [66]:
full_data.update({'cleaned_text':cleaned_text})

In [67]:
full_data.keys()

dict_keys(['author', 'decision', 'opinion', 'cleaned_text'])

In [68]:
json.dump(full_data, open('decision_text.json', 'w'))

# Write to csv

In [69]:
import pandas as pd

In [72]:
text_import = pd.DataFrame(
    {'cleaned_text': cleaned_text
    })


In [74]:
text_import.head()

Unnamed: 0,cleaned_text
0,Apple appeals from the final decision the Inte...
1,
2,
3,Following bench trial damages the district cou...
4,claim original Government Works Radio Systems...


In [75]:
text_import.to_csv('cleaned_text.csv',sep='\t')