## Catalog

<a href=#p0>0. Import Packages and Import Data</a>

<a href=#p1>1. EDA</a>

<a href=#p2>2. Data Processing (filtering and merging)</a>

<a href=#p3>3. Save the result and check the correctness</a>

<a name='p0' /></a>
## 0. Import Packages and Import Data

In [1]:
import pandas as pd
import numpy as np

In [3]:
cases_df=pd.read_pickle('..\data\df_cases_200906.gzip')
label_df=pd.read_pickle('..\data\df_label_200906.gzip')

<a name='p1' /></a>
## 1. Exploratory Data Analysis (EDA)

In [4]:
print('There are '+str(len(cases_df))+' records in case document.')
print('There are '+str(len(label_df))+' records in label document.')
print('There are '+str(len(cases_df['CaseId'].unique()))+' unique caseid in case document.')
print('There are '+str(len(label_df['CaseId'].unique()))+' unique caseid in label document.')

There are 2069 records in case document.
There are 1098 records in label document.
There are 1098 unique caseid in case document.
There are 1098 unique caseid in label document.


EDA proves that:
    1. Since there is # of records difference between case and document, there exists some caseids which has multiple contracts.
    2. Each caseid only has one label_1 and label_2 record, which double proves the statement in the readme file. 
    3. # of caseid both in case and label document are same. Therefore when joining cases with labels, "left join", "inner join","outer join" and "right join" are equivalent.

<a name='p2' /></a>
## 2. Data Processing (filtering and merging)

In [5]:
# select the columns in case document that will be used
useful_cases_df=cases_df[['CaseId', 'FileName','IsExecuted', 'OcrText', 'QualityScore']]

# label validity status for each contract record
useful_cases_df['invalid'] = np.where((useful_cases_df['IsExecuted']==False) | (useful_cases_df['QualityScore']<0.81), True, False)

# for valid contracts, the output contains the concated result of contract names and also the concated result of ocrtext

valid_df=useful_cases_df.loc[(useful_cases_df['invalid']==False)] # filter out valid contracts
valid_df['ValidFileNames'] = valid_df.groupby(['CaseId'])['FileName'].transform(lambda x : ', '.join(x)) # concate contract names
valid_df['OcrText_concat'] = valid_df.groupby(['CaseId'])['OcrText'].transform(lambda x : ' '.join(x)) # concate ocrtext

# for invalid contracts, the output contains the concated result of contract names

invalid_df=useful_cases_df.loc[(useful_cases_df['invalid']==True)] # filter out invalid contracts
invalid_df['InvalidFileNames'] = invalid_df.groupby(['CaseId'])['FileName'].transform(lambda x : ', '.join(x)) # concat contract names

# merge all the data together
output_df=useful_cases_df[['CaseId']].merge(invalid_df[['CaseId','InvalidFileNames']],on='CaseId',how='left')
output_df=output_df.merge(valid_df[['CaseId','ValidFileNames','OcrText_concat']],on='CaseId',how='left')
#output_df=output_df.merge(label_df[['CaseId','label_1','label_2']],on='CaseId',how='left')
output_df=label_df[['CaseId','label_1','label_2']].sort_values(by=['CaseId'],ignore_index=True).reset_index().merge(output_df,on='CaseId',how='left').set_index('index')

output_df.rename(columns={"OcrText_concat": "OcrText"},inplace=True) # rename the output name of ocrtext

# since duplicate records happen during joining, drop duplicates
output_df.drop_duplicates(inplace=True)

# formatting: the screenshot of the sample output indicates that all the contract names are in a list
output_df['ValidFileNames'] = output_df['ValidFileNames'].str.split(", ") 
output_df['InvalidFileNames'] = output_df['InvalidFileNames'].str.split(", ") 

# formatting: the screenshot of the sample output indicates that all NANS values present as "[]" in filenames columns
# and " " in ocrtext
ocr_df=output_df['OcrText']
ocr_df.fillna('',inplace=True)
output_df.drop(columns=['OcrText'],inplace=True)
output_df.apply(lambda s: s.fillna({i: [] for i in output_df.index}, inplace=True))
output_df=pd.concat([output_df,ocr_df],axis=1)


output_df = output_df[['CaseId','InvalidFileNames','ValidFileNames','OcrText','label_1','label_2']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  useful_cases_df['invalid'] = np.where((useful_cases_df['IsExecuted']==False) | (useful_cases_df['QualityScore']<0.81), True, False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid_df['ValidFileNames'] = valid_df.groupby(['CaseId'])['FileName'].transform(lambda x : ', '.join(x)) # concate contract names
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/u

In [6]:
output_df.tail()

Unnamed: 0_level_0,CaseId,InvalidFileNames,ValidFileNames,OcrText,label_1,label_2
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1093,3061230659,[003061230659_72651667_Order form_978-0-661-06...,[],,True,False
1094,3061230710,[003061230710_80047544_other documents_978-1-0...,[],,True,False
1095,3061230728,[003061230728_79408066_Master contract_978-0-1...,[003061230728_74076581_Amendments_978-0-14-763...,None attorney spend tend miss appear.,True,False
1096,3061230748,[003061230748_65193716_Contract Documents_978-...,[],,True,False
1097,3061230757,[003061230757_84690982_other documents_978-0-1...,[003061230757_72990476_Contract Documents_978-...,Determine go network.,False,False


<a name='p3' /></a>
## 3. Save the result and check the correctness

In [7]:
output_df.to_pickle(r"df_final.gzip")

In [8]:
print(len(output_df))

1098


The output length is as expected.

In [9]:
# to SAP colleagues: the command to import the dataset
new_df=pd.read_pickle(r'df_final.gzip')

In [10]:
new_df.tail()

Unnamed: 0_level_0,CaseId,InvalidFileNames,ValidFileNames,OcrText,label_1,label_2
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1093,3061230659,[003061230659_72651667_Order form_978-0-661-06...,[],,True,False
1094,3061230710,[003061230710_80047544_other documents_978-1-0...,[],,True,False
1095,3061230728,[003061230728_79408066_Master contract_978-0-1...,[003061230728_74076581_Amendments_978-0-14-763...,None attorney spend tend miss appear.,True,False
1096,3061230748,[003061230748_65193716_Contract Documents_978-...,[],,True,False
1097,3061230757,[003061230757_84690982_other documents_978-0-1...,[003061230757_72990476_Contract Documents_978-...,Determine go network.,False,False
