Trace File Cleaning Program

# URS Parsing

This routine will parse a URS into four columns:

* UID	
* Validation Allocation
* Relationship
* Link

# Quality Checks

The following items need to be checked when processing an Excel export of a URS document. Test

* Filtering for Approved: <br>Are the requirements in an approved state?
* Filtering for the Correct ID Prefix: <br>The exports contain all the history. We are only interested in MASTER and the current program.
* Are all requirements have Validation Allocation filled in?
* 


In [53]:
import pandas as pd

# Read the Excel file
file_path = "URS_Export.xlsx"      # Change to the desired file name
sheet_name = "Document Preflight"  # Change to the desired sheet name
df = pd.read_excel(file_path, sheet_name=sheet_name)

# Print the first few rows of the DataFrame
display(df[['ID','Type','Validation Allocation','Linked Work Items']].head(5))

Unnamed: 0,ID,Type,Validation Allocation,Linked Work Items
0,MASTER-94271,User Requirement,Product Management,"is branched from: TM-20888, has parent: MASTER..."
1,MASTER-94272,User Requirement,Product Management,"is branched from: TM-20890, has parent: MASTER..."
2,MASTER-94274,User Requirement,Clinical Research (CR),"is branched from: TM-21101, has parent: MASTER..."
3,MASTER-94277,User Requirement,Clinical Research (CR),"is branched from: TM-20891, has parent: MASTER..."
4,MASTER-94278,User Requirement,Product Management,"is branched from: TM-21467, has parent: MASTER..."


In [54]:
def fsplit(s):
    s1 = s['Linked Work Items'].str.split(':',n=1,expand=True)
    s1.columns=['Relationship','Link']
    s1.Relationship=s1.Relationship.str.strip()
    return s.join(s1)

dfc=(
    df
    .loc[df.Type=="User Requirement"]
    #.loc[df.Status=="Approved"]                               # This should be filtered for 'Approved'
    .loc[:,["ID","Linked Work Items","Validation Allocation"]]
    .set_index(["ID","Validation Allocation"])
    .replace({' ':" "},regex=True)
    .apply(lambda x: x.str.split(',').explode())
    .pipe(fsplit)
    .drop(columns=['Linked Work Items'],axis=1)
    .reset_index()
)
dfc.head(10)

Unnamed: 0,ID,Validation Allocation,Relationship,Link
0,MASTER-94271,Product Management,is branched from,TM-20888
1,MASTER-94271,Product Management,has parent,MASTER-94270
2,MASTER-94271,Product Management,is refined by,MASTER-60046
3,MASTER-94271,Product Management,is refined by,MASTER-59966
4,MASTER-94271,Product Management,is refined by,MASTER-84502
5,MASTER-94271,Product Management,is refined by,MASTER-60040
6,MASTER-94271,Product Management,is refined by,MASTER-60061
7,MASTER-94271,Product Management,is branched from,TM-20888
8,MASTER-94271,Product Management,has parent,MASTER-94270
9,MASTER-94271,Product Management,is refined by,MASTER-60046


In [55]:
cat_list = ['is refined by','is validated by']
dfc = dfc[dfc['Relationship'].isin(cat_list)]
display(dfc)

Unnamed: 0,ID,Validation Allocation,Relationship,Link
2,MASTER-94271,Product Management,is refined by,MASTER-60046
3,MASTER-94271,Product Management,is refined by,MASTER-59966
4,MASTER-94271,Product Management,is refined by,MASTER-84502
5,MASTER-94271,Product Management,is refined by,MASTER-60040
6,MASTER-94271,Product Management,is refined by,MASTER-60061
...,...,...,...,...
68595,MASTER-98305,Product Management,is refined by,MASTER-42226
68596,MASTER-98305,Product Management,is refined by,MASTER-41895
68599,MASTER-98305,Product Management,is refined by,MASTER-41894
68600,MASTER-98305,Product Management,is refined by,MASTER-42226


In [56]:
dfc = dfc[dfc['Link'].str.contains('MASTER|Mozart', case=False)].reset_index(drop=True)
dfc = dfc.rename(columns={'ID': 'UID'})

In [57]:
display(dfc.head(10))

Unnamed: 0,UID,Validation Allocation,Relationship,Link
0,MASTER-94271,Product Management,is refined by,MASTER-60046
1,MASTER-94271,Product Management,is refined by,MASTER-59966
2,MASTER-94271,Product Management,is refined by,MASTER-84502
3,MASTER-94271,Product Management,is refined by,MASTER-60040
4,MASTER-94271,Product Management,is refined by,MASTER-60061
5,MASTER-94271,Product Management,is refined by,MASTER-60046
6,MASTER-94271,Product Management,is refined by,MASTER-59966
7,MASTER-94271,Product Management,is refined by,MASTER-84502
8,MASTER-94271,Product Management,is refined by,MASTER-60040
9,MASTER-94271,Product Management,is refined by,MASTER-60061


In [58]:
dfc.to_csv("URS_Cleaned.csv",index=False)

In [62]:
dfz=pivot_table = dfc.pivot_table(
    index='Validation Allocation',  # Rows: Validation Allocation
    columns='Relationship',
    values='UID',                   # Values to count (you can choose any column)
    aggfunc='count'                 # Count unique items
)

display(dfz)

Relationship,is refined by
Validation Allocation,Unnamed: 1_level_1
Clinical Research (CR),38239
Product Management,13483


In [63]:
dfq=pivot_table = dfc.pivot_table(
    index='Relationship',  # Rows: Validation Allocation
    values='UID',          # Values to count (you can choose any column)
    aggfunc='count'        # Count unique items
)

display(dfq)

Unnamed: 0_level_0,UID
Relationship,Unnamed: 1_level_1
is refined by,51722
