# Identifying reports that describe CRC and extracting the TNM stage


Andres Tamm

2025-07-21

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Prepare-data" data-toc-modified-id="Prepare-data-1">Prepare data</a></span></li><li><span><a href="#Step-1.-Identify-reports-that-describe-current-colorectal-cancer" data-toc-modified-id="Step-1.-Identify-reports-that-describe-current-colorectal-cancer-2">Step 1. Identify reports that describe current colorectal cancer</a></span></li><li><span><a href="#Step-2.-Extract-TNM-phrases-from-reports" data-toc-modified-id="Step-2.-Extract-TNM-phrases-from-reports-3">Step 2. Extract TNM phrases from reports</a></span></li><li><span><a href="#Step-3.-Extract-TNM-values-from-phrases" data-toc-modified-id="Step-3.-Extract-TNM-values-from-phrases-4">Step 3. Extract TNM values from phrases</a></span></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
from textmining.reports import get_crc_reports
from textmining.tnm.clean import add_tumour_tnm
from textmining.tnm.tnm import get_tnm_phrase, get_tnm_values
from pathlib import Path

## Prepare data

Reports should be in a Pandas DataFrame, where one of the columns contains the report text. 


I would usually load real data by 

```python
data_dir = Path("C:\\path\\to\\folder\\with\\data")
filename = "histopathology_OUH.csv"
df = pd.read_csv(data_dir / filename)
```

In [2]:
# Get reports 
reports = ['Metastatic tumour from colorectal primary, T3 N0',
           'T1 N0 MX (colorectal cancer)',
           'pT3/2/1 N0 Mx. Malignant neoplasm ascending colon',
           'pT2a/b N0 Mx (sigmoid tumour)',
           'T4a & b N0 M1 invasive carcinoma, descending colon',
           'T1-weighted image, ... rectal tumour staged as ymrT2',
           'Colorectal tumour. Stage: T4b / T4a / T3 / T2 / T1',
           'Sigmoid adenocarcinoma, ... Summary: pT1 (sigmoid, txt txt txt txt), N3b M0',
           'Colorectal tumour in situ, Tis N0 M0',
           'Clinical information: T1 N1 (sigmoid tumour)'
           ]
df = pd.DataFrame(reports, columns=['report_text_anon'])
df['subject'] = '01'

pd.set_option('display.max_colwidth', 500, 'display.max_rows', 1000, 'display.min_rows', 1000)
display(df)

Unnamed: 0,report_text_anon,subject
0,"Metastatic tumour from colorectal primary, T3 N0",1
1,T1 N0 MX (colorectal cancer),1
2,pT3/2/1 N0 Mx. Malignant neoplasm ascending colon,1
3,pT2a/b N0 Mx (sigmoid tumour),1
4,"T4a & b N0 M1 invasive carcinoma, descending colon",1
5,"T1-weighted image, ... rectal tumour staged as ymrT2",1
6,Colorectal tumour. Stage: T4b / T4a / T3 / T2 / T1,1
7,"Sigmoid adenocarcinoma, ... Summary: pT1 (sigmoid, txt txt txt txt), N3b M0",1
8,"Colorectal tumour in situ, Tis N0 M0",1
9,Clinical information: T1 N1 (sigmoid tumour),1


## Step 1. Identify reports that describe current colorectal cancer

Main arguments
* `df`: Pandas DataFrame that contains reports (one report per row)
* `col`: name of column in `df` that contains reports

Outputs
* dataframe that contains reports that describe colorectal cancer (a subset of rows of `df`)
* dataframe that contains all matches for colorectal cancer - some of these matches are marked as excluded (`exclusion_indicator = 1`), because they do not correspond to current colorectal cancer


In [3]:
pd.set_option('display.max_colwidth', 2000, 'display.min_rows', 1000, 'display.max_rows', 1000)

In [4]:
# Run
df_crc, matches_crc = get_crc_reports(df=df, col='report_text_anon')


Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal gland' 'kidney' 'bone' 'pleura' 'brain' 'head'
 'prostate']
Time elapsed: 0.01040496826171875 minutes


In [5]:
# Get included and excluded matches
matches_incl = matches_crc.loc[matches_crc.exclusion_indicator==0]
matches_excl = matches_crc.loc[matches_crc.exclusion_indicator==1]

In [6]:
# Included matches
#   'row' corresponds to the row number of input dataframe, NB counting from 0 (0 - first row, 1 - second row, ...)
print('{} matches for tumour keywords were excluded'.format(matches_incl.shape[0]))
matches_incl[['row', 'left', 'target', 'right']]

8 matches for tumour keywords were excluded


Unnamed: 0,row,left,target,right
7,1,T1 N0 MX (,colorectal cancer,)
1,2,pT3/2/1 N0 Mx.,Malignant neoplasm,ascending colon
2,3,pT2a/b N0 Mx (sigmoid,tumour,)
3,4,T4a & b N0 M1 invasive,carcinoma,", descending colon"
4,5,"T1-weighted image, ... rectal",tumour,staged as ymrT2
8,6,,Colorectal tumour,. Stage: T4b / T4a / T3 / T2 / T1
5,7,Sigmoid,adenocarcinoma,", ... Summary: pT1 (sigmoid, txt txt txt txt), N3b M0"
9,8,,Colorectal tumour,"in situ, Tis N0 M0"


In [7]:
# Excluded matches
print('{} matches for tumour keywords were excluded'.format(matches_excl.shape[0]))
matches_excl[['row', 'left', 'target', 'right', 'exclusion_indicator', 'exclusion_reason']]

2 matches for tumour keywords were excluded


Unnamed: 0,row,left,target,right,exclusion_indicator,exclusion_reason
0,0,Metastatic,tumour,"from colorectal primary, T3 N0",1,metastatic;
6,9,Clinical information: T1 N1 (sigmoid,tumour,),1,site historic or general;historic;


In [8]:
# If some included matches are not correct after review, they can be manually excluded
# In that case, the CRC reports can be identified as
df['row'] = np.arange(df.shape[0])
df['crc_nlp'] = 0
matches_incl = matches_crc.loc[matches_crc.exclusion_indicator==0]
matches_incl_processed = matches_incl # processed matches, add any processing steps
df.loc[df.row.isin(matches_incl_processed.row), 'crc_nlp'] = 1
df_crc = df.loc[df.crc_nlp == 1]

## Step 2. Extract TNM phrases from reports

I am first running `get_tnm_phrase` to get all TNM sequences (e.g. `T1 N0 M0`) and all phrases with single TNM values (e.g. `stage: T1`). 

I am then running `add_tumour_tnm` to identify tumour keywords that occur near the TNM phrases. This can help decide which tumour the TNM phrase refers to. BUT it is not necessary to run this step.

Main arguments for `get_tnm_phrase`
* `df`  : DataFrame that contains reports
* `col` : column in `df` that contains the report text
* `remove_unusual` : remove unusual TNM phrases from output. For example, if 5 T-values are given in sequence, it is likely a multiple choice option not an actual TNM stage. True by default.
* `remove_historical`: remove TNM phrases that were marked to be historical based on nearby words. False by default, because that part of the code may not be accurate at the moment.
* `remove_falsepos`: remove phrases with single TNM values, if they do not have inclusion keywords or if they have exclusion keywords. For example, `T1-weighted` is removed, as it is not a T-stage. True by default.

Main arguments for `add_tumour_tnm`
* `df`         : dataframe that contains reports
* `matches`    : dataframe that contains matches for TNM phrases - this is the first output of 'get_tnm_phrase()' function
* `col_report` : column in `df` that contains reports


In [9]:
# Display opts
pd.set_option('display.max_colwidth', 500, 'display.max_rows', 1000, 'display.min_rows', 1000)

In [10]:
# Extract TNM phrases
#  remove_historical = False, as the detection of historical TNM phrases is likely not accurate atm
matches, check_phrases, check_cleaning, check_rm = get_tnm_phrase(df=df_crc, col='report_text_anon', 
                                                                  remove_unusual=True, 
                                                                  remove_historical=False, 
                                                                  remove_falsepos=True)


Extracting TNM sequences ...
7 matches for TNM sequences

Extracting single TNM values ...

Extracting individual TNM values for category: T
8 matches, 0 marked for exclusion

Extracting individual TNM values for category: N
6 matches, 1 marked for exclusion

Extracting individual TNM values for category: M
6 matches, 1 marked for exclusion

Extracting individual TNM values for category: L
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: V
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: R
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: SM
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: H
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: G
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: Pn
0 matches, 0 marked for exclusion
Time elapsed: 0.26 minutes
0 matches have at least 100 capital let

In [11]:
# Extract tumour keywords that occur near each TNM phrase
# This can help to later decide which tumour the TNM phrase refers to
matches = add_tumour_tnm(df=df_crc, matches=matches, col_report='report_text_anon', targetcol='target_before_clean')

Finding nearby tumour keywords for each TNM phrase

Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal gland' 'kidney' 'bone' 'pleura' 'brain' 'head'
 'prostate']
Time elapsed: 0.004904635747273763 minutes
  Number of matches for tumour keywords that were included: 8
Time elapsed: 0.44 seconds


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])


In [12]:
# View unique values for extracted phrases
check_phrases

Unnamed: 0,target,length
1,pT3/2/1 N0 Mx,13
2,pT2a/2b N0 Mx,13
3,T4a/4b N0 M1,12
4,pT1 N3b M0,10
5,Tis N0 M0,9
0,T1 N0 MX,8
6,ymrT2,5


In [13]:
# Check cleaning of TNM phrases
check_cleaning

Unnamed: 0,target_before_clean,target,length
1,pT3/2/1 N0 Mx,pT3/2/1 N0 Mx,13
2,pT2a/b N0 Mx,pT2a/2b N0 Mx,13
3,T4a & b N0 M1,T4a/4b N0 M1,12
4,"pT1 (sigmoid, txt txt txt txt), N3b M0",pT1 N3b M0,10
5,Tis N0 M0,Tis N0 M0,9
0,T1 N0 MX,T1 N0 MX,8
6,ymrT2,ymrT2,5


In [14]:
# View all included matches - detailed view
cols =  ['sentence', 'left', 'target_before_split', 'target_before_clean', 'target', 
         'right', 'exclusion_indicator', 'exclusion_reason', 'phrase_with_tumour']
matches[cols]

Unnamed: 0,sentence,left,target_before_split,target_before_clean,target,right,exclusion_indicator,exclusion_reason,phrase_with_tumour
0,T1 N0 MX (colorectal cancer),,T1 N0 MX,T1 N0 MX,T1 N0 MX,(colorectal cancer),0,,<<T1 N0 MX>> <<COLORECTAL CANCER>>)
1,pT3/2/1 N0 Mx,,pT3/2/1 N0 Mx,pT3/2/1 N0 Mx,pT3/2/1 N0 Mx,. Malignant neoplasm ascending colon,0,,<<PT3/2/1 N0 MX>>.<<MALIGNANT NEOPLASM>> ascending colon
2,pT2a/b N0 Mx (sigmoid tumour),,pT2a/b N0 Mx,pT2a/b N0 Mx,pT2a/2b N0 Mx,(sigmoid tumour),0,,<<PT2A/B N0 MX>> (sigmoid<<TUMOUR>>)
3,"T4a & b N0 M1 invasive carcinoma, descending colon",,T4a & b N0 M1,T4a & b N0 M1,T4a/4b N0 M1,"invasive carcinoma, descending colon",0,,"<<T4A & B N0 M1>> invasive<<CARCINOMA>>, descending colon"
4,rectal tumour staged as ymrT2,"T1-weighted image, ... rectal tumour staged as",ymrT2,ymrT2,ymrT2,,0,,"t1-weighted image, ... rectal <<TUMOUR>> staged as<<YMRT2>>"
5,"Summary: pT1 (sigmoid, txt txt txt txt), N3b M0","Sigmoid adenocarcinoma, ... Summary:","pT1 (sigmoid, txt txt txt txt), N3b M0","pT1 (sigmoid, txt txt txt txt), N3b M0",pT1 N3b M0,,0,,"sigmoid <<ADENOCARCINOMA>>, ... summary:<<PT1 (SIGMOID, TXT TXT TXT TXT), N3B M0>>"
6,"Colorectal tumour in situ, Tis N0 M0","Colorectal tumour in situ,",Tis N0 M0,Tis N0 M0,Tis N0 M0,,0,,"<<COLORECTAL TUMOUR>> in situ,<<TIS N0 M0>>"


In [15]:
# View all included matches - simpler view
cols =  ['left', 'target_before_clean', 'target', 'right']
matches[cols]

Unnamed: 0,left,target_before_clean,target,right
0,,T1 N0 MX,T1 N0 MX,(colorectal cancer)
1,,pT3/2/1 N0 Mx,pT3/2/1 N0 Mx,. Malignant neoplasm ascending colon
2,,pT2a/b N0 Mx,pT2a/2b N0 Mx,(sigmoid tumour)
3,,T4a & b N0 M1,T4a/4b N0 M1,"invasive carcinoma, descending colon"
4,"T1-weighted image, ... rectal tumour staged as",ymrT2,ymrT2,
5,"Sigmoid adenocarcinoma, ... Summary:","pT1 (sigmoid, txt txt txt txt), N3b M0",pT1 N3b M0,
6,"Colorectal tumour in situ,",Tis N0 M0,Tis N0 M0,


In [16]:
# View matches marked for exclusion
check_rm

Unnamed: 0,row,start,end,left,target,right,exclusion_indicator,exclusion_reason,solitary_indicator,sentence_left,sentence_right,sentence,rank
4,5,,,Colorectal tumour. Stage:,T4b / T4a / T3 / T2 / T1,,1,4 or more T-values;,,,,,


In [17]:
# See if any matches marked for exclusion are among included matches
cols =  ['left', 'target_before_clean', 'target', 'right']
matches.loc[matches.exclusion_indicator==1, cols]

Unnamed: 0,left,target_before_clean,target,right


## Step 3. Extract TNM values from phrases

Arguments for `tnm.get_tnm_values()`:
* `df` : Pandas dataframe that contains reports
* `matches` : TNM phrases that were extracted for each report, output of `tnm.get_tnm_phrases()`
* `col` : name of column in `df` that contains reports

In [18]:
# Get TNM values from phrases
df_crc, s = get_tnm_values(df_crc, matches=matches, col='report_text_anon')




Extracting values from the phrase ...
Extracting additional perineural invasion
Time elapsed: 0.02 seconds

Getting minimum and maximum values ...
Time elapsed: 0.01 minutes


In [19]:
# Column names in df after tnm values were added
print('Columns in df_crc:')
for i, c in enumerate(df_crc.columns):
    print('{}: {}'.format(i,c))

Columns in df_crc:
0: report_text_anon
1: subject
2: crc_nlp
3: T_pre
4: T
5: N
6: M
7: V
8: R
9: L
10: Pn
11: SM
12: H
13: G
14: T_pre_min
15: T_min
16: N_min
17: M_min
18: V_min
19: R_min
20: L_min
21: Pn_min
22: SM_min
23: H_min
24: G_min
25: sha


In [20]:
# View subset of output
df_crc[['report_text_anon', 'T_pre', 'T', 'N', 'M', 'T_pre_min', 'T_min', 'N_min', 'M_min']].fillna('')

Unnamed: 0,report_text_anon,T_pre,T,N,M,T_pre_min,T_min,N_min,M_min
0,T1 N0 MX (colorectal cancer),,1,0,X,,1,0,X
1,pT3/2/1 N0 Mx. Malignant neoplasm ascending colon,p,3,0,X,p,1,0,X
2,pT2a/b N0 Mx (sigmoid tumour),p,2b,0,X,p,2a,0,X
3,"T4a & b N0 M1 invasive carcinoma, descending colon",,4b,0,1,,4a,0,1
4,"T1-weighted image, ... rectal tumour staged as ymrT2",mry,2,,,mry,2,,
5,Colorectal tumour. Stage: T4b / T4a / T3 / T2 / T1,,,,,,,,
6,"Sigmoid adenocarcinoma, ... Summary: pT1 (sigmoid, txt txt txt txt), N3b M0",p,1,3b,0,p,1,3b,0
7,"Colorectal tumour in situ, Tis N0 M0",,is,0,0,,is,0,0


## Run the code on multiple cores to speed it up

In [21]:
from textmining.reports import get_crc_reports_par
from textmining.tnm.tnm import get_tnm_phrase_par


In [22]:
# Identify CRC reports, dividing the data into 10 chunks which are processed in parallel if there are at least 10 cores
df_crc, matches_crc = get_crc_reports_par(nchunks=10, njobs=-1, df=df, col='report_text_anon')


Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal gland' 'kidney' 'bone' 'pleura' 'brain' 'head'
 'prostate']

Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal gland' 'kidney' 'bone' 'pleura' 'brain' 'head'
 'prostate']

Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal glan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_in

Time elapsed: 0.017978366216023764 minutes
Time elapsed: 0.017381703853607176 minutes
Time elapsed: 0.02154443661371867 minutes
Time elapsed: 0.022423752148946128 minutes
Time elapsed: 0.02220884958902995 minutes
Time elapsed: 0.022043935457865396 minutes
Time elapsed: 0.021384286880493163 minutes
Time elapsed: 0.019969419638315836 minutes
Time elapsed: 0.02125316858291626 minutes
Time elapsed: 0.020901584625244142 minutes
Time elapsed: 0.08382823467254638 minutes


In [23]:
# Identify CRC reports, dividing the data into 10 chunks which are processed in parallel if there are at least 10 cores
df_crc, matches_crc = get_crc_reports_par(nchunks=10, njobs=-1, df=df, col='report_text_anon')


Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal gland' 'kidney' 'bone' 'pleura' 'brain' 'head'
 'prostate']

Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal gland' 'kidney' 'bone' 'pleura' 'brain' 'head'
 'prostate']

Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']

Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_in

Time elapsed: 0.010800349712371825 minutes
Time elapsed: 0.011074352264404296 minutes
Time elapsed: 0.011148250102996827 minutes
Time elapsed: 0.011030916372934978 minutes
Time elapsed: 0.011245199044545491 minutes
Time elapsed: 0.01131598154703776 minutes
Time elapsed: 0.011246999104817709 minutes
Time elapsed: 0.011366101106007893 minutes

Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal gland' 'kidney' 'bone' 'pleura' 'brain' 'head'
 'prostate']

Sites included: ['caecum' 'right (ascending) colon' 'hepatic flexure' 'transverse colon'
 'splenic flexure' 'left (descending) colon' 'sigmoid colon' 'rectum'
 'colon' 'colon and rectum']
Sites excluded: ['liver' 'lung' 'pelvis' 'uterus' 'ovaries' 'bladder' 'spleen'
 'anastomosis' 'adrenal gland' 'kidne

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['row'] = np.arange(df.shape[0])


In [24]:
# Again, can also manually check the matches and identify CRC reports from checked matches
df['row'] = np.arange(df.shape[0])
df['crc_nlp'] = 0
matches_incl = matches_crc.loc[matches_crc.exclusion_indicator==0]
matches_incl_processed = matches_incl # processed matches, add any processing steps
df.loc[df.row.isin(matches_incl_processed.row), 'crc_nlp'] = 1
df_crc = df.loc[df.crc_nlp == 1]

In [25]:
# Extract TNM phrases, dividing the data into 10 chunks which are processed in parallel if there are at least 10 cores
matches, check_phrases, check_cleaning, check_rm = get_tnm_phrase_par(nchunks=10, njobs=-1, 
                                                                      df=df_crc, col='report_text_anon', 
                                                                      remove_unusual=True, 
                                                                      remove_historical=False, 
                                                                      remove_falsepos=True)


Extracting TNM sequences ...

Extracting TNM sequences ...

Extracting TNM sequences ...

Extracting TNM sequences ...

Extracting TNM sequences ...

Extracting TNM sequences ...

Extracting TNM sequences ...

Extracting TNM sequences ...

Extracting TNM sequences ...

Extracting TNM sequences ...
0 matches for TNM sequences

Extracting single TNM values ...

Extracting individual TNM values for category: T
0 matches for TNM sequences

Extracting single TNM values ...

Extracting individual TNM values for category: T
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: N
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: N
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: M
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: M
0 matches, 0 marked for exclusion

Extracting individual TNM values for category: L
0 matches, 0 marked for exclusion

Extracting individual

In [26]:
# Get TNM values from phrases
df_crc, s = get_tnm_values(df_crc, matches=matches, col='report_text_anon')


Extracting values from the phrase ...




Extracting additional perineural invasion
Time elapsed: 0.02 seconds

Getting minimum and maximum values ...
Time elapsed: 0.01 minutes
