# In this notebook the basic purpose is just to understand the data, evaluation criteria and what problem exactly is this competition trying to solve. 
* ## First lets read all the file present. 
* ## Then we will go through each file and see how they are connected.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
base_path='/kaggle/input/nbme-score-clinical-patient-notes/'
train=pd.read_csv(os.path.join(base_path,'train.csv'))
test=pd.read_csv(os.path.join(base_path,'test.csv'))
features=pd.read_csv(os.path.join(base_path,'features.csv'))
patient_notes=pd.read_csv(os.path.join(base_path,'patient_notes.csv'))

### We have 4 files (train.csv, test.csv, features.csv and patient_notes.csv) we have read all of them in a similar named variable
## To understand the relation between each files, let us first understand the problem.

* #### Imagine you are a doctor who wants a license to work in USA. For that you will need to give an exam call USMLE (United States Medical Licensing Examination). Now there is a part of exam where you will ask patients about their history (past experiences, lifestyle, any diseases common in family etc). Once you ask all of these you will write them in your exam sheet and submit it.
* #### Now a licensed trained physician (examiner) will check your exam sheet and give you marks. Now the question is that how does an examiner knows that you took history correctly or not?. 
* #### The examiner have a rubric (answer set) to which your exam sheet will be compared and based on the similarity you will be given scores. 
* #### The examiner can rate students by using a file **features**.csv in which for each patient's history taken we have the words or sentences that should be present in the sheet.


### Lets go through an example to understand how this works.


### Lets us first view the dataframe **patient_notes**

In [None]:
patient_notes.head()

### Its simple, we have pn_num( patient number) and case_num (case number) and the pn_history which is the history of patient written by a doctor who is appearing for a licensing exam (USMLE). 

### Let us view a sample patient note for patient_number 16 and case number 0

In [None]:
note_for_pn16=patient_notes[patient_notes['pn_num']==16][patient_notes['case_num']==0]['pn_history'].values[0]
print (note_for_pn16)

### Now comes the next part, to evaluate the history written

#### For that we have features.csv or in this notebook features which have 3 columns. feature_num which is a unique identifier for each feature, case_num an identifier for case number and feature_text which contains the sentences which should be ideally present in the patient_notes written by test taker. 

In [None]:
features.head()

### Remember in the previous previous cell, we viewed the the patient note for patient no:16 and case no:0 written by the test taker?. So in the features, to make things easier let us filter only for  case_no: 0 and patient_no: 16

In [None]:
for x in features[features['case_num']==0]['feature_text']:
    print (x)

### What this actually means is that for case_number: 0 the features_text part or similar to that should be present in the patient_notes. This is the rubric or the answer sheet.

### For comparison let us view the patient note for patient no 16 and case number 0 and compare ourself how did the test taker perform

In [None]:
print (note_for_pn16)

### Here we easily see that on 4th last line its written "FHx: mom with "thyroid disease," dad with recent heart attcak" which is similar to "Family-history-of-thyroid-disorder" and "Family-history-of-MI-OR-Family-history-of-myocardial-infarction" written on the feature_text. 

### And this is what we need to predict. In patient note, how many sentences are similar to the one in features.csv file (or rubrics file ) (or our answer sheet)


### Now imagine if we need to label what we just did, how would we do it?. We can also say that in Patient note of patient no: 16; from character 668 to 693 is similar to  "Family-history-of-thyroid-disorder" and from character 696 to 724 is similar to "Family-history-of-MI-OR-Family-history-of-myocardial-infarction"

In [None]:
print (note_for_pn16[668:693])

In [None]:
print (note_for_pn16[696:724])

### The same part is done by kaggle, they have stated the location for the similar sentence to a feature present in the patient note.

### let us view the train.csv (train) data to see how it is structured

In [None]:
train.head()

### The train data have id (which is a unique identifier for each training data point), case_number, patient_number, feature_number (connecting to the features table) and annotation and location. (will explain annotation and location later). Now for simplicity we will stick to patient_no: 16 and case_num: 0 and then understand annotation and location.

In [None]:
train[train['pn_num']==16][train['case_num']==0]

### Why not just join the features table and train table so that we can view it in much easier way?


In [None]:
train_features=train.merge(features, on=['feature_num','case_num'], how='left')
train_features[train_features['case_num']==0][train_features['pn_num']==16]

### Basically now we can understand annotation and location. Location is basically where in the patient_note string for patient no 16 we have a particular feature present. For example in train data for id: 00016_000 we have a feature_text "Family History of MI and ..." and that sentence or similar is present in our patient note string from character 696 to 724.

### And annotation is simply what is written in those character span

In [None]:
print (note_for_pn16[696:724])

### Please also note that for some of the training data, the location is not present. That can be considered as the missing data.

# To be Continued

In [None]:
###