# Feature Extraction 
    
In the following notebook, our aim will be to extract crucial probabilities, in order to construct a HMM Model later on!


### The red box shows the current position in the pipeline

![PipelineStep2](images/Pipeline_step2.png "View of the current state of the Pipeline")

### Inside the red box, the following chart models the workflow

![FeatureExtractionPipeline](images/FeatureExtractionPipeline.png "View of the Preprocessing Pipeline")

The first step will be to group the experiments. 
The next step should be iterating (may be time intensive, but i currently see no other way) over the pandas DataFrame,
in order to count occurences, which will later help us determining the actual probabilities.

The current idea is to store the probabilities inside a **HUGE** dictionary, inside an object, which has an interface for easy access.

Please see the **pvault.py** file, in which all the actual extraction takes place.

In [1]:
import pandas as pd
import pickle
import configparser

path_to_config = './config.ini'

df = pd.read_pickle('preprocessed.pkl')
label_encoders = None

# read in the config file
cparser = configparser.ConfigParser()
cparser.read(path_to_config)


with open('label_encoders.pkl', 'rb') as file:
    label_encoders = pickle.load(file)

In [2]:
df

Unnamed: 0,usubjid,visdat,siteid,visnam,subjstat
0,0,0,0,0,2
1,0,1,0,0,2
2,0,2,0,0,2
3,0,3,0,0,2
4,0,4,0,0,2
...,...,...,...,...,...
145,3,120,0,0,2
146,3,121,0,0,1
147,3,122,0,0,1
148,3,123,0,0,1


In [3]:
from pvault import ProbabilityVault

pv = ProbabilityVault(df, cparser, label_encoders)

In [4]:
pv.extract_probabilities()

Extracting counts. This may take some time.


100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 142.86it/s]


Calculating initial state probabilities.


100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]


Calculating state transition probabilities.


100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]


Calculating observation probabilities.


100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]

Extraction of probabilities successful. Elapsed time: 0.0 hours, 0.0 minutes and 0.04400205612182617 seconds





In [5]:
pv.to_pickle('pv_testing1.pkl')

In [6]:
pv2 = ProbabilityVault.from_pickle('pv_testing1.pkl')

In [7]:
pv2

<pvault.ProbabilityVault at 0x20f2b088550>

In [8]:
pv2.get_isp_dict()

{'usubjid': {0: 0.25, 1: 0.25, 2: 0.25, 3: 0.25},
 'visdat': {0: 0.5, 21: 0.25, 86: 0.25},
 'siteid': {0: 1.0},
 'visnam': {0: 1.0},
 'subjstat': {2: 0.75, 0: 0.25}}