# SUAN Pharma data preparation

## Problem statement:
We are analyzing the data generated from various sensors with in four steps of production of API (Active Pharmaceutical Ingredient).

![Production process](img/batch_process.png)

## Goal:
We are tasked to identify to the factors which increase the yield of the API at the end of the production process. Over the years through iterative improvements, it is said that the yield has been increased. This project could hopefully help in pin-pointing the exact factors and quantify their effects on the yield.

## Data:
We are provided with data collected for each batch executed during the past 9-10 months. For each batch there exists a time-series data of one minute intervals and a static value for the percentage of the yield generated at the end of the batch.

## Challenges:

### Data preparation:
1. For each batch an Excel file containing five sheets is shared each of which can be briefly described as follows:
    1. PO: It contains 3 values, the batch ID, the start and the end times of the batch.
    2. BHV: Data collected from Broth Harvest step is in this sheet, it contains 19 columns
    3. CFF: Data collected from Cross Flow Filtration step is in this sheet, it contains 114 columns
    4. EXT: Data collected from Extraction step is in this sheet, it contains 78 columns
    5. NF: Data collected from Nano Filtration step is in this sheet, it contains 53 columns
2. Firstly all the data from various steps needs to be column binded and then various batches of data needs to be row binded.

In [2]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.5f' % x)

import numpy as np
import glob

file_names = glob.glob('../data/input/11_Dataset/ODP*')
sheet_names = ['PO', 'BHV', 'CFF', 'EXT', 'NF']

def udf_read_bind(fileNames, sheetNames):
    foo = []
    col_names_list = []
    for i in fileNames:
        print(i)
        data = pd.concat([pd.read_excel(i, sheet_name=name) for name in sheet_names] , axis=0)
        data['PO'] = data['PO'].replace(np.nan, data['PO'].unique()[0])
        data['Date Start'] = data['Date Start'].replace(np.nan, data['Date Start'].unique()[0])
        data['Date End'] = data['Date End'].replace(np.nan, data['Date End'].unique()[0])
        data['PO'] = data['PO'].convert_dtypes()
        data.dropna(subset=['Unnamed: 0'], how='all', inplace=True)
        data.rename(columns = {'Unnamed: 0' : 'timeseries'}, inplace = True)
        data['Date Start'] = pd.to_datetime(data['Date Start'], format = '%m-%d-%Y %H:%M:%S')
        data['Date End'] = pd.to_datetime(data['Date End'], format = '%m-%d-%Y %H:%M:%S')
        data['timeseries'] = pd.to_datetime(data['timeseries'], format = '%m-%d-%Y %H:%M:%S')
        data.insert(3, 'processing_time_mins', (data['Date End'] - data['Date Start'])/pd.Timedelta(minutes = 1))
#         if set(['Unnamed: 79', 'Unnamed: 115']).issubset(df.columns):
#             print(udf_describe(data[['Unnamed: 79', 'Unnamed: 115']]))
        foo.append(data)
        col_names_list.append(data.columns.values)
    appended_data = pd.concat(foo)
    col_names_list = [l.tolist() for l in col_names_list]
    return appended_data, col_names_list

def udf_describe(df):
    desc_df = df.describe(include = 'all')
    desc_df.loc['dtype'] = df.dtypes
    desc_df.loc['size'] = len(df)
    desc_df.loc['% count'] = df.isnull().mean()*100
    return desc_df

# def udf_sanitary_check(df):


In [3]:
df, col_names = udf_read_bind(file_names, sheet_names)
df.shape

../data/input/11_Dataset/ODP 100001700_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001770_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001777_dati_BHV_CFF_NF_EXT.xlsx
../data/input/11_Dataset/ODP 100001776_dati_BHV_CFF_NF_EXT.xlsx
../data/input/11_Dataset/ODP 100001702_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001772_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001768_dati_BHV_CFF_NF_EXT.xlsx
../data/input/11_Dataset/ODP 100001704_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001773_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001774_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001703_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001769_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 100001705_dati_BHV_CFF_NF_EXT.xlsx
../data/input/11_Dataset/ODP 100001767_dati_BHV_CFF_NF_EXT.xlsx
../data/input/11_Dataset/ODP 100001771_dati_BHV_CFF_NF_EXT .xlsx
../data/input/11_Dataset/ODP 1

(88277, 268)

In [4]:
df.head(5)

Unnamed: 0,PO,Date Start,Date End,processing_time_mins,timeseries,101LI636,101WI610,306LI606,101AI635,101AI605,...,108PI659,108PI662,108PI663,108FI653,108FI657,108FI665,108FI669,108FI673,108FI677,108FI681
0,1000001700,2021-09-07 22:30:00,2021-09-08 19:20:00,1250.0,2021-09-07 22:30:00,95481.8125,98488.91406,0.0,5.1806,5.53759,...,,,,,,,,,,
1,1000001700,2021-09-07 22:30:00,2021-09-08 19:20:00,1250.0,2021-09-07 22:31:00,95143.78906,98489.21875,0.0,5.18127,5.53838,...,,,,,,,,,,
2,1000001700,2021-09-07 22:30:00,2021-09-08 19:20:00,1250.0,2021-09-07 22:32:00,94787.25,98495.00781,0.0,5.17957,5.53818,...,,,,,,,,,,
3,1000001700,2021-09-07 22:30:00,2021-09-08 19:20:00,1250.0,2021-09-07 22:33:00,94559.07031,98491.13281,0.0,5.18621,5.53742,...,,,,,,,,,,
4,1000001700,2021-09-07 22:30:00,2021-09-08 19:20:00,1250.0,2021-09-07 22:34:00,94037.35156,98488.625,0.0,5.18442,5.53737,...,,,,,,,,,,


In [7]:
len(list(set.intersection(*map(set,col_names))))

268

In [8]:
len(col_names)

17

In [11]:
col_irregularities = df.nunique()[df.nunique()<=1].keys()
for i in col_irregularities:
    print("Column {col_} has just these values: {list_}".format(col_=i, list_=df[i].unique()))

Column 158PIC678_823 has just these values: [ 0. nan]
Column 118LS960 has just these values: [nan False]
Column 118FI913 has just these values: [   nan 32000.]


In [91]:
udf_describe(df[col_irregularities]).T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,dtype,size,% count
158PIC678_823,12076.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,float64,47125,74.37454
118ZLH303,10897.0,2.0,False,10539.0,,,,,,,,object,47125,76.87639
118LS960,10897.0,1.0,False,10897.0,,,,,,,,object,47125,76.87639
118LS690,10897.0,2.0,False,10842.0,,,,,,,,object,47125,76.87639
118ZLL417,10897.0,2.0,False,9002.0,,,,,,,,object,47125,76.87639
118ZLL427,10897.0,2.0,False,8423.0,,,,,,,,object,47125,76.87639
118ZLL437,10897.0,2.0,False,8793.0,,,,,,,,object,47125,76.87639
118ZLL447,10897.0,2.0,False,9350.0,,,,,,,,object,47125,76.87639
118FI913,10897.0,,,,32000.0,0.0,32000.0,32000.0,32000.0,32000.0,32000.0,float64,47125,76.87639


In [49]:
df.nunique().values


array([12076, 11958, 11167,  3324, 10961, 11891, 11678, 11836, 11778,
        3324,  9754, 10729, 11318,     1, 11891, 11980, 10629, 11904,
       11914, 10140,  9355, 12027, 12046, 12055, 12068, 12033, 11721,
       11988, 11980, 11413,  9324, 10669, 10267, 12067, 12067, 12054,
       12061, 11777, 11892, 11923, 11489,  9013, 10352, 10060, 12014,
       12059, 12052, 12049, 11958, 11995, 12016, 11680, 10124, 10488,
       10795, 12026, 12064, 12067, 12058, 11971, 11995, 12006, 11713,
       10518, 10749, 10111, 12052, 11978, 12062, 12043, 11990, 11987,
       12012, 11810,  9849, 10595,  9411, 12016, 12009, 12058, 12054,
       11997, 12006, 12028, 11854,  9994, 10582,  9222, 11956, 12006,
       12057, 12062, 12001, 12006, 12031, 11900,  9519, 10653, 10516,
       12039, 11974, 12055, 12034, 12021, 12004, 12036, 11871, 10035,
       10586, 12070, 12070, 12071, 12068, 12073, 12070, 12069, 12071,
       12071, 12072, 12075, 12067, 12074, 12072, 12073, 12073, 12054,
       11993, 12060,

In [92]:
udf_describe(df).T

  desc_df = df.describe(include = 'all')


Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max,dtype,size,% count
PO,47125.00000,,,,NaT,NaT,1000001773.02088,2.68614,1000001769.00000,1000001771.00000,1000001773.00000,1000001775.00000,1000001777.00000,Int64,47125,0.00000
Date Start,47125,9,2021-09-16 02:45:00,6856,2021-09-16 02:45:00,2021-09-27 11:10:00,,,,,,,,datetime64[ns],47125,0.00000
Date End,47125,9,2021-09-17 07:18:00,6856,2021-09-17 07:18:00,2021-09-28 10:50:00,,,,,,,,datetime64[ns],47125,0.00000
timeseries,47125,12076,2021-09-25 07:34:00,5,2021-09-16 02:45:00,2021-09-28 10:50:00,,,,,,,,datetime64[ns],47125,0.00000
101LI636,12076.00000,,,,NaT,NaT,63623.76519,30315.49857,2028.07519,39673.62793,77740.80859,88224.35157,98819.89844,float64,47125,74.37454
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108FI665,12076.00000,,,,NaT,NaT,1.22926,1.30249,0.06364,0.07837,0.08541,2.70275,4.22653,float64,47125,74.37454
108FI669,12076.00000,,,,NaT,NaT,1.06588,1.16072,0.05680,0.07862,0.08939,2.42755,8.58963,float64,47125,74.37454
108FI673,12076.00000,,,,NaT,NaT,0.74086,0.78806,0.08568,0.10192,0.10955,1.54553,6.81956,float64,47125,74.37454
108FI677,12076.00000,,,,NaT,NaT,0.51567,0.57468,0.08457,0.10123,0.10962,0.91064,7.57622,float64,47125,74.37454


In [56]:
len(col_names)

9

In [57]:
type(col_names)


list

In [58]:
df.to_csv('../data/intermediate/check_df.csv', index = False)

In [110]:
df.columns

Index(['PO', 'Date Start', 'Date End', 'processing_time_mins', 'timeseries',
       '101LI636', '101WI610', '306LI606', '101AI635', '101AI605',
       ...
       '108PI659', '108PI662', '108PI663', '108FI653', '108FI657', '108FI665',
       '108FI669', '108FI673', '108FI677', '108FI681'],
      dtype='object', length=269)

In [115]:
df.groupby(['PO', 'Date Start', 'Date End', 'processing_time_mins']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,101LI636,101LI636,101LI636,101LI636,101LI636,101LI636,101LI636,101LI636,101WI610,101WI610,...,108FI677,108FI677,108FI681,108FI681,108FI681,108FI681,108FI681,108FI681,108FI681,108FI681
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
PO,Date Start,Date End,processing_time_mins,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
1000001769,2021-09-16 02:45:00,2021-09-17 07:18:00,1713.0,1714.0,72753.87132,23914.62352,19266.83398,60901.4834,84123.5,89925.74805,98748.89063,1714.0,35196.8805,...,0.76613,6.01674,1714.0,0.09813,0.00717,0.07455,0.09334,0.09798,0.10314,0.12221
1000001770,2021-09-17 17:40:00,2021-09-18 15:39:00,1319.0,1320.0,62132.28102,32836.66217,2028.07519,33402.11231,78852.78125,89834.86133,98319.78125,1320.0,22144.54903,...,0.10479,0.12041,1320.0,0.09271,0.00779,0.07179,0.08739,0.0927,0.0979,0.12031
1000001771,2021-09-19 16:55:00,2021-09-20 15:57:00,1382.0,1383.0,56838.34567,33505.4954,2291.93018,22423.72754,69473.05469,86767.07812,98460.16406,1383.0,19052.40259,...,1.11679,7.57622,1383.0,0.09192,0.00716,0.07077,0.0867,0.09185,0.0968,0.11438
1000001772,2021-09-20 20:20:00,2021-09-21 15:48:00,1168.0,1169.0,55719.99644,30427.9184,2126.03638,28914.41406,65307.61328,82324.10156,97140.01563,1169.0,18093.04809,...,1.01855,2.89714,1169.0,0.09355,0.00698,0.07069,0.08886,0.09352,0.09843,0.11439
1000001773,2021-09-21 19:05:00,2021-09-22 10:09:00,904.0,905.0,46111.35901,33036.81783,2291.92334,11936.36621,46158.85938,80825.52344,96132.98438,905.0,3643.53477,...,0.60542,3.2007,905.0,0.09038,0.00782,0.06507,0.085,0.09033,0.09548,0.11593
1000001774,2021-09-23 17:19:00,2021-09-24 16:12:00,1373.0,1374.0,62676.02285,30369.41052,2093.10425,38123.32422,77461.83203,86857.44922,98200.39063,1374.0,38390.33526,...,0.10418,0.11517,1374.0,0.09215,0.00726,0.06965,0.08718,0.0921,0.09718,0.11405
1000001775,2021-09-24 20:26:00,2021-09-25 22:27:00,1561.0,1562.0,69324.75908,26636.70604,2235.26001,58080.5332,79560.58203,89294.19922,98819.89844,1562.0,69695.99495,...,0.84997,5.40975,1562.0,0.09497,0.00722,0.0719,0.08998,0.09502,0.10009,0.1154
1000001776,2021-09-26 02:32:00,2021-09-26 22:59:00,1227.0,1228.0,74216.53564,20092.54706,30793.47852,61550.89844,81315.42968,89408.3457,98335.29688,1228.0,69095.61076,...,0.10277,0.11479,1228.0,0.09107,0.00713,0.06967,0.08628,0.09109,0.09588,0.11422
1000001777,2021-09-27 11:10:00,2021-09-28 10:50:00,1420.0,1421.0,63751.52663,32029.63864,2347.51123,36871.83594,79934.1875,89671.89844,98199.19531,1421.0,44673.89637,...,1.17008,3.27717,1421.0,0.08923,0.00763,0.06573,0.08402,0.0894,0.09442,0.11614
