## Introduction: Physcian Conversion Model

**An HCP conversion model aims to predict the likelihood of a healthcare provider (HCP) becoming a writer for a particular company or brand for the first time. It helps pharmaceutical or medical device companies identify potential HCPs who have not yet written prescriptions for their products but may be receptive to doing so in the future. The model focuses on converting non-writing HCPs into active writers**

Here's a detailed view of an HCP conversion model:

1. Target Variable: The target variable in an HCP conversion model is binary, indicating whether an HCP becomes a writer (1) or not (0) for a specific company or brand. This variable serves as the outcome to be predicted.

2. Predictive Features: Various data sources and types can be used to develop an HCP conversion model. Some common predictive features include:

	  - Demographic Data: Information about the HCP's age, gender, specialty, location, education level, and other relevant characteristics. Demographic data helps identify patterns and preferences specific to certain groups of HCPs.
	
	  - Claims Data: Historical claims data can provide insights into the HCP's prescribing behavior, utilization patterns, and therapeutic areas of interest. It helps identify HCPs with similar patient profiles or conditions that align with the company's products.
	
	  - Promotional Data: Data on past promotional activities directed at the HCP, such as detailing visits, samples provided, speaker programs attended, and engagement with educational materials. This helps assess the level of previous interaction and the impact of promotional efforts.
	
	  - Network Analysis: Examination of the HCP's professional network, affiliations, collaborations, and referral patterns. Understanding the relationships between HCPs can help identify opinion leaders and assess the impact of peer influence.

**Note: Other types of Features can also be utilized depending on availability and Granularity.**

### Loading Required Libraries and Dependencies

In [2]:
import pandas as pd
import numpy as np
import os

import warnings
warnings.filterwarnings("ignore")

# Visual libraries
import matplotlib.pyplot as plt

# Importing necessary libraries for encoding
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

### Import the dataset and clean column names

In [6]:
import os
import pandas as pd

# Corrected file path
file_path = "/Users/sumanmukherjee_admin/Desktop/Arpa/DS/physician_conversion_mlops-main/data/input/Input_data.csv"

# Check if the file exists
if os.path.exists(file_path):
    # Import the dataset
    df_input = pd.read_csv(file_path)

    # Clean column names
    df_input.columns = df_input.columns.str.strip()
    df_input.columns = df_input.columns.str.replace(' ', '_')

    # Print shape and data format
    print('Shape and data format:')
    print('')
    print(df_input.shape)
else:
    print("File not found at the specified path:", file_path)


Shape and data format:

(5000, 59)


In [7]:
df_input.head()

Unnamed: 0,NPI_ID,HCP_ID,TARGET,Age,Sex,Specialty,Year_of_Experience,HCO_Affiliation,HCO_Affiliation_Type,Number_of_Rx,...,F2F_visit,F2F_visit_last_1_month,F2F_visit_last_3_month,F2F_visit_last_6_month,F2F_visit_last_12_month,VRC_visit,VRC_visit_last_1_month,VRC_visit_last_3_month,VRC_visit_last_6_month,VRC_visit_last_12_month
0,9846255,HCP_1,0,64,M,Oncology,10,XYZ Medical Center,Referral,290,...,1,2,3,5,9,2,3,5,9,16
1,5217093,HCP_2,1,64,F,Neuro-oncology,47,ABC Medical Center,Collaboration,645,...,2,4,6,8,12,1,2,3,6,11
2,2659257,HCP_3,1,56,M,Hematology,41,STU Medical Center,Employment,879,...,1,2,3,4,6,2,3,6,8,11
3,9851763,HCP_4,1,74,F,Pediatric,35,XYZ Medical Center,Collaboration,392,...,1,2,4,8,15,1,2,4,6,7
4,6650853,HCP_5,0,46,M,Immunology,60,DEF Medical Center,Contract,1043,...,1,2,3,5,8,2,4,6,9,16


### General EDA and Data checks

In [8]:
#General EDA
pd.options.display.max_columns = 50
print()
df_input.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 59 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   NPI_ID                              5000 non-null   int64 
 1   HCP_ID                              5000 non-null   object
 2   TARGET                              5000 non-null   int64 
 3   Age                                 5000 non-null   int64 
 4   Sex                                 5000 non-null   object
 5   Specialty                           5000 non-null   object
 6   Year_of_Experience                  5000 non-null   int64 
 7   HCO_Affiliation                     5000 non-null   object
 8   HCO_Affiliation_Type                5000 non-null   object
 9   Number_of_Rx                        5000 non-null   int64 
 10  Rx_last_1_Month                     5000 non-null   int64 
 11  Rx_last_3_Month                     5000 non-null   int

In [9]:
df_input.describe()

Unnamed: 0,NPI_ID,TARGET,Age,Year_of_Experience,Number_of_Rx,Rx_last_1_Month,Rx_last_3_Month,Rx_last_6_Month,Rx_last_12_Month,Number_of_Px,Px_last_1_Month,Px_last_3_Month,Px_last_6_Month,Px_last_12_Month,Claims_last_1_Month,Claims_last_3_Month,Claims_last_6_Month,Claims_last_12_Month,Procedures_chemo_last_1_month,Procedures_chemo_last_3_month,Procedures_chemo_last_6_month,Procedures_chemo_last_12_month,Procedures_radio_last_1_month,Procedures_radio_last_3_month,Procedures_radio_last_6_month,...,Procedures_Immuno_last_12_month,Procedures_Biopsy_last_1_month,Procedures_Biopsy_last_3_month,Procedures_Biopsy_last_6_month,Procedures_Biopsy_last_12_month,Promotional_doximity,Promotional_doximity_last_1_month,Promotional_doximity_last_3_month,Promotional_doximity_last_6_month,Promotional_doximity_last_12_month,Promotional_medscape,Promotional_medscape_last_1_month,Promotional_medscape_last_3_month,Promotional_medscape_last_6_month,Promotional_medscape_last_12_month,F2F_visit,F2F_visit_last_1_month,F2F_visit_last_3_month,F2F_visit_last_6_month,F2F_visit_last_12_month,VRC_visit,VRC_visit_last_1_month,VRC_visit_last_3_month,VRC_visit_last_6_month,VRC_visit_last_12_month
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,...,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,5544598.0,0.2502,60.1982,36.4,794.5868,1186.8896,1773.6876,2668.6958,4001.1422,183.946,276.2154,414.7376,620.8114,929.1634,178.3296,268.6344,404.6736,608.417,49.0962,74.5234,112.546,168.1312,57.6208,86.9952,130.3638,...,197.6624,57.337,86.4912,130.5038,196.4128,7.5106,11.7814,18.2198,27.7362,42.0218,7.4578,11.6774,18.0564,27.51,41.669,1.502,2.749,4.6276,7.4366,11.7242,1.5,2.7484,4.6258,7.4488,11.6454
std,2589635.0,0.433171,18.770258,18.600312,327.511269,546.032458,899.699968,1464.578381,2357.893309,66.730701,114.531764,192.461137,316.462382,517.74319,98.029741,159.358798,256.020525,406.822833,15.36382,27.539425,47.387089,78.760197,13.913408,27.094509,47.897752,...,82.541647,13.84632,26.46427,48.327858,82.397703,1.715948,3.323947,6.164665,10.78209,18.106466,1.71266,3.363211,6.118718,10.584078,18.069044,0.500046,0.825915,1.459638,2.608091,4.587281,0.50005,0.827789,1.463763,2.621899,4.520105
min,1001849.0,0.0,28.0,10.0,100.0,102.0,106.0,112.0,126.0,70.0,73.0,79.0,91.0,116.0,8.0,9.0,12.0,15.0,23.0,24.0,26.0,32.0,34.0,35.0,38.0,...,48.0,34.0,35.0,36.0,46.0,5.0,6.0,7.0,8.0,9.0,5.0,6.0,7.0,8.0,9.0,1.0,2.0,3.0,4.0,5.0,1.0,2.0,3.0,4.0,5.0
25%,3326632.0,0.0,44.0,17.0,525.0,760.0,1078.0,1553.0,2227.75,126.0,182.0,259.0,384.0,545.0,95.0,136.0,197.0,287.75,35.0,52.0,75.0,109.0,45.0,66.0,95.0,...,136.0,45.0,66.0,95.0,135.0,6.0,9.0,14.0,20.0,29.0,6.0,9.0,13.0,20.0,28.0,1.0,2.0,4.0,5.0,8.0,1.0,2.0,4.0,5.0,8.0
50%,5601028.0,0.0,60.0,40.0,924.0,1224.0,1756.0,2549.5,3701.0,184.0,266.0,383.0,557.0,808.5,177.0,258.0,371.5,544.0,49.0,72.0,106.0,154.0,58.0,84.0,123.0,...,183.0,57.0,83.0,122.0,182.0,8.0,11.0,17.0,26.0,38.0,7.0,11.0,17.0,26.0,38.0,2.0,3.0,4.0,7.0,11.0,1.5,2.5,4.0,7.0,11.0
75%,7752466.0,1.0,76.0,54.0,1100.0,1608.0,2379.0,3596.25,5359.25,241.0,357.0,537.0,804.0,1208.0,262.0,382.0,573.0,860.0,62.0,94.0,142.0,212.0,70.0,105.0,159.0,...,246.0,69.0,104.0,160.0,242.25,9.0,14.0,22.0,34.0,52.0,9.0,14.0,22.0,33.0,52.0,2.0,3.0,6.0,9.0,14.0,2.0,3.0,6.0,9.0,14.0
max,9999327.0,1.0,92.0,60.0,1100.0,2198.0,4281.0,8083.0,14285.0,300.0,595.0,1116.0,2100.0,3552.0,350.0,695.0,1340.0,2483.0,75.0,150.0,286.0,533.0,81.0,162.0,312.0,...,585.0,81.0,162.0,314.0,589.0,10.0,20.0,40.0,74.0,137.0,10.0,20.0,40.0,76.0,130.0,2.0,4.0,8.0,16.0,32.0,2.0,4.0,8.0,16.0,32.0


In [10]:
#Check for categorical columns
df_input.select_dtypes('object').nunique()

HCP_ID                  5000
Sex                        2
Specialty                  6
HCO_Affiliation           10
HCO_Affiliation_Type       4
dtype: int64

In [11]:
#Drop unwanted column: "HCO Affiliation" - "Affiliation Type" is more valid column for us
df_input.drop(['HCO_Affiliation'], axis= 1, inplace= True)
print()
print(df_input.select_dtypes('object').nunique())


HCP_ID                  5000
Sex                        2
Specialty                  6
HCO_Affiliation_Type       4
dtype: int64


### Dummy Encode Required Variables

In [13]:
onehot_cols = ['Sex', 'Specialty', 'HCO_Affiliation_Type']

In [14]:
df_input = pd.get_dummies(df_input, columns=onehot_cols, drop_first=True)
print(df_input.shape)
print('')
df_input.head()

(5000, 64)



Unnamed: 0,NPI_ID,HCP_ID,TARGET,Age,Year_of_Experience,Number_of_Rx,Rx_last_1_Month,Rx_last_3_Month,Rx_last_6_Month,Rx_last_12_Month,Number_of_Px,Px_last_1_Month,Px_last_3_Month,Px_last_6_Month,Px_last_12_Month,Claims_last_1_Month,Claims_last_3_Month,Claims_last_6_Month,Claims_last_12_Month,Procedures_chemo_last_1_month,Procedures_chemo_last_3_month,Procedures_chemo_last_6_month,Procedures_chemo_last_12_month,Procedures_radio_last_1_month,Procedures_radio_last_3_month,...,Promotional_doximity_last_12_month,Promotional_medscape,Promotional_medscape_last_1_month,Promotional_medscape_last_3_month,Promotional_medscape_last_6_month,Promotional_medscape_last_12_month,F2F_visit,F2F_visit_last_1_month,F2F_visit_last_3_month,F2F_visit_last_6_month,F2F_visit_last_12_month,VRC_visit,VRC_visit_last_1_month,VRC_visit_last_3_month,VRC_visit_last_6_month,VRC_visit_last_12_month,Sex_ M,Specialty_Immunology,Specialty_Neuro-oncology,Specialty_Oncology,Specialty_Pediatric,Specialty_Uro-oncology,HCO_Affiliation_Type_Contract,HCO_Affiliation_Type_Employment,HCO_Affiliation_Type_Referral
0,9846255,HCP_1,0,64,10,290,400,492,770,1373,234,464,738,1060,1949,241,395,699,1047,48,82,112,154,43,57,...,47,8,12,18,30,55,1,2,3,5,9,2,3,5,9,16,1,0,0,1,0,0,0,0,1
1,5217093,HCP_2,1,64,47,645,923,1628,2210,2449,80,139,264,318,534,197,266,420,667,66,75,98,124,59,99,...,23,6,11,14,21,40,2,4,6,8,12,1,2,3,6,11,0,0,1,0,0,0,0,0,0
2,2659257,HCP_3,1,56,41,879,1062,1918,3505,6169,197,274,386,642,826,319,532,655,849,37,55,96,129,78,141,...,37,10,17,20,29,48,1,2,3,4,6,2,3,6,8,11,1,0,0,0,0,0,0,1,0
3,9851763,HCP_4,1,74,35,392,704,1036,1078,1184,217,260,319,401,671,232,419,464,849,69,114,139,196,54,69,...,54,6,12,16,21,27,1,2,4,8,15,1,2,4,6,7,0,0,0,0,1,0,0,0,0
4,6650853,HCP_5,0,46,60,1043,1228,1305,1339,2483,209,269,456,591,642,232,261,519,930,75,83,110,163,55,90,...,26,8,12,15,30,52,1,2,3,5,8,2,4,6,9,16,1,1,0,0,0,0,1,0,0


**Note: Save all the Features in Feature Store**

### Feature Selection
- Variance Threshold check
- Select k best for top n features 

In [15]:
#it will display all the coulmns for feature selection except the columns mentioned using difference function.
id_target_col_list = ['NPI_ID', 'HCP_ID', 'TARGET']
col_for_feature_selection = df_input.columns.difference(id_target_col_list)

In [16]:
from sklearn.feature_selection import VarianceThreshold

var_thr = VarianceThreshold(threshold = 0.1) #Removing both constant and quasi-constant using theshold, the colums with variance
#<0.1 will not be used for feature selection
var_thr.fit(df_input[col_for_feature_selection])

var_thr.get_support() #to get the output in boolean

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [17]:
df_input_subset = df_input[col_for_feature_selection]
df_input_subset.columns

Index(['Age', 'Claims_last_12_Month', 'Claims_last_1_Month',
       'Claims_last_3_Month', 'Claims_last_6_Month', 'F2F_visit',
       'F2F_visit_last_12_month', 'F2F_visit_last_1_month',
       'F2F_visit_last_3_month', 'F2F_visit_last_6_month',
       'HCO_Affiliation_Type_Contract', 'HCO_Affiliation_Type_Employment',
       'HCO_Affiliation_Type_Referral', 'Number_of_Px', 'Number_of_Rx',
       'Procedures_Biopsy_last_12_month', 'Procedures_Biopsy_last_1_month',
       'Procedures_Biopsy_last_3_month', 'Procedures_Biopsy_last_6_month',
       'Procedures_Immuno_last_12_month', 'Procedures_Immuno_last_1_month',
       'Procedures_Immuno_last_3_month', 'Procedures_Immuno_last_6_month',
       'Procedures_chemo_last_12_month', 'Procedures_chemo_last_1_month',
       'Procedures_chemo_last_3_month', 'Procedures_chemo_last_6_month',
       'Procedures_radio_last_12_month', 'Procedures_radio_last_1_month',
       'Procedures_radio_last_3_month', 'Procedures_radio_last_6_month',
       'Pro

In [18]:
remove_col_list = [col for col in df_input_subset.columns 
          if col not in df_input_subset.columns[var_thr.get_support()]]

remove_col_list

[]

**Note: ALL the columns have some variance hence keep all columns** 

### Feature Selection Using Select K Best

In [19]:
#Select top n Features
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

def select_kbest_features(df, target_col,n):
  """
  Selects the top n features from the DataFrame using the SelectKBest algorithm.

  Args:
    df: The DataFrame to select features from.
    n: The number of features to select.

  Returns:
    A list of the top n features.
  """


  selector = SelectKBest(k=n)
  selected_features = selector.fit_transform(df, target_col)
  
  mask = selector.get_support()
  top_n_features = df.columns[mask]

  return top_n_features

In [20]:
id_col_list = ['NPI_ID', 'HCP_ID']
target_col = df_input['TARGET']
top_n_col_list = select_kbest_features(df_input.drop(id_col_list,axis=1),target_col, 30)
print(len(top_n_col_list))
top_n_col_list

30


Index(['TARGET', 'Age', 'Year_of_Experience', 'Number_of_Rx',
       'Rx_last_1_Month', 'Rx_last_3_Month', 'Rx_last_6_Month',
       'Rx_last_12_Month', 'Claims_last_1_Month', 'Claims_last_3_Month',
       'Claims_last_6_Month', 'Claims_last_12_Month',
       'Procedures_chemo_last_1_month', 'Procedures_chemo_last_3_month',
       'Procedures_chemo_last_6_month', 'Procedures_chemo_last_12_month',
       'Procedures_radio_last_6_month', 'Procedures_radio_last_12_month',
       'Procedures_Biopsy_last_6_month', 'Promotional_medscape',
       'Promotional_medscape_last_1_month',
       'Promotional_medscape_last_3_month',
       'Promotional_medscape_last_6_month',
       'Promotional_medscape_last_12_month', 'Sex_ M ', 'Specialty_Oncology',
       'Specialty_Pediatric', 'Specialty_Uro-oncology',
       'HCO_Affiliation_Type_Contract', 'HCO_Affiliation_Type_Referral'],
      dtype='object')

In [21]:
#Convert to list
top_n_col_list = top_n_col_list.tolist()


type(top_n_col_list)

list

In [22]:
cols_for_model_df_list = id_col_list + top_n_col_list
print(len(cols_for_model_df_list))
print('')

32



In [23]:
df_feature_eng_output = df_input[cols_for_model_df_list]
df_feature_eng_output.head()

Unnamed: 0,NPI_ID,HCP_ID,TARGET,Age,Year_of_Experience,Number_of_Rx,Rx_last_1_Month,Rx_last_3_Month,Rx_last_6_Month,Rx_last_12_Month,Claims_last_1_Month,Claims_last_3_Month,Claims_last_6_Month,Claims_last_12_Month,Procedures_chemo_last_1_month,Procedures_chemo_last_3_month,Procedures_chemo_last_6_month,Procedures_chemo_last_12_month,Procedures_radio_last_6_month,Procedures_radio_last_12_month,Procedures_Biopsy_last_6_month,Promotional_medscape,Promotional_medscape_last_1_month,Promotional_medscape_last_3_month,Promotional_medscape_last_6_month,Promotional_medscape_last_12_month,Sex_ M,Specialty_Oncology,Specialty_Pediatric,Specialty_Uro-oncology,HCO_Affiliation_Type_Contract,HCO_Affiliation_Type_Referral
0,9846255,HCP_1,0,64,10,290,400,492,770,1373,241,395,699,1047,48,82,112,154,79,110,182,8,12,18,30,55,1,1,0,0,0,1
1,5217093,HCP_2,1,64,47,645,923,1628,2210,2449,197,266,420,667,66,75,98,124,111,128,73,6,11,14,21,40,0,0,0,0,0,0
2,2659257,HCP_3,1,56,41,879,1062,1918,3505,6169,319,532,655,849,37,55,96,129,238,355,107,10,17,20,29,48,1,0,0,0,0,0
3,9851763,HCP_4,1,74,35,392,704,1036,1078,1184,232,419,464,849,69,114,139,196,91,177,133,6,12,16,21,27,0,0,1,0,0,0
4,6650853,HCP_5,0,46,60,1043,1228,1305,1339,2483,232,261,519,930,75,83,110,163,122,158,234,8,12,15,30,52,1,0,0,0,1,0


In [24]:
df_model_input = df_feature_eng_output.copy()

In [25]:
#Import the dataset
path1 = "/Users/sumanmukherjee_admin/Desktop/Arpa/DS/physician_conversion_mlops-main/data/input/Input_data.csv"
file_path= path1
#file_path = os.path.join(path1,"data","output", "model_input.csv")
df_model_input.to_csv(file_path)