# Columbia Data Science Society - Hackathon February 2023

### Importing the relevant libraries

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from preprocessing.py import * 
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier

### Importing the data
After downloading the dataset at the url https://open.fda.gov/apis/downloads/, we loaded the data to the notebook and applied preprocessing. 

The preprocessing function is located in the python file preprocessing.py . 

In [4]:
with open('sample_data.json', 'r') as f:
  data_ex = json.load(f)

In [13]:
df = preprocess_data(data_ex)

  a = pd.DataFrame(df.sum()).reset_index()


### Modeling 

In this part, we applied several model on the data : 

- **logistic regression** : accuracy : 0.8%
- **logistic regression w/ pca** : accuracy : 0.71% 
- **decision tree w/ pca** : accuracy : 0.75%
- **random forest w/ pca** : accuracy : 0.81%
- **xgboost w/ pca** : accuracy : 0.79%
- **extra tree w/ pca** : accuracy : 0.82%


As the data we had was highly imbalanced (67% - 33%), we had to apply oversampling on the dataset in order to make the models relevant.

The target value that we used for this study is the variable 'Serious'. It indicates how serious the issue related to the usage of a drug is. Thus we believe that this feature is extremely important to deal with as this is exactly what we want to prevent. 
Moreover, this variable is one of the very few that didn't had any missing value. 

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Serious',axis = 1),df['Serious'],test_size = 0.2)

In [56]:
OvSamp = RandomOverSampler()

X_train,y_train = OvSamp.fit_resample(X_train,y_train)



In [80]:
lr = make_pipeline(StandardScaler(),LogisticRegression())
lr.fit(X_train,y_train)
print("Training: ",lr.score(X_train,y_train))
print("Test : ",lr.score(X_test,y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training:  0.9206433907607867




Test :  0.8004166666666667


In [79]:
importance = lr.coef_[0]
lr_feature_imp = pd.DataFrame([[X_train.columns[i],abs(v)] for i,v in enumerate(importance)]).sort_values(1,ascending=False)
lr_feature_imp[:20]

Unnamed: 0,0,1
54,CYCLOPHOSPHAMIDEDosage,0.029014
1345,HEPARIN SODIUMDosage,0.027526
4,DrugChar1,0.020289
623,.ALPHA.-TOCOPHEROLDosage,0.019954
2753,ERYTHROPOIETINDosage,0.01866
67,ADALIMUMABDosage,0.017963
59,RITUXIMABDosage,0.017845
1100,ENZALUTAMIDEDosage,0.017812
337,DUPILUMABDosage,0.016869
8,DrugChar2,0.016852


As the data we're using is highly sparse (over 3700 features), the training of the model is very long and costly on the computational capacity. \\

Therefore we decided to apply PCA in order to make it possible to train other models (tree based models for example, which are very effective). We do know that by applying PCA we lose the interpretability of our model. We cannot use feature importance in order to identify which element might cause the serious hasard to happend. However, we can predict with a good accuracy which case of patient is the most likely to suffer from a medical hasard. 



In [69]:
pca_lr = make_pipeline(StandardScaler(),PCA(n_components=15),LogisticRegression())
pca_lr.fit(X_train,y_train)
print('Train : ',pca_lr.score(X_train,y_train))
print('Test : ',pca_lr.score(X_test,y_test))



Train :  0.7400518371703003




Test :  0.7095833333333333


In [70]:
pca_dt = make_pipeline(StandardScaler(),PCA(n_components=15),DecisionTreeClassifier())
pca_dt.fit(X_train,y_train)
print('Train : ',pca_dt.score(X_train,y_train))
print('Test : ',pca_dt.score(X_test,y_test))



Train :  0.805000762311328




Test :  0.74625


In [72]:
pca_rf = make_pipeline(StandardScaler(),PCA(n_components=15),RandomForestClassifier())
pca_rf.fit(X_train,y_train)
print('Train : ',pca_rf.score(X_train,y_train))
print('Test : ',pca_rf.score(X_test,y_test))



Train :  0.9437414239975606




Test :  0.8129166666666666


In [74]:
pca_xb = make_pipeline(StandardScaler(),PCA(n_components=15),XGBClassifier())
pca_xb.fit(X_train,y_train)
print('Train : ',pca_xb.score(X_train,y_train))
print('Test : ',pca_xb.score(X_test,y_test))



Train :  0.8133861869187377




Test :  0.7929166666666667


In [76]:
pca_et = make_pipeline(StandardScaler(),PCA(n_components=15),ExtraTreesClassifier())
pca_et.fit(X_train,y_train)
print('Train : ',pca_et.score(X_train,y_train))
print('Test : ',pca_et.score(X_test,y_test))



Train :  0.9580728769629516




Test :  0.8233333333333334
