In [1]:


    "# Compound Classification Challenge
    "This is a notebook for the challenge. For a simple demo, we will use Random Forest with the Morgan fingerprint as our feature vector."


   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import rdkit.Chem as Chem\n",
    "import rdkit.Chem.AllChem as AllChem\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "import sklearn.metrics as metrics"
   ]


    "## Data
    "Let's load the compound data file."


   "source": [
    "cmpd_df = pd.read_csv('cmpd.csv')\n",
    "cmpd_df.head()"
   ],


   "source": [
    "cmpd_df.shape"
   ]


    "There are 5530 compound samples with:\n",
    "* SMILES - 2D compound structure,\n",
    "* InChIKey - a hash from InChI,\n",
    "* group - a tag to split the dataset into train and test\n",
    "* activity - y label"


   "source": [
    "cmpd_df['mol'] = cmpd_df.smiles.apply(Chem.MolFromSmiles)"
   ]


   "source": [
    "# with minimal modification, we obtain the fingerprint vector using RDKit\n",
    "\n",
    "def get_Xy(df):\n",
    "    X = np.vstack(df.mol.apply(lambda m: list(AllChem.GetMorganFingerprintAsBitVect(m, 4, nBits=2048))))\n",
    "    y = df.activity.eq('active').astype(float).to_numpy()\n",
    "    return X, y"
   ]


   "source": [
    "X_train, y_train = get_Xy(cmpd_df[cmpd_df.group.eq('train')])\n",
    "X_test, y_test = get_Xy(cmpd_df[cmpd_df.group.eq('test')])"
   ]


    "## Model: Random Forest
    "Probably, RF is the simplest classifier for numerical feature vectors without much tuning, and that gives us a start point of our model exploration."



     "data": {
      "text/plain": "0.8583386992916935"
     }
   "source": [
    "clf = RandomForestClassifier()\n",
    "clf.fit(X_train, y_train)\n",
    "clf.score(X_test, y_test)"
   ]


   "source": [
    "y_pred = clf.predict_proba(X_test)[:, 1]"
   ]


     "data": {
      "text/plain": "0.4228530832473805"
     }
   "source": [
    "# logloss
    "metrics.log_loss(y_test, y_pred, labels=[0, 1])"
   ]


     "data": {
      "text/plain": "0.8763044222851167"
     }
   "source": [
    "# AUC PRC\n",
    "precision, recall, _ = metrics.precision_recall_curve(y_test, y_pred, pos_label=1)\n",
    "metrics.auc(recall, precision)"
   ]


     "data": {
      "text/plain": "0.8846382449883827"
     }
   "source": [
    "# AUC ROC\n",
    "fpr_roc, tpr_roc, _ = metrics.roc_curve(y_test, y_pred, pos_label=1)\n",
    "metrics.auc(fpr_roc, tpr_roc)"
   ]


  {
   "cell_type": "markdown",
   "source": [
    "## Hints\n",
    "\n",
    "Although AUCPRC and AUCROC are already quite high, one may suspect possible overfitting since the dimension of features is 2048, and the number of train samples is 3977.
    "Indeed, it is the case, but a simple regularization with some hyperparam tuning of the RF and/or the Morgan fingerprint does not improve the result significantly.
    "Note that some graph-based deep learning models with minimal tuning easily get you have both AUCPRC and AUCROC > 0.93, and logloss < 0.35.\n",
    "\n",
    "Also, remember that you may freely use other open resources.
    "For example, there are many many compound samples in PubChem, ChEMBL, ChEBI, ..., and most compounds there are not likely \"active\"."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}

NameError: name 'false' is not defined