## Important Python Libraries

In the last notebooks, we already saw how to import Python libraries. Now, we have a deeper look into the python libraries 
* `pandas` and 
* `numpy` 

Since they are not built in Python libraries, we have to install them using pip or conda, which are package installers for Python. Afterwards, we will show you how use conda, when entering
* `pip3 install pandas` and 
* `pip3 install numpy` in your terminal, 
you can install the libraries using pip. 

Now, we can import them for the notebook. Note that we introduce the abbreviations `np` and `pd`, because of convention :)

In [1]:
import pandas as pd
import numpy as np

### Numpy


Roughly speaking, numpy is a library containing data structures, i.e.,  n-dimensional arrays. Actually, you could do everything that numpy can do also with lists of list. However, this is really slow.

Since Python is generally really slow compared to C, the most commonly used Python libraries - as numpy - are speeded up using C code :) Therefore, it is recommended to use numpy arrays instead of lists.

In [2]:
from time import time

my_array = np.arange(1000000000)
my_list = range(1000000000)


In [3]:
start = time()
my_array.sum()
end = time()
print(f'{end-start}s needed to calculate sum of array')

0.23447442054748535s needed to calculate sum of array


In [4]:
start = time()
sum(my_list)
end = time()
print(f'{end-start}s needed to calculate sum of list')

7.878138542175293s needed to calculate sum of list


### Pandas

Pandas is also a library for data structures and actually it uses numpy arrays internally -> So, it is also quite fast. However, with pandas we can have dataframes with row- and column names such that we can easily work with data stored in csv or excel files.

In [5]:
table = pd.read_csv('../example_data/drugs.csv', sep = '\t')

In [6]:
table.head()

Unnamed: 0,drug,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,...,fr_piperzine,fr_priamide,fr_pyridine,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_thiazole,fr_thiophene,fr_unbrch_alkane,fr_urea
0,Vinblastine,15.321994,15.321994,0.098623,-2.301098,0.179801,810.989,752.525,810.420379,316.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Cisplatin,,,,,,,,,,...,,,,,,,,,,
2,Cytarabine,11.510871,11.510871,0.053698,-1.309931,0.44893,243.219,230.115,243.085521,94.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Docetaxel___1007,14.945482,14.945482,0.063719,-2.353438,0.146809,807.89,754.466,807.346605,314.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Docetaxel___1819,14.945482,14.945482,0.063719,-2.353438,0.146809,807.89,754.466,807.346605,314.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We already see that there are columns with NaN values, so we would like to delet these columns

In [7]:
table.dropna(inplace=True)
table.head()

Unnamed: 0,drug,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,...,fr_piperzine,fr_priamide,fr_pyridine,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_thiazole,fr_thiophene,fr_unbrch_alkane,fr_urea
0,Vinblastine,15.321994,15.321994,0.098623,-2.301098,0.179801,810.989,752.525,810.420379,316.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Cytarabine,11.510871,11.510871,0.053698,-1.309931,0.44893,243.219,230.115,243.085521,94.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Docetaxel___1007,14.945482,14.945482,0.063719,-2.353438,0.146809,807.89,754.466,807.346605,314.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Docetaxel___1819,14.945482,14.945482,0.063719,-2.353438,0.146809,807.89,754.466,807.346605,314.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Gefitinib,13.468244,13.468244,0.036409,-0.474756,0.517855,446.91,422.718,446.152097,164.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Now, we want to identify the compound with the most valence electrons. Using `loc` you can access the dataframe rows and columns by name.

In [8]:
table.loc[0, 'drug']

'Vinblastine'

However, we can also insert conditions. This is for example code to identify the drug(s) with most valence electrons.

In [9]:
table.loc[table['NumValenceElectrons'] == table['NumValenceElectrons'].values.max(), ['drug', 'NumValenceElectrons']]

Unnamed: 0,drug,NumValenceElectrons
40,Dactinomycin___1811,490.0
41,Dactinomycin___1911,490.0


We can also drop columns that we don't need, e.g., we drop the column MolWt.

In [10]:
table.drop(columns=['MolWt'], inplace = True)
table.head()

Unnamed: 0,drug,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,MaxPartialCharge,...,fr_piperzine,fr_priamide,fr_pyridine,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_thiazole,fr_thiophene,fr_unbrch_alkane,fr_urea
0,Vinblastine,15.321994,15.321994,0.098623,-2.301098,0.179801,752.525,810.420379,316.0,0.343611,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Cytarabine,11.510871,11.510871,0.053698,-1.309931,0.44893,230.115,243.085521,94.0,0.351213,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Docetaxel___1007,14.945482,14.945482,0.063719,-2.353438,0.146809,754.466,807.346605,314.0,0.407747,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Docetaxel___1819,14.945482,14.945482,0.063719,-2.353438,0.146809,754.466,807.346605,314.0,0.407747,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Gefitinib,13.468244,13.468244,0.036409,-0.474756,0.517855,422.718,446.152097,164.0,0.162433,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Afterwards, we can also store the table in a new file.

In [11]:
table.to_csv('../example_data/drugs_without_molwt.csv', sep = '\t', index=False) # we set index to false to avoid the index numbers