# Software Engineering for Data Scientists

## *Introduction to Python & Jupyter notebooks*

## Today's Objectives

#### 1. Opening & Navigating the Jupyter Notebook

#### 2. Simple Math in the Jupyter Notebook

#### 3. Loading data with ``pandas``

#### 4. Cleaning and Manipulating data with ``pandas``

## 1. Opening and Navigating the Jupyter Notebook

We will start today with the interactive environment that we will be using often through the course: the [Jupyter Notebook](http://jupyter.org).

We will walk through the following steps together:

1. Download [miniconda]() (be sure to get Version 3.5) and install it on your system (hopefully you have done this before coming to class)
   ```
   ```

2. Use the ``conda`` command-line tool to update your package listing and install the Jupyter notebook:

   Update ``conda``'s listing of packages for your system:
   ```
   $ conda update conda
   ```
   
   Install Jupyter notebook and all its requirements
   ```
   $ conda install jupyter
   ```
   
3. Navigate to the HCEPDB directory. For example:

   ```
   $ cd ~/Desktop/HCEPDB/
   ```
   
   Use curl to download the main lecture notebook and the simple breakout notebook:
   
   ```
   # you may skip this next step if you downloaded the file from your web browser
   $ curl -O http://uwdirect.github.io/SEDS_content/02.Python.ipynb
   
   ...
   $ curl -O http://uwdirect.github.io/SEDS_content/02.Simple_Breakout.ipynb
   ...
   
   $ ls
   ...
   02.Python.ipnyb
   02.Simple_Breakout.ipnyb
   ...
   ```

4. Type ``jupyter notebook`` in the terminal to start the notebook

   ```
   $ jupyter notebook
   ```
   
   If everything has worked correctly, it should automatically launch your default browser
   ```
   ```
   
5. Click on ``02.Python.ipnyb`` to open the notebook containing the content for this lecture.

With that, you're set up to use the Jupyter notebook!

## 2. Simple Math in the Jupyter Notebook

Now that we have the Jupyter notebook up and running, we're going to do a short breakout exploring some of the mathematical functionality that Python offers.

Please open [02.Simple_Breakout.ipynb](02.Simple_Breakout.ipynb), find a partner, and make your way through that notebook, typing and executing code along the way.

## 3. Loading data with ``pandas``

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

### Python's Data Science Ecosystem

In addition to Python's built-in modules like the ``math`` module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python.
Some of the most important ones are:

#### [``numpy``](http://numpy.org/): Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data.
If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

#### [``scipy``](http://scipy.org/): Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more.
We will not look closely at Scipy today, but we will use its functionality later in the course.

#### [``pandas``](http://pandas.pydata.org/): Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a *Data Frame*.
If you've used the [R](http://rstats.org) statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

#### [``matplotlib``](http://matplotlib.org): Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

### Installing Pandas & friends

Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like ``conda``. All it takes is to run

```
$ conda install numpy scipy pandas matplotlib
```

and (so long as your conda setup is working) the packages will be downloaded and installed on your system.

### Loading Data with Pandas

In [4]:
import numpy as np

In [5]:
import pandas as pd

Because we'll use it so much, we often import under a shortened name using the ``import ... as ...`` pattern:

Now we can use the ``read_csv`` command to read the comma-separated-value data:

In [6]:
data = pd.read_csv("HCEPDB_moldata.csv")

*Note: strings in Python can be defined either with double quotes or single quotes*

### Viewing Pandas Dataframes

The ``head()`` and ``tail()`` methods show us the first and last rows of the data

In [7]:
data

Unnamed: 0,id,SMILES_str,stoich_str,mass,pce,voc,jsc,e_homo_alpha,e_gap_alpha,e_lumo_alpha,tmp_smiles_str
0,655365,C1C=CC=C1c1cc2[se]c3c4occc4c4nsnc4c3c2cn1,C18H9N3OSSe,394.3151,5.161953,0.867601,91.567575,-5.467601,2.022944,-3.444656,C1=CC=C(C1)c1cc2[se]c3c4occc4c4nsnc4c3c2cn1
1,1245190,C1C=CC=C1c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH2]...,C22H15NSeSi,400.4135,5.261398,0.504824,160.401549,-5.104824,1.630750,-3.474074,C1=CC=C(C1)c1cc2[se]c3c(ncc4ccccc34)c2c2=C[SiH...
2,21847,C1C=c2ccc3c4c[nH]cc4c4c5[SiH2]C(=Cc5oc4c3c2=C1...,C24H17NOSi,363.4903,0.000000,0.000000,197.474780,-4.539526,1.462158,-3.077368,C1=CC=C(C1)C1=Cc2oc3c(c2[SiH2]1)c1c[nH]cc1c1cc...
3,65553,[SiH2]1C=CC2=C1C=C([SiH2]2)C1=Cc2[se]ccc2[SiH2]1,C12H12SeSi3,319.4448,6.138294,0.630274,149.887545,-5.230274,1.682250,-3.548025,C1=CC2=C([SiH2]1)C=C([SiH2]2)C1=Cc2[se]ccc2[Si...
4,720918,C1C=c2c3ccsc3c3[se]c4cc(oc4c3c2=C1)C1=CC=CC1,C20H12OSSe,379.3398,1.991366,0.242119,126.581347,-4.842119,1.809439,-3.032680,C1=CC=C(C1)c1cc2[se]c3c4sccc4c4=CCC=c4c3c2o1
5,1310744,C1C=CC=C1c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2ccc...,C24H13N3SSe,454.4137,5.605135,0.951911,90.622776,-5.551911,2.029717,-3.522194,C1=CC=C(C1)c1cc2[se]c3c(c4nsnc4c4ccncc34)c2c2c...
6,196637,C1C=CC=C1c1cc2[se]c3cc4ccsc4cc3c2[se]1,C17H10SSe2,404.2520,2.644436,0.587932,69.223461,-5.187932,2.201106,-2.986827,C1=CC=C(C1)c1cc2[se]c3cc4ccsc4cc3c2[se]1
7,262174,C1C=CC=C1c1cc2[se]c3c4occc4c4cscc4c3c2[se]1,C19H10OSSe2,444.2730,2.523057,0.397670,97.645325,-4.997670,1.982122,-3.015548,C1=CC=C(C1)c1cc2[se]c3c4occc4c4cscc4c3c2[se]1
8,393249,C1C=CC=C1c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12,C24H15NSe,396.3495,3.115895,0.869140,55.174815,-5.469140,2.331815,-3.137325,C1=CC=C(C1)c1cc2[se]c3cc4cccnc4cc3c2c2ccccc12
9,35,C1C2=C([SiH2]C=C2)C=C1c1cc2occc2c2cscc12,C17H12OSSi,292.4328,2.743214,0.387106,109.062905,-4.987106,1.909966,-3.077141,C1=CC2=C([SiH2]1)C=C(C2)c1cc2occc2c2cscc12


In [10]:
data.tail(2)

Unnamed: 0,id,SMILES_str,stoich_str,mass,pce,voc,jsc,e_homo_alpha,e_gap_alpha,e_lumo_alpha,tmp_smiles_str
2322847,1961981,C1ccc2c1c(sc2-c1scc2cc[SiH2]c12)-c1ccc(cc1)-c1...,C25H16S3SeSi,519.6454,2.679067,0.659243,62.544032,-5.259243,2.258468,-3.000775,c1sc(c2[SiH2]ccc12)-c1sc(c2Cccc12)-c1ccc(cc1)-...
2322848,2754558,[SiH2]1ccc2csc(c12)-c1sc(-c2sc(-c3scc4ccsc34)c...,C24H13NOS5Si,519.7887,1.2724,0.102802,190.489616,-4.702802,1.49095,-3.211851,c1sc(c2[SiH2]ccc12)-c1sc(-c2sc(-c3scc4ccsc34)c...


The ``shape`` attribute shows us the number of elements:

In [11]:
data.shape

(2322849, 11)

The ``columns`` attribute gives us the column names

In [14]:
data.columns

Index(['id', 'SMILES_str', 'stoich_str', 'mass', 'pce', 'voc', 'jsc',
       'e_homo_alpha', 'e_gap_alpha', 'e_lumo_alpha', 'tmp_smiles_str'],
      dtype='object')

The ``index`` attribute gives us the index names

Let's make our ``id`` column the ``index``

Now let's revisit the ``data.index``

View it with head again:

The ``dtypes`` attribute gives the data types of each column:

In [15]:
data.dtypes

id                  int64
SMILES_str         object
stoich_str         object
mass              float64
pce               float64
voc               float64
jsc               float64
e_homo_alpha      float64
e_gap_alpha       float64
e_lumo_alpha      float64
tmp_smiles_str     object
dtype: object

## 4. Manipulating data with ``pandas``

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:

In [16]:
data['mass']

0          394.3151
1          400.4135
2          363.4903
3          319.4448
4          379.3398
5          454.4137
6          404.2520
7          444.2730
8          396.3495
9          292.4328
10         290.4606
11         489.1948
12         335.2668
13         443.5024
14         393.5445
15         340.5204
16         429.6251
17         258.3186
18         336.4684
19         459.5774
20         394.6522
21         509.5136
22         412.5660
23         389.3786
24         434.5416
25         455.2999
26         385.3226
27         368.4770
28         402.3777
29         393.5445
             ...   
2322819    483.7863
2322820    520.7728
2322821    520.7728
2322822    485.7625
2322823    427.6983
2322824    467.7193
2322825    467.7113
2322826    472.7514
2322827    467.7193
2322828    504.7058
2322829    503.7217
2322830    471.7753
2322831    534.8756
2322832    508.4810
2322833    472.7634
2322834    535.8557
2322835    501.7495
2322836    504.7058
2322837    408.6768


Mathematical operations on columns happen *element-wise* (note 18.01528 is the weight of H2O):

Columns can be created (or overwritten) with the assignment operator.
Let's create a *mass_ratio_H2O* column with the mass ratio of each molecule to H2O

In preparation for grouping the data, let's bin the molecules by their molecular mass. For that, we'll use ``pd.cut``

### Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at *value counts* and the basics of *group-by* operations.

#### Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data.
We'll take a look at two of these here.

The ``pandas.value_counts`` returns statistics on the unique values within each column.

We can use it, for example, to break down the molecules by their mass group that we just created:

What happens if we try this on a continuous valued variable?

We can do a little data exploration with this to look 0s in columns.  Here, let's look at the power conversion effeciency (``pce``)

### Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations.
You can visualize the group-by like this (image borrowed from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do))

In [None]:
from Jupyter.display import Image
Image('split_apply_combine.png')

Let's break take this in smaller steps.
Recall our ``mass_group`` column.

groupby allows us to look at the number of values for each column and each value.

Now, let's find the mean of each of the columns for each ``mass_group``.  *Notice* what happens to the non-numeric columns.

You can specify a groupby using the names of table columns and compute other functions, such as the ``sum``, ``count``, ``std``, and ``describe``.

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

```
<data object>.groupby(<grouping values>).<aggregate>()
```

You can even group by multiple values: for example we can look at the LUMO-HOMO gap grouped by the ``mass_group`` and ``pce``.

## 5. Visualizing data with ``pandas``

Of course, looking at tables of data is not very intuitive.
Fortunately Pandas has many useful plotting functions built-in, all of which make use of the ``matplotlib`` library to generate plots.

Whenever you do plotting in the Jupyter notebook, you will want to first run this *magic command* which configures the notebook to work well with plots:

In [None]:
%matplotlib inline

Now we can simply call the ``plot()`` method of any series or dataframe to get a reasonable view of the data:

### Other plot types

Pandas supports a range of other plotting types; you can find these by using the <TAB> autocomplete on the ``plot`` method: