# Basic Python

## Aim of this lab

To learn the basics of Python Programming.  

### Objectives

* Learn the fundamentals of Programming and Python
* Learn how to assign data values to variables.
* Learn how to select individual values and subsections from data.
* Learn how to perform operations on "arrays" of data.
* Learn how to display simple graphs from data.
* Learn what a `for` loop does.  
* Learn how to write `for` loops to repeat simple calculations.  
* Learn how to trace changes to a loop variable as the loop runs.
* Learn how to trace changes to other variables as they are updated by a `for` loop. 
* Learn what a library is and what you can use them for..
* Learn how to "import" a Python library and use the things it contains.
* Learn to read tabular data from a file into a Python program.


## The Python Programming Language

Python is one of the most popular programming languages.  Additionally, it is arguably the preferred languages for doing data science, bioinformatics, and cheminformatics.  The other language in that argument, the R statistical programming language is also quite popular and has many of the same benefits that makes Python great for data science.  However, Python's growth in the last decade or so, coupled with it's open-source community and many open-source libraries particular for the life sciences, has potentially given it the lead.  

Python, and R, are interpretted languages, rather than compiled languages like (C, C++, etc.).  In both circumstances, progamming lanuages are used to instruct the computer what to do by getting converted to machine code.  In compiled code however, they must be first compiled in one step and interpretted in ther other.  Interpretted languages on the other hand do this at run time, allowing you to immeidately see the results (at the cost of some other performance factors, not necessarily of concern to data scienctists, bioinformaticians and cheminformaticians).  This means you can quickly do computational tasks and see the results immediately.  It also allows for things like [Jupyter Notebooks](https://jupyter.org/) which is what we will use for this course.  

Jupyter Notebooks have been referred to as ["The Scientific Paper of the Future"](https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/).  They are widely used by data scientists and the like and allow for interspclices text, photos and other media with programming code (Jupyter stands for Julia, Python and R, the three languages it has support for).  They are great for tutorials which is why we will use them in this course.  

One of the core fundamentals of Python as a data science languages are its strong libraries for data science, visualizaition and cheminforamtics.  After exploring the basics of programming, we will overview four libraries. 

### Pandas

[Pandas](https://pandas.pydata.org/) is a data analytics library that allows for fast computational on data matricies, a core component of statstics and data science.  

### Matplotlib

[Matplotlib](https://matplotlib.org/) is a data visualization library for creating complex, scientific paper-quality graphs, figures and charts.  

### RDKit
[RDKit](https://www.rdkit.org/) is a excellent cheminformatics and computational chemistry library.  It will be explored in depth in a later lab. 

### Scikit-learn
[Scikit-learn](https://scikit-learn.org/stable/) is a library for building machine learning models and will be used in the module on QSAR. 



#### The basics of Python programming

#### Variables

* Information stored in variables is not static 
* It can be changed, manipulated, or updated to contain new information.  
* Does not have to be expicitly declared.  Can be the resuls of a Python expression.

![Variables as Sticky Notes](https://github.com/russodanielp/intro_cheminformatics/blob/master/Lab%2003%20-%20Basic%20Python/img/python-sticky-note-variables-01.svg?raw=1)

![Variables as Sticky Notes](https://github.com/russodanielp/intro_cheminformatics/blob/master/Lab%2003%20-%20Basic%20Python/img/python-sticky-note-variables-02.svg?raw=1)

![Variables as Sticky Notes](https://github.com/russodanielp/intro_cheminformatics/blob/master/Lab%2003%20-%20Basic%20Python/img/python-sticky-note-variables-03.svg?raw=1)

* Changing variables created from old variables doesn't change the variable it was created from

## Who's who in memory?

* IPython ** --> ** Interactive + Python
* Has builit-in functions that add more interaction that a typical Python environment
* start with `%`

Let's save our old data.

## Data Type ##

* variables can store different 'types' of data
* Strings, integers, floats are some common data types
* `type()`

# Repeating actions with loops #


## That worked, what's wrong with that? ##

1. It doesn't scale.  
2. We may not know how many characters in our variable `word`.

## The `for` Loop ##

* Allows us to iterate over.....iterables.
* Items that have elements that we can "loop" over are called iterables. 
  * Strings, Lists, and ranges of numbers are iterables.
  * Strings have characters to loop over
  
What if we wanted to print every letter in the word "oxygen" stored in a variable `word`?

<img src="https://github.com/russodanielp/intro_cheminformatics/blob/master/Lab%2003%20-%20Basic%20Python/img/loops_image.svg?raw=1">

## `for` Loops Can Do More Than Just Print ##

* What if we wanted to count the length of the contents of a string?

* How many times does the `for` loop execute? 
* Why is the value of `length` `6`?
* What would be the value of `letter` after the code is executed?

__Note__: Finding the length of iterables is a common task in Python, so much that it has a reserved function called `len()` that does just that.  

### From 1 to N ###

Python has a built-in function called range that creates a sequence of numbers. Range can accept 1-3 parameters. If one parameter is input, range creates an array of that length, starting at zero and incrementing by 1. If 2 parameters are input, range starts at the first and ends just before the second, incrementing by one. If range is passed 3 parameters, it starts at the first one, ends just before the second one, and increments by the third one. For example, range(3) produces the numbers 0, 1, 2, while range(2, 5) produces 2, 3, 4, and range(3, 10, 3) produces 3, 6, 9. Using range, write a loop that uses range to print the first 3 natural numbers:
>1  
>2  
>3  

## What is a Python Library? ##

* Code is resuable -- Many times simple progammatic tasks have been done before
    - No need to reinvent the wheel!
    
* Python libraries are organized code for reusability
    
Code (words) ** --> **  Functions/Methods ** --> ** Objects ** --> ** Modules ** --> ** libraries

## Pandas

In order to use libraries we need to import them.  This is done through an import statement using the library name.  

Pandas uses what are called data frames.  Dataframes are a pretty simple concept.  They are simply matrices consisting of rows and columns and very convient way to store tabular data (such as that found in excel).  

For example, we can look at the famous iris dataset. 

In [3]:

import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Viola!  Now we have access to all the functions `numpy` has to offer.  

1) `.loadtxt(...)` -> function call  
2) `fname='../data/inflammation-01.csv'` -> parameter  
3) `delimiter=','` -> parameter

## Why did it print to the screen?  ##

* Cause we didn't tell it what else to do with!
* We can store data in _variables_
* We do that by _assignment_

## Pandas

* N-dimensional array.
* Rows are the individual patients.
* Columns are their daily inflammation measurements.
* Arrays can store information about our `data` variable through `attributes`. 
* We can see how many rows and columns we have using `shape`.
* We can see what type of data is stored in our rows and columns using `dtype`.

## Accessing data in arrays ##

* We can find the value of certain elements in our data by _slicing_.
* Any element in `data` is available by providing the information `[row, column]`.

## Retrieving entire rows and or columns ##

* Can be done by _slicing_.
* A slice is of the form `start:end+1`.
* The default values for `start` and `end` are the beginning and end of the array, respectively.
* We can store the results in a new variable.

## Mathematical operations on Numpy arrays ##

* Operators (+, -, \*, /) used on an array and a scalar perform on each value of the array
* Operators used on arrays with other arrays of the same shape perform the operation element-wise

## Numpy math operators ##

* Numpy offers extended math operations in the form of _functions_
* We provide functions function input through _arguments_
* Not all functions have arguments
* All functions have `()` after the function name

Let's explore some common Numpy functions (`max()`, `min()`, `mean()`) using multiple assignment.  


## Mystery functions in IPython ##

* Tab completion allows us to see which functions are available in a module
* The `?` gives us information about a particular product
* The features are only available in IPython (Use the function help(function) for Python

## Exploring our patient data ##

* Often times we are only concerned with subsets of data.
* We can store these data in a new variable, or, we could not.

## Commenting ##

* Sometimes it's useful to explain what or why we doing in a piece of code
* _Comments_ allow us to annotate our program 
* Everyting after `#` is ignored by Python

## Performing row or column-wise operations ##

* Specifying an axis (1=row-wise, 0=column-wise) allows us to perform operations for entire rows or columns

<img src="https://github.com/russodanielp/intro_cheminformatics/blob/master/Lab%2003%20-%20Basic%20Python/img/python-operations-across-axes.svg?raw=1">

What's the average across the rows?  

Are you sure?

The result `(40,)` tells us that this is an $Nx1$ dimensional array, with $N = 40$. Similarly, we can perform operations on the other axis (columns).  Let's find the mean of the columns.

## Plotting with matplotlib ##

* Data visualization is the first step of data exploration
* Provides a lot of 'out-of-the-box' visualation tools
* Has support for rendering in the Juypter Notebook

## Average Inflamation over time ##

## Max Inflammation Over Time ##

## Min Inflammation Over Time ##

## Bringing it all together ##

* Let's create a script from start to finish that will read our data then plot it
* We can group all of our plots into a single figure
    * `figure()` creates a space and the argument parameter `figsize=` tells us how big to make it
    * We can add plots by using the `add_subplot()` function
    * We can set labels on different plots using `set_ylabel()` or `set_xlabel()` function

Viola!  There it is from start to finish, our plotted data.

__Note__: For long library names you can shorten what you have to type by using the following syntax:

```python
import numpy as np

np.loadtxt(...)
```

### Excercises ###

### Key Points ###

* Import a library into a program using import libraryname.
* Use the numpy library to work with arrays in Python.
* Use variable = value to assign a value to a variable in order to record it in memory.
* Variables are created on demand whenever a value is assigned to them.
* Use print(something) to display the value of something.
* The expression array.shape gives the shape of an array.
* Use array[x, y] to select a single element from an array.
* Array indices start at 0, not 1.
* Use low:high to specify a slice that includes the indices from low to high-1.
* All the indexing and slicing that works on arrays also works on strings.
* Use # some kind of explanation to add comments to programs.
* Use numpy.mean(array), numpy.max(array), and numpy.min(array) to calculate simple statistics.
* Use numpy.mean(array, axis=0) or numpy.mean(array, axis=1) to calculate statistics across the specified axis.
* Use the pyplot library from matplotlib for creating simple visualizations.