# Basic Python

## Aim of this lab

To learn the basics of Python Programming.  

### Objectives

* Learn the fundamentals of Programming and Python
* Learn basic programming fundamentals like variables and `for` loops
* Learn important data sciences packages like `pandas` and `matplotlib`


## The Python Programming Language

Python is one of the most popular programming languages.  Additionally, it is arguably the preferred languages for doing data science, bioinformatics, and cheminformatics.  The other language in that argument, the R statistical programming language is also quite popular and has many of the same benefits that makes Python great for data science.  However, Python's growth in the last decade or so, coupled with it's open-source community and many open-source libraries particular for the life sciences, has potentially given it the lead.  

Python, and R, are interpretted languages, rather than compiled languages like (C, C++, etc.).  In both circumstances, progamming lanuages are used to instruct the computer what to do by getting converted to machine code.  In compiled code however, they must be first compiled in one step and interpretted in ther other.  Interpretted languages on the other hand do this at run time, allowing you to immeidately see the results (at the cost of some other performance factors, not necessarily of concern to data scienctists, bioinformaticians and cheminformaticians).  This means you can quickly do computational tasks and see the results immediately.  It also allows for things like [Jupyter Notebooks](https://jupyter.org/) which is what we will use for this course.  

Jupyter Notebooks have been referred to as ["The Scientific Paper of the Future"](https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/).  They are widely used by data scientists and the like and allow for a combination of text, photos, plots, and other media with programming code (Jupyter stands for Julia, Python and R, the three languages it has support for).  They are great for tutorials which is why we will use them in this course.  

One of the core fundamentals of Python as a data science languages are its strong libraries for data science, visualizaition and cheminforamtics.  After exploring the basics of programming, we will overview four libraries. 

### Pandas

[Pandas](https://pandas.pydata.org/) is a data analytics library that allows for fast computational on data matricies, a core component of statstics and data science.  

### Matplotlib

[Matplotlib](https://matplotlib.org/) is a data visualization library for creating complex, scientific paper-quality graphs, figures and charts.  

### RDKit
[RDKit](https://www.rdkit.org/) is a excellent cheminformatics and computational chemistry library.  It will be explored in depth in a later lab. 

### Scikit-learn
[Scikit-learn](https://scikit-learn.org/stable/) is a library for building machine learning models and will be used in the module on QSAR. 



## The basics of Python programming

### Variables

* Information stored in variables is not static 
* It can be changed, manipulated, or updated to contain new information.  
* Does not have to be expicitly declared.  Can be the resuls of a Python expression.

#### Naming Variables
* Must start with a letter or the underscore character
* Cannot start with a number
* Can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
* Variable names are case-sensitive (weight_kg, Weight_kg and WEIGHT_KG are three different variables)

* Changing variables created from old variables doesn't change the variable it was created from

### Data Types

#### Basic Data Types 

* variables can store different 'types' of data
* Strings, integers, floats are some common data types
* `type()` allows you see which is the high order of types

#### Strings

Characters, basically.  Use either ' or " to enclose.

#### Integers

Whole numbers: 1, 2, 3...535.

#### Floats

Real values.  Need to have a decimal place.

### Operators

Operators depend on the type.  Basically, floats and integers can have operations performed on them interchangeably.  Strings can have operators on other strings.

+, -, *, / behave as you would expect.

#### Other Data Types

Two other data types are often used for storing information: lists and dictionaries. 

* Lists (also called arrays) can store multiple items - they are static and information stored in them will always be in the same spot.  
* Dictionaries store items through a key-> value structure.  Values are accessed via their respective keys - there is no order. 

### Lists

Lists are a helpful way to store information.  Lists can contain as many elements as we want and whatever data type we want, even multiple.  They can even store lists.

Elements in the list are ordered and can be accessed via their position or "index", which starts counting at zero. 

### Dictionaries 

Dictionaries are unordered an store items via a key: value pair syntax

### `for` Loops 

Allow for iterating over elemtns or repeating things.  

### Range 

The range function allows us to create ranges to iterate over.

__Note__: Finding the length of iterables is a common task in Python, so much that it has a reserved function called `len()` that does just that.  

## Using External Libraries

* Code is resuable -- Many times simple progammatic tasks have been done before
    - No need to reinvent the wheel!
    
* Python libraries are organized code for reusability
    
Code (words) ** --> **  Functions/Methods ** --> ** Objects ** --> ** Modules ** --> ** libraries

## Pandas

In order to use libraries we need to import them.  This is done through an import statement using the library name.  

Pandas uses what are called data frames.  Dataframes are a pretty simple concept.  They are simply matrices consisting of rows and columns and very convient way to store tabular data (such as that found in excel).  

For example, we can look at the famous [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).  The data set consists of 50 samples from each of three species of Iris (_Iris setosa_, _Iris virginica_ and _Iris versicolor_). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features.  In the original collection, Dr. Ronald Fisher developed a linear discriminant model to distinguish the species from each other and thus it became a widely used dataset for tutorials.  In this dataset, every row is a datapoint (the four measurements and the species). 

Pandas has a very helpful function called `read_csv`.  To access functions from other libraries, we call the library after we import it and then write the name of the function we want to use.  We just need to tell it where the `csv` file we want to read is. 

__Note__: Instead of having to write `pandas` everytime we need to `import` something from the `pandas` library, we can use the `as` statement to apply a shorthand.  Most people call `pandas` by the shortened `pd`. 

Pandas imports the text as a `dataframe`.  Dataframes are like a new data type.  They are like a matrix but with labeled columns and rows (called `index`).  We can acess both.  

You can access each row or column by the name using the `loc` method.  Rows are first, columns are second.  The colon is used to get everything from that row/column.

There is label based indexing using `.loc`.  And there is positional labeling using `.iloc`.

Of course we can use both to get a value from a cell.  For example, the second datapoints species.

You can access the same information via their indicies using the `iloc` method.  Which is very useful for whats known as slicing.

There are very helpful methods that data frames have which could require a full course.  Most of the time, whatever you're trying to do has already been done and searching the internet is very helful.  We could average by the column.  

### Lists as indexers

You can also pass lists as indexers

### Masking

One of the most important and cool things about Pandas is the ability to mask.  

Let's say we wanted to get just the virginica data points from our dataframe.

The same would be the case if we wanted to get all flowers with a `petal_length` greater than some number.   

### Calculating statistics

Dataframes have the ability to calculate statistics.  This can be done column or row-wise.  On the whole dataframe or just a subset.  

Sum by the row...

We can do things on individual columns

Or the find the species counts..

### Math

You can do math operations on entire columns.  When this is done it is done so element wise if it is another dataframe column (of equal length).  


Math using a scaler (a single number) performs the operation on the entire column or dataframe. 

For example, we can convert the data from cm to inches by multipling by 0.39

### Creating new columns

Pandas is a very useful tool for any scientist.  There are several tutorials available online [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html).  

## Matplotlib 

Matplotlib is a extremly powerful plotting library in Python.  Another very useful is the [Seaborn](https://seaborn.pydata.org/) library, which is also very powerful.  It is built on top of matplotlib and offers high-level plotting for some common statistical plots. 

Matplotlib allows more control and is therefore we will demonatrate the basics here.  It has support for rendering in Jupyter Notebooks.  We need to turn this on by running a cell with `% matplotlib inline`.

Plotting is done by create creating `fig` and `ax` objects.  The `fig` object controls more meta level attributes of the final figure.  The `ax` is what we use to plot elements.  We do that my providing the data and what time of datapoint we want to plot and one what axis (x is first, y is second).  

We can add mutiple data and plot types to a single `ax`.  Once we are finished, we call `plt.show()` to visualize.  

#### Scatter

Colors can be provided by providing which each data point should be colored.  We can chose the colors in a dictionary and "map" to a series.

### Bar

Most of the time, making the right plot is just making sure the data is in the right format.  

We can plot a bar graph of petal length in descending order.  

## Line

Or the same plot as a line graph

### Line+scatter

Matplotlib becomes really powerful by building on top of each other.  We can add a scatter plot to the line graph.  

### Wrapping up

Matplotlib is complex.  Inspiraton usually happens by exploring the [gallery](https://matplotlib.org/2.0.2/gallery.html) for ideas or having an idea of what you want to do and searching either the [documentation](https://matplotlib.org/2.0.2/index.html) or the internet for someone who has done what you are trying before.  

We only briefly introducted matplotlib here.  I suggest looking at the [tutorials](https://matplotlib.org/stable/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py) for its full capability.  