# Python basics continued
See [here](https://github.com/fomightez/Python_basics_4nanocourse) for information about this notebook.

------

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell either click the Play icon on the menu bar above, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>


These sessions are temporary and will time out after ten minutes of inactivity. However, a safety net is built in if you start doing serious work in these. Even if it times out, you can save the notebook and upload it again later. See the notebook [safety net demo](safety%20net%20demo.ipynb) so that you are prepared for when it happens. You won't be able to open the demo if it has already happened.

----



WHEN TO CODE


Don't forget regular expressions and basic shell commands can handle a lot of your text and file manipulation needs.
<br></br> 
<br></br>   
When you reach the limit of what you can easily do with those, then you need to code.


RELATED DEMOS:
    
- Regular expression use to edit a file

- As an example of using terminal / command line / bash / shell / unix, use JupyterLab

Regular Expression demonstration.
    
Work though 'Pre-processing this PDB file to look more typical using regular expressions' section from session #3  (skipped than in the interest of time).

## Pre-processing this PDB file to look more typical using regular expressions

Use a file derived from [PDB entry for 3hyd](https://files.rcsb.org/view/3hyd.pdb) to see about columns.

3hyd we saw in the electron density cloud to isomesh demo tosay.

[PDB entry for 3hyd](https://files.rcsb.org/view/3hyd.pdb) was modified to remove the anisotropic temperature factor lines. They begin with `ANISOU`. 

To remove those lines from the original [3hyd.pdb](3hyd.pdb) file, I used regular expressions in a text editor. (Not Microsoft Word). This is like fancy 'Find and replace'.  
We could use Python for this data pre-procesing step, but sometimes there is an easier way.

The specifics are spelled out in [How_made_without_ANISOU_lines.md](How_made_without_ANISOU_lines.md). 

Let's use a fragment of the PDB file to see how it is done. We'll use [REGEX101](https://regex101.com/) to demonstrate right now.

Performing that process on `3hyd.pdb` results in [3hydWITHOUTanisouLINES.pdb](3hydWITHOUTanisouLINES.pdb).

### ALSO DO THIS IN JUPYTERLAB editor today

## Pre-processing this PDB file to look more typical using regular expressions

Use a file derived from [PDB entry for 3hyd](https://files.rcsb.org/view/3hyd.pdb) to see about columns.

3hyd we saw in the electron density cloud to isomesh demo tosay.

[PDB entry for 3hyd](https://files.rcsb.org/view/3hyd.pdb) was modified to remove the anisotropic temperature factor lines. They begin with `ANISOU`. 

To remove those lines from the original [3hyd.pdb](3hyd.pdb) file, I used regular expressions in a text editor. (Not Microsoft Word). This is like fancy 'Find and replace'.  
We could use Python for this data pre-procesing step, but sometimes there is an easier way.

The specifics are spelled out in [How_made_without_ANISOU_lines.md](How_made_without_ANISOU_lines.md). 

Let's use a fragment of the PDB file to see how it is done. We'll use [REGEX101](https://regex101.com/) to demonstrate right now.

Performing that process on `3hyd.pdb` results in [3hydWITHOUTanisouLINES.pdb](3hydWITHOUTanisouLINES.pdb).

RELATED DEMOS:
    
- Regular expression use to edit a file

- As an example of using **terminal / command line / bash / shell / unix**, use JupyterLab. 
    
    - How do you find the terminal in JupyterLab? 
    - How do you list the files? 
    - Example of writing text to file. 
    - You can loop, use wild-cards, and all sorts of stuff we don't have time to really show in this nanocourse.

    loop to get multiple PDB files. We covered using `curl` shell command last time. Example with fetching multiple PDB files to paste into terminal and run by hitting return:
    
    ```bash
    for id in 1avw 1d66 1trn 1f8a 1pin
    do
        curl -OL https://files.rcsb.org/download/$id.pdb
    done```


WHEN TO CODE


Don't forget regular expressions and basic shell commands can handle a lot of your text and file manipulation needs.
<br></br> 
<br></br>   
When you reach the limit of what you can easily do with those, then you need to code.


Why Python?

- high-level language where code is similar to written English/pseudocode
- easy to learn and powerful
- large community & open source, with many features well developed
- no compiling needed

Drawbacks?

- zero-indexing (We're more used to referring to item in first position as `#1`)
- white space / formatting matters
- Python 2 vs 3 controversy (resolved now: Use 3!)
- despite being a high-level language, syntax still matters
- dynamic typing (a.k.a.duck typing) (But you can static type now!)

Other options?

R is great for statistics.  
Since a lot of high-throughput Illumina sequencing and single-cell sequencing uses statiscis for interpreting, you'll see a lot of analysis packages built in R.

Many other options as well, such as Julia or Rust

Where?

For today, we're using remote machines with JupyterLab.  
(Note that the content is hosted on GitHub and I use git to keep it version controlled from my local computer.)  
Added some useful features you won't find in Jupyter sessions everywhere:

- safety net for timeouts

- visual debugger, see [debugger demo notebook](debugger_demo.ipynb) and [here](https://github.com/jupyterlab/debugger) for a sense of how to use it. Needs speciak XeusPython kernel so you cannot use it in every notebook immediately.

- Table of Contents extension

Safety net demonstration.

Open [safety net demo](safety%20net%20demo.ipynb) notebook.

<b>More on 'where?'</b>

You'll want to use a dedicated text editor. (Or an Integrated Development Environment (IDE).) 
This means it has special settings to recognize code based on extensions and helps format ('spaces' vs. 'tabs', anyone?) and auto-complete correctly.  
It also **usually means they have regular expression ability built right in**. (But some I'm showing below are just editors without that. Stick with first five if you want locally-installed software with regular expressions built in.)  
Nothing related to Microsoft unless it is [VSCode](https://code.visualstudio.com/) ; [try in your browser via Binder](https://github.com/betatim/vscode-binder/tree/master).

A few possiblilities:
- [VSCode](https://code.visualstudio.com/) ; [try in your browser via Binder](https://github.com/betatim/vscode-binder/tree/master)
- Sublime Text (Note `.*` button toggled on in lower left [here](https://sheabunge.com/wp-content/uploads/sublime-text-comment-regex.png).)
- [Atom](https://atom.io/) (similar regular epxression button as Sublime Text, see [here](http://2017.compciv.org/guide/topics/end-user-software/atom/how-to-use-regex-atom.html))
- [Theia](https://theia-ide.org/) ; [try in your browser via Binder](https://github.com/betatim/theia-binder)
- [Notepad++](https://notepad-plus-plus.org/downloads/)
- [PythonAnywhere](https://www.pythonanywhere.com/) (only online)
- [Gitpod](https://www.gitpod.io/) (only online and only for GitHub or GitLab, but same people make Theia)
- PyCharm (popular IDE)
- Jupyter (We'll use again today)


<b>Final(?): more on 'where?'</b>
    
Alaji suggested session #4 on April 23rd I'd help you install local, but actually I am more often installing remote that can be accessed from any local machine I happen to be on.

- [MyBinder](https://mybinder.org/)
- [Cyverse](https://cyverse.org/)

We'll soon see an example like that I made for Nathan to run large batches of analyses; tpyically 700 sequences.
    
If you do want to install locally, there are way to use **conda** environments or **virtual python environments (venvs)** in your local system to not mess up your other directly installed software. Alternatively, you can install into a Docker container that you run locally. ([MyBinder](https://mybinder.org/), and a lot of other tech these days, actually includes Docker somewhere in the stack.)
    
The important point is there are a number of ways now to safely attempt to install things with no chance to mess up your system. However, installing to what will be spun up as a remote session via Binder has the advantage of portability and reproducibility over anhything local when you aren't needing to deal with massive amounts of data. If you need a step up from MyBinder, [Cyverse](https://cyverse.org/) is an NSF backed, managed version of Jetstream/XSEDE-like systems that also lets you run software and notebooks that spin up on remote systems. You just have to be a scientist, and you can easily get an account that gives you a monthly allocation. You can apply to have enhanced even more if you have substantial needs. 
    
I'd be happy to talk with anyone more about these options down the road.

**Final(?): more on 'why Python (or R)?'**
    
Open source will serve you better in the long run than proprietary/licensed-based/subscription-based software.
    
If you learn MATLAB, when you move on to another lab or workplace somewhere else, you have to hope they have a license or pay yourself. (The skills and concepts picked up by learning MATLAB would though serve you well.)
    
More portable between different computers in short-term as well.


-------


## Starting simple 


## Pre-processing this PDB file to look more typical using regular expressions

Use a file derived from [PDB entry for 3hyd](https://files.rcsb.org/view/3hyd.pdb) to see about columns.

3hyd we saw in the electron density cloud to isomesh demo tosay.

[PDB entry for 3hyd](https://files.rcsb.org/view/3hyd.pdb) was modified to remove the anisotropic temperature factor lines. They begin with `ANISOU`. 

To remove those lines from the original [3hyd.pdb](3hyd.pdb) file, I used regular expressions in a text editor. (Not Microsoft Word). This is like fancy 'Find and replace'.  
We could use Python for this data pre-procesing step, but sometimes there is an easier way.

The specifics are spelled out in [How_made_without_ANISOU_lines.md](How_made_without_ANISOU_lines.md). 

Let's use a fragment of the PDB file to see how it is done. We'll use [REGEX101](https://regex101.com/) to demonstrate right now.

Performing that process on `3hyd.pdb` results in [3hydWITHOUTanisouLINES.pdb](3hydWITHOUTanisouLINES.pdb).

### Assigned edited text to a variable

We'll use the ATOM section of that that to define `t`.

In [None]:
t='''ATOM      1  N   LEU A   1       1.149   1.920   3.550  1.00  5.65           N  
ATOM      2  CA  LEU A   1       2.138   2.288   4.580  1.00  5.04           C  
ATOM      3  C   LEU A   1       3.461   1.638   4.282  1.00  3.88           C  
ATOM      4  O   LEU A   1       3.527   0.405   4.165  1.00  4.79           O  
ATOM      5  CB  LEU A   1       1.635   1.889   5.948  1.00  6.19           C  
ATOM      6  CG  LEU A   1       2.444   2.344   7.182  1.00 10.41           C  
ATOM      7  CD1 LEU A   1       1.603   2.227   8.438  1.00 18.81           C  
ATOM      8  CD2 LEU A   1       3.699   1.583   7.375  1.00 10.45           C  
ATOM      9  H1  LEU A   1       1.127   0.953   3.458  1.00  5.40           H  
ATOM     10  H2  LEU A   1       0.274   2.239   3.813  1.00  5.15           H  
ATOM     11  H3  LEU A   1       1.404   2.323   2.704  1.00  5.24           H  
ATOM     12  HA  LEU A   1       2.249   3.258   4.575  1.00  4.84           H  
ATOM     13  HB2 LEU A   1       0.742   2.251   6.048  1.00  6.49           H  
ATOM     14  HB3 LEU A   1       1.585   0.920   5.978  1.00  6.58           H  
ATOM     15  HG  LEU A   1       2.680   3.278   7.071  1.00 10.84           H  
ATOM     16 HD11 LEU A   1       2.094   2.588   9.181  1.00 14.99           H  
ATOM     17 HD12 LEU A   1       1.404   1.298   8.594  1.00 15.61           H  
ATOM     18 HD13 LEU A   1       0.787   2.722   8.316  1.00 15.30           H  
ATOM     19 HD21 LEU A   1       3.547   0.650   7.120  1.00 10.08           H  
ATOM     20 HD22 LEU A   1       3.964   1.634   8.317  1.00 10.06           H  
ATOM     21 HD23 LEU A   1       4.396   1.980   6.816  1.00  9.93           H  
ATOM     22  N   VAL A   2       4.521   2.434   4.230  1.00  3.54           N  
ATOM     23  CA  VAL A   2       5.866   1.930   4.058  1.00  3.13           C  
ATOM     24  C   VAL A   2       6.790   2.592   5.064  1.00  3.68           C  
ATOM     25  O   VAL A   2       6.806   3.837   5.179  1.00  4.26           O  
ATOM     26  CB  VAL A   2       6.425   2.168   2.650  1.00  3.78           C  
ATOM     27  CG1 VAL A   2       7.834   1.667   2.541  1.00  5.01           C  
ATOM     28  CG2 VAL A   2       5.522   1.554   1.584  1.00  5.05           C  
ATOM     29  H   VAL A   2       4.482   3.290   4.290  1.00  3.28           H  
ATOM     30  HA  VAL A   2       5.875   0.967   4.219  1.00  3.38           H  
ATOM     31  HB  VAL A   2       6.447   3.134   2.489  1.00  3.90           H  
ATOM     32 HG11 VAL A   2       8.026   1.465   1.622  1.00  4.53           H  
ATOM     33 HG12 VAL A   2       7.933   0.873   3.073  1.00  4.63           H  
ATOM     34 HG13 VAL A   2       8.436   2.348   2.851  1.00  4.60           H  
ATOM     35 HG21 VAL A   2       5.433   0.614   1.755  1.00  4.78           H  
ATOM     36 HG22 VAL A   2       5.916   1.693   0.720  1.00  4.65           H  
ATOM     37 HG23 VAL A   2       4.662   1.976   1.619  1.00  4.78           H  
ATOM     38  N   GLU A   3       7.567   1.777   5.771  1.00  2.92           N  
ATOM     39  CA AGLU A   3       8.674   2.248   6.587  0.50  3.42           C  
ATOM     40  CA BGLU A   3       8.670   2.254   6.575  0.50  3.37           C  
ATOM     41  C   GLU A   3       9.920   1.636   5.964  1.00  2.87           C  
ATOM     42  O   GLU A   3      10.041   0.405   5.901  1.00  3.28           O  
ATOM     43  CB AGLU A   3       8.541   1.813   8.049  0.50  3.31           C  
ATOM     44  CB BGLU A   3       8.513   1.848   8.037  0.50  3.20           C  
ATOM     45  CG AGLU A   3       9.625   2.367   8.970  0.50  3.86           C  
ATOM     46  CG BGLU A   3       9.663   2.264   8.926  0.50  4.43           C  
ATOM     47  CD AGLU A   3       9.581   1.798  10.378  0.50  4.30           C  
ATOM     48  CD BGLU A   3       9.372   2.079  10.393  0.50  7.27           C  
ATOM     49  OE1AGLU A   3       9.326   0.609  10.555  0.50  4.93           O  
ATOM     50  OE1BGLU A   3       8.211   1.826  10.769  0.50  8.58           O  
ATOM     51  OE2AGLU A   3       9.823   2.595  11.317  0.50  5.24           O  
ATOM     52  OE2BGLU A   3      10.301   2.252  11.187  0.50  5.38           O  
ATOM     53  H   GLU A   3       7.465   0.924   5.796  1.00  3.13           H  
ATOM     54  HA AGLU A   3       8.739   3.223   6.557  0.50  3.34           H  
ATOM     55  HA BGLU A   3       8.736   3.228   6.532  0.50  3.32           H  
ATOM     56  HB2AGLU A   3       7.685   2.119   8.386  0.50  3.51           H  
ATOM     57  HB2BGLU A   3       7.707   2.257   8.388  0.50  3.63           H  
ATOM     58  HB3AGLU A   3       8.581   0.845   8.093  0.50  3.50           H  
ATOM     59  HB3BGLU A   3       8.438   0.883   8.085  0.50  3.60           H  
ATOM     60  HG2AGLU A   3      10.496   2.159   8.600  0.50  3.92           H  
ATOM     61  HG2BGLU A   3      10.441   1.727   8.712  0.50  4.83           H  
ATOM     62  HG3AGLU A   3       9.518   3.329   9.035  0.50  4.00           H  
ATOM     63  HG3BGLU A   3       9.855   3.203   8.775  0.50  4.81           H  
ATOM     64  N   ALA A   4      10.842   2.465   5.490  1.00  2.75           N  
ATOM     65  CA  ALA A   4      12.006   2.016   4.750  1.00  3.09           C  
ATOM     66  C   ALA A   4      13.270   2.689   5.248  1.00  2.94           C  
ATOM     67  O   ALA A   4      13.284   3.899   5.529  1.00  3.58           O  
ATOM     68  CB  ALA A   4      11.833   2.220   3.258  1.00  4.15           C  
ATOM     69  H   ALA A   4      10.814   3.320   5.588  1.00  2.76           H  
ATOM     70  HA  ALA A   4      12.112   1.056   4.889  1.00  3.11           H  
ATOM     71  HB1 ALA A   4      11.060   1.734   2.964  1.00  3.79           H  
ATOM     72  HB2 ALA A   4      12.615   1.896   2.807  1.00  3.83           H  
ATOM     73  HB3 ALA A   4      11.717   3.157   3.082  1.00  3.98           H  
ATOM     74  N   LEU A   5      14.334   1.918   5.332  1.00  2.99           N  
ATOM     75  CA  LEU A   5      15.634   2.369   5.796  1.00  3.30           C  
ATOM     76  C   LEU A   5      16.689   1.821   4.849  1.00  3.28           C  
ATOM     77  O   LEU A   5      16.716   0.605   4.614  1.00  3.43           O  
ATOM     78  CB  LEU A   5      15.875   1.877   7.209  1.00  4.50           C  
ATOM     79  CG  LEU A   5      17.250   2.151   7.852  1.00  9.40           C  
ATOM     80  CD1 LEU A   5      17.789   3.495   7.677  1.00  9.16           C  
ATOM     81  CD2 LEU A   5      17.128   1.821   9.337  1.00 11.72           C  
ATOM     82  H   LEU A   5      14.330   1.085   5.118  1.00  2.86           H  
ATOM     83  HA  LEU A   5      15.679   3.347   5.792  1.00  3.38           H  
ATOM     84  HB2 LEU A   5      15.205   2.260   7.785  1.00  4.85           H  
ATOM     85  HB3 LEU A   5      15.765   0.916   7.196  1.00  4.98           H  
ATOM     86  HG  LEU A   5      17.893   1.532   7.474  1.00  8.68           H  
ATOM     87 HD11 LEU A   5      18.580   3.588   8.214  1.00  8.30           H  
ATOM     88 HD12 LEU A   5      17.128   4.128   7.958  1.00  8.71           H  
ATOM     89 HD13 LEU A   5      18.004   3.632   6.753  1.00  8.61           H  
ATOM     90 HD21 LEU A   5      16.648   2.528   9.774  1.00 10.72           H  
ATOM     91 HD22 LEU A   5      18.009   1.743   9.711  1.00 10.33           H  
ATOM     92 HD23 LEU A   5      16.655   0.991   9.440  1.00 10.20           H  
ATOM     93  N   TYR A   6      17.519   2.699   4.281  1.00  3.06           N  
ATOM     94  CA  TYR A   6      18.548   2.342   3.321  1.00  3.08           C  
ATOM     95  C   TYR A   6      19.876   2.878   3.804  1.00  3.48           C  
ATOM     96  O   TYR A   6      20.024   4.082   3.994  1.00  4.60           O  
ATOM     97  CB  TYR A   6      18.275   2.935   1.947  1.00  3.85           C  
ATOM     98  CG  TYR A   6      16.873   2.775   1.400  1.00  3.42           C  
ATOM     99  CD1 TYR A   6      15.855   3.614   1.819  1.00  3.67           C  
ATOM    100  CD2 TYR A   6      16.556   1.796   0.450  1.00  3.75           C  
ATOM    101  CE1 TYR A   6      14.558   3.509   1.303  1.00  3.54           C  
ATOM    102  CE2 TYR A   6      15.281   1.695  -0.071  1.00  3.30           C  
ATOM    103  CZ  TYR A   6      14.283   2.550   0.357  1.00  3.37           C  
ATOM    104  OH  TYR A   6      13.017   2.418  -0.175  1.00  3.64           O  
ATOM    105  H   TYR A   6      17.497   3.543   4.445  1.00  2.94           H  
ATOM    106  HA  TYR A   6      18.611   1.369   3.230  1.00  3.22           H  
ATOM    107  HB2 TYR A   6      18.454   3.887   1.978  1.00  3.50           H  
ATOM    108  HB3 TYR A   6      18.879   2.517   1.315  1.00  3.43           H  
ATOM    109  HD1 TYR A   6      16.041   4.275   2.443  1.00  3.43           H  
ATOM    110  HD2 TYR A   6      17.223   1.227   0.139  1.00  3.37           H  
ATOM    111  HE1 TYR A   6      13.891   4.088   1.592  1.00  3.38           H  
ATOM    112  HE2 TYR A   6      15.090   1.043  -0.705  1.00  3.24           H  
ATOM    113  HH  TYR A   6      12.808   3.103  -0.576  1.00  3.44           H  
ATOM    114  N   LEU A   7      20.857   2.006   3.973  1.00  4.38           N  
ATOM    115  CA  LEU A   7      22.208   2.473   4.312  1.00  6.08           C  
ATOM    116  C   LEU A   7      23.293   1.644   3.744  1.00  7.46           C  
ATOM    117  O   LEU A   7      23.018   0.697   3.000  1.00 13.45           O  
ATOM    118  CB  LEU A   7      22.356   2.753   5.793  1.00 11.09           C  
ATOM    119  CG  LEU A   7      22.263   1.578   6.717  1.00 11.34           C  
ATOM    120  CD1 LEU A   7      22.913   1.965   8.038  1.00 18.81           C  
ATOM    121  CD2 LEU A   7      20.853   1.118   7.009  1.00 11.10           C  
ATOM    122  OXT LEU A   7      24.470   1.913   3.995  1.00  9.60           O  
ATOM    123  H   LEU A   7      20.779   1.153   3.901  1.00  4.08           H  
ATOM    124  HA  LEU A   7      22.322   3.342   3.880  1.00  6.70           H  
ATOM    125  HB2 LEU A   7      23.219   3.171   5.933  1.00  9.43           H  
ATOM    126  HB3 LEU A   7      21.660   3.376   6.052  1.00  9.61           H  
ATOM    127  HG  LEU A   7      22.755   0.830   6.347  1.00 11.80           H  
ATOM    128 HD11 LEU A   7      23.845   2.140   7.890  1.00 14.43           H  
ATOM    129 HD12 LEU A   7      22.815   1.242   8.663  1.00 14.59           H  
ATOM    130 HD13 LEU A   7      22.483   2.753   8.379  1.00 14.87           H  
ATOM    131 HD21 LEU A   7      20.870   0.474   7.720  1.00 10.59           H  
ATOM    132 HD22 LEU A   7      20.484   0.716   6.220  1.00 10.51           H  
ATOM    133 HD23 LEU A   7      20.323   1.876   7.268  1.00 10.64           H  '''

## Indexing and `if` / conditional

Splitting on the `new line` character `\n` makes a list of the lines.

In [None]:
#Let's parse the text on line breaks
for line in t.split("\n"):
    print(line)

In [None]:
#Let's examine the 17th column
for line in t.split("\n"):
    print(line[16])

In [None]:
#Let's only show those that have a 17th column
for line in t.split("\n"):
    if line[16] != ' ':
        print(line[16])

Note the term `line` in the code isn't magical. We could have used any variable in place of `line` because it is just iterating over the items in the list made by the `split()` applied to the variable `t` which contains the text listed above. 

Let's use `x` instead to show we see the same thing:

In [None]:
#Let's only show those that have a 17th column
for x in t.split("\n"):
    if x[16] != ' ':
        print(x[16])

## Zero Indexing of Python and Slicing

Python is simply assigning the `x` to each individual element in the list made by `t.split("\n")`.  
By itself that isn't that shocking but that pattern of Python automatically taking the individual elements from an iterable object is useful. We'll see it soon where we read a file item and the lines are actually specified without us needing to split on the end of line representation `\n` in that case.

But why are we using the number `16` when we talked about the 17th column?

Lets explore indexing in Python.

Let's look at what numbers in prints if we iterate on a `range` of integers. Note that `range` is green because it is a special Python object whereas `x` and `line` we used above weren't green.

In [None]:
for x in range(5):
    print (x)

Okay. We get five numbers. However, the first one is zero and not `1` as we saw at [PDB ATOMIC COORDINATE FILE FORMAT](https://zhanglab.ccmb.med.umich.edu/BindProfX/pdb_atom_format.html).

This is because Python is zero indexed. Not the normal way but if you recall sometimes you'll see in grade school number lines beginning with zero. Mathematics branches and computer science often use zero as the index of the first item in a list. This is what Python uses. (The statistics language R using 1 indexing.) 

We'll see zero indexing has some advantages as we go along. It may take until session #5 to really see the benefits.

Let's see if `line[16]` corresponing to the 17th column makes more sense now.

One way to look at it:

In [None]:
#Let's only show those that have a 17th column
for line in t.split("\n"):
    if line[17-1] != ' ':
        print(line[17-1]) # because to account for zero index, we want one less than number 17
        print(line[0:17])

` print(line[0:17])` is slicing the line and not prining a single character at a specified index.

Note the syntax is a little different than you might expect, it is saying start with zero index (number on the right) and go up to **BUT NOT INCLUDE** the character at the index on the right. We'll cover slicing some more in Session #5 as it can be a little odd at first and since there are shortcuts you can use. However, you should see it easily allows you to access items spanning particular numbered columns as we saw [the PDB specification](https://zhanglab.ccmb.med.umich.edu/BindProfX/pdb_atom_format.html).

## File reading and multi-conditionals

That was slightly tedious getting the PDB file and pasting it in above to make the text string `t`. Normally you'd read stright from the file.

First let's use the shell command line utility `curl` to get (You may be more familiar with `wget` which acts similarly?)

Note we put an `!` (exclamation point at the beginning of the curl command to specify that is special and instead of running as Python (which is the kernel this notebook is based on, see upper left just above the notabook), we want to run it as a command bash shell command. (Another way to think about it: If you were in a terminal, you'd leave off the exclamation point.)

Feel free to change it to your favorite PDB file id accession at the end. In other words replace the `1avw` with whatver you'd like.

In [None]:
!curl -OL https://files.rcsb.org/download/1p3v.pdb

Running that gets another PDB file.

Let's parse that straight from the file.

In [None]:
# read in the PDB file
with open("1p3v.pdb", 'r') as input:
    lines_read = 0 # prepare to give feeback later or allow skipping to certain start
    for line in input:
        lines_read = lines_read + 1
    print (lines_read)

Note to get that to work we didn't have to split the string on the new line character `/n` this time. Python knows to split a file object on the line. Or even more general the iterable assigned to a file object is a line.

If we interated on the line now, we'd get characters. This unit drill down nature of Python is one of the things that make it so useful. The other is that it looks much like if you wrote out what you want to do. In other words it is a higher level language closer to English than some other programming languages.

If you click to open your PDB file and scroll down, that number reported above should match although it may be off by one as the last line showing a line number in the editor view isn't counted as a line by Python as it contains nothing.

We can confirm that by printing the last known value of `line` and see it corresponds to the content on last line where there is nothing. (Line 2119 if you are using 1p3v.)

In [None]:
print(line)

Let's use that to do something more useful that will let us explore the PDB file more.

We'll remove the counting and trying looking at how many cysteines this protein (or structure if yours has more chains) has.

In [None]:
with open("1p3v.pdb", 'r') as input:
    for line in input:
        if 'CYS' in line:
            print(line)

Well we got our answer. But it could be cleaner. The first two lines come from the section of the PDB file called the header. It would be worse if we had a lot of cysteines. Those first two lines all the parts above the ATOM coordinates.

Let's see how much worse it could get by trying with `HIS`.

In [None]:
# read in the PDB file
with open("1p3v.pdb", 'r') as input:
    for line in input:
        if 'HIS' in line:
            print(line)

Now we are getting lines with `THIS` in the header, too. 

Let's skip past the header.

We didn't talk about it above but we can use another conditional like the `if something` we used above a few times.

In [None]:
with open("1p3v.pdb", 'r') as input:
    for line in input:
        if line.startswith("ATOM"):
            if 'CYS' in line:
                print(line)

That is cleaner. And it will work better with `HIS` now.

In [None]:
with open("1p3v.pdb", 'r') as input:
    for line in input:
        if line.startswith("ATOM"):
            if 'HIS' in line:
                print(line)

But hard to read how many since several of them. Let's focus on the alpha-carbons in the PDB file.

In [None]:
with open("1p3v.pdb", 'r') as input:
    for line in input:
        if line.startswith("ATOM") and "CA" in line:
            if 'HIS' in line:
                print(line)

So there are nine histidines if you are using 1p3v example.

So this was using Python interactively in Jupyter. And you could save this notebook to easily save your result of the code above looking at the alpha-carbons of the histides for a report or something later. A lot of people prefer this way to work now. We'll explore more of the advatanges in Session #5.


----

## Running a script

But you may have heard reference to running a Python script or running a script. Let's finish by running a script version of our histidine alpha-carbon listing code.

To do that, copy the code in the cell above to your clipboard. It will be similar to the cod below but may be slightly different if you are using a differnet PDB file:

```python
with open("1p3v.pdb", 'r') as input:
    # prepare to give feeback later or allow skipping to certain start
    for line in input:
        if line.startswith("ATOM") and "CA" in line:
            if 'HIS' in line:
                print(line)
```

Next select `File` > `New` > `Text File` from the main menu at the top above the panels.

Paste the code into the text file and then right-click choose `Rename File` and name the file `script.py`

**It is important you remove `.txt`.** The highlighting should indicate code now if you did it correctly.

Choose `File` > `Save Python File.`


Next, select `File` > `New` > `Terminal` from the main menu at the top above the panels.

A terminal will open. There type the following to run your script:

```shell
python script.py
```

You'll see the same result you saw in the notebook. However, it isn't as useful as saving the notebook automatically. You'd have to add extra handling to save those results to a file. It is easy but you can already see one advantage to using a Jupyter notebook from that. Of course, you could copy and paste the result out of terminal but doing that more than a few times is tedious and not good practice to working reproducibly. The most direct way is to add a shell redirect. The ouput to standard out in the terminal will be sent to a file named `results.txt` with the following addition of `> results.txt` at the end of the call to the script. Like so:

```shell
python script.py > results.txt
```

You won't see any ouput but a file named `results.txt` will be made with the output as the contents.

-----

--------


## 'Continued' pt 2 is in another notebook

Click [here](basics_pt2.ipynb) to continue on beyond what we covered last time.