# Python session - 3.1

## Functions and modules

`Pandas` cheat sheet: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

Software Carpentry reference files: http://tobyhodges.github.io/python-novice-gapminder/

## Functions

Functions are reusable blocks of code that you can name and execute any number of times from different parts of your script(s). This reuse is known as "calling" the function. Functions are important building blocks of a software.

There are several built-in functions of Python, which can be called anywhere (and any number of times) in your current program. You have been using built-in functions already, for example, `len()`, `range()`, `sorted()`, `max()`, `min()`, `sum()` etc.

#### Structure of writing a function:

- `def` (keyword) + function name (you choose) + `()`.
- newline with 4 spaces or a tab +  block of code # Note: Codes at the 0 position are always read
- Call your function using its name

In [None]:
# Examples of the functions that you have alreasy know, i.e. print(), len(), max(), min()

myvar = 'something'
print(myvar)
print(len(myvar))
print(list(myvar))
print(max([12,23,45,67]))

In [2]:
## Non parametric function
# Define a function that prints a sum of number1 and number2 defined inside the function

def get_sum():
    a = 98
    b = 56
    sum_this = a+b
    print(sum_this)
get_sum()

def get_sum2():
    c = 90
    d = 5690
    sum_this2 = c+d
    print(sum_this2)
get_sum2()
get_sum() #calling the first function again

154
5780
154


In [3]:
# Parametric function
# Define a function that prints a sum of number1 and number2 provided by the user
# Hint: get_sum_param(number1, number2)

def get_sum_param(number1, number2):
    sum_this = number1 + number2
    print(sum_this)
    
a = 98
b = 56
get_sum_param(a, b)

c = 90
d = 5690
get_sum_param(c, d)

154
5780


In [4]:
# Returning values
# Define a function that 'returns' a sum of number1 and number2 provided by the user
# Hint: print(get_sum_param(number1, number2))

def get_sum_param(number1, number2):
    sum_this = number1 + number2
    return sum_this

a = 98
b = 56
val_ab = get_sum_param(a, b)

c = 90
d = 5690
val_cd = get_sum_param(c, d)

print(val_ab, val_cd)
print(val_ab * val_cd/val_ab) # do more operations on the variables

154 5780
5780.0


In [5]:
# Local Vs. global variable

# Define a function that returns a sum of number1 and number2 to a variable
# and print it after calling the function
# Hint: returned_value = get_sum_param(number1, number2)

def get_sum_param(number1, number2):
    sum_this = number1 + number2
    number3 = 6789
    sum_this_too = sum_this + number3
    return sum_this, sum_this_too
a = 98
b = 56
returned_values = get_sum_param(a, b)
print(returned_values)
print(a, b, number3) # number3 is a local variable for the function, so it will not be recognised outside the function

(154, 6943)


NameError: name 'number3' is not defined

### Exercises: write old code into a function

In [13]:
# Optional exercise
# Let’s take one of our older code blocks and write it in a function

def check_temp_range():
    temperature = 20 # define temperature inside the function
    if temperature > 25 and temperature < 50:
        print('Warm weather')
    elif temperature <= 25:
        print('Not very Warm')
    elif temperature == 25:
        print('Pleasant weather')
    else:
        print('I do not care about the weather')
        
check_temp_range()

Not very Warm


In [14]:
def check_temp_range_with_param(temperature):
    if temperature > 25 and temperature < 50:
        print('Warm weather')
    elif temperature <= 25:
        print('Not very Warm')
    elif temperature == 25:
        print('Pleasant weather')
    else:
        print('I do not care about the weather')
        
temperature_var = 20 # define temperature inside the function
temperature_var2 = 35 # define temperature inside the function
check_temp_range_with_param(temperature_var)
check_temp_range_with_param(temperature_var2)

Not very Warm
Warm weather


In [16]:
def check_temp_range_with_more_param(temperature, upper_limit, lower_limit):
    if temperature > lower_limit and temperature < upper_limit:
        print('Warm weather')
    elif temperature <= lower_limit:
        print('Not very Warm')
    elif temperature == lower_limit:
        print('Pleasant weather')
    elif temperature >= upper_limit:
        print("It's too hot")
    else:
        print('I do not care about the weather')
        
temperature_var = 20 # define temperature inside the function
lower_limit = 15 # define temperature inside the function
upper_limit = 45 # define temperature inside the function
check_temp_range_with_more_param(temperature_var, upper_limit, lower_limit)

Warm weather


### Libraries and Modules

One of the great things about Python is the free availability of a _huge_ number of libraries (also called package) that can be imported into your code and (re)used. 

Modules contain functions for use by other programs and are developed with the aim of solving some particular problem or providing particular, often domain-specific, capabilities. A library is a collection of modules, but the terms are often used interchangeably, especially since many libraries only consist of a single module (so don’t worry if you mix them). 

In order to import a library, it must available on your system or should be installed.  

A large number of libraries are already available for import in the standard distribution of Python: this is known as the standard library. If you installed the Anaconda distribution of Python, you have even more libraries already installed - mostly aimed at data science.

Importing a library is easy:

- Import (keyword) + library name, for example: 
    - `import os    # contains functions for interacting with the operating system`
    - `import sys   # contains utilities to process command line arguments`

More at: https://pypi.python.org/pypi

In [20]:
import os

# Get current directory
cwd = os.getcwd()
print(cwd)

/Users/sharan/Documents/SWC-EMBL/git_dir/gitrepo_master/software-carpentry-embl-2017/Day2


In [None]:
# Make new directory
os.mkdir('test_dir')

In [19]:
help(os)            # manual page created from the module's docstrings

### Using loops to iterate through files in a directory

In [22]:
# define a function that lists all the files in the folder called data

import os

def read_each_filename(pathname):
    for files in os.listdir(pathname):
        print(files)
   
pathname = 'data' # name of path with multiple files
read_each_filename(pathname)

asia_gdp_per_capita.csv
gapminder_all.csv
gapminder_gdp_africa.csv
gapminder_gdp_americas.csv
gapminder_gdp_asia.csv
gapminder_gdp_europe.csv
gapminder_gdp_oceania.csv


In [None]:
# define a function that reads and prints each line of each file in the folder called data

import os

def read_each_line_of_each_file(pathname): # name of path with multiple files
    for files in os.listdir(pathname):
        with open(pathname+'/'+files, 'r') as in_fh:
            for lines in in_fh:
                print(lines)
                
pathname = 'data' # name of path with multiple files
read_each_line_of_each_file(pathname)

# Hints:
# Options for opening files
# option-1: with open("{}/{}".format(pathname, filename)) as in_fh:
# option-2: with open('%s/%s' % (pathname, filename)) as in_fh:
# option-3: with open(pathname + '/' + filename) as in_fh:
# option-4: with open(os.path.join(pathname, filename)) as in_fh:

In [24]:
# Exercise: Go through each filename in the directory 'data'
# Print the names of the files that contains the keyword 'asia'

def country_files(pathname, country):
    for files in os.listdir(pathname):
        if 'asia' in files:
            print(files)

pathname = 'data'
country = 'asia'
country_files(pathname, country)

asia_gdp_per_capita.csv
gapminder_gdp_asia.csv


In [25]:
# Open each file containing the keyword 'asia' and print all the entries

def read_filename_with_asia(pathname, country):
    for files in os.listdir(pathname):
        if country in files:
            # Print the names of the files that contains the keyword 'asia'
            print(files)
            # Open each file containing the keyword 'asia' and print all the entries
            with open(os.path.join(pathname, files), 'r') as in_fh:
                for lines in in_fh:
                    # Print entries that starts with gdp information on 'Japan', 'Korea', 'China' and 'Taiwan'
                    if lines.startswith('Japan')or lines.startswith('Korea')\
                    or lines.startswith('China')or lines.startswith('Taiwan'):
                        print(lines)
read_filename_with_asia('data', 'asia')

### Examples of importing basic modules.

#### Questions
- How can I read tabular data?

#### Objectives
- Import the Pandas library.
- Use Pandas to load a simple CSV data set.
- Get some basic information about a Pandas DataFrame.

In [26]:
import pandas

In [36]:
# Use Oceania data here

df = pandas.read_csv("data/gapminder_gdp_asia.csv") #put data from a file in the dataframe

#read header
df.head()

#read header with 3 lines
df.head(n=3)

#read tail
df.tail()

#get the column with the data of 'gdpPercap_1952'
df['gdpPercap_1952']

#get multiple columns (Hint: pass list of the columns)
df[['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962']]

#this can be done by writing the column number as well
df[[0]] #one column
df[[1,2,3]] #multiple columns

#this can be done by generating the column number by range too
df[list(range(1,3))] #single square

#datatypes in the dataframe
df.dtypes

#dimention of the dataframe
df.shape

(33, 13)

#### Aside: Namespaces
Python uses namespaces a lot, to ensure appropriate separation of functions, attributes, methods etc between modules and objects. When you import an entire module, the functions and classes available within that module are loaded in under the modules namespace - `pandas` in the example above.  
It is possible to customise the namespace at the point of import, allowing you to e.g. shorten/abbreviate the module name to save some typing:

In [None]:
import pandas as pd

Also, as in the examples above, if you need only a single function from a module, you can import that directly into your main namespace (where you don't need to specify the module before the name of the function):

In [None]:
from pandas import read_csv

#### Conventions
- You should perform all of your imports at the beginning of your program. This ensures that
  - users can easily identify the dependencies of a program, and 
  - that any lacking dependencies (causing fatal `ImportError` exceptions) are caught early in execution
- the shortening of `numpy` to `np` and `pandas` to `pd` are very common, and there are others too - watch out for this when e.g. reading docs and guides/SO answers online.

### Execises - Importing

Use this link to follow further exercises: http://tobyhodges.github.io/python-novice-gapminder/37-reading-tabular/

In [39]:
# Use index_col to specify that a column’s values should be used as row headings.

df1 = pandas.read_csv("data/gapminder_gdp_asia.csv", index_col = 'country')

In [43]:
# Use DataFrame.info to find out more about a dataframe.
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33 entries, Afghanistan to Yemen Rep.
Data columns (total 12 columns):
gdpPercap_1952    33 non-null float64
gdpPercap_1957    33 non-null float64
gdpPercap_1962    33 non-null float64
gdpPercap_1967    33 non-null float64
gdpPercap_1972    33 non-null float64
gdpPercap_1977    33 non-null float64
gdpPercap_1982    33 non-null float64
gdpPercap_1987    33 non-null float64
gdpPercap_1992    33 non-null float64
gdpPercap_1997    33 non-null float64
gdpPercap_2002    33 non-null float64
gdpPercap_2007    33 non-null float64
dtypes: float64(12)
memory usage: 3.4+ KB


In [44]:
# The DataFrame.columns variable stores information about the dataframe’s columns.

df1.columns

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')

In [46]:
# Use DataFrame.T to transpose a dataframe.
df1.T

In [48]:
# Use DataFrame.describe to get summary statistics about data.
df1.describe()

Unnamed: 0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
count,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0
mean,5195.484004,5787.73294,5729.369625,5971.173374,8187.468699,7791.31402,7434.135157,7608.226508,8639.690248,9834.093295,10174.090397,12473.02687
std,18634.890865,19506.515959,16415.857196,14062.591362,19087.502918,11815.777923,8701.176499,8090.262765,9727.431088,11094.180481,11150.719203,14154.937343
min,331.0,350.0,388.0,349.0,357.0,371.0,424.0,385.0,347.0,415.0,611.0,944.0
25%,749.681655,793.577415,825.623201,836.197138,1049.938981,1175.921193,1443.429832,1704.686583,1785.402016,1902.2521,2092.712441,2452.210407
50%,1206.947913,1547.944844,1649.552153,2029.228142,2571.423014,3195.484582,4106.525293,4106.492315,3726.063507,3645.379572,4090.925331,4471.061906
75%,3035.326002,3290.257643,4187.329802,5906.731805,8597.756202,11210.08948,12954.79101,11643.57268,15215.6579,19702.05581,19233.98818,22316.19287
max,108382.3529,113523.1329,95458.11176,80894.88326,109347.867,59265.47714,33693.17525,28118.42998,34932.91959,40300.61996,36023.1054,47306.98978


### Reading other data

Read the data in `gapminder_gdp_americas.csv` (which should be in the same directory as `gapminder_gdp_oceania.csv`) into a variable called `americas` and display its summary statistics.

In [49]:
americas = pandas.read_csv("data/gapminder_gdp_americas.csv", index_col = 'country')
americas.describe()

Unnamed: 0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
count,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0
mean,4079.062552,4616.043733,4901.54187,5668.253496,6491.334139,7352.007126,7506.737088,7793.400261,8044.934406,8889.300863,9287.677107,11003.031625
std,3001.727522,3312.381083,3421.740569,4160.88556,4754.404329,5355.602518,5530.490471,6665.039509,7047.089191,7874.225145,8895.817785,9713.209302
min,1397.717137,1544.402995,1662.137359,1452.057666,1654.456946,1874.298931,2011.159549,1823.015995,1456.309517,1341.726931,1270.364932,1201.637154
25%,2428.237769,2487.365989,2750.364446,3242.531147,4031.408271,4756.763836,4258.503604,4140.442097,4439.45084,4684.313807,4858.347495,5728.353514
50%,3048.3029,3780.546651,4086.114078,4643.393534,5305.445256,6281.290855,6434.501797,6360.943444,6618.74305,7113.692252,6994.774861,8948.102923
75%,3939.978789,4756.525781,5180.75591,5788.09333,6809.40669,7674.929108,8997.897412,7807.095818,8137.004775,9767.29753,8797.640716,11977.57496
max,13990.48208,14847.12712,16173.14586,19530.36557,21806.03594,24072.63213,25009.55914,29884.35041,32003.93224,35767.43303,39097.09955,42951.65309


### Inspecting Data

After reading the data for the Americas, use `help(americas.head)` and `help(americas.tail)` to find out what `DataFrame.head` and `DataFrame.tail` do.

What method call will display the first three rows of this data?
What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)

In [None]:
help(americas.head)
help(americas.tail)

americas.head(n=3)
americas.tail(n=3)

### Reading Files in Other Directories

The data for your current project is stored in a file called `microbes.csv`, which is located in a folder called `field_data`. You are doing analysis in a notebook called `analysis.ipynb` in a sibling folder called `thesis`:

```
your_home_directory
+-- field_data/
|   +-- microbes.csv
+-- thesis/
    +-- analysis.ipynb
```

What value(s) should you pass to `read_csv` to read `microbes.csv` in `analysis.ipynb`?

In [None]:
"../field_data/microbes.csv"

### Writing data

As well as the `read_csv` function for reading data from a file, Pandas provides a `to_csv` function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called `processed.csv`. You can use help to get information on how to use `to_csv`.

In [51]:
americas.T.describe().to_csv("americas_data.csv")

In [53]:
%%bash
ls
cat americas_data.csv

1-variables-and-data-structures.ipynb
2-flow-control.ipynb
3_1-functions-and-modules.ipynb
3_2-pandas-dataframe.ipynb
3_3-plotting.ipynb
4-styling-and-flowchart-exercises.ipynb
americas_data.csv
data
images
,Argentina,Bolivia,Brazil,Canada,Chile,Colombia,Costa Rica,Cuba,Dominican Republic,Ecuador,El Salvador,Guatemala,Haiti,Honduras,Jamaica,Mexico,Nicaragua,Panama,Paraguay,Peru,Puerto Rico,Trinidad and Tobago,United States,Uruguay,Venezuela
count,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0
unique,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0
top,10079.026740000001,3326.143191,7807.095818000001,36319.235010000004,4315.6227229999995,7006.580419,5118.146939,5180.75591,3614.101285,7103.702595000001,3421.523218,4031.4082710000002,1823.015995,3548.3308460000003,5246.107524,3478.125529,2749.320965,7356.031934000001,3998.875695,4446.380

#### Aside: Your Own Modules
Whenever you write some python code and save it as a script, with the `.py` file extension, you are creating your own module. If you define functions within that module, you can load them into other scripts and sessions.

### Some Interesting Module Libraries to Investigate
- os
- sys
- shutil
- random
- collections
- math
- argparse
- time
- datetime
- numpy
- scipy
- matplotlib
- pandas
- scikit-learn
- requests
- biopython
- openpyxl