# Introduction to data analysis in Python

Based on Software Carpentry's "Programming with Python" https://software-carpentry.org/lessons/ and Data Carpentry's "Data Analysis and Visualization in Python for Ecologists" https://datacarpentry.org/lessons/

Recommended setup: Anaconda / miniconda on Linux or Mac (Windows Subsystem for Linux if on Windows).


## Datatypes, variables methods:

### Questions
* How do I program in Python?

* How can I represent my data in Python?

### Objectives

* Define the following data types in Python: strings, integers, and floats.

* Perform mathematical operations in Python using basic operators.

* Define the following as it relates to Python: lists, tuples, and dictionaries.

### Variables

Any Python interpreter can be used as a calculator:




In [1]:
2 + 2

4

In [3]:
6 * 8

48

This is great but not very interesting. To do anything useful with data, we need to assign its value to a variable. In Python, we can assign a value to a variable, using the equals sign `=`. For example, we can track the weight of a patient who weighs 60 kilograms by assigning the value 60 to a variable `weight_kg`:

In [2]:
weight_kg = 60

From now on, whenever we use `weight_kg`, Python will substitute the value we assigned to it. In layman’s terms, a variable is a name for a value.

In Python, variable names:

* can include letters, digits, and underscores
* cannot start with a digit
* are case sensitive.

This means that, for example:

* `weight0` is a valid variable name, whereas `0weight` is not
* `weight` and `Weight` are different variables

### Built-in Python functions

To carry out common tasks with data and variables in Python, the language provides us with several built-in functions. To display information to the screen, we use the print function:

In [None]:
print(weight_lb)

132.66
inflam_001


When we want to make use of a function, referred to as calling the function, we follow its name by parentheses. The parentheses are important: if you leave them off, the function doesn’t actually run! Sometimes you will include values or variables inside the parentheses for the function to use. In the case of print, we use the parentheses to tell the function what value we want to display. We will learn more about how functions work and how to create our own in later episodes.

We can display multiple things at once using only one `print` call:

In [9]:
print('Weight in kilograms:', weight_kg)

Weight in kilograms: 60.3


Moreover, we can do arithmetic with variables right inside the print function:

In [None]:
print('weight in pounds:', 2.2 * weight_kg)

weight in pounds: 132.66


The above command, however, did not change the value of weight_kg:

In [None]:
print(weight_kg)

60.3


To change the value of the `weight_kg` variable, we have to assign `weight_kg` a new value using the equals = sign:

In [None]:
weight_kg = 65.0
print('weight in kilograms is now:', weight_kg)

weight in kilograms is now: 65.0


### Built-in data types

**Strings, integers, and floats**

Python knows various types of data. Three common ones are:

* integer numbers
* floating point numbers, and
* strings.

In the example above, variable `weight_kg` has an integer value of 60. If we want to more precisely track the weight of our patient, we can use a floating point value by executing:

In [5]:
weight_kg = 60.3

To create a string, we add single or double quotes around some text. To identify and track a patient throughout our study, we can assign each person a unique identifier by storing it in a string:

In [6]:
patient_id = '001'

We can also call a function to check the type of a variable:

In [10]:
type(weight_kg)

float

In [11]:
type(patient_id)

str

**Sequences: Lists and Tuples**

Lists are a common data structure to hold an ordered sequence of elements. Each element can be accessed by an index. Note that Python indexes start with 0 instead of 1:


In [7]:
numbers = [1, 2, 3]
numbers[0]

1

A `for` loop can be used to access the elements in a list or other Python data structure one at a time:

In [8]:
for num in numbers:
    print(num)

1
2
3


**Indentation** is very important in Python. Note that the second line in the example above is indented.

To add elements to the end of a list, we can use the `append` method. Methods are a way to interact with an object (a list, for example). We can invoke a method using the dot . followed by the method name and a list of arguments in parentheses. Let’s look at an example using `append`:

In [12]:
numbers.append(4)
print(numbers)

[1, 2, 3, 4]


To find out what methods are available for an object, we can use the built-in `help` command:

In [13]:
help(numbers)

Help on list object:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate sign

A **tuple** is similar to a list in that it’s an ordered sequence of elements. However, tuples can not be changed once created (they are “immutable”). Tuples are created by placing comma-separated values inside parentheses ().

In [14]:
# Tuples use parentheses
a_tuple = (1, 2, 3)
another_tuple = ('blue', 'green', 'red')

# Note: lists use square brackets
a_list = [1, 2, 3]

A **dictionary** is a container that holds pairs of objects - keys and values.

In [15]:
translation = {'one': 'first', 'two': 'second'}
translation['one']

'first'

Dictionaries work a lot like lists - except that you index them with *keys*. You can think about a key as a name or unique identifier for the value it corresponds to.

In [16]:
rev = {'first': 'one', 'second': 'two'}
rev['first']

'one'

To add an item to the dictionary we assign a value to a new key:

In [17]:
rev = {'first': 'one', 'second': 'two'}
rev['third'] = 'three'
rev

{'first': 'one', 'second': 'two', 'third': 'three'}

Using `for` loops with dictionaries is a little more complicated. We can do this in two ways:

In [18]:
for key, value in rev.items():
    print(key, '->', value)

first -> one
second -> two
third -> three


In [19]:
for key in rev.keys():
    print(key, '->', rev[key])

first -> one
second -> two
third -> three


### Functions

Defining a section of code as a **function** in Python is done using the `def` keyword. For example a function that takes two arguments and returns their sum can be defined as:

In [1]:
def add_function(a, b):
    result = a + b
    return result

z = add_function(20, 22)
print(z)

42


### Conditionals

We can ask Python to take different actions, depending on a condition, with an `if` statement:

In [2]:
num = 37
if num > 100:
    print('greater')
else:
    print('not greater')
print('done')

not greater
done


The second line of this code uses the keyword `if` to tell Python that we want to make a choice. If the test that follows the `if` statement is true, the body of the `if` (i.e., the set of lines indented underneath it) is executed, and “greater” is printed. If the test is false, the body of the else is executed instead, and “not greater” is printed. Only one or the other is ever executed before continuing on with program execution to print “done”.

Conditional statements don’t have to include an `else`. If there isn’t one, Python simply does nothing if the test is false:

In [3]:
num = 53
print('before conditional...')
if num > 100:
    print(num, 'is greater than 100')
print('...after conditional')

before conditional...
...after conditional


We can also chain several tests together using `elif`, which is short for “else if”. The following Python code uses `elif` to print the sign of a number.

In [4]:
num = -3

if num > 0:
    print(num, 'is positive')
elif num == 0:
    print(num, 'is zero')
else:
    print(num, 'is negative')

-3 is negative


### Boolean statements

Along with the > and == operators we have already used for comparing values in our conditionals, there are a few more options to know about:

- `>`: greater than
- `<`: less than
- `==`: equal to
- `!=`: does not equal
- `>=`: greater than or equal to
- `<=`: less than or equal to

We can also combine tests using `and` and `or`. `and` is only true if both parts are true:

In [5]:
if (1 > 0) and (-1 >= 0):
    print('both parts are true')
else:
    print('at least one part is false')

at least one part is false


while `or` is true if at least one part is true:

In [6]:
if (1 < 0) or (1 >= 0):
    print('at least one test is true')

at least one test is true


Sometimes it is useful to check whether some condition is not true. The Boolean operator `not` can do this explicitly

In [11]:
if not (1 < 0):
    print('1 is not smaller than 0')

1 is not smaller than 0


### Importing libraries

Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.

Once we’ve imported the library, we can ask the library to perform functions that are not built-in:

In [None]:
import numpy
weight_kg = 60.6
print(numpy.round(weight_kg))


61.0


It is common to rename libraries to appreviated names:

In [None]:
import numpy as np
print(np.round(weight_kg))

61.0


### Exercises

What values do the variables mass and age have after each of the following statements? Test your answer by executing the lines.

In [None]:
mass = 47.5
age = 122
mass = mass * 2.0
age = age - 20

Python allows you to assign multiple values to multiple variables in one line by separating the variables and values with commas. What does the following program print out?

In [None]:
first, second = 'Grace', 'Hopper'
third, fourth = second, first
print(third, fourth)

What are the data types of the following variables?

In [None]:
planet = 'Earth'
apples = 5
distance = 10.5

In [8]:
a_list = [1, 2, 3]
a_tuple = (1, 2, 3)

Change the value of the second element in both the list and tuple to 5:

Write a function `square` that takes a `list` or `tuple` as its argument and returns a respective `list` or `tuple` containing the squares of the values in the original list.

Write a function `is_number` that takes an argument and returns whether the argument is a number.

### Key points

* Basic data types in Python include integers, strings, and floating-point numbers.

* Use `variable = value` to assign a value to a variable in order to record it in memory.

* Variables are created on demand whenever a value is assigned to them.

* Built-in functions are always available to use.

* Lists and tuples are similar in that they are ordered lists of elements; they differ in that a tuple is immutable (cannot be changed).

* Dictionaries are data structures that provide mappings between keys and values.

* Use for variable in sequence to process the elements of a sequence one at a time.

* Use if condition to start a conditional statement, elif condition to provide additional tests, and else to provide a default.



## Data analysis using Pandas

### Questions

- How can I import data in Python?
- What is Pandas?
- Why should I use Pandas to work with data?

### Objectives

- Navigate the workshop directory and download a dataset.
- Explain what a library is and what libraries are used for.
- Describe what the Python Data Analysis Library (Pandas) is.
- Load the Python Data Analysis Library (Pandas).
- Use read_csv to read tabular data into Python.
- Describe what a DataFrame is in Python.
- Access and summarize data stored in a DataFrame.
- Define indexing as it relates to data structures.
- Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame.
- Create simple plots.

### Working With Pandas DataFrames in Python

We can automate the process of performing data manipulations in Python. It’s efficient to spend time building the code to perform these tasks because once it’s built, we can use it over and over on different datasets that use a similar format. This makes our methods easily reproducible. We can also easily share our code with colleagues and they can replicate the same analysis.

### Our Data

For this lesson, we will be real data from traffic registration points in Norway.

We are studying (...). The dataset is stored as a .csv file: each row holds information for a single animal, and the columns represent:



```
_id;_index;_score;_type;county_id;created_at_timestamp;datalogger_type;event_emitted_timestamp;event_number;event_timestamp;firmware_version;klokketime;lane;length;qspeed;region_id;speed;time_gap;traffic_registration_point_id;valid_classification;valid_event;valid_length;valid_speed;vehicle_type;vehicle_type_quality;vehicle_type_raw;weight;with_traffic_registration_point_direction;wrong_direction							
1-27849732;traffic_event_vehicle_2020_10;;traffic_event_vehicle;18;2020-10-26	 02:16:08.090;EMU;2020-10-25	 23:59:53.564;27849732;2020-10-25	 23:59:53.564;1.04 EMU3/15606;23;4;4	5;0;1;73	2;123	9;1;true;true;true;true;2;;LMV2;0;false;false	
1-27849731;traffic_event_vehicle_2020_10;;traffic_event_vehicle;18;2020-10-26	 02:16:08.116;EMU;2020-10-25	 23:58:28.296;27849731;2020-10-25	 23:58:28.296;1.04 EMU3/15606;23;3;4	6;0;1;75	4;1	8;1;true;true;true;true;2;;LMV2;0;true;false	
1-27849730;traffic_event_vehicle_2020_10;;traffic_event_vehicle;18;2020-10-26	 02:16:08.066;EMU;2020-10-25	 23:58:25.527;27849730;2020-10-25	 23:58:25.527;1.04 EMU3/15606;23;3;4	6;0;1;80	1;6	8;1;true;true;true;true;2;;LMV2;0;true;false	
1-27849729;traffic_event_vehicle_2020_10;;traffic_event_vehicle;18;2020-10-26	 02:16:08.085;EMU;2020-10-25	 23:58:19.077;27849729;2020-10-25	 23:58:19.077;1.04 EMU3/15606;23;3;4	1;0;1;75	4;69	9;1;true;true;true;true;2;;LMV2;0;true;false	
```



In [None]:
import pandas as pd
import numpy as np

In Google Colaboratory we add files using

In [None]:
from google.colab import files
uploaded = files.upload()

Saving trafikkdata_hourly_54688V625212.csv to trafikkdata_hourly_54688V625212.csv


We can then ask Pandas to read our data file for us:

In [None]:
pd.read_csv('trafikkdata_hourly_54688V625212.csv')

Unnamed: 0,time,trafikk_id,total.volumeNumbers.volume
0,2015-05-13 15:00:00+00:00,54688V625212,526.0
1,2015-05-13 16:00:00+00:00,54688V625212,434.0
2,2015-05-13 17:00:00+00:00,54688V625212,503.0
3,2015-05-13 18:00:00+00:00,54688V625212,466.0
4,2015-05-13 19:00:00+00:00,54688V625212,328.0
...,...,...,...
48674,2020-12-31 18:00:00+00:00,54688V625212,5.0
48675,2020-12-31 19:00:00+00:00,54688V625212,5.0
48676,2020-12-31 20:00:00+00:00,54688V625212,3.0
48677,2020-12-31 21:00:00+00:00,54688V625212,3.0


The expression `pd.read_csv(...)` is a function call that asks Python to run the function `read_csv` which belongs to the `pandas` library. This dotted notation is used everywhere in Python: the thing that appears before the dot contains the thing that appears after.

As an example, John Smith is the John that belongs to the Smith family. We could use the dot notation to write his name smith.john, just as `read_csv` is a function that belongs to the `pandas` library.

`pandas.read_csv` has at least one parameter: the name of the file we want to read. It also has optional parameters that can be found in the documentation (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html): `sep` for example specifies the delimiter that separates values on a line. If we don't add this parameter, Python will automatically use the default value for an optional parameter, in this case `sep=','`.

Since we haven’t told Python to do anything else with the function’s output, the notebook displays it. In this case, that output is the data we just loaded. By default, only a few rows are shown (with ... to omit elements when displaying big arrays).

Our call to `pd.read_csv` read our file but didn’t save the data in memory. To do that, we need to assign the array to a variable. In a similar manner to how we assign a single value to a variable, we can also assign an array of values to a variable using the same syntax. Let’s re-run `pd.read_csv` and save the returned data:

In [None]:
data = pd.read_csv('trafikkdata_hourly_54688V625212.csv')

This statement doesn’t produce any output because we’ve assigned the output to the variable data.If we want to check that the data have been loaded, we can print the variable’s value:

In [None]:
print(data)

                            time    trafikk_id  total.volumeNumbers.volume
0      2015-05-13 15:00:00+00:00  54688V625212                       526.0
1      2015-05-13 16:00:00+00:00  54688V625212                       434.0
2      2015-05-13 17:00:00+00:00  54688V625212                       503.0
3      2015-05-13 18:00:00+00:00  54688V625212                       466.0
4      2015-05-13 19:00:00+00:00  54688V625212                       328.0
...                          ...           ...                         ...
48674  2020-12-31 18:00:00+00:00  54688V625212                         5.0
48675  2020-12-31 19:00:00+00:00  54688V625212                         5.0
48676  2020-12-31 20:00:00+00:00  54688V625212                         3.0
48677  2020-12-31 21:00:00+00:00  54688V625212                         3.0
48678  2020-12-31 22:00:00+00:00  54688V625212                         4.0

[48679 rows x 3 columns]


Now that the data are in memory, we can manipulate them. First, let’s ask what type of thing `data` refers to:

In [None]:
print(type(data))

<class 'pandas.core.frame.DataFrame'>


The output tells us that `data` currently refers to a data frame, the functionality for which is provided by the Pandas library. These data correspond to traffic volume that drove past some traffic measuring equipment on the E6 where the rows represent time points.

A Pandas data frame contains one or more elements of the same or different types. The `type` function will only tell you that a variable is a Pandas data frame but won’t tell you the type of thing inside the array. We can find out the type of the data contained in the Pandas data frame:

In [None]:
data.dtypes

time                           object
trafikk_id                     object
total.volumeNumbers.volume    float64
dtype: object

This tells us that the Pandas data frame’s elements are objects and floating-point numbers. Note that the values in `time` and `trafikk_id` are actually strings, but Pandas can't distinguish them, so they labelled as `object`, which is a more general synonym for a "thing".

With the following command, we can see the data frame's shape:

In [None]:
print(data.shape)

(48679, 3)


The output tells us that the data frame variable contains 48679 rows and 3 columns. When we created the variable `data` to store our traffic data, we did not only create the data frame; we also created information about the data frame, called members or attributes. This extra information describes data in the same way an adjective describes a noun. `data.shape` is an attribute of `data` which describes the dimensions of `data`. We use the same dotted notation for the attributes of variables that we use for the functions in libraries because they have the same part-and-whole relationship.

If we want to get a single row from the data frame, we use a member of the Pandas data frame `iloc` together with an index in square brackets after the member name. The index notation is similar to  math when referring to an element of a vector.

In [None]:
print(data.iloc[0])

time                          2015-05-13 15:00:00+00:00
trafikk_id                                 54688V625212
total.volumeNumbers.volume                          526
Name: 0, dtype: object


In [None]:
print(type(data.iloc[0]))

<class 'pandas.core.series.Series'>


A single element from a row can be accessed by using the 

In [None]:
data.iloc[0][0]

first row in data: 2015-05-13 15:00:00+00:00
