# <u> Tutorial 3: Data Structures </u>

<br> This series of tutorials is intended to teach the basics of Python for scientific programming. These tutorials were written by Sanjana Kulkarni, an intern in the High Throughput Analytics group during summer 2021. 

## Lists and Arrays

Lists, arrays, and dictionaries are structures that hold many elements of data. Lists are denoted by brackets, and you can place different data types (integer, float, string) into the same list. Place objects in a list and store them to a new variable. 

In [2]:
my_list = [1, 2, "apple", "banana", 4.5]

You can also make nested lists, in which an element of a list is another list.

In [3]:
nested_lst = [1, 2, 3, [4, 5, 6], 8]

An <b>array</b> is like a list, but has some extra versatility. Arrays are defined using a Python package called <a href="https://numpy.org/" target="_blank"><b>numpy</b></a>, which is used extensively. To import a package, use the statement `import desired_package as abbreviation`, where `desired_package` is the package name and `abbreviation` is a short form to refer to the package with. 

We can create an array just like we did the list. However, if the data types are different, then numpy will coerce all of the objects to be of the same type. 

In [4]:
# import numpy first with the abbreviation np
import numpy as np

# define the array. array is a method of numpy
my_array = np.array([1, 2, "apple", "banana", 4.5])

In [5]:
my_array

array(['1', '2', 'apple', 'banana', '4.5'], dtype='<U32')

See how all of the elements in `my_array` became strings, unlike in the list. Numpy arrays are optimized for efficient calculations. The calculations are implemented in C in the package, so they are more efficient than writing the same calculations in Python.

In [6]:
array_1 = np.array([1, 2, 3, 4])
array_2 = np.array([10, 9, 8, 7])

array_1 + array_2

array([11, 11, 11, 11])

Numpy performs the calculations element-wise and would do the same if we changed the operation between the two arrays. 

In [7]:
array_1 * array_2

array([10, 18, 24, 28])

Lists can not be multiplied in this way. You can multiply a list by a number, but this copies the list by the given number, rather than multiply the elements. 

In [8]:
# copies the given list 3 times and makes a new list
[1, 3, 4] * 3

[1, 3, 4, 1, 3, 4, 1, 3, 4]

Numpy has some other handy functions like making an array of only ones or zeros. The argument to these functions is the number of ones or zeros that should be in the array. 

In [9]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [10]:
np.ones(7)

array([1., 1., 1., 1., 1., 1., 1.])

## Tuples

Tuples are another type of data structure. They are denoted by parentheses, and unlike lists and arrays, they are <b>immutable</b>. If you create a list or array, you can change elements of them without a problem. Tuples, however, can not be changed after they are created. You can only generate a new tuple.

In [11]:
my_tuple = (3, 5, 10, 34)

Like lists, tuples can contain elements of different types. You can store strings, integers, and floats into the same tuple. Tuples are useful for storing paired data, like x, y, and z coordinates for a point. Often, you can pass a list of tuples into a plotting function. 

## Slicing Data Structures

To access the individual elements of a list, tuple, or array, we use brackets and the index of the element. However, Python utilizes zero-indexing. Therefore, the first element is index 0, the second is index 1, and so on and so forth. We can return to the arrays and list defined above. 

In [12]:
# the last element of array_1, which has length 4
array_1[3]

4

Indexing can also be done in the reverse direction. The last element of a list/tuple/array has index -1, the penultimate element has index -2, and so on and so forth. So the second element of `array_1`, which has length 4, can also be accessed using the index <b>-3</b>. See how in the following code, both positive and negative indexing extract the same value from the array. 

In [13]:
array_1[1], array_1[-3]

(2, 2)

A range of elements can be sliced from a list, tuple or array. The start and end indices of the desired elements are separated by a colon. NOTE: the first index is inclusive, but the last index is exclusive.

In [68]:
# extract values at indices 2 and 3
array_1[2:4]

array([3, 4])

If you don't provide a starting index, the default is 0, and if you don't provide an ending index, the default is the end.

In [69]:
# extract values at indices 0 and 1
array_1[:2]

array([1, 2])

In [70]:
# extract values from indices 2 to the end
array_1[2:]

array([3, 4])

## Dictionaries

Dictionaries are somewhat different because they store data with labels. If you input just the label, you will get the corresponding data. Duplicate labels are not allowed. Dictionaries are denoted by curly braces with the following syntax:

`{'Label 1': data_1, 'Label 2': data_2, 'Label 3': data_3}`

In [44]:
my_dict = {'2015': [12, 67, 34, 13, 95, 82], '2016': [23, 87, 93, 56, 70, 104], '2017': [349, 654, 349, 567, 186, 240]}

Elements within a dictionary are extracted not by indexing by number, but by name. To get the values associated with year 2016, we use the exact label. 

In [45]:
# get 2016 values
my_dict['2016']

[23, 87, 93, 56, 70, 104]

You can access all of the elements of a dictionary using a few <b>methods</b>. Methods are functions that follow a `.` symbol after the data structure you are accessing. For dictionaries, the most important methods are shown below:

1. `.keys()`: extract all labels (keys)
2. `.values()`: extract all values
3. `.items()`: extract all labels and values

These methods return array-like structures of the extracted data. Dictionaries are very good at storing data with associated labels, and these methods make it easy to access the data. 

In [46]:
my_dict.keys()

dict_keys(['2015', '2016', '2017'])

In [47]:
my_dict.values()

dict_values([[12, 67, 34, 13, 95, 82], [23, 87, 93, 56, 70, 104], [349, 654, 349, 567, 186, 240]])

In [48]:
my_dict.items()

dict_items([('2015', [12, 67, 34, 13, 95, 82]), ('2016', [23, 87, 93, 56, 70, 104]), ('2017', [349, 654, 349, 567, 186, 240])])

## Dataframes

Dataframes are Python's equivalent of Excel tables, but they have a lot of built-in manipulations. Dataframes have an <b>index</b> column that isn't a true column, but is useful for enumerating a dataframe. This index is also zero-indexed, so the first row is assigned an index of 0, not 1. 

The easiest way to generate a dataframe is using the <a href="https://pandas.pydata.org/" target="_blank"><b>pandas</b></a> package. Like with `numpy`, we need an import statement to allow our computing environment to access the functions contained within the package. 

Pandas can generate dataframes from dictionaries, arrays, or other data structures using different functions. We will use the `DataFrame()` function, but you can also use the `DataFrame.from_dict()` function for dictionaries. The documentation for pandas dataframe methods can be found <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html" target="_blank">here</a>, with detailed instructions on using each of the functions. 

In [49]:
import pandas as pd

# use the following function to generate a dataframe from a dictionary
my_df = pd.DataFrame(my_dict)

In [50]:
my_df

Unnamed: 0,2015,2016,2017
0,12,23,349
1,67,87,654
2,34,93,349
3,13,56,567
4,95,70,186
5,82,104,240


The labels / keys of the dictionary become the columns of the dataframe, and the elements become the rows. For the function to work, all of the dictionary entries must have the same length. If not, Python will give you an error saying so. 

To address this problem, there are a few <a href="https://stackoverflow.com/questions/19736080/creating-dataframe-from-a-dictionary-where-entries-have-different-lengths" target="_blank">solutions</a> on Stack Overflow, a collaboration and knowledge sharing platform for programmers and developers. You can choose the solution you like best or that fits your code structure. 

There are innumerable functions and manipulations you can use with pandas dataframes, but I will highlight a few below

1. `df.shape`: returns a tuple of the number of rows and columns (rows first)
2. `df.columns`: list of column names
3. `df.head()`: show the first N rows of the dataframe. By default, N = 5, but can be changed.
4. `df.tail()`: show the last N rows of the dataframe. By default, N = 5, but can be changed.
5. `df.dropna`: removes missing entries (called NaN = <b><u>N</u></b>ot <b><u>a</u></b> <b><u>N</u></b>umber)
6. `df.fillna(replace_value)`: fills missing entries with a user-supplied value
7. `df.groupby(col_name)`: groups entries by the values of user-supplied columns. You can compute mean, median, etc. for each group.
8. `df.replace(to_replace, new_value)`: replaces certain values with new values
9. `df.append(df_2)`: attach df_2 to the bottom of df, joining along shared columns
10. `pd.concat(list_of_dataframes)`: similar to the append method, but with a different call signature. Merges dataframes along columns

In Python, there are many ways to skin a cat, so many functions perform similar tasks or can be modified with arguments to have the same effect. As you become more familiar with Python, you can make code more efficient or look more "<a href="https://docs.python-guide.org/writing/style/" target="_blank">Pythonic</a>" (language fluency).

New columns can be added to dataframes using bracket syntax with the name of the new column as a string inside brackets. The length of the list or array being set to the new column must have the same length as the dataframe. 

In [58]:
# add another column to the dataframe to make a categorical variable
my_df['type'] = ['a', 'a', 'b', 'b', 'b', 'a']

my_df

Unnamed: 0,2015,2016,2017,type
0,12,23,349,a
1,67,87,654,a
2,34,93,349,b
3,13,56,567,b
4,95,70,186,b
5,82,104,240,a


Dataframes can be grouped by categorical variable. From the grouped dataframe, you can compute summary statistics like the mean, median, standard deviation, and number of values in each group.

In [59]:
# count the number of values of each type
my_df.groupby('type').count()

Unnamed: 0_level_0,2015,2016,2017
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,3,3,3
b,3,3,3


In [66]:
# get the mean for each year by type
my_df.groupby('type').mean()

Unnamed: 0_level_0,2015,2016,2017
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,53.666667,71.333333,414.333333
b,47.333333,73.0,367.333333


In [61]:
# get the mean for each year by type
my_df.groupby('type').std()

Unnamed: 0_level_0,2015,2016,2017
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,36.855574,42.712215,214.593414
b,42.594992,18.681542,191.160491


## Series

A series is a data structure generated when extracting a single column from a dataframe. Individual columns can be extracted using bracket syntax, or if the column name does not have spaces and is not a number, after a period:

In [33]:
my_df['2015']

0    12
1    67
2    34
Name: 2015, dtype: int64

In [34]:
type(my_df['2015'])

pandas.core.series.Series

Series are labeled arrays. When performing operations on series, they are also performed element-wise. They are labeled because entires are labeled with indices. If you extract a series from a grouped dataframe, you can index by the groups.

In [35]:
# element-wise operations
my_df['2015'] + my_df['2016']

0     35
1    154
2    127
dtype: int64