<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____

# Session 1:  Programming Fundamentals

## Part A: Data Structures in Python

<a id='beginning'></a>
Programming languages use data structures to tell the computer how to organize the data we are working with. That is, data structures provided by a programming language are not the same in another one. However, in most cases, a name given to a data structure in one programming language should generally be the same in other one. It is worth keeping in mind, that a particular data structure may serve for one purpose, but not for other ones.

In everyday life, a book can be considered a data structure: we use it to store some kind of information. It has some advantages: it has a table of contents; it has numbers on the pages; you can take it with you; read it as long as you can see the words; and read it again as many times as you want. It has some disadvantages: you can lose it, and need to buy it again; it can deteriorate; get eaten by an insect; and so on.

We are going to talk about 3 data structures in Python:


1. [List](#part1) 
2. [Tuple](#part2) 
3. [Dictionary](#part3) 
4. [Data Frame](#part4) 

**Lists** and **tuples** are basic containers, while **dictionaries** (a.k.a **dicts**) could be considered less simple and with a different 'philosophy'. **Data frames** are complex structures not directly supported by base Python, but easily managed with an additional package.


A tuple is a sequence of immutable Python objects. Tuples are sequences, just like lists. The differences between tuples and lists are, the tuples cannot be changed unlike lists and tuples use parentheses, whereas lists use square brackets. Creating a tuple is as simple as putting different comma-separated values.

When we are talking about data structures, we are talking about containers. Containers are the way to organiza data.

____
<a id='part1'></a>

## List

Lists in Python are containers of values as in **R**. The values can be of any kind (numbers or non-numbers), and even other containers (simple or complex). If we have an spreadsheet as a reference, a row is a 'natural' list. Different from R, you can not give names to the list elements.

In [14]:
DetailStudent=["Fred Meyers",40,"False"]

The *object* 'DetailStudent' serves to store temporarily the list. To name a list, use combinations of letters and numbers (never start with a number) in a meaningful way. Typing the name of the object, now a list, will give you all the contents you saved in there:

In [2]:
DetailStudent

['Fred Meyers', 40, 'False']

Python's lists are similar to vectors in R, but Python does not coerce the values (40 is still a number). Lists in Python are so flexible and simple, that it is common to have nested lists:

In [3]:
DetailStudentb=['Michael Nelson',60,'True']
Classroom=[DetailStudent,DetailStudentb] # list of lists
Classroom

[['Fred Meyers', 40, 'False'], ['Michael Nelson', 60, 'True']]

You can access individual elements like this:

In [4]:
Classroom[1]

#Python starts calculation from 0, while R starts from 1

['Michael Nelson', 60, 'True']

From the last result, you must always remember that Python positions start in **0**, see more examples of accessing:

In [5]:
DetailStudentb[0] # first element of the list

'Michael Nelson'

In [6]:
DetailStudentb[:2] # before the index 2, that is position 0 and 1 / In R: DetailStudentb[1:2] (both limits needed)

#asks to show everything before the 3rd element

['Michael Nelson', 60]

In [7]:
DetailStudent[-1] # R does not work like this to get you the last element of a list...This will erase the first one

#give the last element, - means that Python start working in the reverse order

'False'

You can alter lists like in R (just remember positions start from 0 in Python):

In [8]:
DetailStudent[0]='Alfred Mayer'
DetailStudent

#making changes in specific element; working is flexible, but beware, there is no warning of changes! 

['Alfred Mayer', 40, 'False']

Deleting elements is easy, and we can do it:

* By position
* By value

Let's see. If we have these lists:

In [97]:
elementsA=[1,2,3,4] 
elementsB=[1,2,3,4]

Then:

In [96]:
## DELETING BY POSITION
del elementsA[2]  #delete third element
# then:
elementsA  # alternative:  elementsA[:2]+elementsA[2:]

#word[0:2]  # characters from position 0 (included) to 2 (excluded)
#word[2:5]  # characters from position 2 (included) to 5 (excluded)

#Slicing:

#Note how the start is always included, and the end always excluded. This makes sure that s[:i] + s[i:] is always equal to s:

#word[:2] + word[2:]
#word[:4] + word[4:]

#word[:2]   # character from the beginning to position 2 (excluded)
#word[4:]   # characters from position 4 (included) to the end
#word[-2:]  # characters from the second-last (included) to the end

#+---+---+---+---+---+---+
#| P | y | t | h | o | n |
#+---+---+---+---+---+---+
#0   1   2   3   4   5   6
#-6  -5  -4  -3  -2  -1

[1, 2, 4]

In [13]:
#  DELETING BY VALUE
elementsB.remove(2)
elementsB 

[1, 3, 4]

Getting rid of your list:

In [16]:
newList=['a','b']
del newList
newList # becareful!... it is gone! 

#Python tells where is the mistake: Error, the list is not assigned, bcz it was deleted

NameError: name 'newList' is not defined

It is important to know how to get **unique values**:

In [17]:
weekdays=['M','T','W','Th','S','Su','Su']
weekdays

['M', 'T', 'W', 'Th', 'S', 'Su', 'Su']

In [19]:
#then:
weekdays=list(set(weekdays))
weekdays

#how to get unique values. Python is doing it in the efficient way

['T', 'S', 'Su', 'M', 'W', 'Th']

### Doesn't Python have vectors?

Vectors are NOT part of the basic Python, you need to use a mathematical module like **numpy**. When working with vectors, the operations of comparison ('>', '<', etc.) will work **element by element** as in R:

In [26]:
# For Python to work as R with vectors, you need to use the 
# mathematical structure offered by numpy:

import numpy as np

#import means activate the library I already have
#as np - you can give a name of the library

vector1=np.array(['b','c','d']) #np - is a name; array - is a function of the np
vector2=np.array(['a','b','d'])
vector1>vector2 #this line compares the elements

array([ True,  True, False])

If vectors have different sizes, comparison works if one has ONE element:

In [27]:
vector3=np.array(['a'])
vector1>vector3 # each element of vector1 compared to the only one in vector3

array([ True,  True,  True])

In [28]:
vector3

array(['a'], dtype='<U1')

But, this confuses vectors:

In [29]:
vector4=np.array(['a','b'])
vector1>vector4

#cannot compare 3 to 2, Python has problems with shapes of vectors

ValueError: shape mismatch: objects cannot be broadcast to a single shape

This is also valid for numbers:

In [30]:
# If these are our vectors:
numbers1=np.array([1,2,3])
numbers2=np.array([1,2,3])
numbers3=np.array([1])
numbers4=np.array([10,12])

Then, these work well:

In [30]:
# adding element by element:
numbers1+numbers2

array([2, 4, 6])

In [31]:
# adding one value to all the elements of other vector:
numbers1+numbers3

array([2, 3, 4])

In [32]:
# multiplication (element by element)!
numbers1*numbers2

array([1, 4, 9])

In [33]:
# and this kind of multiplication:
numbers1*3

array([3, 6, 9])

This will not work (it does not work in R either):

In [34]:
numbers1+numbers4

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

When dealing with vectors, the elements must share the same type. Otherwise, elements will be coerced into the same type:

In [35]:
numbers5=np.array([1,2,'3'])
numbers5

array(['1', '2', '3'], dtype='<U21')

In [36]:
numbers6=np.array([1,2,3.0])
numbers6

array([1., 2., 3.])

[Go to page beginning](#beginning)

_____
<a id='part2'></a>

## Tuples

Tuples are similar to lists. They can store any kind value, and even other structures:

In [11]:
DetailStudentaTuple=("Fred Meyers",40,"False")

#difference btw list and a tuple is brekets


To create tuples, you can use '()', the command *tuple()* or nothing:

In [24]:
DetailStudentbTuple='Michael Nelson',60,'True'

#tuple can be in brekets or without, just use commas

So, **why do we need *tuples*?** When you do not want that your object be altered:

In [26]:
DetailStudentbTuple[1]=50

#in Python in tuple you cannot change things, you are not allowed; changing or deleting in tuple are not possible


TypeError: 'tuple' object does not support item assignment

[Go to page beginning](#beginning)
____
<a id='part3'></a>
## Dicts

Dicts, on the surface, are very similar to lists in R:

In [1]:
# creating dict:
DetailStudentDict={'fullName':"Fred Meyers",
               'age':40,
               'female':False}
# seeing it:
DetailStudentDict

#dictionary is created by {}
#dicts are structured as keyword and the explanation
#examples with the languages speoken in Excel: it's not efficient to use list is a person speaks 5-10 languages. 



{'fullName': 'Fred Meyers', 'age': 40, 'female': False}

But you realize soon a difference:

In [2]:
DetailStudentDict[0]

KeyError: 0

Dicts _only_ use their **keys** to access the elements:

In [3]:
DetailStudentDict['age']

40

Dicts do allow changing values:

In [4]:
DetailStudentDict['age']=41
# then:
DetailStudentDict['age']

41

## Lists versus Tuples vs Dicts?

__A) Make sure what you have:__

You can easily know what structure you have like this:

In [16]:
type(DetailStudentDict)

#will give an error, if its not the last call. 

dict

In [17]:
type(DetailStudent)

list

In [12]:
type(DetailStudentaTuple)

tuple

__B) Make sure functions are shareable__

They share many basic functions:

In [18]:
listTest=[1,2,3,3]
tupleTest=(1,2,3,4,4)
dictTest={'a':1,'b':2,'c':2}
len(listTest), len(tupleTest), len(dictTest)

#len - function, lenth of the data structure

(4, 5, 3)

Some may work slightly different:

In [19]:
# using set to keep unique values:
set(listTest)

#set - function, keep unique values

{1, 2, 3}

In [20]:
set(tupleTest) # so far so good...

{1, 2, 3, 4}

In [23]:
set(dictTest) # this MAY not be what you expected.

# with dict, using of set function will Pshow not unique values, but unique key words

{'a', 'b', 'c'}

Notice the use of comparissons between lists and vectors:

In [31]:
numbers4=np.array([2])
numbers1<numbers4

array([ True, False, False])

This will work the same for text:

In [32]:
list1=np.array(['b','c','d'])
list2=np.array(['a','b','d'])
list1>list2

array([ True,  True, False])

If we used lists, you get a similar bahavior (not implemented in base R):

In [33]:
list1=['b','c','d']
list2=['a','b','d']
list1>list2

True

Python is doing a simple _lexicographical ordering_, that is, they compare the first element of each list (from left to right), and report _True_ or _False_ if they differ using '>' (or '<'). It is like comparing two words:

In [34]:
np.array([1,2,4]) > np.array([1,2,3]) # this is true because 4>3, and the previous are equal.

array([False, False,  True])

In [35]:
[1,2,4] > [1,2,3]

True

In [36]:
# this is true because 9>8, and the previous are equal, when a difference is detected, the comparisson stops.
(1,2,9,1) > (1,2,8,9,9)

True

In [37]:
# while you can not compare if sizes differ:
np.array([1,2,9,1]) > np.array([1,2,8,9,9]) 

ValueError: operands could not be broadcast together with shapes (4,) (5,) 

Math operations should be taken with care: 

In [38]:
# This will CONCATENATE:
numbersL1=[1,2,3]
numbersL2=[1,2,3]
numbersL1+numbersL2

[1, 2, 3, 1, 2, 3]

In [39]:
# this won't work:
numbersL1 * numbersL2

TypeError: can't multiply sequence by non-int of type 'list'

In [40]:
# this will:
numbersL1 * 3

[1, 2, 3, 1, 2, 3, 1, 2, 3]

Due to its flexibility, lists are used pervasively in simple Python code. 

[Go to page beginning](#beginning)
____
<a id='part4'></a>
## Data Frames

Data frames are containers of values. The most common analogy is an spreadsheet. To create a data frame, we need to call **pandas**:

In [41]:
import pandas

We can prepare the data frame now:

In [42]:
# columns of the data frame (as lists):
names=["Qing", "Françoise", "Raúl", "Bjork"]
ages=[32,33,28,30]
country=["China", "Senegal", "Spain", "Norway"]
education=["Bach", "Bach", "Master", "PhD"]

In [44]:
# now in a dict:
data={'names':names, 'ages':ages, 'country':country, 'education':education}
data

{'names': ['Qing', 'Françoise', 'Raúl', 'Bjork'],
 'ages': [32, 33, 28, 30],
 'country': ['China', 'Senegal', 'Spain', 'Norway'],
 'education': ['Bach', 'Bach', 'Master', 'PhD']}

...and from dict to DataFrame:

In [None]:
students=pandas.DataFrame.from_dict(data)
# seeing it:
students

Sometimes, Python users code like this:

In [45]:
import pandas as pd # renaming the library

students=pd.DataFrame.from_dict(data)
students

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
2,Raúl,28,Spain,Master
3,Bjork,30,Norway,PhD


Or like this:

In [52]:
from pandas import DataFrame as df # calling a function from the library and renaming the function name

students=df.from_dict(data)
students

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
2,Raúl,28,Spain,Master
3,Bjork,30,Norway,PhD


You can set a particular column as **row name**:

In [53]:
students.set_index('names') # You have not changed until: students.set_index('names',inplace=True)

Unnamed: 0_level_0,ages,country,education
names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Qing,32,China,Bach
Françoise,33,Senegal,Bach
Raúl,28,Spain,Master
Bjork,30,Norway,PhD


The command *type()* still works here:

In [54]:
type(students)

pandas.core.frame.DataFrame

You can get more information on the data types like this (as _str()_ in R):

In [55]:
students.dtypes

names        object
ages          int64
country      object
education    object
dtype: object

The _info()_ function can get you more details:

In [56]:
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
names        4 non-null object
ages         4 non-null int64
country      4 non-null object
education    4 non-null object
dtypes: int64(1), object(3)
memory usage: 208.0+ bytes


The data frames in pandas behave much like in R:

In [57]:
#one particular column
students.names

0         Qing
1    Françoise
2         Raúl
3        Bjork
Name: names, dtype: object

In [58]:
# or
students['names'] # it is not the same as: students[['names']]

0         Qing
1    Françoise
2         Raúl
3        Bjork
Name: names, dtype: object

In [59]:
# it is not the same as: 
students[['names']] # a data frame, not a column (or series)

Unnamed: 0,names
0,Qing
1,Françoise
2,Raúl
3,Bjork


In [64]:
# two columns
students.iloc[:,[1,3]] 

#iloc - function of pandas, integer location; The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by **position**.
# data.iloc[<row selection>, <column selection>]
#students.iloc[:,[1,3]]  - : means all rows to select, [1,3] - means select only column 1 and 3

# Single selections using iloc and DataFrame
# Rows:
#data.iloc[0] # first row of data frame 
#data.iloc[1] # second row of data frame
#data.iloc[-1] # last row of data frame 
# Columns:
#data.iloc[:,0] # first column of data frame
#data.iloc[:,1] # second column of data frame
#data.iloc[:,-1] # last column of data frame

Unnamed: 0,ages,education
0,32,Bach
1,33,Bach
2,28,Master
3,30,PhD


In [66]:
# this is also a DF
students[['country','names']]

Unnamed: 0,country,names
0,China,Qing
1,Senegal,Françoise
2,Spain,Raúl
3,Norway,Bjork


In [67]:
## Using positions is the best way to get several columns:
students.iloc[:,1:4]

Unnamed: 0,ages,country,education
0,32,China,Bach
1,33,Senegal,Bach
2,28,Spain,Master
3,30,Norway,PhD


Deleting a column:

In [68]:
# This is what you want get rid of:
byeColumns=['education']

#this would change the original: students.drop(byeColumns,axis=1,inplace=False)
studentsNoEd=students.drop(byeColumns,axis=1)

# this is a new DF
studentsNoEd

Unnamed: 0,names,ages,country
0,Qing,32,China
1,Françoise,33,Senegal
2,Raúl,28,Spain
3,Bjork,30,Norway


You can modify any values in a data frame. Let me create a **deep** copy of this data frame to play with:

In [105]:
studentsCopy=students.copy()
studentsCopy

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
2,Raúl,28,Spain,Master
3,Bjork,30,Norway,PhD


Then,

In [107]:
# I can change the age of Qing to 23 replacing 32:
studentsCopy.iloc[0,1]=23 # change is immediate! (no warning)

In [108]:
studentsCopy

Unnamed: 0,names,ages,country,education
0,Qing,23,China,Bach
1,Françoise,33,Senegal,Bach
2,Raúl,28,Spain,Master
3,Bjork,30,Norway,PhD


In [109]:
# I can reset a column as **missing**:
studentsCopy.country=None

In [110]:
# And, delete a column by droping it:
studentsCopy.drop(['ages'],1,inplace=True) # axis=1 is column

In [111]:
# Then, our copy looks like this:
studentsCopy

Unnamed: 0,names,country,education
0,Qing,,Bach
1,Françoise,,Bach
2,Raúl,,Master
3,Bjork,,PhD


One important detail when erasing rows, is to reset the indexes:

In [112]:
# another copy for you to see the difference:
studentsCopy2=students.copy()
studentsCopy2

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
2,Raúl,28,Spain,Master
3,Bjork,30,Norway,PhD


In [113]:
# drop third row (axis=0)
studentsCopy2.drop(2)

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
3,Bjork,30,Norway,PhD


In [114]:
# resetting index after dropping 
studentsCopy2.drop(2).reset_index()

Unnamed: 0,index,names,ages,country,education
0,0,Qing,32,China,Bach
1,1,Françoise,33,Senegal,Bach
2,3,Bjork,30,Norway,PhD


In [115]:
#better resetting index
studentsCopy2.drop(2).reset_index(drop=True)

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
2,Bjork,30,Norway,PhD


Pandas offers some practical functions:

In [116]:
# rows and columns
students.shape # dim(meals) in R

(4, 4)

In [117]:
# length:
len(students) # length in R gives number of columns, here you get number of rows.

4

There is no specific function to get number of rows/columns in pandas, but **len** is useful:

In [118]:
len(students.index) # or students.shape[0]

4

In [119]:
len(students.columns) # or students.shape[1]

4

Remember that you can use len with list, tuples and data frames!...and even dictionaries (notice it gives you the count at the top level, it is not smart to report the count inside of an composite element).

In [120]:
aDict={'name':'John', "language_spoken":['Spanish','English']}
len(aDict)

2

You also have _tail_ and _head_ functions in Pandas, to get some top or bottom rows:

In [121]:
students.head(2) #and students.tail(2)

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach


You can also see the column names like this:

In [122]:
# similar to names() in R
students.columns

Index(['names', 'ages', 'country', 'education'], dtype='object')

It may look like a list, but it is not:

In [123]:
type(students.columns) # index type...but list functions work here!

pandas.core.indexes.base.Index

If you needed a list:

In [124]:
students.columns.values.tolist()

# or:
# students.columns.tolist()

# this is the easiest:
# list(students)

['names', 'ages', 'country', 'education']

### Querying Data Frames:

Once you have a data frame you can start writing interesting queries:

In [126]:
studentsCopy3=students.copy()
studentsCopy3

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
2,Raúl,28,Spain,Master
3,Bjork,30,Norway,PhD


In [125]:
# Who is the oldest in the group?
students[students.ages==max(students.ages)].names

1    Françoise
Name: names, dtype: object

In [127]:
# Who is above 30 and from China?
students[(students.ages>30) & (students.country=='China')] # parenthesis are important with '&' in Pandas!!!

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach


In [130]:
# Who is not from Norway?
students[students.country!="Norway"]

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
2,Raúl,28,Spain,Master


In [131]:
# Who is from one of these?

DangeourousPlaces=["Peru", "USA", "Spain"]
students[students.country.isin(DangeourousPlaces)]

#isin is an element-wise function version of the python keyword in.

Unnamed: 0,names,ages,country,education
2,Raúl,28,Spain,Master


In [132]:
students[~students.country.isin(DangeourousPlaces)] # the opposite

Unnamed: 0,names,ages,country,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
3,Bjork,30,Norway,PhD


In [135]:
# The education level of who is above 30 and from China?
students[(students.ages>30) & (students.country=='China')].education

0    Bach
Name: education, dtype: object

In [138]:
# **Show me the data ordered by age (decreasing)?**
toSort=["ages"]
Order=[False]
students.sort_values(by=toSort,ascending=Order)

#Order false means its from biggest to smallest
#order true means from smallest to largest

Unnamed: 0,names,ages,country,education
1,Françoise,33,Senegal,Bach
0,Qing,32,China,Bach
3,Bjork,30,Norway,PhD
2,Raúl,28,Spain,Master


In [139]:
# Show who is the oldest person with a Bachelor:
students[students.education=='Bach'].sort_values('ages',ascending=True).tail(1)

Unnamed: 0,names,ages,country,education
1,Françoise,33,Senegal,Bach


## Class exercises:

In a new Jupyter notebook solve each excercise, and then upload them to GitHub. Name the notebook as 'ex_data_structures':

A. Turn this into a Data Frame name "friends":

In [None]:
names=["Tomás", "Pauline", "Pablo", "Bjork","Alan","Juana"]
woman=[False,True,False,False,False,True]
ages=[32,33,28,30,32,27]
country=["Chile", "Senegal", "Spain", "Norway","Peru","Peru"]
education=["Bach", "Bach", "Master", "PhD","Bach","Master"]

B. Answer the following:

In [14]:
# Who is the oldest person in this group of friends?

In [None]:
# How many people are 32?

In [None]:
# How many are not Peruvian? (use two different codes)

In [None]:
# Who is the person with the highest level of education?

In [None]:
# what is the sex of the oldest person in the group?

### Homework

If you have the query:

In [None]:
# where is the youngest male in the group from?

a. Find the answer using *sort_values()*

b. Do some research and find the answer using *[where()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html)* and *[min()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html)*


**where()** is like if-then function. Pay attention to the order.

making boolean series for a team name 
#filter1 = data["Team"]=="Atlanta Hawks" 

making boolean series for age 
#filter2 = data["Age"]>24

filtering data on basis of both filters 
#data.where(filter1 & filter2, inplace = True) 

display 
#data

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame  other is used

c. Do some research and find the answer using *[query()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html)* and *[min()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html)*




Solve this in a new Jupyter notebook, and then upload it to GitHub. Name the notebook as 'hw_data_structures'.

____

* [Go to page beginning](#beginning)
* [Go to REPO in Github](https://github.com/EvansDataScience/ComputationalThinking_Gov_1)
* [Go to Course schedule](https://evansdatascience.github.io/GovernanceAnalytics/)