<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____

# Session 1:  Programming Fundamentals

## Part A: Data Structures in Python

<a id='beginning'></a>
Programming languages use data structures to tell the computer how to organize the data we are working with. That is, data structures provided by a programming language are not the same in another one. However, in most cases, a name given to a data structure in one programming language should generally be the same in other one. It is worth keeping in mind, that a particular data structure may serve for one purpose, but not for other ones.

In everyday life, a book can be considered a data structure: we use it to store some kind of information. It has some advantages: it has a table of contents; it has numbers on the pages; you can take it with you; read it as long as you can see the words; and read it again as many times as you want. It has some disadvantages: you can lose it, and need to buy it again; it can deteriorate; get eaten by an insect; and so on.

We are going to talk about 3 data structures in Python:


1. [List](#part1) 
2. [Tuple](#part2) 
3. [Dictionary](#part3) 
4. [Data Frame](#part4) 

**Lists** and **tuples** are basic containers, while **dictionaries** (a.k.a **dicts**) could be considered less simple and with a different 'philosophy'. **Data frames** are complex structures not directly supported by base Python, but easily managed with an additional package.

____
<a id='part1'></a>

## List

Lists in Python are containers of values as in **R**. The values can be of any kind (numbers or non-numbers), and even other containers (simple or complex). If we have an spreadsheet as a reference, a row is a 'natural' list. Different from R, you can not give names to the list elements.

In [1]:
DetailStudent=["Fred Meyers",40,"False"]

The *object* 'DetailStudent' serves to store temporarily the list. To name a list, use combinations of letters and numbers (never start with a number) in a meaningful way. Typing the name of the object, now a list, will give you all the contents you saved in there:

In [2]:
DetailStudent

['Fred Meyers', 40, 'False']

Python's lists are similar to vectors in R, but Python does not coerce the values (40 is still a number). Lists in Python are so flexible and simple, that it is common to have nested lists:

In [3]:
DetailStudentb=['Michael Nelson',60,'True']
Classroom=[DetailStudent,DetailStudentb] # list of lists
Classroom

[['Fred Meyers', 40, 'False'], ['Michael Nelson', 60, 'True']]

You can access individual elements like this:

In [4]:
Classroom[1]

#Python starts calculation from 0, while R starts from 1

['Michael Nelson', 60, 'True']

From the last result, you must always remember that Python positions start in **0**, see more examples of accessing:

In [5]:
DetailStudentb[0] # first element of the list

'Michael Nelson'

In [6]:
DetailStudentb[:2] # before the index 2, that is position 0 and 1 / In R: DetailStudentb[1:2] (both limits needed)

#asks to show everything before the 3rd element

['Michael Nelson', 60]

In [7]:
DetailStudent[-1] # R does not work like this to get you the last element of a list...This will erase the first one

#give the last element, - means that Python start working in the reverse order

'False'

You can alter lists like in R (just remember positions start from 0 in Python):

In [8]:
DetailStudent[0]='Alfred Mayer'
DetailStudent

#making changes in specific element; working is flexible, but beware, there is no warning of changes! 

['Alfred Mayer', 40, 'False']

Deleting elements is easy, and we can do it:

* By position
* By value

Let's see. If we have these lists:

In [11]:
elementsA=[1,2,3,4] 
elementsB=[1,2,3,4]

Then:

In [12]:
## DELETING BY POSITION
del elementsA[2]  #delete third element
# then:
elementsA  # alternative:  elements[:2]+elements[3:]

[1, 2, 4]

In [13]:
#  DELETING BY VALUE
elementsB.remove(2)
elementsB 

[1, 3, 4]

Getting rid of your list:

In [16]:
newList=['a','b']
del newList
newList # becareful!... it is gone! 

#Python tells where is the mistake: Error, the list is not assigned, bcz it was deleted

NameError: name 'newList' is not defined

It is important to know how to get **unique values**:

In [17]:
weekdays=['M','T','W','Th','S','Su','Su']
weekdays

['M', 'T', 'W', 'Th', 'S', 'Su', 'Su']

In [19]:
#then:
weekdays=list(set(weekdays))
weekdays

#how to get unique values. Python is doing it in the efficient way

['T', 'S', 'Su', 'M', 'W', 'Th']

### Doesn't Python have vectors?

Vectors are NOT part of the basic Python, you need to use a mathematical module like **numpy**. When working with vectors, the operations of comparison ('>', '<', etc.) will work **element by element** as in R:

In [20]:
# For Python to work as R with vectors, you need to use the 
# mathematical structure offered by numpy:

import numpy as np

#import means activate the library I already have
#as np - you can give a name of the library

vector1=np.array(['b','c','d']) #np - is a name; array - is a function of the np
vector2=np.array(['a','b','d'])
vector1>vector2 #this line compares the elements

array([ True,  True, False])

If vectors have different sizes, comparison works if one has ONE element:

In [21]:
vector3=np.array(['a'])
vector1>vector3 # each element of vector1 compared to the only one in vector3

array([ True,  True,  True])

In [None]:
vector3

But, this confuses vectors:

In [22]:
vector4=np.array(['a','b'])
vector1>vector4

#cannot compare 3 to 2, Python has problems with shapes of vectors

ValueError: shape mismatch: objects cannot be broadcast to a single shape

This is also valid for numbers:

In [None]:
# If these are our vectors:
numbers1=np.array([1,2,3])
numbers2=np.array([1,2,3])
numbers3=np.array([1])
numbers4=np.array([10,12])

Then, these work well:

In [None]:
# adding element by element:
numbers1+numbers2

In [None]:
# adding one value to all the elements of other vector:
numbers1+numbers3

In [None]:
# multiplication (element by element)!
numbers1*numbers2

In [None]:
# and this kind of multiplication:
numbers1*3

This will not work (it does not work in R either):

In [None]:
numbers1+numbers4

When dealing with vectors, the elements must share the same type. Otherwise, elements will be coerced into the same type:

In [None]:
numbers5=np.array([1,2,'3'])
numbers5

In [None]:
numbers6=np.array([1,2,3.0])
numbers6

[Go to page beginning](#beginning)

_____
<a id='part2'></a>

## Tuples

Tuples are similar to lists. They can store any kind value, and even other structures:

In [23]:
DetailStudentaTuple=("Fred Meyers",40,"False")

#difference btw list and a tuple is brekets


To create tuples, you can use '()', the command *tuple()* or nothing:

In [24]:
DetailStudentbTuple='Michael Nelson',60,'True'

#tuple can be in brekets or without, just use commas

So, **why do we need *tuples*?** When you do not want that your object be altered:

In [26]:
DetailStudentbTuple[1]=50

#in Python in tuple you cannot change things, you are not allowed; changing or deleting in tuple are not possible


TypeError: 'tuple' object does not support item assignment

[Go to page beginning](#beginning)
____
<a id='part3'></a>
## Dicts

Dicts, on the surface, are very similar to lists in R:

In [27]:
# creating dict:
DetailStudentDict={'fullName':"Fred Meyers",
               'age':40,
               'female':False}
# seeing it:
DetailStudentDict

#dictionary is created by {}
#dicts are structured as keyword and the explanation
#examples with the languages speoken in Excel: it's not efficient to use list is a person speaks 5-10 languages. 



{'fullName': 'Fred Meyers', 'age': 40, 'female': False}

But you realize soon a difference:

In [None]:
DetailStudentDict[0]

Dicts _only_ use their **keys** to access the elements:

In [None]:
DetailStudentDict['age']

Dicts do allow changing values:

In [None]:
DetailStudentDict['age']=41
# then:
DetailStudentDict['age']

## Lists versus Tuples vs Dicts?

__A) Make sure what you have:__

You can easily know what structure you have like this:

In [None]:
type(DetailStudentDict)

In [None]:
type(DetailStudent)

In [None]:
type(DetailStudentaTuple)

__B) Make sure functions are shareable__

They share many basic functions:

In [None]:
listTest=[1,2,3,3]
tupleTest=(1,2,3,4,4)
dictTest={'a':1,'b':2,'c':2}
len(listTest), len(tupleTest), len(dictTest)

Some may work slightly different:

In [None]:
# using set to keep unique values:
set(listTest)

In [None]:
set(tupleTest) # so far so good...

In [None]:
set(dictTest) # this MAY not be what you expected.

Notice the use of comparissons between lists and vectors:

In [None]:
numbers4=np.array([2])
numbers1<numbers4

This will work the same for text:

In [None]:
list1=np.array(['b','c','d'])
list2=np.array(['a','b','d'])
list1>list2

If we used lists, you get a similar bahavior (not implemented in base R):

In [None]:
list1=['b','c','d']
list2=['a','b','d']
list1>list2

Python is doing a simple _lexicographical ordering_, that is, they compare the first element of each list (from left to right), and report _True_ or _False_ if they differ using '>' (or '<'). It is like comparing two words:

In [None]:
np.array([1,2,4]) > np.array([1,2,3]) # this is true because 4>3, and the previous are equal.

In [None]:
[1,2,4] > [1,2,3]

In [None]:
# this is true because 9>8, and the previous are equal, when a difference is detected, the comparisson stops.
(1,2,9,1) > (1,2,8,9,9) 

In [None]:
# while you can not compare if sizes differ:
np.array([1,2,9,1]) > np.array([1,2,8,9,9]) 

Math operations should be taken with care: 

In [None]:
# This will CONCATENATE:
numbersL1=[1,2,3]
numbersL2=[1,2,3]
numbersL1+numbersL2

In [None]:
# this won't work:
numbersL1 * numbersL2

In [None]:
# this will:
numbersL1 * 3

Due to its flexibility, lists are used pervasively in simple Python code. 

[Go to page beginning](#beginning)
____
<a id='part4'></a>
## Data Frames

Data frames are containers of values. The most common analogy is an spreadsheet. To create a data frame, we need to call **pandas**:

In [None]:
import pandas

We can prepare the data frame now:

In [None]:
# columns of the data frame (as lists):
names=["Qing", "Françoise", "Raúl", "Bjork"]
ages=[32,33,28,30]
country=["China", "Senegal", "Spain", "Norway"]
education=["Bach", "Bach", "Master", "PhD"]

In [None]:
# now in a dict:
data={'names':names, 'ages':ages, 'country':country, 'education':education}
data

...and from dict to DataFrame:

In [None]:
students=pandas.DataFrame.from_dict(data)
# seeing it:
students

Sometimes, Python users code like this:

In [None]:
import pandas as pd # renaming the library

students=pd.DataFrame.from_dict(data)
students

Or like this:

In [None]:
from pandas import DataFrame as df # calling a function from the library and renaming the function name

students=df.from_dict(data)
students

You can set a particular column as **row name**:

In [None]:
students.set_index('names') # You have not changed until: students.set_index('names',inplace=True)

The command *type()* still works here:

In [None]:
type(students)

You can get more information on the data types like this (as _str()_ in R):

In [None]:
students.dtypes

The _info()_ function can get you more details:

In [None]:
students.info()

The data frames in pandas behave much like in R:

In [None]:
#one particular column
students.names

In [None]:
# or
students['names'] # it is not the same as: students[['names']]

In [None]:
# it is not the same as: 
students[['names']] # a data frame, not a column (or series)

In [None]:
# two columns
students.iloc[:,[1,3]]  

In [None]:
# thie is also a DF
students[['country','names']]

In [None]:
## Using positions is the best way to get several columns:
students.iloc[:,1:4]

Deleting a column:

In [None]:
# This is what you want get rid of:
byeColumns=['education']

#this would chane the original: students.drop(byeColumns,axis=1,inplace=False)
studentsNoEd=students.drop(byeColumns,axis=1)

# this is a new DF
studentsNoEd

You can modify any values in a data frame. Let me create a **deep** copy of this data frame to play with:

In [None]:
studentsCopy=students.copy()
studentsCopy

Then,

In [None]:
# I can change the age of Qing to 23 replacing 32:
studentsCopy.iloc[0,0]=23 # change is immediate! (no warning)

In [None]:
# I can reset a column as **missing**:
studentsCopy.country=None

In [None]:
# And, delete a column by droping it:
studentsCopy.drop(['ages'],1,inplace=True) # axis=1 is column

In [None]:
# Then, our copy looks like this:
studentsCopy

One important detail when erasing rows, is to reset the indexes:

In [None]:
# another copy for you to see the difference:
studentsCopy2=students.copy()
studentsCopy2

In [None]:
# drop third row (axis=0)
studentsCopy2.drop(2) 

In [None]:
# resetting index
studentsCopy2.drop(2).reset_index()

In [None]:
#better resetting index
studentsCopy2.drop(2).reset_index(drop=True)

Pandas offers some practical functions:

In [None]:
# rows and columns
students.shape # dim(meals) in R

In [None]:
# length:
len(students) # length in R gives number of columns, here you get number of rows.

There is no specific function to get number of rows/columns in pandas, but **len** is useful:

In [None]:
len(students.index) # or students.shape[0]

In [None]:
len(students.columns) # or students.shape[1]

Remember that you can use len with list, tuples and data frames!...and even dictionaries (notice it gives you the count at the top level, it is not smart to report the count inside of an composite element).

In [None]:
aDict={'name':'John', "language_spoken":['Spanish','English']}
len(aDict)

You also have _tail_ and _head_ functions in Pandas, to get some top or bottom rows:

In [None]:
students.head(2) #and students.tail(2)

You can also see the column names like this:

In [None]:
# similar to names() in R
students.columns

It may look like a list, but it is not:

In [None]:
type(students.columns) # index type...but list functions work here!

If you needed a list:

In [None]:
students.columns.values.tolist()

# or:
# students.columns.tolist()

# this is the easiest:
# list(students)

### Querying Data Frames:

Once you have a data frame you can start writing interesting queries:

In [None]:
# Who is the oldest in the group?
students[students.ages==max(students.ages)].names

In [None]:
# Who is above 30 and from China?
students[(students.ages>30) & (students.country=='China')] # parenthesis are important with '&' in Pandas!!!

In [None]:
# Who is not from Norway?
students[students.country!="Norway"] 

In [None]:
# Who is from one of these?

DangeourousPlaces=["Peru", "USA", "Spain"]
students[students.country.isin(DangeourousPlaces)]

In [None]:
students[~students.country.isin(DangeourousPlaces)] # the opposite

In [None]:
# The education level of who is above 30 and from China?
students[(students.ages>30) & (students.country=='China')].education 

In [None]:
# **Show me the data ordered by age (decreasing)?**
toSort=["ages"]
Order=[False]
students.sort_values(by=toSort,ascending=Order)

In [None]:
# Show who is the oldest person with a Bachelor:
students[students.education=='Bach'].sort_values('ages',ascending=True).tail(1)

## Class exercises:

In a new Jupyter notebook solve each excercise, and then upload them to GitHub. Name the notebook as 'ex_data_structures':

A. Turn this into a Data Frame name "friends":

In [None]:
names=["Tomás", "Pauline", "Pablo", "Bjork","Alan","Juana"]
woman=[False,True,False,False,False,True]
ages=[32,33,28,30,32,27]
country=["Chile", "Senegal", "Spain", "Norway","Peru","Peru"]
education=["Bach", "Bach", "Master", "PhD","Bach","Master"]

B. Answer the following:

In [None]:
# Who is the oldest person in this group of friends?

In [None]:
# How many people are 32?

In [None]:
# How many are not Peruvian? (use two different codes)

In [None]:
# Who is the person with the highest level of education?

In [None]:
# what is the sex of the oldest person in the group?

### Homework

If you have the query:

In [None]:
# where is the youngest male in the group from?

a. Find the answer using *sort_values()*

b. Do some research and find the answer using *[where()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html)* and *[min()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html)*

c. Do some research and find the answer using *[query()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html)* and *[min()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html)*

Solve this in a new Jupyter notebook, and then upload it to GitHub. Name the notebook as 'hw_data_structures'.

____

* [Go to page beginning](#beginning)
* [Go to REPO in Github](https://github.com/EvansDataScience/ComputationalThinking_Gov_1)
* [Go to Course schedule](https://evansdatascience.github.io/GovernanceAnalytics/)