<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### Prof. José Manuel Magallanes, PhD

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

* Visiting Associate Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)
_____

_____


# Session 1: Introduction to Python

# 1.  Data Structures

Python has basic native structures, like lists, tuples and dictionaries.

## A.  **LISTS** 

Lists are the most flexible structure to save or contain data elements.

In [0]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
ages=[32,33,28,30,29]
country=["China", "Senegal", "España", "Norway","Korea"]
education=["Bach", "Bach", "Master", "PhD","PhD"]

Above we have created some lists. Lists can contain any values. Lists support different operations:

* **Accessing**:

Keep in mind the positions in Python start in **0**.

In [9]:
# one element
ages[-1]

29

In [10]:
# several, using slices:
ages[1::] #second to before last

[33, 28, 30, 29]

In [6]:
# several, using slices:
ages[:-2] #all but two last ones

[32, 33, 28]

In [11]:
# non consecutive
from operator import itemgetter
list(itemgetter(0,2,3)(ages))

[32, 28, 30]

In [17]:
# difficul to understand?
ages[0:4:2] + [ages[3]]

[32, 28, 30]

* **Modifying**:

In [18]:
# by position
country[2]="Spain"

# list changed:
country

['China', 'Senegal', 'Spain', 'Norway', 'Korea']

In [19]:
# by value
country=["PR China" if x == "China" else x for x in country]

# list changed:
country

['PR China', 'Senegal', 'Spain', 'Norway', 'Korea']

* **Deleting**

In [20]:
# by position
del country[-1] #last value

# list changed:
country

['PR China', 'Senegal', 'Spain', 'Norway']

In [21]:
# by position
names.pop() #last value by default

# list changed:
names

['Qing', 'Françoise', 'Raúl', 'Bjork']

In [22]:
# only 'del' works for several positions

lista=[1,2,3,4,5,6]
del lista[1:3]

#now:
lista

[1, 4, 5, 6]

In [23]:
# by value
ages.remove(29) 

# list changed:
ages # just first ocurrence of value!!

[32, 33, 28, 30]

In [24]:
# by value
education.remove('PhD') 

# list changed:
education # just first ocurrence!!

['Bach', 'Bach', 'Master', 'PhD']

In [29]:
# deleting every  value:

lista=[1,'a',45,'b','a']
lista=[x for x in lista if x!='a']

# you get:
lista

[1, 45, 'b']

* **Inserting values**

In [30]:
# at the end
lista.append("abc")
lista

[1, 45, 'b', 'abc']

In [31]:
# PART ONE:
# first delete a position
education.pop(2)
education

['Bach', 'Bach', 'PhD']

In [32]:
# PART TWO:
# now insert in that position
education.insert(2,"Master")
education

['Bach', 'Bach', 'Master', 'PhD']

## B.  **TUPLES**

Tuples are inmutable structures in Python, they look like lists but do not share much of their functionality:

In [0]:
# new list:
weekend="Friday", "Saturday", "Sunday"

You can access:

In [36]:
weekend[0]

'Friday'

But no other operation is allowed.

Python itself uses tuples as output of some important functions:

In [37]:
zip(names,ages)

<zip at 0x7f7206ee9048>

The **zip** functions creates tuples, by combining in parallel. You can see it if you turn the result into a list:

In [38]:
list(zip(names,ages))  # a list of tuples

[('Qing', 32), ('Françoise', 33), ('Raúl', 28), ('Bjork', 30)]

## C. **DICTIONARIES**  

*Dicts* work in a more sophisticated way, as they have a **'key'**:**'value'** structure:

In [39]:
classroom={'student':names,'age':ages,'edu':education}
# see it:

classroom

{'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'student': ['Qing', 'Françoise', 'Raúl', 'Bjork']}

Dicts do not use indexes to access values:

In [48]:
classroom[0]

KeyError: ignored

Dicts use keys:

In [47]:
classroom['student']

['Qing', 'Françoise', 'Raúl', 'Bjork']

Notice I created a dictionary where the value is not ONE but a LIST of values.

Once you access a value, you can modify it. You can also use _pop_ or _del_ using the **keys**. But you can not use _append_ to add an element, you need **update**:

In [42]:
classroom.update({'country':country})
# now:
classroom

{'age': [32, 33, 28, 30],
 'country': ['PR China', 'Senegal', 'Spain', 'Norway'],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'student': ['Qing', 'Françoise', 'Raúl', 'Bjork']}

## D. DATA FRAMES

**Data frames**  are more complex containers of values. The most common analogy is a spreadsheet. To create a data frame, we need to call **pandas**:

In [0]:
import pandas

We can prepare a data frame from a dictionary immediately, but ONLY if you have the same amount of elements in each list representing a column.

In [50]:
# our data frame:
students=pandas.DataFrame(classroom)
## see it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32,Bach,PR China
1,Françoise,33,Bach,Senegal
2,Raúl,28,Master,Spain
3,Bjork,30,PhD,Norway


But, let me update the dictionary with: 

In [51]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
#
classroom.update({'student':names})
#
classroom

{'age': [32, 33, 28, 30],
 'country': ['PR China', 'Senegal', 'Spain', 'Norway'],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie']}

We have five students, but only data for four of them. Then this does not work:

In [52]:
pandas.DataFrame(classroom)

ValueError: ignored

In that case, you need this:

In [53]:
#then
students=pandas.DataFrame({key:pandas.Series(value) for key, value in classroom.items()})

# seeing it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,,,


Sometimes, Python users code like this:

In [54]:
import pandas as pd # renaming the library

students=pd.DataFrame({key:pd.Series(value) for key, value in classroom.items()})
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,,,


### Data frame basic operations

In [55]:
# data of structure: list? tuple? dataframe?
type(students)

pandas.core.frame.DataFrame

In [56]:
# type of data in data frame column
students.dtypes

student     object
age        float64
edu         object
country     object
dtype: object

In [57]:
# details of data frame
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
student    5 non-null object
age        4 non-null float64
edu        4 non-null object
country    4 non-null object
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


In [58]:
# number of rows and columns
students.shape 

(5, 4)

In [59]:
# number of rows:
len(students) 

5

In [60]:
# first rows
students.head(2) # compare with: students.tail(2)

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal


In [61]:
# name of columns
students.columns

Index(['student', 'age', 'edu', 'country'], dtype='object')

If you needed the column names as a list:

In [62]:
students.columns.tolist()# or simply: list(students)

['student', 'age', 'edu', 'country']

If you needed a column values as a list:

In [65]:
students.age.tolist()# list(students.ages)

[32.0, 33.0, 28.0, 30.0, nan]

### Accesing elements in DF:

The data frames in pandas behave much like in R:

In [66]:
#one particular column
students.student

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [67]:
# or
students['student'] 

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [68]:
# it is not the same as: 
students[['student']] # a data frame, not a column (or series)

Unnamed: 0,student
0,Qing
1,Françoise
2,Raúl
3,Bjork
4,Marie


In [69]:
# this is also a DF
students[['country','student']]

Unnamed: 0,country,student
0,PR China,Qing
1,Senegal,Françoise
2,Spain,Raúl
3,Norway,Bjork
4,,Marie


In [70]:
# and this, using loc:
columnNames=['country','student']
students.loc[:,columnNames]

Unnamed: 0,country,student
0,PR China,Qing
1,Senegal,Françoise
2,Spain,Raúl
3,Norway,Bjork
4,,Marie


In [71]:
## Using positions is very common:
columnPositions=[1,3,0]
students.iloc[:,columnPositions] 

Unnamed: 0,age,country,student
0,32.0,PR China,Qing
1,33.0,Senegal,Françoise
2,28.0,Spain,Raúl
3,30.0,Norway,Bjork
4,,,Marie


### Changing values

If you have a position, you can update values:

In [72]:
students.iloc[4,1]=23 # change is immediate! (no warning)
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


### Deleting columns

You can modify any values in a data frame, but let me create a **deep** copy of this data frame to play with:

In [74]:
studentsCopy=students.copy()
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


In [75]:
# This is what you want get rid of:
byeColumns=['edu'] # you can delete more than one

#this is the result
studentsCopy.drop(columns=byeColumns)

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Raúl,28.0,Spain
3,Bjork,30.0,Norway
4,Marie,23.0,


Notice you do not have saved the previous result:

In [76]:
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


In [0]:
#NOW we do
studentsCopy.drop(columns=byeColumns,inplace=True)

In [78]:
#then:
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Raúl,28.0,Spain
3,Bjork,30.0,Norway
4,Marie,23.0,


### Deleting a row

Let me delete a row:

In [79]:
# axis 0 is delete by row
studentsCopy.drop(index=2,inplace=True) 
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
3,Bjork,30.0,Norway
4,Marie,23.0,


As you see, the index dissapeared. Then, you should reset the indexes:

In [80]:
studentsCopy.reset_index(drop=True,inplace=True)
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Bjork,30.0,Norway
3,Marie,23.0,


----
_____

<a id='part2'></a>


## 2.  Data Pre processing

<a id='beginning'></a>

Preprocessing includes three stages:

1. **Cleaning**: Cleaning requires that every cell has the right value, and that the dataframe has only the contents needed. Having a clean data frame means:

    a. Verify that headers are well read, well written and are at the top of data frame.

    b. Verify that the last lines of the data frame are just data.

    c. Verify that every row speaks of the unit of analysis.

    d. Verify that each cell has a category or a number well written.
    <br>
    
2. **Formatting**: Formatting requires that the clean values are coded in the rigth data type:

    a. Categorical: Ordinal or Nominal.
    
    b. Numerical.
    
    c. Text.
    
    d. Date.
    
    It also requires that the data frame has the rigth shape.
    
    <br>
    
3. **Integrating and Saving**: It is process of combining several dataframes in one, and saving it into a file that can be the input of future processes.


<a id='part2'></a>

## Cleaning

Cleaning requires that every cell has the right value, and that the dataframe has only the contents needed.

## Exercise 1:

* Go to this [website](http://hdr.undp.org/en/content/table-1-human-development-index-and-its-components-1) and download the CSV file about Human Develeopment Index and its components.
* Go to your gmail account and create a new GoogleSheet.
* Import the file into the GoogleSheet.
* Make sure commas do not appear in thousands.
* Create a CSV link to that data from Google.
* Make a plan to have a clean data frame.
* Execute the plan in Python.
* Give data the rigth format.


## Exercise 2:

* Go to this [website](https://en.wikipedia.org/wiki/Democracy_Index) and scrape the table with all the countries.
* Verify the data type returned.
* Make a plan to have a clean data collected.
* Execute the plan in Python.
* Give data the rigth format.

## Exercise 3:
* Integrate those data frames into one, and save it for R.