# Introducción Python 

Authors : Cindee Madison, Thomas Kluyver (https://bids.github.io/2016-01-14-berkeley/python/00-python-intro.html)

Translations and extensions by: Santiago Alonso

We will work in the cloud. In Google Colab.


For local use, install Anaconda.



Some practicing exercise:
* https://horstmann.com/codecheck/python-questions.html

## 1. VARIABLES 

The most basic component of any programming language are "things", also called variables or (in special cases) objects.

The most common basic "things" in Python are: 
- integer
- float 
- strings 
- booleans 
- Otros (e.g. pandas dataframes, numpy array). 

We'll meet many of these as we go through the lesson.


__TIP:__ To run the code in a cell quickly, press Shift-Enter.

__TIP:__ To quickly create a new cell below an existing one, go to a cell and type b. 

In [None]:
# A "thing" (integer)
2

In [None]:
# Use print to show multiple things in the same cell
# Note that you can use single or double quotes for strings
print(2)
print("hola mundo")
print('hola mundo')

In [None]:
# Things can be stored as variables
a = 2
b = 'hello'
c = True  # This is case sensitive
print(a, b, c)

In [None]:
# The type function tells us the type of thing we have
print(type(a))
print(type(b))
print(type(c))

### <span style="color:purple">Now you</span>
Make three variables, with names and types of your preference. To do it, create a code cell below this one.

## 2. Commands that operate on things
Just storing data in variables isn't much use to us. Right away, we'd like to start performing operations and manipulations on data and variables.

There are three very common means of performing an operation on a thing.

### 2.1 Use an operator
All of the basic math operators work like you think they should for numbers. They can also do some useful operations on other things, like strings. There are also boolean operators that compare quantities and give back a bool variable as a result.


In [None]:
# Standard math operators work as expected on numbers
a = 2
b = 3
print(a + b)
print(a * b)
print(a ** b)  # a to the power of b (a^b does something completely different!)
print(a / b)   # Careful with dividing integers if you use Python 2

In [None]:
# There are also operators for strings
print('hello' + 'world')
print('hello' * 3)
#print('hello' / 3)  # You can't do this!

In [None]:
# Boolean operators compare two things
a = (1 > 3)
b = (3 == 3)
c = 'vaca' == 'Vaca' 
d = 'vaca' == 'vaca'
print(a)
print(b)
print(c)
print(d)
print(a or b)
print(a and b)

### 2.2 Functions

Group of instructions that transform an input

In [None]:
# There are thousands of functions that operate on things
print(type(3))
print(len('hello'))
print(round(3.3))


__TIP:__ To find out what a function does, you can type it's name and then a question mark to get a pop up help window. Or, to see what arguments it takes, you can type its name, an open parenthesis, and hit tab

In [None]:
round?
round(3.14159, 2)

__TIP:__ Many useful functions are not in the Python built in library, but are in external scientific packages. These need to be imported into your Python notebook (or program) before they can be used. Probably the most important of these are numpy and matplotlib.

In [None]:
# Many useful functions are in external packages 
# Let's meet numpy y pandas
import numpy as np  #We put an alias to numpy, "np", para acortar cuando lo usamos
import pandas as pd

In [None]:
# To see what's in a package, type the name, a period, then hit tab
# Also, go to forums like stackoverflow when you have doubts or see execution errors
# pd?
# pd.

In [None]:
# Some examples of numpy functions and "things" 
print(np.sqrt(4))
print(np.pi)  # Not a function, just a variable
print(np.sin(np.pi))

### 2.3 Methods
Before we get any farther into the Python language, we have to say a word about "objects". We will not be teaching object oriented programming in this workshop, but you will encounter objects throughout Python (in fact, even seemingly simple things like ints and strings are actually objects in Python).

In the simplest terms, you can think of an object as a small bundled "thing" that contains within itself both data and functions that operate on that data. For example, strings in Python are objects that contain a set of characters and also various functions that operate on the set of characters. When bundled in an object, these functions are called "methods".

Instead of the "normal" `function(arguments)` syntax, methods are called using the syntax `variable.method(arguments)`.

In [None]:
# A string is actually an object 
a = 'hola, mundo'
print(type(a))

In [None]:
# Objects have bundled methods. 
#a. 
print(a.capitalize())
print(a.replace('l', 'X'))

### EXERCISE 1 - Conversion

Throughout this lesson, we will successively build towards a program that will calculate the variance of some measurements, in this case Height in Metres. The first thing we want to do is convert from an antiquated measurement system.

1. Create a code cell below this one and do the following.
1. Create a variable called `inches_in_metre` that indicates how many inches there is in a metre.
1. Create a variable `inches` with any number of inches.
1. Convert inches to metres and save the result in a variable called `metres` 
1. Print `metres`

## 3. Object collections

While it is interesting to explore your own height, in science we work with larger slightly more complex datasets. In this example, we are interested in the characteristics and distribution of heights. Python provides us with a number of objects to handle collections of things.

Probably 99% of your work in scientific Python will use one of four types of collections: `lists`, `tuples`, `dictionaries`, `numpy arrays`, `pandas dataframes`. We'll look quickly at each of these and what they can do for you.


### 3.1 Lists

Lists are probably the handiest and most flexible type of container.

Lists are declared with square brackets [].

Individual elements of a list can be selected using the syntax a[ind].

In [None]:
# Lists are created with square bracket syntax
a = ['blueberry', 'strawberry', 'pineapple']
print(a, type(a))

In [None]:
# Lists (and all collections) are also indexed with square brackets
# NOTE: The first index is zero, not one
print(a[0])
print(a[1])
print(a[2])

In [None]:
## You can also count from the end of the list
print('last item is:', a[-1])
print('second to last item is:', a[-2])

In [None]:
# you can access multiple items from a list by slicing, using a colon between indexes
# NOTE: The end value is not inclusive
print('a =', a)
print('get first two:', a[0:2])

In [None]:
# You can leave off the start or end if desired
print(a[:2])
print(a[2:])
print(a[:])
print(a[:-1])

In [None]:
# Lists are objects, like everything else, and have methods such as append
a.append('banana')
print(a)

a.append([1,2])
print(a)

a.pop()
print(a)

__PRECAUCIÓN:__ A 'gotcha' for some new Python users is that many collections, including lists, actually store pointers to data, not the data itself.

Remember when we set b=a and then changed a?

What happens when we do this in a list?

HELP: look into the copy module


In [None]:
#Int, float, strings, are inmutable
a = 1
b = a
print('original b', b)
a = 2
print('What is b after changing a?', b)

#Lists are mutable
a = [1, 2, 3]
b = a
print('original b', b)
a[0] = 42
print('What is b after changing a?', b) # I changed a and this also change b!!!

#How to copy mutables? see the module copy
import copy
a = [1,2,3]
b = copy.deepcopy(a)
print('original b', b)
a[0] = 42
print('What is b after changing a?', b) 


### EXERCISE 2 - Store a bunch of heights (in metres) in a list
1. Ask five people around you for their heights (in metres).
1. Store these in a list called `heights`.
1. Append your own height, calculated above in the variable metres, to the list.
1. Get the first height from the list and print it.
**Bonus**
Extract the last value in two different ways: first, by using the index for the last item in the list, and second, presuming that you do not know how long the list is.
HINT: len() can be used to find the length of a collection


### 3.2 Tuples
We won't say a whole lot about tuples except to mention that they basically work just like lists, with two major exceptions:

1. You declare tuples using () instead of []
1. Once you make a tuple, you can't change what's in it (referred to as immutable)

You'll see tuples come up throughout the Python language, and over time you'll develop a feel for when to use them.

In general, they're often used instead of lists:

1. to group items when the position in the collection is critical, such as coord = (x,y)
1. when you want to make prevent accidental modification of the items, e.g. shape = (12,23)

In [None]:
xy = (23, 45)
print(xy[0])
xy[0] = "this won't work with a tuple"

### Anatomy of a traceback error
Traceback errors are raised when you try to do something with code it isn't meant to do. It is also meant to be informative, but like many things, it is not always as informative as we would like.

Looking at our error:

1. The command you tried to run raise a TypeError. 
1. This suggests you are using a variable in a way that its Type doesnt support. `tuples` son inmutables
1. In Jupyter, the arrow ----> points to the line where the error occurred, In this case on line 3 of your code form the above line.
Learning how to read a traceback error is an important skill to develop, and helps you know how to ask questions about what has gone wrong in your code (many errors and their solutions can be found in Google, and Stackoverflow)


### 3.3 Dictionaries

Dictionaries are the collection to use when you want to store and retrieve things by their names (or some other kind of key) instead of by their position in the collection. A good example is a set of model parameters, each of which has a name and a value. 

Dictionaries are declared using {}.

In [None]:
# Make a dictionary of model parameters 
# key:value
convertors = {'inches_in_feet' : 12,
              'inches_in_metre' : 39}

print(convertors)
print(convertors['inches_in_feet'])

In [None]:
## Add a new key:value pair
convertors['metres_in_mile'] = 1609.34
print(convertors)

In [None]:
# Raise a KEY error
print(convertors['blueberry'])

### 3.4 Numpy arrays (ndarrays)

Even though numpy arrays (often written as ndarrays, for n-dimensional arrays) are not part of the core Python libraries, they are so useful in scientific Python that we'll include them here in the core lesson. Numpy arrays are collections of things, all of which must be the same type, that work similarly to lists (as we've described them so far). The most important are:

1. You can easily perform elementwise operations (and matrix algebra) on arrays
1. Arrays can be n-dimensional
1. There is no equivalent to append, although arrays can be concatenated
1. Arrays can be created from existing collections such as lists, or instantiated "from scratch" in a few useful ways.

When getting started with scientific Python, you will probably want to try to use ndarrays whenever possible, saving the other types of collections for those cases when you have a specific reason to use them.


In [None]:
# We need to import the numpy library to have access to it 
# We can also create an alias for a library, this is something you will commonly see with numpy

import numpy as np

In [None]:
# Make an array from a list
alist = [2, 3, 4]
blist = [5, 6, 7]
a = np.array(alist)
b = np.array(blist)
print(a, type(a))
print(b, type(b))

In [None]:
# Do arithmetic on arrays
print(a**2)
print(np.sin(a))
print(a * b)
print(a.dot(b), np.dot(a, b))

In [None]:
# Boolean operators work on arrays too, and they return boolean arrays
print(a > 2)
print(b == 6)

c = a > 2
print(c)
print(type(c))
print(c.dtype)

In [None]:
# Indexing arrays
print(a[0:2])

c = np.random.rand(3,3)
print(c)
print('\n')
print(c[1,1])
print(c[1:3,0:2])

c[0,:] = a
print('\n')
print(c)

In [None]:
# Arrays can also be indexed with other boolean arrays
print(a)
print(b)
print(a > 2)
print(a[a > 2])
print(b[a > 2])

b[a == 3] = 77
print(b)

In [None]:
# ndarrays have attributes in addition to methods
#c.
print(c.shape)
print(c.prod())

In [None]:
# There are handy ways to make arrays full of ones and zeros
print(np.zeros(5), '\n')
print(np.ones(5), '\n')
print(np.identity(5), '\n')

In [None]:
# You can also easily make arrays of number sequences
print(np.arange(0, 10, 2))

### 3.5 Pandas dataframes

Dataframes are similar to numpy arrays; they are actually build on numpy array, and they can be also thought as matrices with columns that gather variables of the same type (e.g one column can be names, another income, etx).

At the same time, dataframes and numpy ndarray differ. For example, Python processes numpy arrays faster. However, dataframes have many useful functions and methods not present in numpy arrays. That's why we learn, and use, both.


In [None]:
# Let's import the module pandas with the alias pd.
# If you imported the module before, you don't need to import it twice.
import pandas as pd

In [None]:
# Building dataframes from dictionaries.
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
purchases = pd.DataFrame(data) #Each key is a column in the dataframe, and each value of the key is a row.
purchases


In [None]:
# Building dataframes from numpy array
data_numpy_array = np.array([[3, 2, 0, 1], [0, 3, 7, 2]])
purchases = pd.DataFrame(np.transpose(data_numpy_array), 
                       columns = ['apples','oranges'])
purchases


In [None]:
# Building dataframes from a list
data_list = [[3, 0], [2, 3], [0, 7],[1,2]] #Cada item de la lista es una fila del dataframe
purchases = pd.DataFrame(data_list, 
                       columns = ['apples','oranges'])
purchases


In [None]:
# The constructor pd.DataFrame enumarates the rows. We can change them to strings
# Rows are called index
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases

In [None]:
# We can use the index name to locate values in a column (cell values)
purchases.loc['Robert',:] 

In [None]:
# We can use the column name to locate values iin a row (cell values)
purchases.loc[:,'oranges'] 

In [None]:
# We can access individual values
purchases.loc['David','apples']

In [None]:
# with the method reset_index() we can enumarate again the rows and drop the names
purchases = purchases.reset_index(drop = True)
print(purchases)

# This is useful to access cells by their coordinates
purchases.iloc[0,0]


### EJERCICIO 3 - Análisis simple con numpy arrays

Revisit your list of heights

1. Create a new cell code below
1. turn the list of heights into a numpy array
1. calculate the mean and standard deviation
1. create a mask of all heights greater than a certain value (your choice)
1. find the mean of the masked heights



Numpy broadcasting
https://numpy.org/doc/stable/user/basics.broadcasting.html

<center><img src="np_broadcast_visual.png" width = "400" height = '400'></center>

## 4. ITERAR: LOOPS FOR Y LOOPS WHILE

So far, everything that we've done could, in principle, be done by hand calculation. In this section and the next, we really start to take advantage of the power of programming languages to do things for us automatically.

We start here with ways to repeat yourself. The two most common ways of doing this are known as for loops and while loops. For loops in Python are useful when you want to cycle over all of the items in a collection (such as all of the elements of an array), and while loops are useful when you want to cycle for an indefinite amount of time until some condition is met.

The basic examples below will work for looping over lists, tuples, and arrays. Looping over dictionaries is a bit different, since there is a key and a value for each item in a dictionary. Have a look at the Python docs for more information.


In [None]:
# A basic for loop - don't forget the white space!
wordlist = ['hi', 'hello', 'bye']
for word in wordlist:
    print(word + '!')


**Note on indentation**: Notice the indentation once we enter the for loop. Every idented statement after the for loop declaration is part of the for loop. This rule holds true for while loops, if statements, functions, etc. Required identation is one of the reasons Python is such a beautiful language to read.

If you do not have consistent indentation you will get an `IndentationError`. Fortunately, most code editors will ensure your indentation is correction.

NOTE In Python the default is to use four (4) spaces for each indentation, most editros can be configured to follow this guide.


In [None]:
# Indentation error: Fix it!
for word in wordlist:
    new_word = word.capitalize()
   print(new_word + '!') # Bad indent

In [None]:
# Sum all of the values in a collection using a for loop
numlist = [1, 4, 77, 3]

total = 0
for num in numlist:
    total = total + num
    
print("Sum is", total)

In [None]:
# Often we want to loop over the indexes of a collection, not just the items
print(wordlist)
for i, word in enumerate(wordlist):
    print(i, word, wordlist[i])

In [None]:
# While loops are useful when you don't know how many steps you will need,
# and want to stop once a certain condition is met.
step = 0
prod = 1
while prod < 100:
    step = step + 1
    prod = prod * 2
    print(step, prod)
    
print('Reached a product of', prod, 'at step number', step)

### EXERCISE 4 - Variance

We can now calculate the variance of the heights we collected before.

As a reminder, **sample variance* is the calculated from the sum of squared differences of each observation from the mean:

### $variance = \frac{\Sigma{(x-mean)^2}}{n-1}$

where mean is the mean of our observations, x is each individual observation, and n is the number of observations.

First, we need to calculate the mean:

1. Create a variable total for the sum of the heights.
2. Using a for loop, add each height to total.
3. Find the mean by dividing this by the number of measurements, and store it as mean.
__Note__: To get the number of things in a list, use len(the_list).

Now we'll use another loop to calculate the variance:

1. Create a variable sum_diffsq for the sum of squared differences.
1. Make a second for loop over heights.
    * At each step, subtract the height from the mean and call it diff.
    * Square this and call it diffsq.
    * Add diffsq on to sum_diffsq.
1. Divide diffsq by n-1 to get the variance.
1. Display the variance.

__Note__: To square a number in Python, use **, eg. 5**2.

Bonus

Test whether variance is larger than 0.01, and print out a line that says "variance more than 0.01: " followed by the answer (either True or False).


## 5. CONDITIONAL IF

Often we want to check if a condition is True and take one action if it is, and another action if the condition is False. We can achieve this in Python with an if statement.

__TIP__: You can use any expression that returns a boolean value (True or False) in an if statement. Common boolean operators are ==, !=, <, <=, >, >=. You can also use `is` and `is not` if you want to check if two variables are identical in the sense that they are stored in the same location in memory.



In [None]:
# A simple if statement
x = 3
if x > 0:
    print('x is positive')
elif x < 0:
    print('x is negative')
else:
    print('x is zero')

In [None]:
# If statements can rely on boolean variables
x = -1
test = (x > 0)
print(type(test)); print(test)

if test:
    print('Test was true')

## 6. FUNCTIONS & MODULES

One way to write a program is to simply string together commands, like the ones described above, in a long file, and then to run that file to generate your results. This may work, but it can be cognitively difficult to follow the logic of programs written in this style. Also, it does not allow you to reuse your code easily - for example, what if we wanted to run our logistic growth model for several different choices of initial parameters?

The most important ways to "chunk" code into more manageable pieces is to create functions and then to gather these functions into modules, and eventually packages. Below we will discuss how to create functions and modules. A third common type of "chunk" in Python is classes, but we will not be covering object-oriented programming in this workshop.


In [None]:
# We've been using functions all day
x = 3.333333
print(round(x, 2))
print(np.sin(x))

In [None]:
# It's very easy to write your own functions
def multiply(x, y):
    return x*y

In [None]:
# Once a function is "run" and saved in memory, it's available just like any other function
print(type(multiply))
print(multiply(4, 3))

In [None]:
# It's useful to include docstrings to describe what your function does, inputs, and outputs
def say_hello(time, people):
    '''
    Function says a greeting. Useful for engendering goodwill
    '''
    return 'Good ' + time + ', ' + people

**Docstrings**: A docstring is a special type of comment that tells you what a function does. You can see them when you ask for help about a function

In [None]:
say_hello('afternoon', 'friends')

In [None]:
# All arguments must be present, or the function will return an error
say_hello('afternoon')

In [None]:
# Keyword arguments can be used to make some arguments optional by giving them a default value
# All mandatory arguments must come first, in order; parameters with default values at the end
def say_hello(time, people='friends'):
    return 'Good ' + time + ', ' + people

In [None]:
say_hello('afternoon')

In [None]:
say_hello('afternoon', 'students')

### EJERCICIO 5 - Crear una función de varianza

Finally, let's turn our variance calculation into a function that we can use over and over again. Copy your code from Exercise 4 into the box below, and do the following:

1. Turn your code into a function called `calculate_variance` that takes a list of values and returns their variance.
2. Write a nice docstring describing what your function does.
3. In a subsequent cell, call your function with different sets of numbers to make sure it works.

Create another code cell and do the following:

1. Refactor your function by pulling out the section that calculates the mean into another function called `calculate_mean`, and calling that inside your `calculate_variance` function.
1. Make sure it can works properly when all the data are integers as well.
1. Give a better error message when it's passed an empty list. Use the web to find out how to raise exceptions in Python.

### EJERCICIO 6 - Put `Calculate_mean` & ` Calculate_variance` in a module

We can make our functions more easily reusable by placing them into modules that we can import, just like we have been doing with `numpy`. It's pretty simple to do this.

1. Copy your function(s) into a new text file, in the same directory as this notebook, called `stats.py`.
1. In the cell below, type `import stats` to import the module. 
1. Type `stats`. and hit tab to see the available functions in the module. 
1. Try calculating the variance of a number of samples of heights (or other random numbers) using your imported module.



## 7. LOAD, ARRANGE, USE, AND SAVE DATA 

Data scientists obtain and use information from many sources. Python can manage many formats (csv, dta, RData, mat, json, sql, etc). In this section we will use `pandas dataframes`. 

### 7.1 LOAD

In [1]:
import numpy #already loaded, here for pedagogical reasons
import json
import pandas as pd #already loaded, here for pedagogical reasons

In [None]:
# To load a .csv file
data_csv = pd.read_csv('GHE2016_AllAges.csv') #Death rate by country and disease type
print(type(data_csv))
data_csv.head() #It's on wide format; later we will see how to put it in long format

In [3]:
babynames = pd.read_csv('babynames.csv') 
print(babynames.dtypes) #name and data type of each column (i.e. variables)
babynames

Unnamed: 0      int64
year            int64
sex            object
name           object
n               int64
prop          float64
dtype: object


Unnamed: 0.1,Unnamed: 0,year,sex,name,n,prop
0,1,1880,F,Mary,7065,0.072384
1,2,1880,F,Anna,2604,0.026679
2,3,1880,F,Emma,2003,0.020521
3,4,1880,F,Elizabeth,1939,0.019866
4,5,1880,F,Minnie,1746,0.017888
...,...,...,...,...,...,...
1924660,1924661,2017,M,Zykai,5,0.000003
1924661,1924662,2017,M,Zykeem,5,0.000003
1924662,1924663,2017,M,Zylin,5,0.000003
1924663,1924664,2017,M,Zylis,5,0.000003


In [None]:
# To load an stata .dat file
data_stata = pd.read_stata('heus_mepssample.dta')
itr = pd.read_stata('heus_mepssample.dta', iterator=True) #has the description of the variables
print(data_stata.dtypes) 
print('\n')
print("La variable ed_hs es: " + itr.variable_labels()['ed_hs'])

In [None]:
# To load a .json file
with open('black_mirror.json') as f: 
    data = json.load(f) #parses f (the json file in text form) and puts it in a dictionary
print(data.keys()) #Dictionary keys
data_json = pd.DataFrame(data['_embedded']['episodes']) #black mirror episodes in a pandas data frame
print(data_json.dtypes) #columns and data types. Each row is an episod
print('\n')
episode = 6
print(data_json.loc[episode,'name']) 
print(data_json.loc[episode,'season']) 
print(data_json.loc[episode,'summary'])



### 7.2 Arrange

Based on the tutorial premiers for rstudio cloud

In [None]:
# Same data, different arrangement
tabla1 = pd.read_csv('For_Reshape_Tutorial.csv')
tabla2 = pd.read_csv('For_Reshape_Tutorial_2.csv')
tabla3 = pd.read_csv('For_Reshape_Tutorial_3.csv')
tabla4a = pd.read_csv('For_Reshape_Tutorial_4a.csv') #casos
tabla4b = pd.read_csv('For_Reshape_Tutorial_4b.csv') #población
tabla5 = pd.read_csv('For_Reshape_Tutorial_5.csv')


In [None]:
# Let's see table 1
print(tabla1.dtypes)
print("(rows, columns): " + str(tabla1.shape))
print('\n')
print(tabla1) 

Some definitions:
* **Variable:** Measurable/describable property.

* **Value**: value of a variable (e.g. the variable `age` can take different values).

* **Observation**: set of values from different variables (e.g. the age, SES, gender of an individual)


### EXERCISE 7 - Identify variables, values, and observations

In tabla 1:
- What are variables? 
- What are values? 
- What are observations?

In [None]:
# Let's see other tables
tabla2

In [None]:
tabla3

In [None]:
tabla4a

In [None]:
tabla4b

In [None]:
tabla5

All tables have the same information, however some times we want them in an specific arrangement. For instance, let's imagine we want tabla 2 but arranged as tabla 1 (and assume we do not have tabla 1). How can we rearrange tabla 2?

In [None]:
# We want tabla 2 to look as tabla 1
# Problem: in tabla 2, cases y population are in the column "type"
# Objective: go from long to wide format: extend the columna type into two new columns cases & population
# Solution: pd.pivot_table()

tabla2_wide = pd.pivot_table(tabla2, index = ['country', 'year'], columns = ['type'], values = ['count']).reset_index()
tabla2_wide.columns = ['country','year', 'cases','population']
tabla2_wide #en formato wide (ancho)

In [None]:
# Para comparar con la obtenida en la anterior celda
tabla1

In [None]:
# Now the opposite problem: there are some columns that I want to collapse into one
# That is, we want tabla 1 to look like tabla 2
# Problem: in tabla 1, cases y population are in different columns
# Objective: go from wide to long: colapse the columns cases & population into one
# Solución: pd.melt()

tabla1_long = pd.melt(tabla1, id_vars = ['country', 'year'], value_vars = ['cases','population'], 
                     var_name = 'type', value_name = 'count').reset_index(drop=True)
tabla1_long = tabla1_long.sort_values(by = ['country', 'year'], ascending = True)
tabla1_long #en formato long (largo)

In [None]:
# Para comparar con la obtenida en la anterior celda
tabla2


### EXERCISE 8 - Arrange a table

In the database GHE2016_AllAges.csv each country has a column. We are interested in using country as a variable, thus we need to change the arrangement of the database. This means that we are interested in using a long format i.e. collapse many columns (countries) into one. Do the following:

* Load the database into a python variable called `data_raw`
* Create a python variable called `columnas_data_raw` with the names of the columns of `data_raw`. TIP: use the method .columns
* Use pd.melt to create a new python variable called `data_long`. In `data_long` place all countries in a single column called `country`. TIP: the names of the countries are in `columnas_data_raw`, from the index 7 onward
* What are the columns temp3 y temp4? Rename these columns TIP: use the method .rename.


### 7.3 Use
Python has many libraries to analize data. In this section we will obtain descriptive stats.

We will use an R database called `babynames`. It has the frequency (n) and proportion (prop) of different names (name) by gender (sex) during a long period (1880-2017)(year) in the United States.

This section is inspired by Rstudio cloud premiers. P.S. Some comparisons with R https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html

In [None]:
#Do not forget to run previous code cells
print(babynames.describe())
print("Amount of different names: " + str(len(babynames['name'].unique()))) 


In [None]:
# Let's count how many times your name appears
# .query() filters row values.
name = babynames.query("(name == 'Santiago') & (sex == 'M')") #Note how we use "" to be able to use '' inside
print(name)
name['n'].sum() 

In [None]:
# Now some descriptive stats like max, min, mean, std 
name = babynames.query("(name == 'Santiago') & (sex == 'M')")
print("On average, the name appears each year: " + str(round(name['n'].mean())))
print("The standard deviation across years: " + str(round(name['n'].std())))
print("The min. number of times the name appears in a year: " + str(round(name['n'].min())))
print("The max. number of times the name appears in a year: " + str(round(name['n'].max())))

In [None]:
# Let's compute descriptive stats for all names
# .groupby breaks the data base in groups by the desired variables/column

# men
baby_filter = babynames.query("(sex == 'M')")
names_mean = round(baby_filter.groupby(['name']).mean()) #sum, std, etc.
print(names_mean['n'].sort_values(ascending = False).head(20)) #top 20
print('\n')

# women
baby_filter = babynames.query("(sex == 'F')")
names_mean = round(baby_filter.groupby(['name']).mean())
print(names_mean['n'].sort_values(ascending = False).head(20)) #top 20
print('\n')

# total by year and gender
names_mean = babynames.groupby(['year','sex']).sum().reset_index()
print(names_mean.head(20)) 


In [None]:
# Graphs with ggplot via plotnine
#!pip install plotnine
import plotnine as p9
from plotnine import ggplot, geom_line, aes, stat_smooth, facet_wrap, themes

names_mean = babynames.groupby(['year','sex']).sum().reset_index()
p1 = (ggplot(names_mean, aes('year', 'n', color='sex'))
 + geom_line() 
 + themes.theme_xkcd() #many themes e.g. theme_classic()
)

p1

In [None]:
# Graphs with matplotlib
import matplotlib
from matplotlib import pyplot as plt

fig1 = plt.figure(figsize = [9,6]) #initialize a blank canvas
idx = names_mean['sex'] == 'F'
x = names_mean.loc[idx,'year']
y = names_mean.loc[idx,'n']
plt.plot(x, y, color = 'red', label = 'F')

idx = names_mean['sex'] == 'M'
x = names_mean.loc[idx,'year']
y = names_mean.loc[idx,'n']
plt.plot(x, y, color = 'cyan', label = 'M')

plt.legend()
plt.xlabel('Year')
plt.ylabel('Number of names');

### EXERCISE 9 - Usage of databases

With the method .query(), filter three diseases in the database GHE2016_AllAges.csv. Import the database as a pandas dataframe

Print the first 30 rows for each disease. Tip: use .head()

Use the version `data_long` (from exercise 8)

### 7.4 Save

Some times a code could take hours or even days. In such cases, it is a good idea to save the output so that the next time you just load them.

In [None]:
# Save in csv
data_long.to_csv('GHE2016_AllAges_long.csv')

In [None]:
# Save in pickle (python format)
data_long.to_pickle('GHE2016_AllAges_long.pkl')

In [None]:
# Save in .dat (Stata)
# You can rename columns before saving:
# data_long = data_long.rename({'People(x1000)':'People_x_1000'}) 
data_long.to_stata('GHE2016_AllAges_long.dat')

In [None]:
# Save in excel 
data_long.to_excel('GHE2016_AllAges_long.xlsx')

In [None]:
# Save images
ggplot.save(p1, filename = 'p1_ggplot.pdf')

fig1.savefig('p1_matplotlib.pdf')

### EXERCISE 10 - Make a graph and save it

Graph the evolution across years of the name Mary for women in the data `babynames`.

Save the graph in `mi_primera_figura.png`. Use matplotlib.


# 8. Polars

[Based on Polars documentation](https://docs.pola.rs/user-guide/)

In [22]:
import polars as pl
import numpy as np
from datetime import datetime

### Build data frame

In [6]:
df = pl.DataFrame(
    {
        "integer": [1, 2, 3],
        "date": [
            datetime(2022, 1, 1),
            datetime(2022, 1, 2),
            datetime(2022, 1, 3),
        ],
        "float": [4.0, 5.0, 6.0],
    }
)

print(df)

shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   │
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
└─────────┴─────────────────────┴───────┘


### Save and load data

In [8]:
df.write_csv("output.csv")
df_csv = pl.read_csv("output.csv")
print(df_csv)

shape: (3, 3)
┌─────────┬────────────────────────────┬───────┐
│ integer ┆ date                       ┆ float │
│ ---     ┆ ---                        ┆ ---   │
│ i64     ┆ str                        ┆ f64   │
╞═════════╪════════════════════════════╪═══════╡
│ 1       ┆ 2022-01-01T00:00:00.000000 ┆ 4.0   │
│ 2       ┆ 2022-01-02T00:00:00.000000 ┆ 5.0   │
│ 3       ┆ 2022-01-03T00:00:00.000000 ┆ 6.0   │
└─────────┴────────────────────────────┴───────┘


# Use

In [9]:
df.select(pl.col("*")) # * is for all columns

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


In [10]:
df.select(pl.col("integer", "date")) #Specific columns

integer,date
i64,datetime[μs]
1,2022-01-01 00:00:00
2,2022-01-02 00:00:00
3,2022-01-03 00:00:00


In [11]:
df.select(pl.exclude("float"))

integer,date
i64,datetime[μs]
1,2022-01-01 00:00:00
2,2022-01-02 00:00:00
3,2022-01-03 00:00:00


In [13]:
#Filter
df.filter((pl.col("integer") >= 2) & (pl.col("float").is_not_nan()))

integer,date,float
i64,datetime[μs],f64
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


In [15]:
#Create columns
df.with_columns(pl.col("integer").sum().alias("Sum_Column_Integer"), (pl.col("integer") + pl.col("float")).alias("Integer_Plus_Float"))



integer,date,float,Sum_Column_Integer,Integer_Plus_Float
i64,datetime[μs],f64,i64,f64
1,2022-01-01 00:00:00,4.0,6,5.0
2,2022-01-02 00:00:00,5.0,6,7.0
3,2022-01-03 00:00:00,6.0,6,9.0


In [20]:
#Group by
df2 = pl.DataFrame(
    {
        "x": range(8),
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)
print(df2)
print(df2.group_by("y", maintain_order=True).count())
print(df2.group_by("y", maintain_order=True).agg(
    pl.col("x").count().alias("count"),
    pl.col("x").sum().alias("sum"),
))

shape: (8, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0   ┆ A   │
│ 1   ┆ A   │
│ 2   ┆ A   │
│ 3   ┆ B   │
│ 4   ┆ B   │
│ 5   ┆ C   │
│ 6   ┆ X   │
│ 7   ┆ X   │
└─────┴─────┘
shape: (4, 2)
┌─────┬───────┐
│ y   ┆ count │
│ --- ┆ ---   │
│ str ┆ u32   │
╞═════╪═══════╡
│ A   ┆ 3     │
│ B   ┆ 2     │
│ C   ┆ 1     │
│ X   ┆ 2     │
└─────┴───────┘
shape: (4, 3)
┌─────┬───────┬─────┐
│ y   ┆ count ┆ sum │
│ --- ┆ ---   ┆ --- │
│ str ┆ u32   ┆ i64 │
╞═════╪═══════╪═════╡
│ A   ┆ 3     ┆ 3   │
│ B   ┆ 2     ┆ 7   │
│ C   ┆ 1     ┆ 5   │
│ X   ┆ 2     ┆ 13  │
└─────┴───────┴─────┘


In [37]:
#Join dataframes
df = pl.DataFrame(
    {
        "a": range(8),
        "b": np.random.rand(8),
        "d": [1, 2.0, float("nan"), float("nan"), 0, -5, -42, None],
    }
)

df2 = pl.DataFrame(
    {
        "x": range(8),
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)
joined = df.join(df2, left_on="a", right_on="x") #left_on and right_on have the same info. Join combines that same info into one columns
print(joined)

shape: (8, 4)
┌─────┬──────────┬───────┬─────┐
│ a   ┆ b        ┆ d     ┆ y   │
│ --- ┆ ---      ┆ ---   ┆ --- │
│ i64 ┆ f64      ┆ f64   ┆ str │
╞═════╪══════════╪═══════╪═════╡
│ 0   ┆ 0.197025 ┆ 1.0   ┆ A   │
│ 1   ┆ 0.690501 ┆ 2.0   ┆ A   │
│ 2   ┆ 0.815058 ┆ NaN   ┆ A   │
│ 3   ┆ 0.580804 ┆ NaN   ┆ B   │
│ 4   ┆ 0.25513  ┆ 0.0   ┆ B   │
│ 5   ┆ 0.601945 ┆ -5.0  ┆ C   │
│ 6   ┆ 0.984781 ┆ -42.0 ┆ X   │
│ 7   ┆ 0.106888 ┆ null  ┆ X   │
└─────┴──────────┴───────┴─────┘


In [38]:
#Concat dataframes
stacked = df.hstack(df2)
print(stacked)

shape: (8, 5)
┌─────┬──────────┬───────┬─────┬─────┐
│ a   ┆ b        ┆ d     ┆ x   ┆ y   │
│ --- ┆ ---      ┆ ---   ┆ --- ┆ --- │
│ i64 ┆ f64      ┆ f64   ┆ i64 ┆ str │
╞═════╪══════════╪═══════╪═════╪═════╡
│ 0   ┆ 0.197025 ┆ 1.0   ┆ 0   ┆ A   │
│ 1   ┆ 0.690501 ┆ 2.0   ┆ 1   ┆ A   │
│ 2   ┆ 0.815058 ┆ NaN   ┆ 2   ┆ A   │
│ 3   ┆ 0.580804 ┆ NaN   ┆ 3   ┆ B   │
│ 4   ┆ 0.25513  ┆ 0.0   ┆ 4   ┆ B   │
│ 5   ┆ 0.601945 ┆ -5.0  ┆ 5   ┆ C   │
│ 6   ┆ 0.984781 ┆ -42.0 ┆ 6   ┆ X   │
│ 7   ┆ 0.106888 ┆ null  ┆ 7   ┆ X   │
└─────┴──────────┴───────┴─────┴─────┘


In [39]:
#Supports numpy functions
df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

out = df.select(np.log(pl.all()).name.suffix("_log"))
print(out)

shape: (3, 2)
┌──────────┬──────────┐
│ a_log    ┆ b_log    │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.0      ┆ 1.386294 │
│ 0.693147 ┆ 1.609438 │
│ 1.098612 ┆ 1.791759 │
└──────────┴──────────┘


In [40]:
#Pivot
df = pl.DataFrame(
    {
        "foo": ["A", "A", "B", "B", "C"],
        "N": [1, 2, 2, 4, 2],
        "bar": ["k", "l", "m", "n", "o"],
    }
)
print(df)

out = df.pivot(index="foo", columns="bar", values="N", aggregate_function="first")
print(out)

shape: (5, 3)
┌─────┬─────┬─────┐
│ foo ┆ N   ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ A   ┆ 1   ┆ k   │
│ A   ┆ 2   ┆ l   │
│ B   ┆ 2   ┆ m   │
│ B   ┆ 4   ┆ n   │
│ C   ┆ 2   ┆ o   │
└─────┴─────┴─────┘
shape: (3, 6)
┌─────┬──────┬──────┬──────┬──────┬──────┐
│ foo ┆ k    ┆ l    ┆ m    ┆ n    ┆ o    │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str ┆ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64  │
╞═════╪══════╪══════╪══════╪══════╪══════╡
│ A   ┆ 1    ┆ 2    ┆ null ┆ null ┆ null │
│ B   ┆ null ┆ null ┆ 2    ┆ 4    ┆ null │
│ C   ┆ null ┆ null ┆ null ┆ null ┆ 2    │
└─────┴──────┴──────┴──────┴──────┴──────┘


In [41]:
#Melt
df = pl.DataFrame(
    {
        "A": ["a", "b", "a"],
        "B": [1, 3, 5],
        "C": [10, 11, 12],
        "D": [2, 4, 6],
    }
)
print(df)

out = df.melt(id_vars=["A", "B"], value_vars=["C", "D"])
print(out)

shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ C   ┆ D   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a   ┆ 1   ┆ 10  ┆ 2   │
│ b   ┆ 3   ┆ 11  ┆ 4   │
│ a   ┆ 5   ┆ 12  ┆ 6   │
└─────┴─────┴─────┴─────┘
shape: (6, 4)
┌─────┬─────┬──────────┬───────┐
│ A   ┆ B   ┆ variable ┆ value │
│ --- ┆ --- ┆ ---      ┆ ---   │
│ str ┆ i64 ┆ str      ┆ i64   │
╞═════╪═════╪══════════╪═══════╡
│ a   ┆ 1   ┆ C        ┆ 10    │
│ b   ┆ 3   ┆ C        ┆ 11    │
│ a   ┆ 5   ┆ C        ┆ 12    │
│ a   ┆ 1   ┆ D        ┆ 2     │
│ b   ┆ 3   ┆ D        ┆ 4     │
│ a   ┆ 5   ┆ D        ┆ 6     │
└─────┴─────┴──────────┴───────┘


In [46]:
#Other functionalites (time series)
df = pl.read_csv("apple_stock.csv", try_parse_dates=False)
df = df.with_columns(pl.col("Date").str.to_date("%Y-%m-%d")) #Parses strings to dates
df_with_year = df.with_columns(pl.col("Date").dt.year().alias("year")) #.dt pulls the year
print(df)
print(df_with_year)


shape: (100, 2)
┌────────────┬────────┐
│ Date       ┆ Close  │
│ ---        ┆ ---    │
│ date       ┆ f64    │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62  │
│ 1981-05-06 ┆ 27.38  │
│ 1981-05-18 ┆ 28.0   │
│ 1981-09-25 ┆ 14.25  │
│ …          ┆ …      │
│ 2012-12-04 ┆ 575.85 │
│ 2013-07-05 ┆ 417.42 │
│ 2013-11-07 ┆ 512.49 │
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘
shape: (100, 3)
┌────────────┬────────┬──────┐
│ Date       ┆ Close  ┆ year │
│ ---        ┆ ---    ┆ ---  │
│ date       ┆ f64    ┆ i32  │
╞════════════╪════════╪══════╡
│ 1981-02-23 ┆ 24.62  ┆ 1981 │
│ 1981-05-06 ┆ 27.38  ┆ 1981 │
│ 1981-05-18 ┆ 28.0   ┆ 1981 │
│ 1981-09-25 ┆ 14.25  ┆ 1981 │
│ …          ┆ …      ┆ …    │
│ 2012-12-04 ┆ 575.85 ┆ 2012 │
│ 2013-07-05 ┆ 417.42 ┆ 2013 │
│ 2013-11-07 ┆ 512.49 ┆ 2013 │
│ 2014-02-25 ┆ 522.06 ┆ 2014 │
└────────────┴────────┴──────┘


In [47]:
filtered_df = df.filter(
    pl.col("Date") == datetime(1995, 10, 16),
)
print(filtered_df)

shape: (1, 2)
┌────────────┬───────┐
│ Date       ┆ Close │
│ ---        ┆ ---   │
│ date       ┆ f64   │
╞════════════╪═══════╡
│ 1995-10-16 ┆ 36.13 │
└────────────┴───────┘


In [48]:
filtered_range_df = df.filter(
    pl.col("Date").is_between(datetime(1995, 7, 1), datetime(1995, 11, 1)),
)
print(filtered_range_df)

shape: (2, 2)
┌────────────┬───────┐
│ Date       ┆ Close │
│ ---        ┆ ---   │
│ date       ┆ f64   │
╞════════════╪═══════╡
│ 1995-07-06 ┆ 47.0  │
│ 1995-10-16 ┆ 36.13 │
└────────────┴───────┘


In [49]:
df_sort = df.sort("Date")
print(df_sort)

shape: (100, 2)
┌────────────┬────────┐
│ Date       ┆ Close  │
│ ---        ┆ ---    │
│ date       ┆ f64    │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62  │
│ 1981-05-06 ┆ 27.38  │
│ 1981-05-18 ┆ 28.0   │
│ 1981-09-25 ┆ 14.25  │
│ …          ┆ …      │
│ 2012-12-04 ┆ 575.85 │
│ 2013-07-05 ┆ 417.42 │
│ 2013-11-07 ┆ 512.49 │
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘


In [52]:
#Average per time period
annual_average_df = df_sort.group_by_dynamic("Date", every="1y").agg(pl.col("Close").mean())#df has to be sorted!!
df_with_year = annual_average_df.with_columns(pl.col("Date").dt.year().alias("year"))
print(df_with_year)

shape: (34, 3)
┌────────────┬───────────┬──────┐
│ Date       ┆ Close     ┆ year │
│ ---        ┆ ---       ┆ ---  │
│ date       ┆ f64       ┆ i32  │
╞════════════╪═══════════╪══════╡
│ 1981-01-01 ┆ 23.5625   ┆ 1981 │
│ 1982-01-01 ┆ 11.0      ┆ 1982 │
│ 1983-01-01 ┆ 30.543333 ┆ 1983 │
│ 1984-01-01 ┆ 27.583333 ┆ 1984 │
│ …          ┆ …         ┆ …    │
│ 2011-01-01 ┆ 368.225   ┆ 2011 │
│ 2012-01-01 ┆ 560.965   ┆ 2012 │
│ 2013-01-01 ┆ 464.955   ┆ 2013 │
│ 2014-01-01 ┆ 522.06    ┆ 2014 │
└────────────┴───────────┴──────┘
