In [1]:
# Configuracion para recargar módulos y librerías 
%reload_ext autoreload
%autoreload 2

# MAT281

## Aplicaciones de la Matemática en la Ingeniería

Puedes ejecutar este jupyter notebook de manera interactiva:

[![Binder](../shared/images/jupyter_binder.png)](https://mybinder.org/v2/gh/sebastiandres/mat281_m01_introduccion/master?filepath=00_template/00_template.ipynb)

[![Colab](../shared/images/jupyter_colab.png)](https://colab.research.google.com/github/sebastiandres/mat281_m01_introduccion/blob/master//00_template/00_template.ipynb)

## ¿Qué contenido aprenderemos?
* Operaciones numéricas con ```numpy```.
* Manipulación de datos con ```pandas```.

## Motivación

En los últimos años, el interés por los datos ha crecido sostenidamente, algunos términos de moda tales como *data science*, *machine learning*, *big data*, *artifial intelligence*, *deep learning*, etc. son prueba fehaciente de ello. Por dar un ejemplo, las búsquedas la siguiente imagen muestra el interés de búsqueda en Google por *__Data Science__* en los últimos cinco años. 

[Fuente](https://trends.google.com/trends/explore?date=today%205-y&q=data%20science)

![alt text](images/dataScienceTrend.png "Logo Title Text 1")


Muchos se ha dicho respecto a esto, declaraciones tales como: 

* _"The world’s most valuable resource is no longer oil, but data."_
* _"AI is the new electricity."_
* _"Data Scientist: The Sexiest Job of the 21st Century."_

<script type="text/javascript" src="https://ssl.gstatic.com/trends_nrtr/1544_RC05/embed_loader.js"></script> <script type="text/javascript"> trends.embed.renderExploreWidget("TIMESERIES", {"comparisonItem":[{"keyword":"data science","geo":"","time":"today 5-y"}],"category":0,"property":""}, {"exploreQuery":"date=today%205-y&q=data%20science","guestPath":"https://trends.google.com:443/trends/embed/"}); </script> 

Los datos por si solos no son útiles, su verdadero valor está en el análisis y en todo lo que esto conlleva, por ejemplo:

* Predicciones
* Clasificaciones
* Optimización
* Visualización
* Aprendizaje

Por esto es importante recordar al tío Ben: _"Un gran poder conlleva una gran responsabilidad"_.

## Numpy

Desde la propia página web:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

* a powerful N-dimensional array object
* sophisticated (broadcasting) functions
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.


In [2]:
# Repaso
# algunos timeits
# operaciones

## Pandas


Desde el repositorio de GitHub:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Actualmente cuenta con más de 1200 contribuidores y casi 18000 commits!

In [3]:
import pandas as pd

In [4]:
pd.__version__

'0.23.4'

### Series

In [5]:
pd.Series?

[0;31mInit signature:[0m [0mpd[0m[0;34m.[0m[0mSeries[0m[0;34m([0m[0mdata[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mindex[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mdtype[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mname[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mcopy[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mfastpath[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the

Para crear una instancia de una serie existen muchas opciones, las más comunes son:

* A partir de una lista.
* A partir de un _numpy.array_.
* A partir de un diccionario.
* A partir de un archivo (por ejemplo un csv).

In [6]:
my_serie = pd.Series(range(3, 33, 3))
my_serie

0     3
1     6
2     9
3    12
4    15
5    18
6    21
7    24
8    27
9    30
dtype: int64

In [7]:
type(my_serie)

pandas.core.series.Series

In [8]:
# Presiona TAB y sorpréndete con la cantidad de métodos!
my_serie.

SyntaxError: invalid syntax (<ipython-input-8-c7fa11dbe4bd>, line 2)

Las series son arreglos unidemensionales que constan de _data_ e _index_.

In [10]:
my_serie.values

array([ 3,  6,  9, 12, 15, 18, 21, 24, 27, 30])

In [11]:
type(my_serie.values)

numpy.ndarray

In [12]:
my_serie.index

RangeIndex(start=0, stop=10, step=1)

In [13]:
type(my_serie.index)

pandas.core.indexes.range.RangeIndex

A diferencia de numpy, pandas ofrece más flexibilidad para los valores e índices.

In [14]:
my_serie_2 = pd.Series(range(3, 33, 3), index=list('abcdefghij'))
my_serie_2

a     3
b     6
c     9
d    12
e    15
f    18
g    21
h    24
i    27
j    30
dtype: int64

Acceder a los valores de una serie es muy fácil!

In [15]:
my_serie_2['b']

6

In [16]:
my_serie_2.loc['b']

6

In [17]:
my_serie_2.iloc[1]

6

```loc```?? ```iloc```??

In [18]:
pd.Series.loc?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7f5ec8ea7d68>
[0;31mDocstring:[0m  
Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- A ``callable`` function with one argument (the calling Series, DataFrame
  or Panel) and that returns valid output for indexing (one of the above)

See more at :ref:`Selection by Label <indexing.label>`

See Also
--------
DataFrame.at : Access a single value for a row/column label pair
DataFrame.iloc : Access gro

A modo de resumen:

* ```loc``` es un método que hace referencia a las etiquetas (*labels*) del objeto .
* ```iloc``` es un método que hace referencia posicional del objeto.

Pandas incluso permite que los index sean fechas! Por ejemplo, a continuación se crea una serie con las tendencia de búsqueda de *data science* en Google.

In [19]:
import os

In [20]:
ds_trend = pd.read_csv(os.path.join('data', 'dataScienceTrend.csv'), index_col=0, squeeze=True, parse_dates=True)

In [21]:
ds_trend.head(10)

week
2013-09-29    15
2013-10-06    15
2013-10-13    14
2013-10-20    14
2013-10-27    14
2013-11-03    14
2013-11-10    15
2013-11-17    16
2013-11-24    12
2013-12-01    17
Name: trend, dtype: int64

In [22]:
ds_trend.tail(10)

week
2018-07-22     84
2018-07-29     86
2018-08-05     82
2018-08-12     83
2018-08-19     91
2018-08-26     93
2018-09-02    100
2018-09-09     93
2018-09-16     98
2018-09-23     93
Name: trend, dtype: int64

In [23]:
ds_trend.dtype

dtype('int64')

In [24]:
ds_trend.index

DatetimeIndex(['2013-09-29', '2013-10-06', '2013-10-13', '2013-10-20',
               '2013-10-27', '2013-11-03', '2013-11-10', '2013-11-17',
               '2013-11-24', '2013-12-01',
               ...
               '2018-07-22', '2018-07-29', '2018-08-05', '2018-08-12',
               '2018-08-19', '2018-08-26', '2018-09-02', '2018-09-09',
               '2018-09-16', '2018-09-23'],
              dtype='datetime64[ns]', name='week', length=261, freq=None)

Por ejemplo, podemos determinar rápidamente la máxima tendencia.

In [25]:
max_trend = ds_trend.max()
max_trend 

100

Para determinar cuando ocurrió existen dos maneras usuales:

* Utilizar una máscara (*mask*)
* Utilizar métodos ya implementados

In [26]:
# Mask
ds_trend[ds_trend == max_trend]

week
2018-09-02    100
Name: trend, dtype: int64

In [27]:
# Built-in method
ds_trend.idxmax()

Timestamp('2018-09-02 00:00:00')

### Dataframes

Arreglo bidimensional y extensión natural de una serie.

Utilizando el dataset de los jugadores de la NBA la flexibilidad de pandas se hace mucho más visible. No es necesario que todos los elementos sean del mismo tipo!

In [28]:
player_data = pd.read_csv(os.path.join('data', 'player_data.csv'), index_col='name')
player_data.head()

Unnamed: 0_level_0,year_start,year_end,position,height,weight,birth_date,college
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke University
Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State University
Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0,"April 16, 1947","University of California, Los Angeles"
Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",Louisiana State University
Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974",San Jose State University


In [29]:
type(player_data)

pandas.core.frame.DataFrame

In [30]:
player_data.dtypes

year_start      int64
year_end        int64
position       object
height         object
weight        float64
birth_date     object
college        object
dtype: object

Puedes pensar que un dataframe es una colección de series

In [31]:
player_data['birth_date'].head()

name
Alaa Abdelnaby            June 24, 1968
Zaid Abdul-Aziz           April 7, 1946
Kareem Abdul-Jabbar      April 16, 1947
Mahmoud Abdul-Rauf        March 9, 1969
Tariq Abdul-Wahad      November 3, 1974
Name: birth_date, dtype: object

In [32]:
type(player_data['birth_date'])

pandas.core.series.Series

### Exploración 

In [34]:
player_data.describe()

Unnamed: 0,year_start,year_end,weight
count,4550.0,4550.0,4544.0
mean,1985.076264,1989.272527,208.908011
std,20.974188,21.874761,26.268662
min,1947.0,1947.0,114.0
25%,1969.0,1973.0,190.0
50%,1986.0,1992.0,210.0
75%,2003.0,2009.0,225.0
max,2018.0,2018.0,360.0


In [35]:
player_data.describe(include='all')

Unnamed: 0,year_start,year_end,position,height,weight,birth_date,college
count,4550.0,4550.0,4549,4549,4544.0,4519,4248
unique,,,7,28,,4161,473
top,,,G,6-7,,"October 25, 1948",University of Kentucky
freq,,,1574,473,,3,99
mean,1985.076264,1989.272527,,,208.908011,,
std,20.974188,21.874761,,,26.268662,,
min,1947.0,1947.0,,,114.0,,
25%,1969.0,1973.0,,,190.0,,
50%,1986.0,1992.0,,,210.0,,
75%,2003.0,2009.0,,,225.0,,


In [33]:
player_data.max()

year_start    2018.0
year_end      2018.0
weight         360.0
dtype: float64

Para extraer elementos lo más recomendable es el método loc.

In [None]:
player_data.loc['Zaid Abdul-Aziz', 'college']

Evita acceder con doble corchete

In [39]:
player_data['college']['Zaid Abdul-Aziz']

'Iowa State University'

Aunque en ocasiones funcione, no se asegura que sea siempre así. [Más info aquí.](https://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing)

### Manipulación

## Resumen
* 1
* 2
* 3

## Evaluación Laboratorio

* Nombre: 
* Rol:

#### Instruciones

Loreipsum