## Gestionando la ausencia de datos

Ocurre con frecuencia que disponemos de catálogos de datos donde hay muestras incompletas.
Por ejemplo, los datos obtenidos a partir de encuestas donde se registran preguntas sin responder o sensores que no proporcionan ningún valor viable, etc.

**Hay que aceptarlo y saber gestionarlo**

```Pandas``` asigna el valor o el código NaN (Not a Number) a los valores desconocidos. Más especificamente, los objetos son designados como: None y las fechas como NaT.

Las operaciones que involucren este tipo de datos internamente han de manejar los correspondientes códigos: NaN, None o NaT. ¿Cómo afecta un NaN a una media aritmética?

En este capítulo trabajaremos con esta típología de valores.


In [70]:
# Y finalmente,  podemos asignar y usar nans
import numpy as np
datos = np.array([1,2,np.nan,4,5,6,np.nan,8])
print(datos)

print(datos.mean())


[ 1.  2. nan  4.  5.  6. nan  8.]
nan


In [1]:
import pandas as pd

In [2]:
#Empezamos cargando datos: who.csv con 358 columnas!
df = pd.read_csv("data/who.csv")
df = df[["Country",df.columns[-2]]]
print(df[:5])

       Country  Urban_population_growth
0  Afghanistan                     5.44
1      Albania                     2.21
2      Algeria                     2.61
3      Andorra                      NaN
4       Angola                     4.14


In [66]:
# Como ya sabéis através de la API se puede obtener una descripción más detallada de las posibilidades de cada método de Python, y en especial
# de los métodos de Pandas. 
# Para cargar un fichero de tamaño elevado es recomendable cargar aquellos atributos que nos interesen desde un principio usando el argumento: usecols
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
df = pd.read_csv("data/who.csv", usecols=["Country","Urban_population_growth"])
print(df[:5])

       Country  Urban_population_growth
0  Afghanistan                     5.44
1      Albania                     2.21
2      Algeria                     2.61
3      Andorra                      NaN
4       Angola                     4.14


In [6]:
# ¿Qué valor corresponde a un NA del dataframe?
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html
df.isna()

Unnamed: 0,Country,Urban_population_growth
0,False,False
1,False,False
2,False,False
3,False,True
4,False,False
...,...,...
197,False,False
198,False,False
199,False,False
200,False,False


In [9]:
#¿Qué columnas tienen datos sin valor: NaN, NaT, None?
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html

print(df.columns[df.isna().any()])

# Equivale a preguntar si ¿existe algún valor positivo dentro de esas series?
print("-"*30)
print(df.isna().any())

Index(['Urban_population_growth'], dtype='object')
------------------------------
Country                    False
Urban_population_growth     True
dtype: bool


In [31]:
#No dudéis en ejecutar "partes" (dividamos la instrucción para comprenderla)
print(df.isna()[:5])

   Country  Urban_population_growth
0    False                    False
1    False                    False
2    False                    False
3    False                     True
4    False                    False


In [11]:
#¿Cuántas muestras son correctas? 
df.notna().sum()
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.notna.html
# y de cuantas muestras?


Country                    202
Urban_population_growth    188
dtype: int64

In [13]:
df.notnull().sum() #ambas funcionas son equivalentes en Pandas, no en numpy

Country                    202
Urban_population_growth    188
dtype: int64

### Tratando la ausencia de datos
- Ignorando: "Hay X muestras válidas de tantas"
- Rellenando: reemplazar muestras desconocidas por otros valores: media, valor neutro, etc.

In [39]:
#La manera más optima de remplazar estos valores es con la función: fillna
print(df.fillna(0)[:5])


       Country  Urban_population_growth
0  Afghanistan                     5.44
1      Albania                     2.21
2      Algeria                     2.61
3      Andorra                     0.00
4       Angola                     4.14


In [14]:
# Si queremos que nuestra variable de dataframe contenga dichas asignaciones recordad asignar la operación a la variable pertinente o a una nueva
df = df.fillna(0) 

### Maneras de rellenar una serie con datos NA

Cuando los dataframes contienen números la operabildad con valores perdidos puede gestionarse de manera más eficiente. Pongamos un ejemplo:

In [19]:
import numpy as np

np.random.seed(20)

#Creamos un dataframe 
df = pd.DataFrame(np.random.randn(5, 3), 
                     index=['a', 'b', 'c', 'd', 'e'],
                     columns=['one', 'two', 'three'])
print(df)

        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262 -1.084833  0.559696
c  0.939469 -0.978481  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017 -0.842368 -1.279503


In [20]:
#Creamos valores NaN para testear 
df.two[df.two<0]=np.nan
print(df)

        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262       NaN  0.559696
c  0.939469       NaN  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017       NaN -1.279503


Podemos usar ```fillna``` para rellenar de diversas maneras la serie o series. Por ejemplo, usando una operación de agregación como la media

In [21]:
print(df)
print("-"*33)
print(df.fillna(df.mean()))

        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262       NaN  0.559696
c  0.939469       NaN  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017       NaN -1.279503
---------------------------------
        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262  0.259663  0.559696
c  0.939469  0.259663  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017  0.259663 -1.279503


In [30]:
#Con un valor en concreto del propio dataframe
print(df.fillna("HOLA"))
print("-"*33)
print(df.fillna(df.loc["a", ["one"]].values[0]))

        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262      HOLA  0.559696
c  0.939469      HOLA  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017      HOLA -1.279503
---------------------------------
        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262  0.883893  0.559696
c  0.939469  0.883893  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017  0.883893 -1.279503


#### Podemos rellenar con datos interpolados

En la documentación vemos una serie de ejemplos: [Interpolate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html)

In [38]:
print(df)
print("-"*35)
print(df.interpolate())

        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262       NaN  0.559696
c  0.939469       NaN  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017       NaN -1.279503
-----------------------------------
        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262  0.238397  0.559696
c  0.939469  0.280929  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017  0.323461 -1.279503


In [53]:
print(df.interpolate(axis=1)) # Tomemos como referencia el valor NA de (b,"two")
print("--"*35)
print(df.mean(axis=1).b)

        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262 -0.891783  0.559696
c  0.939469  0.721283  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017 -1.035760 -1.279503
----------------------------------------------------------------------
-0.8917828081181468


In [62]:
# Para usar otro tipo de interpolaciones es recomendable tener un índice numérico por cuestiones de frecuencia en el método de interpolación
df.index = range(len(df))
print(df.two.interpolate(method="pad"))

0    0.195865
1    0.195865
2    0.195865
3    0.323461
4    0.323461
Name: two, dtype: float64


In [63]:
print(df.two.interpolate(method="nearest"))

0    0.195865
1    0.195865
2    0.323461
3    0.323461
4         NaN
Name: two, dtype: float64


In [64]:
print("Valores interpolados:" + str(df.two.interpolate().count()-df.two.count()))

Valores interpolados:3


### Eliminación de valores NA

Existen operaciones para la eliminación de valores NA

In [31]:
print(df)
print("-"*35)
print(df.dropna())

# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

        one       two     three
a  0.883893  0.195865  0.357537
b -2.343262       NaN  0.559696
c  0.939469       NaN  0.503097
d  0.406414  0.323461 -0.493411
e -0.792017       NaN -1.279503
-----------------------------------
        one       two     three
a  0.883893  0.195865  0.357537
d  0.406414  0.323461 -0.493411


In [32]:
#O bien, podemos borrar cambiando el eje AXIS=0 o 1
df.dropna(axis=1)

Unnamed: 0,one,three
a,0.883893,0.357537
b,-2.343262,0.559696
c,0.939469,0.503097
d,0.406414,-0.493411
e,-0.792017,-1.279503


In [36]:
# el argumento AXIS está en un gran número de métodos de Pandas
print(df.mean()) # y por defecto, suele ser axis=0 (considerar las columnas ejeX)
print("-"*35)
print(df.mean(axis=1))

one     -0.181100
two      0.259663
three   -0.070517
dtype: float64
-----------------------------------
a    0.479098
b   -0.891783
c    0.721283
d    0.078822
e   -1.035760
dtype: float64


### Ejercicios

**1) Del fichero who.csv, contabiliza cuántos paises tienen algun valor NaN.**

**1b) Ordena el anterior resultado para identificar cuál es el pais con mayor número de campos desconocidos.**

**2) who.csv, Selecciona la primera, tercera y decima columna, de las filas comprendidas entre la 100 y la 150.**

**2b) ¿Cuántos valores NaN hay presentes?**

**2c) Crea un nuevo dataframe donde los NaN sean cero.**

**2d) Elimina aquellas filas de la anterior selección donde haya NaN.**

## Series Temporales
Las series temporales son muestras de valores tomadas a lo largo del tiempo con un muestreo generalmente equidistante. Por ejemplo, información económica, demográfica, meteorológica; registros de seguridad, actividad, etc.

La biblioteca Pandas gestiona las series temporales utilizando el índice: una fecha (`datetime`):
https://docs.python.org/es/3/library/datetime.html


El índice de un _dataframe_ es el pilar básico de acceso a los valores, por lo que su uso simplifica procesos de filtrado, selección, interpolación, etc.

Enlace a la documentación: [TimeSeries](https://pandas.pydata.org/docs/user_guide/timeseries.html)

In [1]:
import pandas as pd
df = pd.read_csv("data/rdu-weather-history.csv",sep=";")  
#Qué contiene el fichero
print(df.head())

Unnamed: 0,date,temperaturemin,temperaturemax,precipitation,snowfall,snowdepth,avgwindspeed,fastest2minwinddir,fastest2minwindspeed,fastest5secwinddir,...,drizzle,snow,freezingrain,smokehaze,thunder,highwind,hail,blowingsnow,dust,freezingfog
0,2015-04-08,62.1,84.0,0.0,0.0,0.0,5.82,40.0,29.97,30.0,...,No,No,No,Yes,No,No,No,No,No,No
1,2015-04-20,63.0,78.1,0.28,0.0,0.0,11.86,180.0,21.92,170.0,...,No,No,No,No,Yes,No,No,No,No,No
2,2015-04-26,45.0,54.0,0.02,0.0,0.0,5.82,50.0,12.97,40.0,...,No,No,No,No,No,No,No,No,No,No
3,2015-04-28,39.0,69.1,0.0,0.0,0.0,2.68,40.0,12.08,40.0,...,No,No,No,No,No,No,No,No,No,No
4,2015-05-03,46.9,79.0,0.0,0.0,0.0,2.68,200.0,12.08,210.0,...,No,No,No,No,No,No,No,No,No,No


In [4]:
print(df.date.sort_values())

2509    2007-01-01
1065    2007-01-02
1066    2007-01-03
1067    2007-01-04
3251    2007-01-05
           ...    
2507    2019-06-19
2508    2019-06-20
488     2019-06-21
489     2019-06-22
3623    2019-06-23
Name: date, Length: 4557, dtype: object


Nosotros solo cubriremos los aspectos básicos de estos tipos de datos; lo que queremos es poder responder preguntas similares a las siguientes:
- ¿Cómo podría obtener la temperatura media de un año?
- ¿Cómo podría obtener la temperatura más alta de todos los meses de julio?

En primer lugar, se ha de transformar el índice en una Fecha:

In [11]:
from pandas import DatetimeIndex

df = pd.read_csv("data/rdu-weather-history.csv",sep=";")  
df.index = DatetimeIndex(df["date"])
df.head()



Unnamed: 0_level_0,date,temperaturemin,temperaturemax,precipitation,snowfall,snowdepth,avgwindspeed,fastest2minwinddir,fastest2minwindspeed,fastest5secwinddir,...,drizzle,snow,freezingrain,smokehaze,thunder,highwind,hail,blowingsnow,dust,freezingfog
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-04-08,2015-04-08,62.1,84.0,0.0,0.0,0.0,5.82,40.0,29.97,30.0,...,No,No,No,Yes,No,No,No,No,No,No
2015-04-20,2015-04-20,63.0,78.1,0.28,0.0,0.0,11.86,180.0,21.92,170.0,...,No,No,No,No,Yes,No,No,No,No,No
2015-04-26,2015-04-26,45.0,54.0,0.02,0.0,0.0,5.82,50.0,12.97,40.0,...,No,No,No,No,No,No,No,No,No,No
2015-04-28,2015-04-28,39.0,69.1,0.0,0.0,0.0,2.68,40.0,12.08,40.0,...,No,No,No,No,No,No,No,No,No,No
2015-05-03,2015-05-03,46.9,79.0,0.0,0.0,0.0,2.68,200.0,12.08,210.0,...,No,No,No,No,No,No,No,No,No,No


In [12]:
df = df.drop(columns="date")

In [13]:
df.index.year

Int64Index([2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,
            ...
            2011, 2011, 2011, 2011, 2019, 2019, 2019, 2019, 2019, 2019],
           dtype='int64', name='date', length=4557)

In [14]:
df.loc["2014"]

Unnamed: 0_level_0,temperaturemin,temperaturemax,precipitation,snowfall,snowdepth,avgwindspeed,fastest2minwinddir,fastest2minwindspeed,fastest5secwinddir,fastest5secwindspeed,...,drizzle,snow,freezingrain,smokehaze,thunder,highwind,hail,blowingsnow,dust,freezingfog
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-01-02,37.0,48.9,0.33,0.0,0.0,2.68,310.0,12.97,320.0,21.92,...,No,No,No,No,No,No,No,No,No,No
2014-01-11,45.0,69.1,0.28,0.0,0.0,10.29,230.0,59.95,220.0,86.12,...,No,No,No,No,No,No,No,No,No,No
2014-01-13,33.1,62.1,0.00,0.0,0.0,7.16,220.0,21.03,230.0,25.95,...,No,No,No,No,No,No,No,No,No,No
2014-01-15,33.1,59.0,0.00,0.0,0.0,4.47,180.0,17.00,210.0,23.94,...,No,No,No,No,No,No,No,No,No,No
2014-01-18,27.1,46.9,0.00,0.0,0.0,6.71,270.0,17.00,240.0,23.94,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-11-30,36.0,66.0,0.00,0.0,0.0,8.72,230.0,23.04,240.0,31.09,...,No,No,No,No,No,No,No,No,No,No
2014-12-02,46.0,55.0,0.00,0.0,0.0,6.49,50.0,18.12,50.0,25.05,...,No,No,No,No,No,No,No,No,No,No
2014-12-10,31.1,50.0,0.00,0.0,0.0,5.37,300.0,14.99,310.0,23.94,...,No,No,No,No,No,No,No,No,No,No
2014-12-23,37.0,46.9,0.90,0.0,0.0,2.46,90.0,8.95,100.0,12.97,...,No,No,No,No,Yes,No,No,No,No,No


In [15]:
df.loc["2014-01"]

Unnamed: 0_level_0,temperaturemin,temperaturemax,precipitation,snowfall,snowdepth,avgwindspeed,fastest2minwinddir,fastest2minwindspeed,fastest5secwinddir,fastest5secwindspeed,...,drizzle,snow,freezingrain,smokehaze,thunder,highwind,hail,blowingsnow,dust,freezingfog
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-01-02,37.0,48.9,0.33,0.0,0.0,2.68,310.0,12.97,320.0,21.92,...,No,No,No,No,No,No,No,No,No,No
2014-01-11,45.0,69.1,0.28,0.0,0.0,10.29,230.0,59.95,220.0,86.12,...,No,No,No,No,No,No,No,No,No,No
2014-01-13,33.1,62.1,0.0,0.0,0.0,7.16,220.0,21.03,230.0,25.95,...,No,No,No,No,No,No,No,No,No,No
2014-01-15,33.1,59.0,0.0,0.0,0.0,4.47,180.0,17.0,210.0,23.94,...,No,No,No,No,No,No,No,No,No,No
2014-01-18,27.1,46.9,0.0,0.0,0.0,6.71,270.0,17.0,240.0,23.94,...,No,No,No,No,No,No,No,No,No,No
2014-01-19,30.2,54.0,0.0,0.0,0.0,8.5,240.0,17.0,240.0,25.95,...,No,No,No,No,No,No,No,No,No,No
2014-01-27,33.1,64.9,0.0,0.0,0.0,9.62,230.0,18.12,30.0,23.94,...,No,No,No,No,No,No,No,No,No,No
2014-01-01,29.1,51.1,0.0,0.0,0.0,2.46,200.0,8.95,210.0,12.97,...,No,No,No,No,No,No,No,No,No,No
2014-01-07,9.1,25.2,0.0,0.0,0.0,7.61,300.0,16.11,320.0,25.05,...,No,No,No,No,No,No,No,No,No,No
2014-01-08,15.3,43.0,0.0,0.0,0.0,2.91,220.0,10.07,210.0,12.97,...,No,No,No,No,No,No,No,No,No,No


In [17]:
# Aggregations
df.loc["2015"].temperaturemin.mean()

51.70246575342466

In [16]:
# Conditional operatives
df.loc["2015"].temperaturemin.min() > df.loc["2016"].temperaturemin.min()

False

In [19]:
# Slicing
df.loc["2015":"2019"] 

Unnamed: 0_level_0,temperaturemin,temperaturemax,precipitation,snowfall,snowdepth,avgwindspeed,fastest2minwinddir,fastest2minwindspeed,fastest5secwinddir,fastest5secwindspeed,...,drizzle,snow,freezingrain,smokehaze,thunder,highwind,hail,blowingsnow,dust,freezingfog
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-04-08,62.1,84.0,0.00,0.0,0.0,5.82,40.0,29.97,30.0,38.03,...,No,No,No,Yes,No,No,No,No,No,No
2015-04-20,63.0,78.1,0.28,0.0,0.0,11.86,180.0,21.92,170.0,29.08,...,No,No,No,No,Yes,No,No,No,No,No
2015-04-26,45.0,54.0,0.02,0.0,0.0,5.82,50.0,12.97,40.0,16.11,...,No,No,No,No,No,No,No,No,No,No
2015-04-28,39.0,69.1,0.00,0.0,0.0,2.68,40.0,12.08,40.0,17.00,...,No,No,No,No,No,No,No,No,No,No
2015-05-03,46.9,79.0,0.00,0.0,0.0,2.68,200.0,12.08,210.0,14.99,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-05-24,68.0,93.0,0.00,0.0,0.0,5.82,280.0,14.09,330.0,21.03,...,No,No,No,Yes,Yes,No,No,No,No,No
2019-05-26,72.0,93.9,0.05,0.0,0.0,5.82,240.0,17.00,240.0,23.04,...,No,No,No,No,Yes,No,No,No,No,No
2019-06-08,71.1,79.0,0.32,0.0,0.0,9.40,100.0,18.12,80.0,23.94,...,No,No,No,No,Yes,No,No,No,No,No
2019-05-06,60.1,77.0,0.00,0.0,0.0,5.82,40.0,14.09,50.0,18.12,...,No,No,No,No,No,No,No,No,No,No


## Actividades

A. ¿Cuántas veces ha nevado por año (`snowfall`)?

B. ¿En qué año se han registrado más nieve (`snowdepth`)? 

C. Crea un dataframe que contenga la temperatura máxima de julio por cada año.

D. Haz una agrupación que contenga las temperaturas máximas y mínimas de cada mes de cada año.

# Pivotación de tablas 

Pivotar una tabla consiste en organizar las columnas a filas o las filas a columnas. Con ello disponemos los datos *transpuestos* a la modelización original.

Enlace a la documentación:
- https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
- https://pandas.pydata.org/docs/reference/api/pandas.pivot.html

In [41]:
import pandas as pd
import numpy as np

samples=5
df= pd.DataFrame(
    {
        "Municipio":np.repeat(["muni%i"%i for i in range(samples)],3) ,
        "Categoria"   :["Inscritos","Censo","Población"]*(samples),
        "Values"   : np.random.randint(1,10,samples*3)
    })
    
    
print(df)

   Municipio  Categoria  Values
0      muni0  Inscritos       5
1      muni0      Censo       8
2      muni0  Población       5
3      muni1  Inscritos       9
4      muni1      Censo       1
5      muni1  Población       5
6      muni2  Inscritos       2
7      muni2      Censo       8
8      muni2  Población       7
9      muni3  Inscritos       4
10     muni3      Censo       7
11     muni3  Población       9
12     muni4  Inscritos       4
13     muni4      Censo       5
14     muni4  Población       4


In [50]:
# indexcolumn, Grouper, array, or list of the previous
# Keys to group by on the pivot table index. If a list is passed, it can contain any of the other types (except list). 
# If an array is passed, it must be the same length as the data and will be used in the same manner as column values.
pd.pivot_table(df, index=['Categoria'])

Unnamed: 0_level_0,Values
Categoria,Unnamed: 1_level_1
Censo,5.8
Inscritos,4.8
Población,6.0


In [52]:
# columnscolumn, Grouper, array, or list of the previous
# Keys to group by on the pivot table column. If a list is passed, it can contain any of the other types (except list). 
# If an array is passed, it must be the same length as the data and will be used in the same manner as column values.
pd.pivot_table(df, columns=['Categoria'])

Categoria,Censo,Inscritos,Población
Values,5.8,4.8,6.0


In [56]:
pd.pivot_table(df, index=['Municipio'],columns=["Categoria"])

Unnamed: 0_level_0,Values,Values,Values
Categoria,Censo,Inscritos,Población
Municipio,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
muni0,8,5,5
muni1,1,9,5
muni2,8,2,7
muni3,7,4,9
muni4,5,4,4


In [57]:
# aggfuncfunction, list of functions, dict, default “mean”
# If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names
#  (inferred from the function objects themselves).
pd.pivot_table(df, index=['Categoria'], aggfunc=sum)


Unnamed: 0_level_0,Values
Categoria,Unnamed: 1_level_1
Censo,29
Inscritos,24
Población,30
