<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

# *pandas.DataFrame*


   - **two-dimensional** array
   - where **rows** and **columns** are indexed
   - **missing** values are replaced by *numpy.NaN*
   - **scalar** values are broadcasted

   - can be built **several ways**, or most usually **read** from files (csv, json, …)

In [None]:
import pandas as pd

## creating *pandas.DataFrame* from *pandas.Series*

   - we create three **series of data**
      - **distance**, **lowest_temp** and **highest_temp** related to the solar system
   - series are **indexed by** the **names** of the planets
   - some values are **missing**
      - the **lowest** and the **highest** temperature of **neptune**, **saturn** and **uranus**
   - all planets are from the **solar system**
   
   - **index** must be identical

In [None]:
# distance is relative to Earth's 
distance = pd.Series([0.387, 0.723, 30, 1., 5.203, 1.523, 9.6, 19.19],
                     index=['Mercury', 'Venus', 'Neptune', 'Earth', 'Jupiter', 'Mars', 'Saturn', 'Uranus'])

lowest_temp = pd.Series([-200.0, 446.0,  -90.0, -125.0, -140.0],
                        index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])

highest_temp = pd.Series([430.0, 490.0, 60.0, 17.0, 20.0],
                         index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])


we **group** the series using a **python dict** 
   - the **names** of the series are the **keys** of **dict**
   - the **elements** of the series are the **values** 

In [None]:
planets = pd.DataFrame({'distance': distance,
                        'lowest temperature': lowest_temp, 
                        'highest temperature': highest_temp, 
                        'origin': 'solar system'})

   - note that we give a single scalar value for the *'origin'* series
   - the **single** value is **broadcasted** to the **entire column** (here *Solar System*)
   - **missing** values are **replaced by** *numpy.NaN* (min/max temperature of neptune, saturn and uranus

## overview methods

### a glimpse with *head()*

In [None]:
# head() provides a nice way to have 
# a glimpse at the data
# there's a tail() method as well of course

planets.head()

### statistics with *describe()*

In [None]:
# another way to have an idea of the contents 
# is to see statistical info on the columns
planets.describe()

### naming

In [None]:
# we can give a name to the index

planets.index.name = 'planets names'

In [None]:
planets.head()

### information on *pandas.DataFrame*

In [None]:
# the index corresponds to a column
planets.index

In [None]:
# its type is Index (not Series)
type(planets.index)

In [None]:
# column names
planets.columns

In [None]:
# you can transpose a *pandas.DataFrame*
# like a numpy array
    
# in this view columns and rows are swapped
planets.T 

In [None]:
# miscell information on missing data,
# types, memory usage, ..

planets.info()

## accessing elements 

### accessing columns

In [None]:
# to retrieve columns by name
# we use the [] operator
# NOTE: this returns a reference
#  to the Series object,
#  and NOT a copy

planets['distance'] 

In [None]:
# we can also use the attribute operator
# when possible 
# this WON'T WORK for example
# with 'lowest temperature'
# because of the space in its name

planets.distance

In [None]:
# we can extract several columns at once 
# and get another DataFrame object

planets[['distance', 'lowest temperature', 'highest temperature']]

### indexing using **labels** with *pandas.DataFrame.loc*

the classical way
   - **standard** (python and numpy) **indexing operators** **[]** and attribute operator **.**
   - are **available** and **intuitive**

   
However
   - using **standard operators** has  **optimization** limits
   - for **production code** use the **optimized pandas data access methods** 
   
   
http://pandas.pydata.org/pandas-docs/stable/indexing.html

**several forms for the `loc` indexing scheme**
   - *df.loc[row_label]*
   - *df.loc[row_label, column_label]*

*row_label* and *column_label* can be:
   - **a single label**
   - **list of labels**
   - **slices** with labels
   - **masks** (**Boolean array**)

In [None]:
# loc[] returns a single value
# when denoting a single cell

planets.loc['Earth', 'distance']

In [None]:
# this denotes a row so it's a Series

type(planets.loc['Earth'])

In [None]:
# as we can see here

planets.loc['Earth']

In [None]:
# because we use a list for the row
# we receive a Series too

type(planets.loc[['Earth'], 'distance'])

In [None]:
# this is a Series

planets.loc[['Earth'], 'distance']

In [None]:
# ditto

type(planets.loc['Earth', ['distance']])

In [None]:

planets.loc['Earth', ['distance']]

when *row_label* and *column_label* are **lists of labels**
   - it returns a *pandas.DataFrame*

In [None]:
type(planets.loc[['Earth']])

In [None]:
planets.loc[['Earth']]

In [None]:
planets.loc[['Earth', 'Mars']]

### slicing

- rows from 'Earth' included to 'Mars'**included**
- all columns *':'*

In [None]:
planets.loc['Earth':'Mars', :]

   - every other row: `::2`
   - columns from *distance* to *highest temperature* **included**

In [None]:
# odd rows only, 3 columns only

planets.loc[::2, 'distance':'highest temperature']

   - *planets* **farther than** earth from the sum

In [None]:
planets.loc[planets.loc[:, 'distance'] > 1]

### indexing using a **position** with  *pandas.DataFrame.iloc*

#### accessing elements using *pandas.DataFrame.iloc*
   - *df.loc[row_id]*
   - *df.loc[row_id, column_id]*
   

*row_id* and *column_id* can be:
   - **integer**
   - **list of integers**
   - **slices**
   - **masks** (**Boolean array**)  

In [None]:
planets_1 = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],
                          [1.523, -140.0, 20.0],
                          [0.387, -200.0, 430.0]],
                         
                         index= ['Earth', 'Jupiter', 'Mars', 'Mercury'],
                         
                         columns=['distance', 'lowest temperature', 'highest temperature'])

In [None]:
# the first row

# returns a Series

planets.iloc[0] 

In [None]:
# the first and the third rows
    
# returns a DataFrame

planets.iloc[[0, 2]] 

In [None]:
# [1, 3] for the second and the fourth columns
# [0, 2] for the first and the third rows

# returns a DataFrame

planets.iloc[[0, 2], [1, 3]]

In [None]:
# first row, first column (as a float)

# returns a scalar  

planets.iloc[0, 1]

In [None]:
# same but because we use lists 
# we receive a dataframe instead of a scalar

planets.iloc[[0], [1]] # pandas.DataFrame

   - **rows** from **position** 0 to **position** 2 **excluded** (*python slicing rules*)
   - **columns** from **position** 1 to position 3 **excluded** (*python slicing rules*)

In [None]:
planets.iloc[0:2, 1:3] # pandas.DataFrame

   - all rows *':'*
   - columns from 1 to 3 excluded

In [None]:
planets.iloc[:, 1:3]

   - all columns *':'*
   - rows from 0 to 3 excluded

In [None]:
planets.iloc[0:3, :]

### rows and columns are indexed

in a **data frame**
   - the **rows** and the **columns** are **indexed**
   - the type is *pandas.Index* (for short)

In [None]:
type(planets_1.index), type(planets_1.columns)

In [None]:
# this class is also exposed directly
# in the toplevel pandas namespace

pd.Index

   - you can create an object **Index**
   - and pass it to the data frame **constructor**

In [None]:
index_rows = pd.Index(['Earth', 'Jupiter', 'Mars','Mercury',
                       'Neptune', 'Saturn', 'Uranus', 'Venus'])

In [None]:
index_cols = pd.Index(['distance', 'lowest temperature',
                     'highest temperature'])

In [None]:
planets_3 = pd.DataFrame(
     [[1.000, -90.0, 60.0],
      [5.203, -125.0, 17.0],
      [1.523, -140.0, 20.0],
      [0.387, -200.0, 430.0],
      [30.0],
      [9.600],
      [ 19.190],
      [ 0.723, 446.0, 490.0]],                         
     index = index_rows,
     columns = index_cols)

In [None]:
planets_3

##  applying vectorized functions to *pandas.DataFrame*

   - *pandas.DataFrame* columns are stored in *numpy.ndarray*
   - **ufuncs** functions can be **applied** to *pandas.Series*
   - **rows** and **columns** labels are preserved

In [None]:
import numpy as np

df = pd.DataFrame(np.linspace(0, 2*np.pi, 100), columns=['angle'])

In [None]:
df.head()

In [None]:
df['sinus'] = np.sin(df)
df.head(3)

In [None]:
df['cosinus'] = np.cos(df['angle'])
df.head(3)

we can combine series like numpy arrays;  
here we check on the first 3 rows that

$$
sin^2 x + cos^2 x = 1
$$

In [None]:
( np.power(df['sinus'], 2) + np.power(df['cosinus'], 2) )[0:3]

### plotting

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
df[['sinus', 'cosinus']].plot();

***

**exercise**

for all planets farther from the sun than the earth (inclusive), compute a dataframe
with their names and minimal temperatures.

****
remaining of this notebook are optional complements

## miscell other features

### creating *pandas.DataFrame* by specifying parameters *data*, *columns* and *index*

In [None]:
planets_1 = pd.DataFrame(
     [[1.000, -90.0, 60.0],
      [5.203, -125.0, 17.0],
      [1.523, -140.0, 20.0],
      [0.387, -200.0, 430.0],
      [30.0],
      [9.600],
      [ 19.190],
      [ 0.723, 446.0, 490.0]],
     index=['Earth', 'Jupiter', 'Mars', 'Mercury',
            'Neptune', 'Saturn', 'Uranus', 'Venus'],
     columns=['distance', 'lowest temperature', 'highest temperature'])

In [None]:
planets_1.head(3)

### sorting *pandas.DataFrame* according **columns**

In [None]:
df = pd.DataFrame({ 'col1':  [19, 3, 26, 46, 4, 19],
                    'col2': ['h', 'w', 'y', 'd', 'm', 'w'],
                    'col3':  [8.45, 19.23, 89.56, 17.5, 54.76, 89.56]})

In [None]:
df.sort_values(by='col1', ascending=False)

   - **first** *col1* is **sorted**
   - then, for **identical values**, *col2* is sorted 

In [None]:
df.sort_values(by=['col1', 'col2'], ascending=False)

   - you can sort only a few elements (*pandas.DataFrame.nlargest()*, *pandas.DataFrame.nsmallest()*)
   - (*it might be faster on large datasets*)

In [None]:
df.nlargest(2, 'col3')

In [None]:
df.nsmallest(3, 'col1')

### 6) changing the *pandas.DataFrame* *index*

   - *pandas.DataFramce.set_index(new_column)*
   - *pandas.DataFramce.reset_index()*
   - direct assignement

In [None]:
planets = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],],                         
                         index= ['Earth', 'Jupiter'],                         
                         columns=['distance', 'lowest temperature', 'highest temperature'])

   - with *pandas.DataFramce.set_index* you **index** by **another** column

In [None]:
planets.set_index('distance')

   - with *pandas.DataFramce.reset_index*  the **index** became a **normal** *pandas.DataFrame* column 

In [None]:
planets.reset_index()

   - with direct assigment you create a new index

In [None]:
planets

In [None]:
planets.index = ['la terre', 'jupiter']