# 102 - need to know functions and properties

When you work with pandas dataframes it is extremely handy to know some functions and properties of dataframes.

See the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for the full list of dataframe properties and functions. 

# 0 - setup notebook

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# 1 - get some data

In [2]:
cars = pd.read_csv("./dat/mtcars.csv")

# 2 - must know functions and properties

## head() and tail()

To get a quick impression of what the data look like you can use the functions head() and tail().   
- head() gives header rows (i.e. the first 5 rows) of a dataframe   
- tail() gives the tail rows (i.e. the last 5 rows)  

If you want to see only three rows you can use head(3) or tail(3)

In [30]:
cars.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [5]:
cars.tail(2)

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
31,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


## shape

the shape property holds the dimensions (the number of rows and the number of columns) of a data frame.  
Calling the shape property returns a tuple (nrows, ncols)

In [6]:
cars.shape

(32, 12)

In [7]:
#--- get and print the nrows and ncols separately ---
(nrows, ncols) = cars.shape
print(nrows)
print(ncols)

32
12


## info()

The info functions tells you all about the dataframe (probably more than you want to know)

In [8]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
model    32 non-null object
mpg      32 non-null float64
cyl      32 non-null int64
disp     32 non-null float64
hp       32 non-null int64
drat     32 non-null float64
wt       32 non-null float64
qsec     32 non-null float64
vs       32 non-null int64
am       32 non-null int64
gear     32 non-null int64
carb     32 non-null int64
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB


Here we find 
- the index and its values
- the column index and its values (i.e. the column names)
- the number of rows and columns

For each column it gives
- the column name
- the type of the column series
- the number of non-null values
- note: number of null values = total rows - number of non-nulls

## index

The property index holds the index (the "names of the rows") of the dataframe.

In [10]:
cars.index

RangeIndex(start=0, stop=32, step=1)

## columns

The property columns hold the column index (the names of the columns) of the data frame

In [11]:
cars.columns

Index(['model', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
       'gear', 'carb'],
      dtype='object')

# 3 - nice to know - how to create a dataframe from scratch

In this cookbook we will read dataframes from external files into Python.  
The function reading the file creates the dataframe, so we need not worry about creating dataframes.

So, we do not need to know, but in case you are curious, here are a few ways to create a dataframe.  
(if you fully understood 101 Dataframe fundamentals this should be easy to follow)

## DataFrame()

The function [DataFrame()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) can create a dataframe in a number of ways.

In [29]:
#--- create a dataframe from a dictionary of dictionaries ---
data = { 
        'col1':{'row1':'a', 'row2':'b', 'row3':'c'},
        'col2':{'row1':1  , 'row2':2  , 'row3':3  },
        'col3':{'row1':1.1, 'row2':2.2, 'row3':3.3}
        }

df1 = pd.DataFrame(data)
df1.head()

Unnamed: 0,col1,col2,col3
row1,a,1,1.1
row2,b,2,2.2
row3,c,3,3.3


Note that you did not have the specify the index and the column-index.   
The information to construct these indici is alread in the dictionaries.   
The dictionary keys are used to construct the indici.

In [25]:
#--- check the types of the series --------- 
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, row1 to row3
Data columns (total 3 columns):
col1    3 non-null object
col2    3 non-null int64
col3    3 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 96.0+ bytes


Note that pandas did deduce the series type from the input.

In [20]:
#--- create a dataframe from a list of lists ---------

data = [
        ['a'  ,'b'  ,'c'  ],
        ['aa' ,'bb' ,'cc' ],
        ['aaa','bbb','ccc']
       ]
df2 = pd.DataFrame(data)
df2.head()

Unnamed: 0,0,1,2
0,a,b,c
1,aa,bb,cc
2,aaa,bbb,ccc


Note that we did not specify the index, so pandas is using the default (start at 0 and simply number).  
We neither did specify the column-index. pandas uses defaults here also.   

Now we have rather awkward variable names.  
We can override the defaults by using index=[] and/or columns=[]  

In [21]:
data = [
        ['a'  ,'b'  ,'c'  ],
        ['aa' ,'bb' ,'cc' ],
        ['aaa','bbb','ccc']
       ]
df2 = pd.DataFrame(data, index=['row1','row2','row3'], columns=['col1','col2','col3'])
df2.head()

Unnamed: 0,col1,col2,col3
row1,a,b,c
row2,aa,bb,cc
row3,aaa,bbb,ccc


In [28]:
#-- create a dataframe using list comprehensions-------

from math import sin
data = [[x, round(sin(0.125*3.14159*x),3)] for x in range(10)]
df3= pd.DataFrame(data,columns=['x', 'sinx'])
df3.head(10)

Unnamed: 0,x,sinx
0,0,0.0
1,1,0.383
2,2,0.707
3,3,0.924
4,4,1.0
5,5,0.924
6,6,0.707
7,7,0.383
8,8,0.0
9,9,-0.383
