# 101 - dataframe fundamentals

In pandas data are stored in a table called a [dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).  
This notebook will introduce the basic concepts underlying the dataframe, i.e:
- the index (pointer to a unique table row to identify an unique observed case)
- the columns index (pointer to a table column to identify an unique variable)
- the series datastructure (to hold the values of a variable)

# 0 - setup notebook

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# 1 - get some data
([source and documentation](https://vincentarelbundock.github.io/Rdatasets/datasets.html))

In [2]:
cars = pd.read_csv("./dat/mtcars.csv")

Lets check that cars is indeed a dataframe.

In [3]:
type(cars)

pandas.core.frame.DataFrame

# 2 - the parts of a dataframe

First let's have a look at contents of the the first six rows of the dataframe.

In [4]:
cars.head(6)

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1


The figure below names the importent parts of this dataframe.

![the parts of an dataframe](./art/dataframe_parts.png)

# 3 - the index (the identifier of a row -i.e. the observed case-)

First let's have a look at the current index of the cars dataframe.  
This index is used to identify a row in our dataframe.   
Since each row is one observed case (i.e. one car), we can also say that the index identifies one observed case.

In [5]:
#--- show the current index ----------
cars.index

RangeIndex(start=0, stop=32, step=1)

So our index starts with 0 for the frist row. 1 is the index of the second row, 2 of the third row etc up to (but not including) 32.  
This is the default index that pandas creates for a dataframe (i.e. just number the rows, starting at 0).  

Using this index it is possible to show the contents of a row.  
To locate the row we must ust the  [loc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) (loacte) method of a dataframe.

In [6]:
#--- use loc[5] to show the contents of the dataframe row with index label '5' ----
#--- (note, that is the 6 row in the dataframe) ------------------------------------------
cars.loc[5]

model    Valiant
mpg         18.1
cyl            6
disp         225
hp           105
drat        2.76
wt          3.46
qsec       20.22
vs             1
am             0
gear           3
carb           1
Name: 5, dtype: object

It is possible to override the default index.  
We can specify that one of the existing columns of the dataframe should be used as the index.  
Use the method [set_index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html) to do that.

For the cars dataframe it is natural to use the column model as the index  
(the name of the model identifies the observed case on that row very naturally)

In [7]:
#--- make the column model the index ---------------
cars2 = cars.set_index('model')
cars2.head(6)

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1


In [8]:
#--- show the new index -------------
cars2.index

Index(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive',
       'Hornet Sportabout', 'Valiant', 'Duster 360', 'Merc 240D', 'Merc 230',
       'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL', 'Merc 450SLC',
       'Cadillac Fleetwood', 'Lincoln Continental', 'Chrysler Imperial',
       'Fiat 128', 'Honda Civic', 'Toyota Corolla', 'Toyota Corona',
       'Dodge Challenger', 'AMC Javelin', 'Camaro Z28', 'Pontiac Firebird',
       'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa', 'Ford Pantera L',
       'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'],
      dtype='object', name='model')

In [9]:
#--- show the content of the third row -----
cars2.loc['Datsun 710']

mpg      22.80
cyl       4.00
disp    108.00
hp       93.00
drat      3.85
wt        2.32
qsec     18.61
vs        1.00
am        1.00
gear      4.00
carb      1.00
Name: Datsun 710, dtype: float64

We can switch back to default index by using the method reset_index() 

In [10]:
cars3 = cars2.reset_index()
cars3.head(6)

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1


# 4 - the column-names (the variable identifiers)

Let's have a look at the names of the current columns.

In [11]:
cars.columns

Index(['model', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
       'gear', 'carb'],
      dtype='object')

We can use this names to show the values in the column 

In [12]:
cars['mpg'].head(6)

0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
5    18.1
Name: mpg, dtype: float64

Note that the first column contains the index of the row, the second contains the mpg value. 

The element in cars.columns are numbered, the first column-name has number 0, the second is 1 etc.  
We can use these numbers to select one of the column names

In [13]:
cars.columns[2] # -- select the name of the third coulmn

'cyl'

We can rename the columns.  
That is, we can rename all the columns at once, but not one, two or a subset.

In [14]:
cars2 = cars.copy()
cars2.columns = ['a','b','c','d','e','f','g','h','i','j','k','l']
cars2.head()

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k,l
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


If you want to change one column name you could do it like this.

In [15]:
list_of_column_names = list(cars.columns) # get the existing column names and put them in a list
list_of_column_names[2] = 'cylinder'      # change one element of that list
cars2.columns = list_of_column_names      # use the changed list to set the new column names
cars2.head(6)

Unnamed: 0,model,mpg,cylinder,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1


# 5 - the series (container for variable values)

The columns of a dataframe hold values. These values in column are all of the same kind, i.e.:  
- the are all of the **same data type** (integer, float, boolean, string etc.), and
- the all have the **same semantics**/meaning  
(in one column we find only the number of cylinders, in an other column we find only the amount of horsepower, we do not mix both in one column.)

Lets have a look at the columns mpg and hp.

In [16]:
#--- show the type of mpg coloum ---------
type(cars['mpg'])

pandas.core.series.Series

In [17]:
#--- show the first 5 variables ----------
cars['mpg'].head()

0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: float64

So cars['mpg'] is indeed a series and the values in this series are floats.

Lets do the same for the model column

In [18]:
print(type(cars['model']))
cars['model'].head()

<class 'pandas.core.series.Series'>


0            Mazda RX4
1        Mazda RX4 Wag
2           Datsun 710
3       Hornet 4 Drive
4    Hornet Sportabout
Name: model, dtype: object

So the column cars['model'] is a series, the values in this column are of type object (i.e. string)

Note that the series have an index (the same index as the dataframe)

In [19]:

cars['mpg'].index

RangeIndex(start=0, stop=32, step=1)

In [20]:
#--- assign a column to a variable -------
mpgs = cars['mpg'].copy()
print(type(mpgs))  #-- show the type of models--------
mpgs.head()        #-- show the frist 5 values-------

<class 'pandas.core.series.Series'>


0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: float64

In [21]:
#-- give the element with index 1 a new value, set it to 'aaa'
mpgs[1]= 'aaa'

In [22]:
#--- now check the type of the values in the mpgs series--
mpgs.head()

0      21
1     aaa
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: object

The element with index=1 is indeed set to the string 'aaa'.  
But note that the type of the column mpg has changed from float to string (object).

### Series are numpy arrays

Pandas uses [numpy](https://en.wikipedia.org/wiki/NumPy) arrays to implement series, i.o.w. pandas is build on top of numpy.  

[Numpy](http://www.numpy.org/) is the fundamental package for scientific computing with Python.  
Numpy is designed for efficient processing of vectors and matrices (typically 50 to 100 times faster than standard python).  
Pandas series inherit this efficiency from Numpy, at least when you use [Vectorization](https://en.wikipedia.org/wiki/Array_programming).  

Vectorization is a style of computer programming where operations are applied to whole arrays instead of individual elements.  
Vectorization in itself is a big subject. Here we will only mention it (it would take far too much space and time to do it full justice).


# 6 - series, dataframes, panels and axis

Pandas has three data structures to store data in.
- the [series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html), a 1D structure with one axis (i.e. axis0 labeled by the index)
- the [dataframe](https://pandas.pydata.org/pandas-docs/stable/api.html#dataframe), a 2D structure with two axes (i.e. axis0 labeled by the index and axis1 labeled by the column index)
- the [panel](https://pandas.pydata.org/pandas-docs/stable/api.html#panel), a 3D structure with three axes.

Panels are very seldom used, so we will ignore them here.  
Dataframes and series are already explained.

However, you should be aware of the terms axis0 and axis1 as synonyms for the rows-index and the column-index.