# DataFrame

- `DataFrame` is a 2D labelled data structure with column of different data type
- When creating a `DataFrame` the resulting **index** will be a union

In [1]:
import numpy as np
import pandas as pd

In [2]:
def load_data(dataset_name):
    import os, re
    dataset_path = re.sub('ml-notebook.*$',
                          'ml-notebook'+os.sep+'Dataset'+os.sep+dataset_name, 
                          os.getcwd())
    return pd.read_csv(dataset_path)

In [3]:
df = load_data('insurance.csv')

In [4]:
df.head()

Unnamed: 0,Index,Claims,Payment(in thousands of Swedish Kronor)
0,1,0.0,0.0
1,2,10.0,65.3
2,3,108.0,392.5
3,4,11.0,21.3
4,5,11.0,23.5


In [5]:
df.drop('Index', axis='columns') # This returns a new dataframe

Unnamed: 0,Claims,Payment(in thousands of Swedish Kronor)
0,0.0,0.0
1,10.0,65.3
2,108.0,392.5
3,11.0,21.3
4,11.0,23.5
...,...,...
58,8.0,55.6
59,8.0,76.1
60,9.0,48.7
61,9.0,52.1


In [6]:
df.drop('Index', axis='columns', inplace=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 2 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Claims                                   63 non-null     float64
 1   Payment(in thousands of Swedish Kronor)  63 non-null     float64
dtypes: float64(2)
memory usage: 1.1 KB


In [8]:
df.describe() # descriptive stats

Unnamed: 0,Claims,Payment(in thousands of Swedish Kronor)
count,63.0,63.0
mean,22.904762,98.187302
std,23.351946,87.327553
min,0.0,0.0
25%,7.5,38.85
50%,14.0,73.4
75%,29.0,140.0
max,124.0,422.2


In [9]:
df.columns

Index(['Claims', 'Payment(in thousands of Swedish Kronor)'], dtype='object')

In [10]:
df.rename({df.columns[1]: 'Payment'}, axis='columns') # returns a new dataframe

Unnamed: 0,Claims,Payment
0,0.0,0.0
1,10.0,65.3
2,108.0,392.5
3,11.0,21.3
4,11.0,23.5
...,...,...
58,8.0,55.6
59,8.0,76.1
60,9.0,48.7
61,9.0,52.1


In [11]:
df.rename({df.columns[1]: 'Payment'}, axis='columns', inplace=True)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Claims   63 non-null     float64
 1   Payment  63 non-null     float64
dtypes: float64(2)
memory usage: 1.1 KB


In [13]:
df.index

RangeIndex(start=0, stop=63, step=1)

# Selection

- In dataframe we can use `.loc[]` and `.iloc[]` functions to select a particular rows and columns in a dataframe
- We can also use bool mask to select particular rows which satisfy a condition

In [14]:
df.loc[0:5,'Claims'] # Selects the 0-5 rows from the 'Claims' columns
# NOTE: range is inclusive!

0      0.0
1     10.0
2    108.0
3     11.0
4     11.0
5     11.0
Name: Claims, dtype: float64

In [15]:
df.iloc[1:10,[1,0]] # Selection by position

Unnamed: 0,Payment,Claims
1,65.3,10.0
2,392.5,108.0
3,21.3,11.0
4,23.5,11.0
5,57.2,11.0
6,58.1,12.0
7,422.2,124.0
8,15.7,13.0
9,31.9,13.0


In [16]:
df[df['Payment'] > 200] # equivalent to `where` clause in SQL

Unnamed: 0,Claims,Payment
2,108.0,392.5
7,124.0,422.2
35,31.0,209.8
43,45.0,214.0
44,48.0,248.1
47,53.0,244.6
53,60.0,202.4
54,61.0,217.6
