# Introduction

DataFrame is a data structure provided by pandas library,apart from Series & Panel. It is a 2-
dimensional structure & can be compared to a table of rows and columns.


Each row can be identified by an integer index (0..N) or a label explicitly set when creating a
DataFrame object. Each column can be of distinct type and is identified by a label.


This topic covers various ways to construct/create a DataFrame object. Ex. from Numpy arrays,
from list of tuples, from dictionary.

### Creating a sample dataframe

In [3]:
import pandas as pd

df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']})

df

Unnamed: 0,numbers,colors
0,1,red
1,2,white
2,3,blue


In [4]:
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']},
 columns=['numbers', 'colors'])
df

Unnamed: 0,numbers,colors
0,1,red
1,2,white
2,3,blue


#### Creating dataframe using random numbers

In [7]:
import numpy as np
import pandas as pd

# Set the seed for a reproducible sample
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))

df

Unnamed: 0,A,B,C
0,1.764052,0.400157,0.978738
1,2.240893,1.867558,-0.977278
2,0.950088,-0.151357,-0.103219
3,0.410599,0.144044,1.454274
4,0.761038,0.121675,0.443863


#### Create a DataFrame with integers:

In [8]:
df = pd.DataFrame(np.arange(15).reshape(5,3),columns=list('ABC'))
df

Unnamed: 0,A,B,C
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11
4,12,13,14


#### Create a DataFrame and include nans (NaT, NaN, 'nan', None) across columns and rows:


In [9]:
df = pd.DataFrame(np.arange(48).reshape(8,6),columns=list('ABCDEF'))
df

Unnamed: 0,A,B,C,D,E,F
0,0,1,2,3,4,5
1,6,7,8,9,10,11
2,12,13,14,15,16,17
3,18,19,20,21,22,23
4,24,25,26,27,28,29
5,30,31,32,33,34,35
6,36,37,38,39,40,41
7,42,43,44,45,46,47


In [11]:
df.iloc[::2,0] = np.nan # in column 0, set elements with indices 0,2,4, ... to NaN
df.iloc[::4,1] = pd.NaT # in column 1, set elements with indices 0,4, ... to np.NaT
df.iloc[:3,2] = 'nan' # in column 2, set elements with index from 0 to 3 to 'nan'
df.iloc[:,5] = None # in column 5, set all elements to None
df.iloc[5,:] = None # in row 5, set all elements to None
df.iloc[7,:] = np.nan # in row 7, set all elements to NaN

In [12]:
df

Unnamed: 0,A,B,C,D,E,F
0,,NaT,,3.0,4.0,
1,6.0,7,,9.0,10.0,
2,,13,,15.0,16.0,
3,18.0,19,20.0,21.0,22.0,
4,,NaT,26.0,27.0,28.0,
5,,,,,,
6,,37,38.0,39.0,40.0,
7,,,,,,


#### Create a sample DataFrame from multiple collections using Dictionary

In [13]:
import pandas as pd
import numpy as np
np.random.seed(123)
x = np.random.standard_normal(4)
y = range(4)
df = pd.DataFrame({'X':x, 'Y':y})

df

Unnamed: 0,X,Y
0,-1.085631,0
1,0.997345,1
2,0.282978,2
3,-1.506295,3


#### Create a DataFrame from a list of tuples

In [14]:
import pandas as pd

data = [
('p1', 't1', 1, 2),
('p1', 't2', 3, 4),
('p2', 't1', 5, 6),
('p2', 't2', 7, 8),
('p2', 't3', 2, 8)
]
df = pd.DataFrame(data)

df

Unnamed: 0,0,1,2,3
0,p1,t1,1,2
1,p1,t2,3,4
2,p2,t1,5,6
3,p2,t2,7,8
4,p2,t3,2,8


#### Create a DataFrame from a dictionary of lists

Create a DataFrame from multiple lists by passing a dict whose values lists. The keys of the
dictionary are used as column labels. The lists can also be ndarrays. The lists/ndarrays must all be the same length.


In [15]:
import pandas as pd
# Create DF from dict of lists/ndarrays
df = pd.DataFrame({'A' : [1, 2, 3, 4],
 'B' : [4, 3, 2, 1]})

df

Unnamed: 0,A,B
0,1,4
1,2,3
2,3,2
3,4,1


If the arrays are not the same length an error is raised

In [17]:
df = pd.DataFrame({'A' : [1, 2, 3, 4], 'B' : [5, 5, 5]}) # a ValueError is raised

#please click the box below to expand the output of this column

ValueError: All arrays must be of the same length

#### Using ndarrays

In [18]:
import pandas as pd
import numpy as np

np.random.seed(123)
x = np.random.standard_normal(4)
y = range(4)
df = pd.DataFrame({'X':x, 'Y':y})

df

Unnamed: 0,X,Y
0,-1.085631,0
1,0.997345,1
2,0.282978,2
3,-1.506295,3


####  Create a sample DataFrame with datetime


In [19]:
import pandas as pd
import numpy as np

np.random.seed(0)

# create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) })

df

Unnamed: 0,Date,Val
0,2015-02-24 00:00:00,1.764052
1,2015-02-24 00:01:00,0.400157
2,2015-02-24 00:02:00,0.978738
3,2015-02-24 00:03:00,2.240893
4,2015-02-24 00:04:00,1.867558


In [20]:
# create an array of 5 dates starting at '2015-02-24', one per day
rng = pd.date_range('2015-02-24', periods=5, freq='D')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})

df

Unnamed: 0,Date,Val
0,2015-02-24,-0.977278
1,2015-02-25,0.950088
2,2015-02-26,-0.151357
3,2015-02-27,-0.103219
4,2015-02-28,0.410599


In [21]:
# create an array of 5 dates starting at '2015-02-24', one every 3 years
rng = pd.date_range('2015-02-24', periods=5, freq='3A')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})

df

Unnamed: 0,Date,Val
0,2015-12-31,0.144044
1,2018-12-31,1.454274
2,2021-12-31,0.761038
3,2024-12-31,0.121675
4,2027-12-31,0.443863


#### DataFrame with  - DatetimeIndex:

In [22]:
import pandas as pd
import numpy as np
np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Val' : np.random.randn(len(rng)) }, index=rng)

df

Unnamed: 0,Val
2015-02-24 00:00:00,1.764052
2015-02-24 00:01:00,0.400157
2015-02-24 00:02:00,0.978738
2015-02-24 00:03:00,2.240893
2015-02-24 00:04:00,1.867558


#### Offset-aliases for parameter freq in date_range:

Alias     Description
B         business day frequency  
C         custom business day frequency (experimental)  
D         calendar day frequency  
W         weekly frequency  
M         month end frequency  
BM        business month end frequency  
CBM       custom business month end frequency  
MS        month start frequency  
BMS       business month start frequency  
CBMS      custom business month start frequency  
Q         quarter end frequency  
BQ        business quarter endfrequency  
QS        quarter start frequency  
BQS       business quarter start frequency  
A         year end frequency  
BA        business year end frequency  
AS        year start frequency  
BAS       business year start frequency  
BH        business hour frequency  
H         hourly frequency  
T, min    minutely frequency  
S         secondly frequency  
L, ms     milliseconds  
U, us     microseconds  
N         nanoseconds 

#### Create a sample DataFrame with MultiIndex

In [23]:
np.random.seed(0)
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
 'foo', 'foo', 'qux', 'qux'],
 ['one', 'two', 'one', 'two',
 'one', 'two', 'one', 'two']]))
idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
idx

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

#### Using from_product:

In [24]:
idx = pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'],['one','two']])

idx

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           )

#### Then, use this MultiIndex:

In [27]:
df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=['A', 'B'])

df

Unnamed: 0,Unnamed: 1,A,B
bar,one,-0.887786,-1.980796
bar,two,-0.347912,0.156349
baz,one,1.230291,1.20238
baz,two,-0.387327,-0.302303
foo,one,-1.048553,-1.420018
foo,two,-1.70627,1.950775
qux,one,-0.509652,-0.438074
qux,two,-1.252795,0.77749


##### Save and Load a DataFrame in pickle (.plk) format

In [30]:
# Save dataframe to pickled pandas object
df.to_pickle('pwd') # where to save it usually as a .plk

# Load dataframe from pickled pandas object
df= pd.read_pickle('pwd')
df

Unnamed: 0,Unnamed: 1,A,B
bar,one,-0.887786,-1.980796
bar,two,-0.347912,0.156349
baz,one,1.230291,1.20238
baz,two,-0.387327,-0.302303
foo,one,-1.048553,-1.420018
foo,two,-1.70627,1.950775
qux,one,-0.509652,-0.438074
qux,two,-1.252795,0.77749


#### Create a DataFrame from a list of dictionaries

In [31]:
import pandas as pd
L = [{'Name': 'John', 'Last Name': 'Smith'},
 {'Name': 'Mary', 'Last Name': 'Wood'}]
pd.DataFrame(L)

Unnamed: 0,Name,Last Name
0,John,Smith
1,Mary,Wood


In [32]:
L = [{'Name': 'John', 'Last Name': 'Smith', 'Age': 37},
 {'Name': 'Mary', 'Last Name': 'Wood'}]


In [33]:
pd.DataFrame(L)

Unnamed: 0,Name,Last Name,Age
0,John,Smith,37.0
1,Mary,Wood,
