# 3.4 DataFrame 
`DataFrame` is a data structure provided by pandas to store 2-dimensional labeled data that can be conceptually viewed as a table with rows and columns.

- **From `dict` of lists**

`DataFrame` can be created from a `dict` of `lists`. The keys in the `dict` are used as the label of the columns in the `DataFrame`. All lists should have the same size.

In [1]:
import pandas as pd

d = {
    "A": [1,2,3,4,5,6],
    "B": ['a','b','c','d','e','f']
}
df = pd.DataFrame(d)
print(df)

   A  B
0  1  a
1  2  b
2  3  c
3  4  d
4  5  e
5  6  f


By default, each row will be labelled with their integer positions. However this can be defined by providing the input argument index to `DataFrame`. 

In [2]:
import pandas as pd

d = {
    "A": [1,2,3,4,5,6],
    "B": ['a','b','c','d','e','f']
}
df = pd.DataFrame(d, index=['I','II','III','IV','V','VI'])
print(df)

     A  B
I    1  a
II   2  b
III  3  c
IV   4  d
V    5  e
VI   6  f


`dict` of `ndarray` can also be used with similar syntax to create a `DataFrame`.



In [3]:
import pandas as pd
import numpy as np
d = {
    "A": np.array([1,2,3,4,5,6]),
    "B": np.array(['a','b','c','d','e','f'])
}
df = pd.DataFrame(d, index=['I','II','III','IV','V','VI'])
print(df)

     A  B
I    1  a
II   2  b
III  3  c
IV   4  d
V    5  e
VI   6  f


- **From `dict` of Series**
  
`DataFrame` can also be created from a `dict` of `Series`. The main thing of a `Series` and a `list` is that `Series` is labeled, meaning each item of a Series is labelled with an index. Therefore when we pass a `dict` of `Series` to create a `DataFrame`, the items with same label/index are placed in the same row. The labels of the rows follow the label in the Series whereas the labels of the columns follow the key of the `dict`.

In [4]:
import pandas as pd
d = {
    "A": pd.Series([1,2,3,4,5,6], index=['I','II','III','IV','V','VI']),
    "B": pd.Series(['a','b','c','d','e','f'], index=['I','II','III','IV','V','VII'])
}
df = pd.DataFrame(d)
print(df)

       A    B
I    1.0    a
II   2.0    b
III  3.0    c
IV   4.0    d
V    5.0    e
VI   6.0  NaN
VII  NaN    f


If one `Series` contains a label not exist in the other `Series`, it will be filled with `NaN` (not-a-number). Note that in this case we did not pass the input argument index to DataFrame. Therefore all labels from all the `Series` are used to create the labels for the `DataFrame`. This can be observed in the previous code snippet. `d["A"]` has `index VI` but not `VII` whereas `d["B"]` does not have index `VI` but `VII`. The created df has indices of `I`, `II`, `III`, `IV`, `V`, `VI`, `VII` with `NaN` in column `A` row `VII` and column `B` row `VI`.

The created `DataFrame` will only contain labels specified by the index.

In [5]:
import pandas as pd
d = {
    "A": pd.Series([1,2,3,4,5,6], index=['I','II','III','IV','V','VI']),
    "B": pd.Series(['a','b','c','d','e','f'], index=['I','II','III','IV','V','VII'])
}
df = pd.DataFrame(d, index=['I','II','IV'])
print(df)

    A  B
I   1  a
II  2  b
IV  4  d


- **From `list` of `dict`**

`DataFrame` can be created form a `list` of `dict`. Each `dict` is considered as a row in the `DataFrame`, and the keys in each dict are the labels of the columns in the `DataFrame`.

The labels of the rows can be specified with the index argument in `DataFrame`.

In [6]:
import pandas as pd
d = [
    {'a':1, 'b':2, 'c':3},
    {'a':5, 'b':4}
]
df = pd.DataFrame(d)
print(df)

   a  b    c
0  1  2  3.0
1  5  4  NaN


- **From `dict` of `dict`**

`DataFrame` can also be created with a `dict` of `dict`. The keys of the external `dict` are used for the labels of the column (just like in the `dict` of `lists`) and the keys of the internal `dict` are used for the labels of the rows (just like in the `list` of `dict`).

In [7]:
import pandas as pd
d = {
    'a': { 'x': 1, 'y': 2 },
    'b': { 'x': 3, 'y': 4 },
    'c': { 'x': 5, 'y': 6 },
    'd': { 'x': 7, 'y': 8 }
}
df = pd.DataFrame(d)
print(df)

   a  b  c  d
x  1  3  5  7
y  2  4  6  8


- **MultiIndexed DataFrame**

MultiIndexed `DataFrame` is a type of `DataFrame` where we have multiple level of indexing. For example, for the labels of the columns, we can have `a`, `b`, `c` under the group of `A`, and also `a`, `b`, `c` under the group of `B`. Let's go for an example before this is getting more confusing. 

In [8]:
import pandas as pd
d = {
    ('A', 'a'): { ('X', 'x'): 1, ('X', 'y'): 2 },
    ('A', 'b'): { ('X', 'x'): 3, ('X', 'y'): 4 },
    ('A', 'c'): { ('X', 'x'): 5, ('X', 'y'): 6 },
    ('B', 'a'): { ('X', 'z'): 7, ('Y', 'x'): 8 }
}
df = pd.DataFrame(d)
print(df)

       A              B
       a    b    c    a
X x  1.0  3.0  5.0  NaN
  y  2.0  4.0  6.0  NaN
  z  NaN  NaN  NaN  7.0
Y x  NaN  NaN  NaN  8.0


As the `DataFrame` is created with a `dict` of `dict`, the keys of the external dict are used as the columns whereas the keys of the internal `dict` are used as the rows.