# DataFrames

A `DataFrame` is a two-dimensional data structure composed of rows and columns—exactly like a simple spreadsheet or a SQL table. Each column of a DataFrame is a pandas Series. These columns should be of the same length, but they can be of different data types—float, int, bool, and so on. DataFrames are both value-mutable and size-mutable. This lets us perform operations that would alter values held within the DataFrame or add/delete columns to/from the DataFrame.

## `DataFrame` Creation

Similar to a `Series`, which has a `name` and `index` as attributes, a `DataFrame` has column names and a row index. The row index can be made of either numerical values or strings such as month names. Indexes are needed for fast lookups as well as proper aligning and joining of data in pandas multilevel indexing is also possible in DataFrames. The following is a simple view of a `DataFrame` with five rows and three columns. In general, the `index` is not counted as a column:

|Index | Event type | Total attendees | Percentage of student participants|
|:--|:--|:--|:--|
|Monday | C | 42 | 23.56%| 
|Tuesday | B | 58 | 12.89%| 
|Wednesday | A | 27 | 45.90%| 
|Thursday | A | 78 | 47.89%| 
|Friday | B| 92 | 63.25%| 

A DataFrame is the most commonly used data structure in pandas. The constructor accepts many different types of arguments:

- Dictionary of 1D ndarrays, lists, dictionaries, or Series structures
- 2D NumPy array
- Structured or record ndarray
- Series
- Another DataFrame

Row label indexes and column labels can be specified along with the data. If they're not specified, they will be generated from the input data in an intuitive fashion, for example, from the `keys` of `dict` (in the case of column labels) or by using `range(n)` in the case of row labels, where `n` corresponds to the number of rows.

### Using a dictionary of `Series`

In [1]:
import pandas as pd

In [26]:
stockSummaries = {
    'AMZN': pd.Series(
        [346.15, 0.59, 459, 0.52, 589.8, 158.88],
        index=['Closing price', 'EPS', 'Shares Outstanding(M)', 'Beta', 'P/E', 'Market Cap(B)']
    ),
    'GOOG': pd.Series(
        [1133.43, 36.05, 335.83, 0.87, 31.44, 380.64],
        index=['Closing price', 'EPS', 'Shares Outstanding(M)', 'Beta', 'P/E', 'Market Cap(B)']
    ),
    'FB': pd.Series(
        [61.48, 0.59, 2450, 104.93, 150.92],
        index=['Closing price', 'EPS', 'Shares Outstanding(M)', 'P/E', 'Market Cap(B)']
    ),
    'YHOO': pd.Series(
        [34.90, 1.27, 1010, 27.48, 0.66, 35.36],
        index=['Closing price', 'EPS', 'Shares Outstanding(M)', 'P/E', 'Beta', 'Market Cap(B)']
    ),
    'TWTR':pd.Series(
        [65.25, -0.3, 555.2, 36.23],
        index=['Closing price', 'EPS', 'Shares Outstanding(M)', 'Market Cap(B)']
    ),
    'AAPL':pd.Series(
        [501.53, 40.32, 892.45, 12.44, 447.59, 0.84],
        index=['Closing price', 'EPS','Shares Outstanding(M)', 'P/E', 'Market Cap(B)', 'Beta']
    )
}

The preceding dictionary summarizes the performance of six different stocks and indicates that the DataFrame will have six columns. Observe that each series has a different set of indices and is of different length. The final `DataFrame` will contain a unique set of the values in each of the indices. If a certain column has no value at a row index, `NA` is appended to that cell automatically. Now, the following step wraps up this dictionary into a DataFrame:

In [27]:
pd.DataFrame(stockSummaries)

Unnamed: 0,AMZN,GOOG,FB,YHOO,TWTR,AAPL
Beta,0.52,0.87,,0.66,,0.84
Closing price,346.15,1133.43,61.48,34.9,65.25,501.53
EPS,0.59,36.05,0.59,1.27,-0.3,40.32
Market Cap(B),158.88,380.64,150.92,35.36,36.23,447.59
P/E,589.8,31.44,104.93,27.48,,12.44
Shares Outstanding(M),459.0,335.83,2450.0,1010.0,555.2,892.45


The DataFrame need not necessarily have all the row and column labels from the original dictionary. At times, only a subset of these rows and columns may be needed. In such cases, the row and column indices can be restricted as shown in the following code:

In [33]:
stock_df = pd.DataFrame(
    stockSummaries,
    index=['Closing price','EPS', 'Shares Outstanding(M)', 'P/E', 'Market Cap(B)', 'Beta'],
    columns=['FB', 'TWTR', 'SCNW']
)

In [34]:
stock_df

Unnamed: 0,FB,TWTR,SCNW
Closing price,61.48,65.25,
EPS,0.59,-0.3,
Shares Outstanding(M),2450.0,555.2,
P/E,104.93,,
Market Cap(B),150.92,36.23,
Beta,,,


Here, a new column name, `SCNW`, which is not found in the original dictionary, has been added. This will result in a column named SCNW with NAs throughout. Similarly, manually passing an index name that is absent in the original data structure will result in a row with `NA`s throughout.

In [49]:
stock_df.columns

Index(['FB', 'TWTR', 'SCNW'], dtype='object')

In [50]:
stock_df.index

Index(['Closing price', 'EPS', 'Shares Outstanding(M)', 'P/E', 'Market Cap(B)',
       'Beta'],
      dtype='object')

### Using a dictionary of `ndarrays`/`list`s

In the preceding example, the dictionary consisted of Series as the values in the key-value pair. It is possible to construct a DataFrame with a dictionary of lists instead of a dictionary of Series. Unlike the previous case, the row index will not be defined anywhere in the dictionary. Hence, the row label indices are generated using `range(n)`. Therefore, **it is crucial in this case for all lists or arrays in the dictionary to be of equal length**. If this condition is not met, a `ValueError` occurs.

In [51]:
algos = {
    'search': ['DFS', 'BFS', 'Binary Search', 'Linear','ShortestPath (Djikstra)'],
    'sorting': ['Quicksort','Mergesort', 'Heapsort', 'Bubble Sort', 'Insertion Sort'],
    'machine learning': ['RandomForest', 'K Nearest Neighbor', 'Logistic Regression', 'K-Means Clustering', 'Linear Regression']
}

In [52]:
algo_df = pd.DataFrame(algos)

In [53]:
algo_df

Unnamed: 0,search,sorting,machine learning
0,DFS,Quicksort,RandomForest
1,BFS,Mergesort,K Nearest Neighbor
2,Binary Search,Heapsort,Logistic Regression
3,Linear,Bubble Sort,K-Means Clustering
4,ShortestPath (Djikstra),Insertion Sort,Linear Regression


In [54]:
pd.DataFrame(algos, index = ['algo_1', 'algo_2', 'algo_3', 'algo_4', 'algo_5'])

Unnamed: 0,search,sorting,machine learning
algo_1,DFS,Quicksort,RandomForest
algo_2,BFS,Mergesort,K Nearest Neighbor
algo_3,Binary Search,Heapsort,Logistic Regression
algo_4,Linear,Bubble Sort,K-Means Clustering
algo_5,ShortestPath (Djikstra),Insertion Sort,Linear Regression


In [55]:
pd.DataFrame(algos, columns=['search'])

Unnamed: 0,search
0,DFS
1,BFS
2,Binary Search
3,Linear
4,ShortestPath (Djikstra)


### Using a structured array

Structured arrays are covered in `NumPy` section. Each field in a structured array can be of a different data type.

In [93]:
member_data = np.array(
    [
        ('Sanjeev', 37, 162.4),
        ('Yingluck', 45, 137.8),
        ('Emeka', 28, 153.2),
        ('Amy', 67, 101.3)
    ],
    dtype = [
        ('Name', 'a15'),
        ('Age', 'i4'),
        ('Weight', 'f4')
    ]
)

In [95]:
pd.DataFrame(member_data)

Unnamed: 0,Name,Age,Weight
0,b'Sanjeev',37,162.399994
1,b'Yingluck',45,137.800003
2,b'Emeka',28,153.199997
3,b'Amy',67,101.300003


In [97]:
pd.DataFrame(member_data, index=['a', 'b', 'c', 'd'])

Unnamed: 0,Name,Age,Weight
a,b'Sanjeev',37,162.399994
b,b'Yingluck',45,137.800003
c,b'Emeka',28,153.199997
d,b'Amy',67,101.300003


In [99]:
pd.DataFrame(member_data, columns=['Name'])

Unnamed: 0,Name
0,b'Sanjeev'
1,b'Yingluck'
2,b'Emeka'
3,b'Amy'


### Using a list of dictionaries

When a list of dictionaries is converted to a `DataFrame`, each dictionary in the list corresponds to a row in the `DataFrame` and each key in each dictionary represents a column label.

In [102]:
demographic_data = [
    {"Age": 32, "Gender": "Male"},
    {"Race": "Hispanic", "Gender": "Female", "Age": 26}
]

In [103]:
pd.DataFrame(demographic_data)

Unnamed: 0,Age,Gender,Race
0,32,Male,
1,26,Female,Hispanic


### Using a dictionary of tuples for multilevel indexing

A dictionary of tuples can create a structured DataFrame with hierarchically indexed rows and columns. The following is a dictionary of tuples:

In [104]:
sales_data = {
    ("2012", "Q1"): {
        ("North", "Brand A"): 100,
        ("North", "Brand B"): 80,
        ("South", "Brand A"): 25,
        ("South", "Brand B"): 40,
    },
    ("2012", "Q2"): {("North", "Brand A"): 30, ("South", "Brand B"): 50},
    ("2013", "Q1"): {
        ("North", "Brand A"): 80,
        ("North", "Brand B"): 10,
        ("South", "Brand B"): 25,
    },
    ("2013", "Q2"): {
        ("North", "Brand A"): 70,
        ("North", "Brand B"): 50,
        ("South", "Brand A"): 35,
        ("South", "Brand B"): 40,
    },
}

In [105]:
pd.DataFrame(sales_data)

Unnamed: 0_level_0,Unnamed: 1_level_0,2012,2012,2013,2013
Unnamed: 0_level_1,Unnamed: 1_level_1,Q1,Q2,Q1,Q2
North,Brand A,100,30.0,80.0,70
North,Brand B,80,,10.0,50
South,Brand A,25,,,35
South,Brand B,40,50.0,25.0,40


Instead of a regular key-value pair, the key is a `tuple` with two values denoting two levels in the row index, and the value is a dictionary in which each key-value pair represents a column. Here, again, the key is a `tuple` and denotes two column indices.

### Using a Series

Consider the following series:

In [114]:
curr_dict = {
    "US": "dollar",
    "UK": "pound",
    "Germany": "euro",
    "Mexico": "peso",
    "Nigeria": "naira",
    "China": "yuan",
    "Japan": "yen",
}

In [118]:
curr_series = pd.Series(curr_dict, name='Currency')

In [119]:
curr_series

US         dollar
UK          pound
Germany      euro
Mexico       peso
Nigeria     naira
China        yuan
Japan         yen
Name: Currency, dtype: object

Here, the `Series` has a defined `index` and `name`. When being converted to a `DataFrame`, this `index` is retained and the `name` of the `Series` gets assigned as a column name:

In [125]:
df = pd.DataFrame(curr_series)

### Alternative Ways

There are also alternative constructors for DataFrames; they can be summarized as follows:

1. `DataFrame.from_dict`: It takes a dictionary of dictionaries or sequences and returns a DataFrame. It slightly differs from the method discussed earlier due to an argument to specify order. While the other method always converts keys of dictionaries to columns, this constructor provides an option to convert the keys to row labels:

In [None]:
class DataFrame:
    @staticmethod
    def from_dict():
        pass

In [139]:
pd.DataFrame.from_dict(algos, orient = "columns")

Unnamed: 0,search,sorting,machine learning
0,DFS,Quicksort,RandomForest
1,BFS,Mergesort,K Nearest Neighbor
2,Binary Search,Heapsort,Logistic Regression
3,Linear,Bubble Sort,K-Means Clustering
4,ShortestPath (Djikstra),Insertion Sort,Linear Regression


In [140]:
pd.DataFrame.from_dict(algos, orient = "index")

Unnamed: 0,0,1,2,3,4
search,DFS,BFS,Binary Search,Linear,ShortestPath (Djikstra)
sorting,Quicksort,Mergesort,Heapsort,Bubble Sort,Insertion Sort
machine learning,RandomForest,K Nearest Neighbor,Logistic Regression,K-Means Clustering,Linear Regression


In [141]:
pd.DataFrame.from_dict(algos, orient = "index", columns = ["A", "B", "C", "D", "E"])

Unnamed: 0,A,B,C,D,E
search,DFS,BFS,Binary Search,Linear,ShortestPath (Djikstra)
sorting,Quicksort,Mergesort,Heapsort,Bubble Sort,Insertion Sort
machine learning,RandomForest,K Nearest Neighbor,Logistic Regression,K-Means Clustering,Linear Regression


2. `DataFrame.from_records`: It takes a list of tuples or structured `ndarray` to construct a `DataFrame`. Unlike the method mentioned earlier for structured arrays, this function allows you to set one of the fields of the array as an `index`:

In [148]:
pd.DataFrame.from_records(member_data, index="Weight")

Unnamed: 0_level_0,Name,Age
Weight,Unnamed: 1_level_1,Unnamed: 2_level_1
162.399994,b'Sanjeev',37
137.800003,b'Yingluck',45
153.199997,b'Emeka',28
101.300003,b'Amy',67
