# Pandas `DataFrame`
A most important data structure. Holds entire tables (2D) data.

There are many different ways of **constructing** Pandas `DataFrame`s. Here we'll use a dictionary where each key is a column header and values are the rest of the column.

In [1]:
import pandas as pd

## Construct and examine a dataset with categorical data (baseball)
We'll construct it with a dictionary.

In [2]:
baseball_dict = {
    'City': ['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee'], 
    'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'], 
    'Division': 5 * ['Central'], 
    'League': 5 * ['NL'], 
}
baseball_dict

{'City': ['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee'],
 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
 'Division': ['Central', 'Central', 'Central', 'Central', 'Central'],
 'League': ['NL', 'NL', 'NL', 'NL', 'NL']}

In [3]:
baseball_df = pd.DataFrame(baseball_dict)
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


Let's take a look at some key Pandas `DataFrame` attributes to explore this data.

In [4]:
baseball_df.index

RangeIndex(start=0, stop=5, step=1)

In [5]:
baseball_df.columns

Index(['City', 'Team', 'Division', 'League'], dtype='object')

In [6]:
baseball_df.shape

(5, 4)

In [7]:
baseball_df.dtypes

City        object
Team        object
Division    object
League      object
dtype: object

In [8]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   City      5 non-null      object
 1   Team      5 non-null      object
 2   Division  5 non-null      object
 3   League    5 non-null      object
dtypes: object(4)
memory usage: 292.0+ bytes


In [9]:
baseball_df.describe()

Unnamed: 0,City,Team,Division,League
count,5,5,5,5
unique,5,5,1,1
top,Pittsburgh,Pirates,Central,NL
freq,1,1,5,5


## Construct and examine a numeric dataset (air quality)
We'll construct it from a NumPy array

In [10]:
import numpy as np

air_quality_arr = np.array([
    [12.5, 25.0, 0.4, 15, 22.5, 45],
    [35.2, 50.1, 0.8, 30, 24.0, 50],
    [55.1, 80.3, 1.2, 45, 26.5, 55],
    [22.3, 40.5, 0.6, 20, 21.0, 48]
])
air_quality_arr

array([[12.5, 25. ,  0.4, 15. , 22.5, 45. ],
       [35.2, 50.1,  0.8, 30. , 24. , 50. ],
       [55.1, 80.3,  1.2, 45. , 26.5, 55. ],
       [22.3, 40.5,  0.6, 20. , 21. , 48. ]])

In [11]:
column_headers = ["PM2.5", "PM10", "CO (ppm)", "NO2 (ppb)", "Temperature (C)", "Humidity (%)"]

In [12]:
air_quality_df = pd.DataFrame(air_quality_arr, columns=column_headers)
air_quality_df

Unnamed: 0,PM2.5,PM10,CO (ppm),NO2 (ppb),Temperature (C),Humidity (%)
0,12.5,25.0,0.4,15.0,22.5,45.0
1,35.2,50.1,0.8,30.0,24.0,50.0
2,55.1,80.3,1.2,45.0,26.5,55.0
3,22.3,40.5,0.6,20.0,21.0,48.0


In [14]:
air_quality_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   PM2.5            4 non-null      float64
 1   PM10             4 non-null      float64
 2   CO (ppm)         4 non-null      float64
 3   NO2 (ppb)        4 non-null      float64
 4   Temperature (C)  4 non-null      float64
 5   Humidity (%)     4 non-null      float64
dtypes: float64(6)
memory usage: 324.0 bytes


In [13]:
air_quality_df.describe()

Unnamed: 0,PM2.5,PM10,CO (ppm),NO2 (ppb),Temperature (C),Humidity (%)
count,4.0,4.0,4.0,4.0,4.0,4.0
mean,31.275,48.975,0.75,27.5,23.5,49.5
std,18.403691,23.303415,0.341565,13.228757,2.345208,4.203173
min,12.5,25.0,0.4,15.0,21.0,45.0
25%,19.85,36.625,0.55,18.75,22.125,47.25
50%,28.75,45.3,0.7,25.0,23.25,49.0
75%,40.175,57.65,0.9,33.75,24.625,51.25
max,55.1,80.3,1.2,45.0,26.5,55.0
