# Pandas (Part 1): Pandas Data Structures

In this notebook, you will learn how to create the following objects:
 - Series
 - DataFrame
 - Index
 
Read more: 
 - textbook: (https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html) and
 - Pandas website: (https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html).

In [1]:
import pandas as pd
import numpy as np

## 1.  Series objects

 - creating series objects
 - series object attributes

### 1.1 Creating series objects

```python
>>> pd.Series(data, index=index)
```
 - ``index`` is an optional argument. ``index`` defaults to an integer sequence
 - ``data`` can be one of many entities.
  - a list or NumPy array
  - a dictionary
  - a scalar value

#### 1.1.1 python list

In [2]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [3]:
pd.Series([2, 4, 6], index= [1, 2, 3])

1    2
2    4
3    6
dtype: int64

#### 1.1.2 scalar value

In [4]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

#### 1.1.3 dictionary

In [5]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [6]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

### 1.2. Series attributes

#### 1.2.1 Values and indexes¶

In [7]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [8]:
population.values

array([38332521, 26448193, 19651127, 19552860, 12882135], dtype=int64)

In [9]:
population.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

#### 1.2.2 Data type and size

In [10]:
population.dtype

dtype('int64')

In [11]:
population.size

5

In [12]:
population.shape

(5,)

In [13]:
population.ndim

1

## 2.  DataFrame objects

### 2.1 Creating dataframe objects

```python
>>> pd.DataFrame(data, index, columns, dtype)
```
 - ``index`` is an optional argument. ``index`` defaults to an integer sequence
 - ``data`` can be one of many entities.
  - Dict of 1D ndarrays, lists, dicts, or Series
  - 2-D numpy.ndarray
  - Structured or record ndarray
  - A Series
  - Another DataFrame

#### 2.1.1 a Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [14]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### 2.1.2 a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [15]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [16]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### 2.1.3 A dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [17]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [18]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### 2.1.4 A two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [19]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.880988,0.204377
b,0.105474,0.365815
c,0.561091,0.078543


#### 2.1.5 A NumPy structured array

In [20]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [21]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


### 2.2. DataFrame attributes

In [31]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [32]:
states.columns

Index(['population', 'area'], dtype='object')

In [33]:
states.values

array([[38332521,   423967],
       [26448193,   695662],
       [19651127,   141297],
       [19552860,   170312],
       [12882135,   149995]], dtype=int64)

In [36]:
states.dtypes

population    int64
area          int64
dtype: object

In [35]:
# other attributes
print(states.size, states.shape, states.ndim)

10 (5, 2) 2


## 3.  Index object

This ``Index`` object can be thought of either as an *immutable array* or as an *ordered set* 

In [29]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [30]:
# attributes
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [28]:
# immutable
#ind[1] = 0