# Pandas - Data Manipulation
> Pandas is built on top of NumPy.  
> **Series** is a one-dimensional array of indexed data.  
> - Convert **list** into **pd.Series**.  
> - Convert **dictionary** into **pd.Series**.  
> - Convert **NumPy** into **pd.Series**.  

> **DataFrames** are multidimensional array (have labels, allows heterogeneous/missing data).  
> - Convert **dictionary** into **pd.DataFrame**.  
> - Convert **pd.Series** into **pd.DataFrame**.  

> Indexing: ***explicitly defined index***)  
> Reference: http://pandas.pydata.org

In [2]:
import numpy as np
import pandas as pd

In [3]:
pd.__version__

'1.0.5'

## Pandas Series
> **Python list** to **Pandas Series**.  
> Automatically indexed: Can call both index and values using **pd.index** and **pd.values**.   
> Slicing is the same for pandas objects.   
> Pandas objects can be indexed *explicitly* using ```index```.    

In [4]:
x = [0.25, 0.5, 0.75, 1.0]
data = pd.Series(x)

print(type(data))
print('---------------------')
print(data)
print('---------------------')
print(data.values)
print('---------------------')
print(data.index)

<class 'pandas.core.series.Series'>
---------------------
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
---------------------
[0.25 0.5  0.75 1.  ]
---------------------
RangeIndex(start=0, stop=4, step=1)


In [7]:
# Slicing using the automatically-generated index
print(data[1])
print('---------------------')
print(data[3])
print('---------------------')
print(data[1:3])

0.5
---------------------
1.0
---------------------
1    0.50
2    0.75
dtype: float64


In [8]:
# Specify your own 'string' index with index

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])

print(data)
print('---------------------')
print(data['b'])
print('---------------------')
print(data[['b','d']])

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
---------------------
0.5
---------------------
b    0.5
d    1.0
dtype: float64


In [10]:
# Specify your own 'numeric' index with index

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])

print(data)
print('---------------------')
print(data[5])
print('---------------------')
print(data[[5,7]])

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64
---------------------
0.5
---------------------
5    0.5
7    1.0
dtype: float64


## Pandas Series
> **Dictionary** to **Pandas Series**.  
> **Dictionary keys** becomes **Series index**.      
> **Dictionary values** becomes **Series values**.       

### Dictionary

In [11]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

print(population_dict.keys())
print(population_dict.values())

dict_keys(['California', 'Texas', 'New York', 'Florida', 'Illinois'])
dict_values([38332521, 26448193, 19651127, 19552860, 12882135])


### Pandas series

In [12]:
population = pd.Series(population_dict)

print(population.index)
print('---------------')
print(population.values)
print('---------------')
print(population)

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
---------------
[38332521 26448193 19651127 19552860 12882135]
---------------
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


In [13]:
print(population['California'])
print('---------------')
print(population['California':'New York'])
print('---------------')
print(population[['California','Illinois']])


38332521
---------------
California    38332521
Texas         26448193
New York      19651127
dtype: int64
---------------
California    38332521
Illinois      12882135
dtype: int64


## Activity 1
> 1. Construct x1 as a panda series with 3 elements 2,4,6.
> 2. Print the index of x1.  
> 3. Construct x2 with values ```[5,5,5]``` index = ```[100,200,300]```.   
> 4. Print x2 out.  
> 5. Construct x3 with values ```['a','b','c']``` index = ```[2,1,3]```.  

In [1]:
import numpy as np
import pandas as pd
x1 = pd.Series([2,4,6])
x1

0    2
1    4
2    6
dtype: int64

In [2]:
x2 = pd.Series(5, index=[100, 200, 300])
x2

100    5
200    5
300    5
dtype: int64

In [3]:
x3 = pd.Series({2:'a', 1:'b', 3:'c'})
x3

2    a
1    b
3    c
dtype: object

## Pandas DataFrame
> From **pd.Series** to **pd.DataFrame** via **dictionary**.  
```python 
pd.DataFrame(  {'key1': pd.Series1,'key2': pd.Series2} )```
> **keys** are *columns* names.  
> **index** in **pd.Series** are *index* names.  

> From **Numpy** to **pd.DataFrame**.  
```python 
pd.DataFrame(mxn_NumPy_object, 
             columns=['col1','col2',...,'coln'],
             index=['row1','row2',...,'rowm'])```

### Single pd.Series to pd.DataFrame

In [16]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


### Multiple pd.Series to pd.DataFrame

In [16]:
# A new dictionary "area_dict"

area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}

# A new pd.Series "area"

area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [17]:
# Combine 2 pd.Series via dictionary

states = pd.DataFrame(  {'population': population,'area': area} )
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [18]:
print('Row names: \n', states.index)
print('---------------')
print('Column names: \n', states.columns)

Row names: 
 Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
---------------
Column names: 
 Index(['population', 'area'], dtype='object')


In [37]:
print(states['population'])
print('---------------')
print(states.population)
print('---------------')
print(states['population'] is states.population)

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64
---------------
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64
---------------
True


### Add new variable to DataFrame

In [38]:
states['density'] = states['population'] / states['area']
states

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [52]:
print(states.values.shape)
print('---------------')
print(states[states.density > 100][['population','density']])

(5, 3)
---------------
          population     density
New York    19651127  139.076746
Florida     19552860  114.806121


In [41]:
# Same keys 'a' and 'b' as column names

data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [24]:
# Different keys 'a' and 'b' as column names

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


## Activity 2
> 1. Construct a **DataFrame** of size (3,2).  
> 2. Name the columns by 'foo' and 'bar'.  
> 3. Name the index by 'a','b' 'c'.  

In [4]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.655405,0.882395
b,0.682925,0.043983
c,0.537839,0.758983


### Structured DataFrame

In [20]:
A = np.zeros(3, dtype=[('X1', 'i8'), ('X2', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('X1', '<i8'), ('X2', '<f8')])

In [21]:
A['X1'] = np.ones(3)
A['X2'] = np.full(3,2)
pd.DataFrame(A)

Unnamed: 0,X1,X2
0,1,2.0
1,1,2.0
2,1,2.0


## Pandas Indexing
> Pandas indices are **immutable**  
> Note: the method ```Index``` is capitalized.  

In [22]:
ind = pd.Index([1, 3, 5, 7, 9])
print(ind)

Int64Index([1, 3, 5, 7, 9], dtype='int64')


In [23]:
print(ind[1])
print(ind[::2])

3
Int64Index([1, 5, 9], dtype='int64')


In [24]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [38]:
ind[1]=0 # TypeError: Index does not support mutable operations

TypeError: Index does not support mutable operations

In [39]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [40]:
print(indA & indB) # intersection
print(indA | indB) # union
print(indA ^ indB) # symmetric difference

Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')


## Explicit versus Implicit Indexing
> **Explicit** index (**pd.loc['a']**) uses the index system you assign to the pandas objects.  
> **Implicit** indexing (**pd.iloc[p]**) uses automatically generated indices.  

In [30]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [49]:
# Masking 
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [50]:
print(data.keys())
print(data.index)

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


In [51]:
# Adding new index with value (like dictionary)

data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [46]:
# Slicing by explicit index (it WORKS!)

data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [47]:
# Slicing by implicit integer index (it WORKS!)

data[0:3]

a    0.25
b    0.50
c    0.75
dtype: float64

In [52]:
# Passing an index LIST

data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

### Confusion: Explicit vs Implicit Indexing
> Arise when ```index``` are numbers.  

In [32]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [33]:
# Explicit index when indexing

print(data[1])

a


In [34]:
# Explicit index when indexing (DOES NOT WORK!)

print(data[2])

KeyError: 2

In [35]:
# Range 1:3 (CHANGE TO IMPLICIT INDEX)

data[:2]

1    a
3    b
dtype: object

### Use loc for explicit and iloc for implicit

In [64]:
data.loc[1]

'a'

In [65]:
data.loc[1:3]

1    a
3    b
dtype: object

In [67]:
data.iloc[0]

'a'

In [68]:
data.iloc[1:3]

3    b
5    c
dtype: object