# Pandas

## Pandas 패키지

    - Index를 가진 자료형인 R의 data.frame 자료형을 Python에서 구현
    - 참고자료
        - http://pandas.pydata.org/
        - http://pandas.pydata.org/pandas-docs/stable/10min.html
            - http://pandas.pydata.org/pandas-docs/stable/tutorials.html
            
## Pandas 자료형

    - Series
        - 시계열 데이터
        - Index를 가지는 1차원 Numpy Array
    - DataFrame
        - 복수 필드 시계열 데이터 또는 테이블 데이터
        - Index를 가지는 2차원 Numpy Array
    - Index
        - Label : 각각의 Row/Column에 대한 이름
        - Name : 인덱스 자체에 대한 이름
    
    - https://docs.google.com/drawings/d/12FKb94RlpNp7hZNndpnLxmdMJn3FoLfGwkUAh33OmOw/pub?w=602&h=446

### Series

명시적인 Index를 가지지 않는 Series

In [1]:
import pandas as pd

In [2]:
s = pd.Series([4, 7, -5, 4])
s

0    4
1    7
2   -5
3    4
dtype: int64

### Vectorized Operation

In [3]:
s * 2

0     8
1    14
2   -10
3     8
dtype: int64

### 명시적인 Index를 가지는 Series

    - 생성시 index 인수로 index 지정
    - index 원소는 각 데이터에 대한 key 역할을 하는 Label
    - dict

In [4]:
s2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])

In [5]:
s2

d    4
b    7
a   -5
c    3
dtype: int64

### Series Indexing 1: Label Indexing

    - loc 메소드를 활용하여 인덱싱 가능 

In [6]:
s2.loc['a']

-5

In [7]:
s2.loc[['a', 'b']]

a   -5
b    7
dtype: int64

### Series Indexing 2: Integer Indexing

    - iloc 메소드를 활용하여 정수 인덱싱 가능 

In [8]:
s2.iloc[2]

-5

In [9]:
s2.iloc[1:4]

b    7
a   -5
c    3
dtype: int64

In [10]:
s2.iloc[[2, 1]]

a   -5
b    7
dtype: int64

### Series Indexing 3: boolean indexing
    - 조건에 맞는 열만 추려내어 출력

In [11]:
s2>0 #조건부 쿼리

d     True
b     True
a    False
c     True
dtype: bool

In [12]:
s2[s2 > 0]

d    4
b    7
c    3
dtype: int64

In [13]:
s2==4

d     True
b    False
a    False
c    False
dtype: bool

In [14]:
s2[s2==4]

d    4
dtype: int64

### dict 데이터를 이용한 Series 생성

In [15]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
sdata

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [16]:
s3 = pd.Series(sdata)
s3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
s4 = pd.Series(sdata, index=states)
s4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [18]:
pd.isnull(s4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

### DataFrame

    - Multi-Series
        - 동일한 Row 인덱스를 사용하는 복수 Series
        - Series를 Value로 가지는 dict
    - 2차원 행렬
        - DataFrame을 행렬로 생각하면 각 Series는 행렬의 Column의 역할
        - (Row) Index와 Column Index를 가진다.
    - Numpy Array와 차이점
        - 각 Column(Series)마다 type이 달라도 된다.

In [19]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
}
df = pd.DataFrame(data)
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [20]:
pd.DataFrame(data, columns = ['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [21]:
df.dtypes

state     object
year       int64
pop      float64
dtype: object

### 명시적인 Column/Row Index를 가지는 DataFrame

In [22]:
df2 = pd.DataFrame(data, 
                   columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


### row Indexing

In [23]:
df.loc[0]

state    Ohio
year     2000
pop       1.5
Name: 0, dtype: object

In [24]:
# df2.loc[0]
df2.iloc[0]

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

### Single Column Access

In [25]:
df["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [26]:
type(df["state"])

pandas.core.series.Series

In [27]:
df.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

### DataFrame의 Column Indexing

    - Single Label key
    - Single Label attribute
    - Label List Fancy Indexing

In [28]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [29]:
df2["year"]

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [30]:
df2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [31]:
df2[["state", "debt", "year"]]

Unnamed: 0,state,debt,year
one,Ohio,,2000
two,Ohio,,2001
three,Ohio,,2002
four,Nevada,,2001
five,Nevada,,2002


In [32]:
df2[["year"]]

Unnamed: 0,year
one,2000
two,2001
three,2002
four,2001
five,2002


### Boolean Indexing

In [33]:
df[df['state']=='Ohio']

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


In [34]:
df[df['year']<2002]

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
3,Nevada,2001,2.4


In [35]:
df[ (df['state']=='Ohio') & (df['year']<2002) ]

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7


In [36]:
df[ (df['state']=='Ohio') | (df['year']<2002) ]

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4


In [37]:
x=df[ (df['state']=='Ohio') | (df['year']<2002) ]

In [38]:
x.index

Int64Index([0, 1, 2, 3], dtype='int64')

### Column Data Update

In [39]:
df2['debt'] = 16.5
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [40]:
df2['debt'] = range(5)
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


In [41]:
df2['debt'] = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


### Adding new Column

In [42]:
df2['new']=df2['pop']*2
df2

Unnamed: 0,year,state,pop,debt,new
one,2000,Ohio,1.5,,3.0
two,2001,Ohio,1.7,-1.2,3.4
three,2002,Ohio,3.6,,7.2
four,2001,Nevada,2.4,-1.5,4.8
five,2002,Nevada,2.9,-1.7,5.8


### Delete Column

In [43]:
del df2['new']
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
