# Pandas

 Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with labels
rather than simple integer indices 

Pandas data structures: the Series, DataFrame, and Index 



# Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. 
It can be created from list or array as follows:

In [12]:
import pandas as pd
data = pd.Series([2,3,4.3,2],index=['a','c','b','e']) # explicit index를 가질 수 있음(numpy와 다른점)
data

a    2.0
c    3.0
b    4.3
e    2.0
dtype: float64

In [5]:
data.values

array([2. , 3. , 4.3, 2. ])

In [6]:
data.index

Index(['a', 'c', 'b', 'e'], dtype='object')

In [7]:
data[1]  # data['c']와 동일

3.0

In [8]:
data[1:3]

c    3.0
b    4.3
dtype: float64

In [9]:
# Series를 dictionary형태로도 만들어줄 수 있다(dictionary를 series형태로 만들기)

population_dict =  {'California': 38332521,
                    'Texas': 26448193,'New York': 19651127,
                    'Florida': 19552860,'Illinois': 12882135}  # dictionary를 지정

population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [8]:
population['Texas']  # key를 통해서 인덱싱 할 수 있다.

26448193

In [10]:
# dictionary와 다른점은 arrary-style의 인덱싱도 지원한다는 것
population['California':'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

In [11]:
# Series의 기본 형태는 pd.Series(data, index = index)
# 이때 data는 array일수도, list일수도 있음

pd.Series(5, index=[100,200,300]) 
# data를 정수 하나로도 지정할수 있는데, 그럼 index개수만큼 알아서 채워짐

100    5
200    5
300    5
dtype: int64

In [13]:
# data는 딕셔너리 형태일수도 있음. key가 index가 됨
pd.Series({2:'a',1:'b',3:'c'})

2    a
1    b
3    c
dtype: object

In [14]:
# index를 통해서 딕셔너리에서 원하는 element만 나타낼수 있음
pd.Series({2:'a',1:'b',3:'c'}, index=[3,2])

3    c
2    a
dtype: object

In [11]:
pd.Series({2:'a',1:'b',3:'c',8:'d'}, index=[3,8])

3    c
8    d
dtype: object

# The Pandas DataFrame Object
If a Series is an analog of a one-dimensional array with flexible indices, 
a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [17]:
population_dict =  {'California': 38332521,
                    'Texas': 26448193,
                    'New York': 19651127,
                    'Florida': 19552860,
                    'Illinois': 12882135}  # 앞서 정의한 dictionary
population = pd.Series(population_dict)

area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}  # 동일한 key를 가진 새로운 dictionary
area = pd.Series(area_dict)

# pd.DataFrame을 사용해서 two-dimensional object를 만들 수 있음
# 동일한 key를 갖는 두개의 Series를 결합

states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [20]:
# Dataframe도 Series와 마찬가지로 index를 가짐.
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [21]:
# 추가적으로, columns attribute도 가짐
states.columns

Index(['population', 'area'], dtype='object')

### Dataframe as specialized dictionary

Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, **a DataFrame maps a column name to a Series of column data**.
  For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

In [22]:
states['area'] # column name을 인덱싱하면 그 Series의 내용이 나옴
# column name이 key, 해당하는 series를 value라고 생각하면 됨

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

### pandas dataframe을 구성하는 여러 방법들

In [24]:
## From a single Series object. 

pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [26]:
## From a list of dicts

data = [{'a': i, 'b': 2 * i}
        for i in range(3)]    # [{'a':0,'b':0}, {'a':1,'b':2}, {'a':2,'b':4}]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [27]:
## Even if some keys in the dictionary are missing, Pandas will fill them in with NaN 

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [28]:
## From a dictionary of Series objects

pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [14]:
## From a two-dimensional NumPy array
import numpy as np

pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a','b','c']) 
 # 2차원의 numpy array에다가 행,열이름을 지정

Unnamed: 0,foo,bar
a,0.718294,0.190204
b,0.304971,0.451198
c,0.66371,0.588608


In [31]:
#If omitted, an integer index will be used for each:

pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'])

Unnamed: 0,foo,bar
0,0.237443,0.10429
1,0.797158,0.207638
2,0.500714,0.388874


In [34]:
## From a NumPy structured array. 
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
print(A)
pd.DataFrame(A)

[(0, 0.) (0, 0.) (0, 0.)]


Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


# Pandas Index Object

In [15]:
ind = pd.Index([2,3,5,7,11])

# array처럼 인덱싱 할 수 있음
ind[1]

3

In [17]:
ind[::2] # 처음부터 끝까지 2칸씩 건너뛰며

Int64Index([2, 5, 11], dtype='int64')

In [36]:
# numpy의 array와 같은 attribute를 가짐
print(ind.size, ind.shape, ind.ndim, ind.dtype) 

5 (5,) 1 int64


In [37]:
# One difference between Index objects and NumPy arrays is that
# indices are immutable—that is, they cannot be modified via the normal means:

ind[1] = 0

TypeError: Index does not support mutable operations

In [None]:
# index as ordered set
# The Index object follows many of the conventions used by Python’s built-in set data structur

indA = pd.Index([1,3,5,7,9])
indB = pd.Index([2,3,5,7,11])

indA & indB
indA | indB
indA ^ indB

# Data Indexing and Selection
Series 보고, Datafame 볼거임

### Data Selection in Series
Series는 one dimensional NumPy array // Python dictionary 


In [18]:
## Series as dictionary

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])

data
# 딕셔너리처럼, Series도 key와 value를 지정해줄 수 있음
# 여기서는 index가 key같음

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [19]:
'a' in data

True

In [39]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [42]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [43]:
# Series objects can even be modified with a dictionary-like syntax. 
# you can extend a Series by assigning to a new index value:
# 즉 변수추가가 가능함

data['e'] = 1.24 
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.24
dtype: float64

In [44]:
## Series as one-dimensional array
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [45]:
data[0:2]

a    0.25
b    0.50
dtype: float64

#### 주의할점!!

explicit index data['a':'b'] 처럼 explicit index로 인덱싱하면 final index가 포함이 됨 
하지만 data[0:2] 처럼 implicit index로 인덱싱하면 final index는 제외됨


In [46]:
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [47]:
data[['a', 'e']]  # 특정행을 지정해서 출력할때는 대괄호 두개 !!!

a    0.25
e    1.24
dtype: float64

In [48]:
data['a', 'e'] # 하나만 쓰면 오류

KeyError: ('a', 'e')

In [49]:
# Indexers: loc, iloc, and ix

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5]) ; data

1    a
3    b
5    c
dtype: object

In [50]:
data[1] # explicit index

'a'

In [51]:
data[1:3] # implicit index

# confusion!!!! 그래서 만든게 loc/iloc ....

3    b
5    c
dtype: object

In [52]:
## loc attribute always references the "explicit index"
data.loc[1]

'a'

In [54]:
data.loc[1:3]

1    a
3    b
dtype: object

In [55]:
## iloc attribute references the "implicit index" 
data.iloc[1]

'b'

In [56]:
data.iloc[1:3]

3    b
5    c
dtype: object

### Data Selection in DataFrame

DataFrame은 two dimensional // structured array // dictionary of series sharing the same index

In [21]:
## DataFrame as a dictionary of related Series objects
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,'Illinois': 149995})

pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,'Illinois': 12882135})

# 두개의 one-dimension Series를 Dataframe으로 변경
data = pd.DataFrame({'area_col':area, 'pop_col':pop}); data

Unnamed: 0,area_col,pop_col
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [22]:
data['area_col']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area_col, dtype: int64

In [23]:
data.area_col

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area_col, dtype: int64

In [24]:
data['density'] = data['pop_col'] / data['area_col'] # 새로운 변수 추가
data

Unnamed: 0,area_col,pop_col,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [25]:
## DataFrame as two-dimensional array 
# Dataframe을 2치원의 array로도 볼 수 있다

data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [26]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area_col,423967.0,695662.0,141297.0,170312.0,149995.0
pop_col,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [27]:
# Dataframe을 인덱싱할때 또 주의해야 할 점!!
# 행을 인덱싱? 열을 인덱싱? 다름

data.values[0]  # 행을 인덱싱(implicit)

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [28]:
data['area_col'] # 열을 인덱싱(emplicit)

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area_col, dtype: int64

In [29]:
# Pandas again uses the loc, iloc, and ix indexers mentioned earlier

data

Unnamed: 0,area_col,pop_col,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [30]:
data.iloc[:3,:2]  # implicit index (0,1,2행 / 0,1열)

Unnamed: 0,area_col,pop_col
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [32]:
data.loc[:'Illinois',:'pop_col'] # explicit index

Unnamed: 0,area_col,pop_col
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [33]:
# The ix indexer allows a hybrid of these two approaches:

data.ix[:3,:'pop_col'] # (0,1,2번째 행 / 처음부터 pop_col까지의 열)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,area_col,pop_col
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [76]:
# loc indexer we can combine masking and fancy indexing

data.loc[data.density > 100, ['pop_col', 'density']]
# data의 density가 100보다 큰 행들의 pop_col 과 density열만 가져옴

Unnamed: 0,pop_col,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [78]:
data.iloc[0, 2] = 90  # 0번째 행과 2번째 열을 90으로 바꿔라 
data

Unnamed: 0,area_col,pop_col,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [79]:
# Additional indexing conventions

data['Florida':'Illinois']

Unnamed: 0,area_col,pop_col,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [80]:
data[1:3]  # 1,2번째 행을 가리킴

Unnamed: 0,area_col,pop_col,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [81]:
data[data.density > 100]  # density가 100보다 큰 행들

Unnamed: 0,area_col,pop_col,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


# Handling Missing Data

In [34]:
# Detecting null values
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [35]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [36]:
data[data.notnull()] # nan/null이 아닌 값들만 골라오기

0        1
2    hello
dtype: object

In [37]:
data.dropna() # nan/null 값들을 빼주기

0        1
2    hello
dtype: object

In [38]:
# dataframe인 경우

df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [39]:
# We cannot drop single values from a DataFrame; 
# we can only drop full rows or full columns.

df.dropna() 
# dataframe에서 dropna를 쓰는 경우, nan/null값이 하나라도 있는 "행"을 제거

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [40]:
df.dropna(axis=1)
# nan/null값이 하나라도 있는 "열"을 제거

Unnamed: 0,2
0,2
1,5
2,6


In [41]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [43]:
df.dropna(axis = 1, how = 'all')
# 열 중에서 값이 모두 nan인 열을 제거

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [45]:
df.dropna(thresh=3) # non-missing인 값이 3개 이상인 행만 추출
                    # axis 옵션이 없으면 default는 행(0)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


In [46]:
# Filling null values
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [47]:
data.fillna(0) # missing value를 0으로 채우기

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [49]:
data.fillna(method = 'ffill') # f는 forward
# nan을 앞에 있는 숫자로 채우기

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [50]:
data.fillna(method = 'bfill') # b는 backward
# nan을 뒤에있는 숫자로 채우기

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [51]:
# dataframe인 경우는 axis 옵션을 추가해주어야함
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [59]:
df.fillna(method='ffill', axis = 1)
# 열을 기준으로 앞에 있는 바로 앞에 있는 열 값으로 채움

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


In [57]:
df.fillna(method='bfill')
# 행을 기준으로 바로 뒤에 있는 행 값으로 채움

Unnamed: 0,0,1,2,3
0,1.0,3.0,2,
1,2.0,3.0,5,
2,,4.0,6,


### Combining Datasets: Concat and Append

In [5]:
# combine the contents of two or more arrays into a single array:
import numpy as np
import pandas as pd

x = [1,2,3]
y = [4,5,6]
z = [7,8,9]

np.concatenate([x,y,z])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [12]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])

print(pd.concat([ser1,ser2]))
print('---------------')
pd.concat([ser1,ser2], axis = 1)

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object
---------------


Unnamed: 0,0,1
1,A,
2,B,
3,C,
4,,D
5,,E
6,,F


In [22]:
# 데이터프레임으로 해보기
def make_df(cols, ind):
 """Quickly make a DataFrame"""
 data = {c: [str(c) + str(i) for i in ind]
 for c in cols}
 return pd.DataFrame(data, ind)

df1 = make_df('AB', [1, 2]) ; print(df1)
df2 = make_df('AB', [3, 4]) ; print(df2)
pd.concat([df1,df2]) # default는 밑으로 붙이기

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4


Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


In [28]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
print(df3); print(df4)
pd.concat([df3, df4], axis = 1) # axis = 1 옵션을 추가하면 옆으로 붙이기

    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1


Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1


In [31]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # 인덱스를 동일하게

In [30]:
print(pd.concat([x,y]))
print(pd.concat([x,y],ignore_index = True))

    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3


In [37]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5); print(df6); 
pd.concat([df5, df6]) # default는 없는 값은 nan으로 대체

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  after removing the cwd from sys.path.


Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B3,C3,D3
4,,B4,C4,D4


In [38]:
# 공통된 열(둘 다에 존재하는 열)만 결합하고 싶으면? 
# 공통된 열만 남기고 나머지는 삭제
pd.concat([df5,df6], join = 'inner')

Unnamed: 0,B,C
1,B1,C1
2,B2,C2
3,B3,C3
4,B4,C4
