## pandas

https://pandas.pydata.org/

**빅데이터 시대** 
- 데이터로 부터 유용한 정보를 뽑아내는 분석프로세스를 위해
- 데이터를 수집하고 정리하는 데 최적화된 도구

## 판다스 자료 구조

- 분석을 위해 다양한 소스로 부터 수집하는 데이터는 형태나 속성이 매우 다양함
- 서로 다른 형식을 갖는 여러 종류의 데이터를 컴퓨터가 이해 할 수 있도록 동일한 형식을 갖는 구조로 통합 해야함
- **Series 와 Dataframe** 이라는 구조화된 데이터 형식을 제공
- 서로다른 여러가지 유형의 데이터를 공통의 포맷으로 정리하는 목적
- Dataframe : 행과 열로 이루어진 2차원 구조의 형태로 데이터 분석 실무에 자주 사용됨

### 1. 시리즈(Series)

- 데이터가 순차적으로 나열된 1차우너 배열의 형태
- 인덱스(index)는 데이터값(value)와 일대일 대응
- 파이썬의 딕셔너리와 비슷한 구조

### 딕셔너리 => 시리즈  
pandas.Series(딕셔너리)

In [5]:
import pandas as pd

In [3]:
dict_data = {'a':1,'b':2,'c':3}
sr = pd.Series(dict_data)
print(type(sr)); print()
print(sr)

<class 'pandas.core.series.Series'>

a    1
b    2
c    3
dtype: int64


In [4]:
obj = pd.Series([4,7,-5,3])
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64


## Series의 index / value
- Series객체.index : 인덱스 배열
- Series객체.values : 데이터값 배열

In [5]:
print(obj.values)
print(obj.index)

[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)


In [6]:
import pandas as pd
obj2 = pd.Series([4,7,-5,3], index=['d','b','a','c'])
print(obj2)
print(obj2.index)

d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')


In [7]:
import numpy as np
import pandas as pd
list_A = np.array(list('adcdef'))
list_B = np.arange(10, 70, 10)
dict_data = {key:value for key, value in zip(list_A, list_B)}
print(dict_data)
sr = pd.Series(dict_data)

{'a': 10, 'd': 40, 'c': 30, 'e': 50, 'f': 60}


In [8]:
import numpy as np
import pandas as pd
list_A = np.array(list('adcdef'))
list_B = np.arange(10, 70, 10)
sr = pd.Series(list_B, index=list_A)
for i in range(sr.size):
    key = sr.index[i]
    print("sr['{}'] : {} or sr[{}] : {}".format(key,sr[key],i, sr.values[i]))

sr['a'] : 10 or sr[0] : 10
sr['d'] : d    20
d    40
dtype: int32 or sr[1] : 20
sr['c'] : 30 or sr[2] : 30
sr['d'] : d    20
d    40
dtype: int32 or sr[3] : 40
sr['e'] : 50 or sr[4] : 50
sr['f'] : 60 or sr[5] : 60


In [9]:
print(sr.index[0])
print(sr['a'], sr[0], sr.values[0])

a
10 10 10


In [10]:
print(obj2); print()
print(obj2[obj2>0])

d    4
b    7
a   -5
c    3
dtype: int64

d    4
b    7
c    3
dtype: int64


In [11]:
print(obj2*2)

d     8
b    14
a   -10
c     6
dtype: int64


In [12]:
print(np.exp(obj2))

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64


In [13]:
print('b' in obj2)
print('e' in obj2)

True
False


In [14]:
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':15000, 'Utah':5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    15000
Utah       5000
dtype: int64


In [15]:
states = ['California', 'Ohio', 'Texas', 'Oregon']
print(type(states))
obj4 = pd.Series(sdata, index = states)
print(obj4)

<class 'list'>
California        NaN
Ohio          35000.0
Texas         71000.0
Oregon        15000.0
dtype: float64


In [18]:
import pandas as pd
print(pd.isnull(obj4))
print(pd.notnull(obj4))

California     True
Ohio          False
Texas         False
Oregon        False
dtype: bool
California    False
Ohio           True
Texas          True
Oregon         True
dtype: bool


In [19]:
print(obj4.isnull())

California     True
Ohio          False
Texas         False
Oregon        False
dtype: bool


In [22]:
print(obj3); print()
print(obj4); print()
print(obj3+obj4)

Ohio      35000
Texas     71000
Oregon    15000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Texas         71000.0
Oregon        15000.0
dtype: float64

California         NaN
Ohio           70000.0
Oregon         30000.0
Texas         142000.0
Utah               NaN
dtype: float64


In [28]:
# print(obj4.name)
obj4.name = 'population'
obj4.index.name = 'state'
print(obj4)

state
California        NaN
Ohio          35000.0
Texas         71000.0
Oregon        15000.0
Name: population, dtype: float64


In [29]:
obj.index=['Bob','Sreve', 'jeff','Ryan']
print(obj)

Bob      4
Sreve    7
jeff    -5
Ryan     3
dtype: int64


### 2. 데이터프레임(DataFrame)

- 2차원 배열
- R의 데이터 프레임에서 유래
- 엑셀, 관계형 DB등에서 사용됨
- 하나의 열이 각각의 Series객체임

In [48]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [49]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [50]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [51]:
pd.DataFrame(data, columns = ['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [58]:
import pandas as pd
frame2 = pd.DataFrame(data, columns = ['year','state', 'pop', 'debt'], index=['one', 'two','three','four','five','six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [60]:
print(frame2.columns)

Index(['YEA', 'STA', 'POP', 'DEB'], dtype='object')


In [61]:
frame2.rename(columns={'year':'YEA', 'state':'STA', 'pop':'POP', 'debt':'DEB'}, inplace=True)
frame2.rename(index={'one':'01', 'two':'02', 'pop':'POP', 'debt':'DEB'}, inplace=True)
frame2.head

<bound method NDFrame.head of         YEA     STA  POP  DEB
01     2000    Ohio  1.5  NaN
02     2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN>

In [62]:
frame2['STA']

01         Ohio
02         Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: STA, dtype: object

In [72]:
frame2.loc['three']

YEA    2002
STA    Ohio
POP     3.6
DEB     NaN
Name: three, dtype: object

In [73]:
frame2.iloc[2]

YEA    2002
STA    Ohio
POP     3.6
DEB     NaN
Name: three, dtype: object

In [77]:
frame2['DEB'] = np.arange(1,13,2)
frame2

Unnamed: 0,YEA,STA,POP,DEB
01,2000,Ohio,1.5,1
02,2001,Ohio,1.7,3
three,2002,Ohio,3.6,5
four,2001,Nevada,2.4,7
five,2002,Nevada,2.9,9
six,2003,Nevada,3.2,11


In [79]:
val = pd.Series([-1.2,-1.5,-1.7], index=['02','four','six'])
frame2['DEB']=val
frame2

Unnamed: 0,YEA,STA,POP,DEB
01,2000,Ohio,1.5,
02,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,-1.7


In [81]:
frame2['eastern']=frame2.STA == 'Ohio'
frame2

Unnamed: 0,YEA,STA,POP,DEB,eastern
01,2000,Ohio,1.5,,True
02,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,,False
six,2003,Nevada,3.2,-1.7,False


In [86]:
frame2['Bic_State']=(frame2.STA =='Ohio') & (frame2.POP >2.0)
frame2

Unnamed: 0,YEA,STA,POP,DEB,eastern,bicstate,Bic_State
01,2000,Ohio,1.5,,True,False,False
02,2001,Ohio,1.7,-1.2,True,False,False
three,2002,Ohio,3.6,,True,True,True
four,2001,Nevada,2.4,-1.5,False,False,False
five,2002,Nevada,2.9,,False,False,False
six,2003,Nevada,3.2,-1.7,False,False,False


In [91]:
del frame2['Bic_State']
frame2

Unnamed: 0,YEA,STA,POP,DEB
01,2000,Ohio,1.5,
02,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,-1.7


### 중첩된 딕셔너리

In [92]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [95]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [96]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [97]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [105]:
print(frame3.iloc[0,0])
print(frame3.iloc[0,1])
print(frame3.iloc[1,0])
print(frame3.iloc[1,1])

2.4
1.7
2.9
1.5


In [109]:
frame3.iloc[0,0:]

Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64

In [111]:
frame3['Ohio'][:-1]

2001    1.7
2002    3.6
Name: Ohio, dtype: float64

In [117]:
import pandas as pd
import seaborn as sns #conda install -c anaconda seaborn

titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [118]:
titanic.tail()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [119]:
df = titanic.loc[:,['age','fare']]
df.head()

Unnamed: 0,age,fare
886,27.0,13.0
887,19.0,30.0
888,,23.45
889,26.0,30.0
890,32.0,7.75


In [120]:
df.tail()

Unnamed: 0,age,fare
886,27.0,13.0
887,19.0,30.0
888,,23.45
889,26.0,30.0
890,32.0,7.75


In [121]:
df_add10 = df +10

In [122]:
df_add10.head()

Unnamed: 0,age,fare
0,32.0,17.25
1,48.0,81.2833
2,36.0,17.925
3,45.0,63.1
4,45.0,18.05


In [123]:
print(type(df_add10))

<class 'pandas.core.frame.DataFrame'>


In [124]:
df_sub = df_add10 -df

In [125]:
df_sub

Unnamed: 0,age,fare
0,10.0,10.0
1,10.0,10.0
2,10.0,10.0
3,10.0,10.0
4,10.0,10.0
...,...,...
886,10.0,10.0
887,10.0,10.0
888,,10.0
889,10.0,10.0


In [126]:
obj = pd.Series(range(3), index=['a','b','c'])
index=obj.index
print(index)
index[1:]

Index(['a', 'b', 'c'], dtype='object')


Index(['b', 'c'], dtype='object')

In [129]:
obj = pd.Series([4.5,7.2,-5.3,3.6], index = ['d','b','a','c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [130]:
obj2 = obj.reindex(['a','b','c','d','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [133]:
obj3=pd.Series(['blue','purple','yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [134]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [135]:
 import numpy as np
 import pandas as pd
 frame=pd.DataFrame(np.arange(9).reshape((3,3)), index=['a','b','c'], columns=['Ohio','California','Texas'])
 frame


Unnamed: 0,Ohio,California,Texas
a,0,1,2
b,3,4,5
c,6,7,8


In [136]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,2,,1
b,5,,4
c,8,,7


In [142]:
obj= pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [143]:
new_obj=obj.drop('c')

In [144]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [147]:
new_obj = obj.drop(['d','c'])
new_obj

a    0.0
b    1.0
e    4.0
dtype: float64

In [148]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio', 'Colorado','Utan','New York'], columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utan,8,9,10,11
New York,12,13,14,15


In [149]:
data.drop(['Colorado','Ohio'])

Unnamed: 0,one,two,three,four
Utan,8,9,10,11
New York,12,13,14,15


In [152]:
data2 = data.drop('two', axis=1)
data2.drop('Utan',axis=0)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
New York,12,14,15


In [153]:
data.drop(['two','four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utan,8,10
New York,12,14


In [156]:
data.drop('Ohio', axis='rows')

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utan,8,9,10,11
New York,12,13,14,15


In [157]:
data3=data.copy()
data3.drop('Ohio', inplace=True)
data#

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utan,8,9,10,11
New York,12,13,14,15


## 인덱싱

In [159]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [161]:
print(obj['b'], obj[1]); print()
print(obj[2:4]); print()
print(obj[['b','a','d']]); print()
print(obj<2)

1.0 1.0

c    2.0
d    3.0
dtype: float64

b    1.0
a    0.0
d    3.0
dtype: float64

a     True
b     True
c    False
d    False
dtype: bool


In [162]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [163]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [164]:
data[data['three']>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [165]:
data<5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [167]:
data[data<5]=0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [170]:
data.loc['Colorado',['two','three']]

two      5
three    6
Name: Colorado, dtype: int32

In [171]:
data.iloc[2,[3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [172]:
data.iloc[[1,2],[3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [173]:
data.loc[:'Utah','two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [174]:
data.iloc[:,:3][data.three>5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [175]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [177]:
print(ser[:1])
print(ser.loc[:1])
print(ser.iloc[:1])

0    0.0
dtype: float64
0    0.0
1    1.0
dtype: float64
0    0.0
dtype: float64


In [179]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-2.110619,-0.156342,-0.047628
Ohio,-1.749993,1.02265,-0.562879
Texas,0.687543,0.86449,0.609548
Oregon,0.787336,-0.056655,0.563912


In [180]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,2.110619,0.156342,0.047628
Ohio,1.749993,1.02265,0.562879
Texas,0.687543,0.86449,0.609548
Oregon,0.787336,0.056655,0.563912


- np.random.randn: 평균 0, 표준편자가 1인 가우시안 정규분포 난수 matrix생성

In [181]:
f = lambda x:x.max()-x.min()
frame.apply(f)

b    2.897955
d    1.178991
e    1.172427
dtype: float64

In [182]:
frame.apply(f, axis='columns')

Utah      2.062991
Ohio      2.772643
Texas     0.254942
Oregon    0.843992
dtype: float64

In [183]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-2.110619,-0.156342,-0.562879
max,0.787336,1.02265,0.609548


## Sort

In [185]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [187]:
obj.sort_index() # index를 기준으로 sortiog

a    1
b    2
c    3
d    0
dtype: int64

In [189]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [190]:
frame.sort_index() # 행을 정렬(오름차순)

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [194]:
frame.sort_index(axis=1) # 열을 정렬

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [195]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [196]:
frame.sort_index(axis=1, ascending=True)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [197]:
obj = pd.Series([4, 7, -3, 2])

In [199]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [200]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [201]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [202]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [203]:
frame.sort_values(by=['a','b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [204]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [205]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [210]:
obj.rank(method='first') # 먼저 온 순서대로 중복없음

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [211]:
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [212]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [215]:
frame.rank(axis='columns') # 한 행에 있는 열 값을 기준으로 순서 매김

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


In [216]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [217]:
obj.index.is_unique

False

In [218]:
obj['a']

a    0
a    1
dtype: int64

In [219]:
obj['c']

4

In [220]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,0.247367,1.24116,-2.295198
a,1.9745,0.80984,-0.165718
b,-0.035042,0.886739,0.638879
b,-0.311401,-2.019368,-0.193105


In [221]:
df.loc['b']

Unnamed: 0,0,1,2
b,-0.035042,0.886739,0.638879
b,-0.311401,-2.019368,-0.193105


In [222]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [223]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [224]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [226]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [227]:
df.idxmax()

one    b
two    d
dtype: object

In [228]:
 df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [229]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


## Unique Values, Value Counts, and Membership

In [230]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [233]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [234]:
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

In [235]:
pd.value_counts(obj.values, sort=False)

c    3
b    2
a    3
d    1
dtype: int64

In [236]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [237]:
mask = obj.isin(['b','c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [238]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [243]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
to_match 

0    c
1    a
2    b
3    b
4    c
5    a
dtype: object

In [244]:
unique_vals = pd.Series(['c', 'b', 'a'])
unique_vals 

0    c
1    b
2    a
dtype: object

In [245]:
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2], dtype=int64)

In [247]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data
                

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [248]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


isin : Series의 각 원소가 넘겨받은 연속된 값에 속하는 지 나타내는 bool배열을 반환  
match : 각 값에 대해 유일한 값을 담고 있는 배열에서의 정수 색인을 계산.  
unique : Series에서 중복되는 값을 제거하고 유일한 값만 포함하는 배열을 반환  
value_count : Series에서 유일값에 대한 색인과 두수를 계산 (도수는 내림차순)

order_id : 주문번호  
quantity : 아이템의 주문수량  
item_name : 아이템 이름  
choice_description : 주문아이템 상세 선택 옵션  
item_price : 주문아이템의 가격  

1. 가장많이 주문한 아이템 top 10
2. 가장 비싼 아이템 총 몇개 팔렸을까?
3. Veggie Salad Bowl 이 몇 번 주문되었을까?

In [249]:
import pandas as pd
file_path = 'chipotle.tsv'
chipo = pd.read_csv(file_path, sep = '\t')
chipo.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [254]:
chipo['item_name'].value_counts()[:10]

Chicken Bowl           726
Chicken Burrito        553
Chips and Guacamole    479
Steak Burrito          368
Canned Soft Drink      301
Steak Bowl             211
Chips                  211
Bottled Water          162
Chicken Soft Tacos     115
Chicken Salad Bowl     110
Name: item_name, dtype: int64