** Pandas
- Handle data in a way suited to analysis
- Similar to R

can learn pandas from here
https://bitbucket.org/hrojas/learn-pandas


In [39]:
import pandas as pd
import numpy as np

### 1. Pandas - Series

Series는 array, list, column과 비슷한 1차원 오브젝트이다. 
디폴트 값으로서 0~N까지의 인덱스를 각 항목에 부여한다.

data.values    : NumPy array

data.index    : pd.Index array-like object

#### 1) Series as generalized NumPy array

From what we’ve seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array.

The essential difference is **the presence of the index** : 

while the NumPy array has an implicitly defined integer index, 

the Pandas Series has a explicitly defined index associated with the values


In [40]:
# The index need not be an integer, but can consist of values of any kind type.

data = pd.Series([0.25, 0.5, 0.75, 1], index=['a', 'b', 'c', 'd'] )
data['b']

0.5

In [41]:
# Ex 2
series = pd.Series(['Dave', 'Cheng-Han', 359, 9001], 
                       index=['Instructor', 'Curriculum Manager',
                              'Course Number', 'Power Level'])
series

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
Power Level                9001
dtype: object

#### 2) Series as specialized dictionary

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary.

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, 

and a Series is a structure that maps typed keys to a set of typed values.

We can make the Series-as-dictionary analogy even more clear by constructing a Series object directly from a Python dictionary :



In [42]:
# By default, a Series will be created where the index is drawn from the sorted keys.
# From here, typical dictionary-style item access can be performed.


population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}

population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [43]:
# Series 선언하는 다양한 방법

print(pd.Series(data))
print(pd.Series(5, index=[100,200,300] ))
print(pd.Series( {2:'a', 1:'b', 3:'c' }))
print(pd.Series( {2:'a', 1:'b', 3:'c'}, index=[3,2]))

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
100    5
200    5
300    5
dtype: int64
1    b
2    a
3    c
dtype: object
3    c
2    a
dtype: object


In [44]:
# Index로 특정 항목 뽑아내기 (1개인 경우, 여러 개인 경우)

series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                      index=['Instructor', 'Curriculum Manager',
                            'Course Number', 'Power Level'])
print(series['Instructor'])
print("")
print(series[['Instructor', 'Curriculum Manager', 'Course Number']])

Dave

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
dtype: object


In [45]:
# boolean operator로 Indexing하기

cuteness = pd.Series([1,2,3,4,5], 
                        index=['Cockroach', 'Fish', 'Mini Pig', 'Puppy', 'Kitten'])

print(cuteness > 3)
print("")
print( cuteness[cuteness>3])


Cockroach    False
Fish         False
Mini Pig     False
Puppy         True
Kitten        True
dtype: bool

Puppy     4
Kitten    5
dtype: int64








### 2. Pandas - Dataframe

Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

Dataframe을 만들기 위해서, '리스트로 구성된 딕셔너리'를 인자로 넣어준다.

그렇게 하면,
1. 딕셔너리의 키는 열 이름이 된다.
2. 키에 해당하는 리스트는 그 열의 값들이 된다.



#### 1) DataFrame as a generalized NumPy array

if a Series is an analog of a one-dimensional array with flexible indices, 

a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.


In [46]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
                           'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

states = pd.DataFrame( {'population': population, 'area' : area } )

states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


Like the Series object, the DataFrame has an idex attribute that gives access to the index labels
- states.index
- states.columns


#### 2) DataFrame as specialized dictionary

Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.

In [47]:
pd.DataFrame(population, columns=['population'])
    # same as pd.DataFrame( {'population' = population})

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


#### 3) Creating DataFrame Objects

In [48]:
# 1. From a single Series object

pd.DataFrame(population, columns=['population'] )

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


In [49]:
# 2. From a list of dicts.
# Any list of dictionaries can be made into a DataFrame.

data = [ {'a' :i, 'b':2*i }  for i in range(3) ]
print(pd.DataFrame(data))

pd.DataFrame([{'a':1, 'b':2}, {'b':3, 'c':4}])

   a  b
0  0  0
1  1  2
2  2  4


Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [50]:
# 3. From a dictionary of Series objects

pd.DataFrame( {'population': population, 'area' :area} )

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [51]:
# 4. From a two-dimensional NumPy array

pd.DataFrame(np.random.rand(3,2), columns=['foo', 'bar'], index=['a', 'b', 'c'] )

Unnamed: 0,foo,bar
a,0.623475,0.651269
b,0.269232,0.285292
c,0.053331,0.82308


In [52]:
# 5. From a NumPy structured away

A = np.zeros(3, dtype=[ ('A', 'i8'), ('B', 'f8') ] )

pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


In [53]:
data = {'year' : [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
       'team' : ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
       'wins' : [11,8,10,15,11,6,10,4],
       'losses': [5,8,6,1,5,10,6,12]}

football = pd.DataFrame(data)
print(football)

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012


In [54]:
print(football.dtypes)     # 각 열의 데이터 타입을 출력
print("")
print(football.describe()) # 수치형 열의 기초 통계량 출력
print("")
print(football.head())     # 데이터셋의 처음 N행 출력, 디폴트는 5행
print("")
print(football.tail())     # 데이터셋의 마지막 N행 출력


losses     int64
team      object
wins       int64
year       int64
dtype: object

          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012

   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012


### 3. Series & Dfs

You can think of DataFrame as a group of Series that share an index.

Also,
1. Selecting a single column returns a Series
2. Selecting multiple columns returns a DataFrame

Series와 DataFrame은 아래와 같이 사용할 수 있다.

In [55]:
people = ['Sarah', 'Mike', 'Chrisna']
ages = [28, 32, 25]

# 1. Series와 DF 함께 사용
df = pd.DataFrame({'name' : pd.Series(people),
               'age' : pd.Series(ages)})
print(df)


# 2. 그냥 DF만 사용
df = pd.DataFrame({'name' : people,
               'age' : ages})
print(df)

   age     name
0   28    Sarah
1   32     Mike
2   25  Chrisna
   age     name
0   28    Sarah
1   32     Mike
2   25  Chrisna


In [56]:
## 주의, DataFrame은 Index를 자동으로 정렬하기 때문에 그 위치를 수동으로 바꿔줄 수 있다.

df = df[['name', 'age']]
df

Unnamed: 0,name,age
0,Sarah,28
1,Mike,32
2,Chrisna,25
