# I. Numpy

# NumPy (Numerical Python)

---
<img src = "http://www.numpy.org/_static/numpy_logo.png" width =300> 
<div class = "alert alert-block alert-success">
<font color = black> 

- A library used for computing scientific/mathematical data. <br>

- NumPy's array class is called **ndarray**. <br>

- Designed for efficiency on large arrays of data <br>
> Array - used to store multiple values in on single variable.<br>

- NumPy arrays have a fixed size at creation <br>
> Changing the size of an *ndarray* will create a new array and delete the original<br>

- Not flexible like lists, you can only store same data type in each column. <br>
- Uses less memory and can be executed in less steps than list.<br>

</div>

In [3]:
import numpy as np

## array 생성 

### array dimension

<img src = "https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png">

In [4]:
arr1d = np.array([1,2,3]) #1dimension
arr1d

array([1, 2, 3])

In [6]:
arr2d = np.array([(1,2,3),(4,5,6),(7,8,9)]) #2dimension
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [9]:
arr3d = np.array([[[1, 2, 3],[4, 5, 6]],[[7, 8, 9],[10, 11, 12]]]) #3dimension
print(arr3d)

[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]


**주의**

a = np.array(1,2,3,4)<font color = red> #WRONG! <br>
<font color = black>
a = np.array([1,2,3,4])<font color = blue> #Correct!

### numpy.arange

[official NumPy doc]( https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html)

python의 built-in function 중 range()와 같으나, 리스트 대신 ndarray를 반환. 

half-open interval로 변수 생성 (마지막 값 생략) 

In [10]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [14]:
np.zeros(10) #0만 있는 행렬

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [15]:
np.zeros((2,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [17]:
np.ones((2,5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

배열의 크기가 커지면 배열을 초기화하는데도 시간이 걸린다. 
이 시간을 단축하려면 배열을 생성만 하고 특정한 값으로 초기화를 하지 않는 empty 명령을 사용한다. 
empty 명령으로 생성된 배열에는 기존에 메모리에 저장되어 있던 값이 있으므로 배열의 원소의 값을 미리 알 수 없다.

In [18]:
np.empty(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

### numpy.reshape

Gives a new shape to an array without changing its data.

In [16]:
np.zeros(10).reshape(2,5)

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

### ndarray 속성들

In [19]:
# "ndim" 배열의 dimension의 수를 출력
print(arr1d.ndim)

1


In [20]:
# "itemsize" 배열의 byte수를 출력
arr2d.itemsize

8

In [21]:
# "dtype" 배열의 data type 출력
arr2d.dtype

dtype('int64')

In [22]:
# "size" 배열의 원소 개수 출력
arr3d.size

12

In [24]:
# "shape" 배열의 shape를 (행,열)로 출력
arr3d.shape

(2, 2, 3)

### NumPy 배열 연산

In [36]:
a2d = np.array([(1,2,3),(4,5,6),(7,8,9)])
a2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [37]:
a2d + 1

array([[ 2,  3,  4],
       [ 5,  6,  7],
       [ 8,  9, 10]])

In [38]:
a2d * a2d

array([[ 1,  4,  9],
       [16, 25, 36],
       [49, 64, 81]])

In [39]:
a2d1 = a2d + 1

In [40]:
a2d1 < a2d

array([[False, False, False],
       [False, False, False],
       [False, False, False]])

In [42]:
a2d1 > a2d

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

### NumPy 배열 인덱싱, 슬라이싱

In [52]:
index = np.arange(0,20,2)
print(index)

[ 0  2  4  6  8 10 12 14 16 18]


In [53]:
index[5]

10

Basic slice syntax is "i:j:k"

i = Starting index <br>
j = Stopping index <br>
k = Step <br>

In [54]:
index[0:10:3]

array([ 0,  6, 12, 18])

In [55]:
index[5:9]

array([10, 12, 14, 16])

In [56]:
index[5:9] = 100
index

array([  0,   2,   4,   6,   8, 100, 100, 100, 100,  18])

In [57]:
index[:] = 100
index

array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100])

더 고차원의 배열을 인덱싱하게되면 더이상 한두개의 원소가 아닌 하나의 배열을 인덱싱하게 된다. 

In [65]:
a2d = np.array([(1,2,3),(4,5,6),(7,8,9)])
print(f)

[[ 1  2  3]
 [10 11 12]
 [ 7  8  9]]


In [66]:
a2d[0]

array([1, 2, 3])

In [67]:
a2d[0,0]

1

In [68]:
a2d[1] = 10, 11, 12
a2d

array([[ 1,  2,  3],
       [10, 11, 12],
       [ 7,  8,  9]])

<img src = "https://www.oreilly.com/library/view/python-for-data/9781449323592/httpatomoreillycomsourceoreillyimages2172114.png" width = 300>

In [69]:
a2d[:2, 1:]

array([[ 2,  3],
       [11, 12]])

In [70]:
a3d = np.array([[[1, 2, 3],[4, 5, 6]],[[7, 8, 9],[10, 11, 12]]])

In [71]:
a3d[1] #배열 인덱싱

array([[ 7,  8,  9],
       [10, 11, 12]])

In [76]:
a3d[1,1] = 13 #배열 대체
a3d

array([[[10,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [13, 13, 13]]])

In [82]:
a3d[0,0,1] = 10 #원소 대체 
a3d

array([[[10, 10,  3],
        [10,  5,  6]],

       [[10,  8,  9],
        [13, 13, 13]]])

더 자세한 내용은 참고 링크 참조 
https://docs.scipy.org/doc/numpy-1.15.0/user/basics.indexing.html

# II. Pandas

<div class = "alert alert-block alert-info">
<font color = black>

## Pandas
***
<img src = "https://pandas.pydata.org/_static/pandas_logo.png">

_"Python has long been great for data munging and preparation, but less so for data analysis and modeling. Pandas helps fills this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R."_ ("pandas.pydata.org") </font> <br>

- Convert a Python list, dict, or NumPy array into a Pandas data frame
- Open a local file using Pandas (e.g. CSV, EXCEL, TXT files, and etc.) 
- Open a remote file or database through a URL. 

<font color = black> <b> Two main data structures </b> </font>
1. Series
 - Similiar to a one-dimensional array and data can be hetereogenous (unlike NumPy).<br>
 <br>
2. DataFrame
 - Can be seen as a table of data. It organizes data into rows and columns, making it a two-dimenesional data structure. 
 - Has both a row and column index 

In [3]:
import pandas as pd

### 1. Series

In [6]:
b_w = pd.Series([12, 13 ,14, 'number'])
b_w

0        12
1        13
2        14
3    number
dtype: object

In [7]:
# create a series with a label
bw2 = pd.Series([1,2,3,4], index = ['a','b','c','d'])

In [8]:
# Accessing elements through the labeled index
bw2['a']

1

In [9]:
# NumPy like functions 
bw2[bw2 > 2]

c    3
d    4
dtype: int64

In [10]:
bw2 * 2

a    2
b    4
c    6
d    8
dtype: int64

### 1. 데이터 로드

요약: 
- pd.read_csv
- pandas만의 DataFrame = table
- 0으로 시작하는 특수한 인덱스
- 각 변수의 데이터 타입 확인 가능

Notice a few things that have happened here.

1. The pd.read_csv has enough built-in smarts to read the first row of the file, get the variable names from it.
1. It then read all rows in the file, and used them to create a pandas DataFrame, which is like a set of Numpy arrays we can treat as a table.
1. It created an automatic unique index, beginning with zero.
1. It inferred the type of each variable from the data. 

Let's explore this pandas DataFrame to learn some of its features.  Note that a pandas Series is like one column of this DataFrame, coupled with its own index column.  So the main difference between a Series and a DataFrame is that the latter has multiple columns. Columns can be of different data types, but within a column, must be consistent.

In [5]:
df = pd.read_csv('./rain.csv')
df

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2
6,jul,0.8
7,aug,0.7
8,sep,
9,oct,3.9


### 2. Dataframe

In [11]:
df.shape #dataframe의 행과 열 수 확인

(12, 2)

In [12]:
df.columns #열의 인덱스 확인

Index([u'month_2014', u'rainfall_inches'], dtype='object')

In [13]:
df.dtypes #각 인덱스의 데이터 타입

month_2014          object
rainfall_inches    float64
dtype: object

We can select subsets of the rows by indexing, and select specific columns by their name:

In [14]:
df['rainfall_inches'][:6] #특정 열의 n번째 행까지의 값만 추출하기

0    5.3
1    5.4
2    4.8
3    4.7
4    3.3
5    1.2
Name: rainfall_inches, dtype: float64

We can get all of our statistics on the rainfall_inches column in one short command:

In [15]:
df['rainfall_inches'].describe() #describe()를 통해 기초통계치 확인

count    11.000000
mean      3.681818
std       1.923444
min       0.700000
25%       2.250000
50%       4.500000
75%       5.050000
max       5.900000
Name: rainfall_inches, dtype: float64

Notice how it silently handled the missing value for September and gave the correct statistical results?

You can also get these values 'a la carte'.  You might recognize that these are essentially Numpy functions, but that in pandas we can now deal with multiple data types.  Notice that there is some flexibility in syntax (multiple ways go ge some things done).

In [16]:
df['rainfall_inches'].min() #최소값

0.7

In [17]:
df['rainfall_inches'].max() #최대값

5.9

In [18]:
df['rainfall_inches'].median() #중앙값

4.5

### 2. 인덱싱 & 슬라이싱

There are two indexing methods you can use to get subsets of rows and columns in a dataframe.  **loc** uses the values of the indices and includes the last value, whereas **iloc** uses the index positions and like usual Python indexing, does not include the last value. 

-  _loc_ : uses the values of the indices and includes the last value :  **label 기반 추출**
-  _iloc_ : uses the index positions , half-open interval **Integer position 기반**

참고 1: https://pandas.pydata.org/pandas-docs/stable/indexing.html <br>
참고 2: https://datascienceschool.net/view-notebook/704731b41f794b8ea00768f5b0904512/ <br>



In [27]:
df.loc[0:6,:'rainfall_inches'] #rainfall_inches & 7번째 행까지의 값

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2
6,jul,0.8


In [30]:
df.iloc[0:6,0:2] #6번째 행 & 1번째부터 3번째 열까지 호출

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2


Note also that the 0th index position for columns is not the index column, but the first column of data in the dataframe, and the same applies for rows.

In [31]:
df.iloc[0:6,0:2]

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2


### 3. 값 필터링
### Filtering on values

You can easily filter a dataframe for one or more conditions based on the values in a column. Below we filter df to select only months with less than 4 inches of rainfall.  

In [32]:
df['rainfall_inches'] < 4 
##df를 nested하지 않으면 boolean으로 처리. 특정 행 안에서 n보다 작은지 아닌지만 (T/F) 호출.

0     False
1     False
2     False
3     False
4      True
5      True
6      True
7      True
8     False
9      True
10    False
11    False
Name: rainfall_inches, dtype: bool

Notice the nested use of df.  What happens if you don't do that?

Let's use the nested frame of df

In [33]:
df[df['rainfall_inches'] < 6] #특정 행 안에서 n보다 작은 값을 호출

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2
6,jul,0.8
7,aug,0.7
9,oct,3.9
10,nov,4.5


You can also select rows based on the values of more than one column.  

#### Just remember to nest the individual conditions within parentheses:

In [34]:
df[(df['month_2014']=='jan') & (df['rainfall_inches'] > 5)] #2개 이상의 조건을 만족시키는 값 호출

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3


### Using string functions to filter a dataframe
 Notice the **str** component of the syntax.

In [35]:
df[df['month_2014'].str.contains('j')] #month_2014내에서 'j'를 포함하는 값 추출

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
5,jun,1.2
6,jul,0.8


You can even do statistics on such a filtered set:

In [36]:
df[df['month_2014'].str.contains('j')]['rainfall_inches'].quantile(.5)

1.2

### 4. 결측치 (Missing data)

This filtering approach is handy when you want to eliminate missing data also.


In [37]:
df[df['rainfall_inches'].notnull()] 

#위에서 원자료와 비교해보면 8번째 인덱스 값이 결측치여서 제외된 것을 확인 가능

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2
6,jul,0.8
7,aug,0.7
9,oct,3.9
10,nov,4.5


Let's say we find the value for sep should be 2.5. How could we set that value using the loc indexing?

특정 인덱스에 있는 값을 바꾸고 싶을때- 

In [38]:
df.loc[8,'rainfall_inches'] = 2.5 
df

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2
6,jul,0.8
7,aug,0.7
8,sep,2.5
9,oct,3.9


### 5. 정렬
### Sorting

In [39]:
df.sort_values(['rainfall_inches'], ascending = False) #ascending을 True로 해보자

Unnamed: 0,month_2014,rainfall_inches
11,dec,5.9
1,feb,5.4
0,jan,5.3
2,mar,4.8
3,apr,4.7
10,nov,4.5
9,oct,3.9
4,may,3.3
8,sep,2.5
5,jun,1.2


### 6. Unique value 목록 만들기 & 갯수 세기
### Getting unique value lists and counts

In [40]:
print(df['month_2014'].unique()) #목록 만들기

['jan' 'feb' 'mar' 'apr' 'may' 'jun' 'jul' 'aug' 'sep' 'oct' 'nov' 'dec']


In [41]:
df['month_2014'].count() #개수

12

## 실습

In [None]:
import pandas as pd

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

#state = 미국 주
#year = 연도
#pop = 인구

sp = pd.DataFrame(data)


1. 기술통계치를 한 번에 볼 수 있게 출력해보세요 (describe())

2. 자료형이 숫자가 아닌 자료에 대한 정보만 따로 출력해보세요. 

3. 각 열의 자료형을 확인하는 방법은?

4. 'state' 열만 따로 추출한 후 출력해보세요 (tolist())

5. 'state'열에 있는 요소를 중복 없이 출력해보세요

6. 전체 인구(pop)평균값을 출력해보세요. 

7. 전체 인구(pop)의 최대값을 출력해보세요.

8. 인구의 20 번째 백분위 수 값을 계산해보세요.

9. 'State'가 'Ohio'인지 여부를 나타내는 Bool 배열을 계산해보세요

10. 'Ohio'에 해당하는 행만 선택한 후 출력해보세요.

11. Ohio의 자료만 담고있는 DataFrame을 새로 만들어서 출력해보세요. 

12. 인구가 2보다 많은 행만 선택한 후 출력해보세요. 

13. 오하이오에있는 인구의 평균만 계산해보세요. 

14. Ohio의 2002년도 인구를 인덱싱을 이용하여 3.4로 바꿔보세요 (원래는 3.6)