<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Library" data-toc-modified-id="Import-Library-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Library</a></span></li><li><span><a href="#Create-Sample-Data" data-toc-modified-id="Create-Sample-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Create Sample Data</a></span></li><li><span><a href="#What-is-Vectorizing?" data-toc-modified-id="What-is-Vectorizing?-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>What is Vectorizing?</a></span></li><li><span><a href="#Example" data-toc-modified-id="Example-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Example</a></span><ul class="toc-item"><li><span><a href="#map-+-lambda" data-toc-modified-id="map-+-lambda-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>map + lambda</a></span></li><li><span><a href="#Apply" data-toc-modified-id="Apply-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Apply</a></span></li><li><span><a href="#Merge" data-toc-modified-id="Merge-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Merge</a></span></li><li><span><a href="#np.setdiff1d" data-toc-modified-id="np.setdiff1d-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>np.setdiff1d</a></span></li><li><span><a href="#Column-Indexing-&amp;-Sum()" data-toc-modified-id="Column-Indexing-&amp;-Sum()-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Column Indexing &amp; Sum()</a></span></li><li><span><a href="#np.where()" data-toc-modified-id="np.where()-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>np.where()</a></span></li></ul></li></ul></div>

# Import Library

In [1]:
import numpy as np
import pandas as pd

In [2]:
import warnings
warnings.filterwarnings(action='ignore')

---
<br> 
# Create Sample Data

In [3]:
data = pd.DataFrame({"x":[1,2,3,4], 
                     "y":[5,6,7,8], 
                     "z":[10,11,12,13]})

In [4]:
data

Unnamed: 0,x,y,z
0,1,5,10
1,2,6,11
2,3,7,12
3,4,8,13


---
<br> 
# What is Vectorizing?

만약, x 컬럼에 곱하기 3을 하고 싶다면,  
<br >
data.x[0] * 3   
data.x[1] * 3  
data.x[2] * 3  
data.x[3] * 3이 아니라,    
<br> 
**data.x * 3**으로 한 번에 계산할 수 있다.

<br> 
without Vectorizing : 4번의 연산을 해야함

In [5]:
for i in range(4) :
    print(data.x[i] * 3)

3
6
9
12


<br> 
with Vectorizing : 1번의 연산으로 끝

In [6]:
data.x * 3

0     3
1     6
2     9
3    12
Name: x, dtype: int64

---
<br> 
# Example

<br> 
## map + lambda

기존 코드

a@b@c -> a@c  
ex) BP@BP@ACTCOL EP-505S-D -> BP@ACTCOL EP-505S-D

In [7]:
# for i in soft_em_BP.index:
#     soft_em_BP.loc[i, 'MATERIAL'] = soft_em_BP.loc[i, 'MATERIAL'].split('@')[0] + '@' + soft_em_BP.loc[i, 'MATERIAL'].split('@')[2]

수정한 코드

In [8]:
# soft_em_BP['MATERIAL'] = list(map(lambda x:x.split('@')[0] + '@' + x.split('@')[2], soft_em_BP['MATERIAL'].fillna("@@")))
# soft_em_BP['MATERIAL'][soft_em_BP['MATERIAL'] == "@"] = np.nan

**lambda**  
함수를 한 줄로 표현할 수 있게 해주는 기능

In [9]:
def plus(x,y) :
    return(x+y)
plus(10,20)

30

In [10]:
(lambda x,y:x+y)(10,20) 

30

**map**  
함수와 리스트를 인자로 받고, 리스트로부터 원소를 하나씩 꺼내서 함수를 적용시키는 기능

In [11]:
list(map(lambda x: x ** 2, [1,2,3,4,5]))

[1, 4, 9, 16, 25]

<br> 
## Apply

기존 코드  
<br>
object type인 모든 컬럼에 대해, for문으로 LabelEncoder() 적용

In [12]:
# lbl = LabelEncoder()
# cols = list(df.dtypes[df.dtypes == 'object'].index)
# for c in cols:
#     lbl.fit(list(df[c].values))
#     df[c] = lbl.transform(list(df[c].values))

수정한 코드

In [13]:
# cols = df.columns[df.dtypes == 'object']
# df[cols] = df[cols].apply(LabelEncoder().fit_transform)

**apply**  

In [14]:
data

Unnamed: 0,x,y,z
0,1,5,10
1,2,6,11
2,3,7,12
3,4,8,13


In [15]:
data.apply(sum, axis=0) # col

x    10
y    26
z    46
dtype: int64

In [16]:
data.apply(sum, axis=1) # row

0    16
1    19
2    22
3    25
dtype: int64

In [17]:
# row별로
# 만약 x*2 >= y 이면
# x-y를 newX에
# 아니면 0을 newX에
data['newX'] = data.apply(lambda df: df['x']-df['y'] 
                          if (df['x']*2 >= df['y']) 
                          else 0, 
                          axis = 1) # row
data

Unnamed: 0,x,y,z,newX
0,1,5,10,0
1,2,6,11,0
2,3,7,12,0
3,4,8,13,-4


<br> 
## Merge 

기존 코드

In [18]:
# for i in slab_em_AD.index:
#   if i in soft_em_AD.index:
#     slab_em_AD.loc[i] = soft_em_AD.loc[i]

수정한 코드

In [19]:
# slab_em_AD = pd.merge(slab_em_AD, soft_em_AD, how = "left", on = "MATERIAL")

**merge**

In [20]:
data2 = pd.DataFrame({"x":[2,1,4,3], 
                     "y2":[100,101,102,103]})
data2

Unnamed: 0,x,y2
0,2,100
1,1,101
2,4,102
3,3,103


In [21]:
pd.merge(data, data2, how = "left", on ="x")

Unnamed: 0,x,y,z,newX,y2
0,1,5,10,0,101
1,2,6,11,0,100
2,3,7,12,0,103
3,4,8,13,-4,102


## np.setdiff1d

기존 코드

In [22]:
# Head = ['ITEM', 'EXPERIMENT']
# AD = ['AD@1.4-BD',  'AD@A-1',     'AD@A-33',     'AD@B-8002',             'AD@B-8409',
#       'AD@B-8418',  'AD@B-8462',  'AD@B-8715LF', 'AD@Benzylalcohol',      'AD@CYCLO-PENTANE',
#       'AD@D/Water', 'AD@DC-5906', 'AD@DEG',      'AD@DEOA',               'AD@DPG',
#       'AD@ECOMATE', 'AD@EG',      'AD@GLY',      'AD@L417',               'AD@L-580',
#       'AD@L-618',   'AD@L-668',   'AD@L-6988',   'AD@METHYLENE CHLORIDE', 'AD@PUR500',
#       'AD@T-9']

In [23]:
# # 기존 Data에 없는 Columns 생성
# for i in Head+AD+BP+IS+PI_ratio+Etc+Target:
#     if i not in slab_data.columns:
#         slab_data[i] = np.nan

수정한 코드

In [24]:
# slab_data[np.setdiff1d(Head+AD+BP+IS+PI_ratio+Etc+Target, slab_data.columns)] = np.nan 

**np.setdiff1d()**  
<br> 
Problem :  
['x', 'y', 'z', 'a', 'b'] 컬럼으로 이루어진 데이터를 생성하고 싶음.  
현재 데이터는 x, y, z 컬럼으로 이루어져 있음.  
여기에 없는 컬름은 na로 채워 새로 생성해야 함.    

In [25]:
data.columns

Index(['x', 'y', 'z', 'newX'], dtype='object')

없는 컬럼은 np.setdiff1d() 로 확인

In [26]:
np.setdiff1d(['x','y','z','a','b'], data.columns)

array(['a', 'b'], dtype='<U1')

In [27]:
data[np.setdiff1d(['x','y','z','a','b'], data.columns)] = np.nan

In [28]:
data

Unnamed: 0,x,y,z,newX,a,b
0,1,5,10,0,,
1,2,6,11,0,,
2,3,7,12,0,,
3,4,8,13,-4,,


<br> 
## Column Indexing & Sum()

기존 코드

각 컬럼별 0보다 큰 데이터 개수

In [29]:
# for i in Drop_ITEM:
#     print(i, slab_data.loc[slab_data[slab_data[i]>0].index, i].count(), '개')   

수정한 코드

In [30]:
# (slab_data[Drop_ITEM] > 0).sum()

**Col Indexing**

In [31]:
data[['x','a']]

Unnamed: 0,x,a
0,1,
1,2,
2,3,
3,4,


In [32]:
temp = ['x','a']
data[temp]

Unnamed: 0,x,a
0,1,
1,2,
2,3,
3,4,


**sum()**

In [33]:
data

Unnamed: 0,x,y,z,newX,a,b
0,1,5,10,0,,
1,2,6,11,0,,
2,3,7,12,0,,
3,4,8,13,-4,,


In [34]:
data.sum(axis=0) # col sum

x       10.0
y       26.0
z       46.0
newX    -4.0
a        0.0
b        0.0
dtype: float64

In [35]:
data.sum(axis=1) # row sum

0    16.0
1    19.0
2    22.0
3    21.0
dtype: float64

In [36]:
data.sum().sum() # all sum

78.0

**Condition + Sum()**

In [37]:
temp = ['x','y']
data[temp] >= 3

Unnamed: 0,x,y
0,False,True
1,False,True
2,True,True
3,True,True


In [38]:
temp = ['x','y']
(data[temp] >= 3).sum()

x    2
y    4
dtype: int64

<br> 
## np.where()
condition에 해당하는 위치 찾기

기존 코드

컬럼별로 0보다 큰 row 삭제

In [39]:
# for i in Drop_ITEM:
#     slab_data.drop(slab_data[slab_data[i]>0].index, inplace=True)

수정한 코드

In [40]:
# slab_data.drop(np.where(slab_data[Drop_ITEM].sum(axis=1) > 0)[0], inplace = True)

**np.where()**

In [41]:
data.a = [1,1,1, np.nan]
data.b = [2,2,2,2]

data

Unnamed: 0,x,y,z,newX,a,b
0,1,5,10,0,1.0,2
1,2,6,11,0,1.0,2
2,3,7,12,0,1.0,2
3,4,8,13,-4,,2


<br> 
Problem :  
하나라도 na인 row 삭제

In [42]:
np.where(np.isnan(data))[0]

array([3])

In [43]:
data.drop(np.where(np.isnan(data))[0], inplace = True)

In [44]:
data

Unnamed: 0,x,y,z,newX,a,b
0,1,5,10,0,1.0,2
1,2,6,11,0,1.0,2
2,3,7,12,0,1.0,2
