<a href="https://colab.research.google.com/github/xuhu357/DataAnalysis/blob/master/ch07_%EB%8D%B0%EC%9D%B4%ED%84%B0_%EC%A4%80%EB%B9%84%ED%95%98%EA%B8%B0_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 표시자 / 더미 변수

통계 모델이나, 기계 학습 application을 위한 또 다른 데이터 변형은 분류값을 더미나 표시 행렬로 변환.

어떤 DataFrame의 한칼럼에 k가지의 값이 있다면, k개의 칼럼이 있는 DataFrame이나 행렬을 만들과 값으로는 1과 0을 채워 넣을 것이다. 

이때 사용가능한 함수는 pd의 get_dummies 함수이다.

In [0]:
import numpy as np
import pandas as pd

In [0]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)
                  })

In [3]:
df

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [4]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


prefix를 추가하고 싶다면, 

In [5]:
dummies = pd.get_dummies(df['key'], prefix='key')
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


다른 데이터와 병합하고 싶다면, 즉 data1 과 새롭게 만들어진 dummies를 병합하고 싶다면, 

In [6]:
df_with_dummies = df[['data1']].join(dummies)

df_with_dummies

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


만약 DataFrame의 한 칼럼이 여러개의 카테고리에 속한다면, 일이 조금 복잡해짐. 

전에 보았었던, MovieLens 영화 데이터를 예로 살펴보자. 

In [8]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving movies.dat to movies.dat
Saving ratings.dat to ratings.dat
Saving users.dat to users.dat
User uploaded file "movies.dat" with length 171308 bytes
User uploaded file "ratings.dat" with length 24594131 bytes
User uploaded file "users.dat" with length 134368 bytes


In [0]:
mnames = ['movie_id', 'title', 'genres']

In [10]:
movies = pd.read_table('movies.dat', sep='::', header=None, names=mnames)

  """Entry point for launching an IPython kernel.


In [11]:
movies[:10]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


각 장르마다 표시용 값을 추가하려면, 약간의 수고를 해야 함. 
먼저 데이터 묶음에서 유일한 장르 목록을 추출해야 함. set_union 사용.

In [0]:
genre_iter = (set(x.split('|')) for x in movies.genres)

In [0]:
genres = sorted(set.union(*genre_iter))

In [14]:
genres

['Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

이제 표시용 DataFrame 생성하기 위해 0으로 초기화된 DataFrame을 생성하자~

In [16]:
dummies = pd.DataFrame(np.zeros((len(movies), len(genres))), columns=genres)

dummies

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


이제는 각 영화를 순회하면서 dummies의 칼럼의 각 항목을 1로 설정해보자.

In [0]:
for i, gen in enumerate(movies.genres):
  dummies.loc[i, gen.split('|')] = 1

In [20]:
dummies

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


이제는 앞에서 한대로 movies DataFrame과 조합하면 된다.

In [0]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))

In [22]:
movies_windic[:1]

Unnamed: 0,movie_id,title,genres,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Crime,Genre_Documentary,...,Genre_Fantasy,Genre_Film-Noir,Genre_Horror,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Thriller,Genre_War,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
movies_windic.columns

Index(['movie_id', 'title', 'genres', 'Genre_Action', 'Genre_Adventure',
       'Genre_Animation', 'Genre_Children's', 'Genre_Comedy', 'Genre_Crime',
       'Genre_Documentary', 'Genre_Drama', 'Genre_Fantasy', 'Genre_Film-Noir',
       'Genre_Horror', 'Genre_Musical', 'Genre_Mystery', 'Genre_Romance',
       'Genre_Sci-Fi', 'Genre_Thriller', 'Genre_War', 'Genre_Western'],
      dtype='object')

get_dummies나 cut 같은 이산 함수를 잘 조합하면, 통계 애플리케이션에서 유용하게 사용할 수 있다.

In [0]:
values = np.random.rand(10)

In [26]:
values

array([0.69054451, 0.14746498, 0.78415342, 0.77901441, 0.99648749,
       0.83770095, 0.53978632, 0.18184354, 0.94528051, 0.26356115])

In [0]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [30]:
pd.cut(values, bins)

[(0.6, 0.8], (0.0, 0.2], (0.6, 0.8], (0.6, 0.8], (0.8, 1.0], (0.8, 1.0], (0.4, 0.6], (0.0, 0.2], (0.8, 1.0], (0.2, 0.4]]
Categories (5, interval[float64]): [(0.0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1.0]]

In [31]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,1,0
1,1,0,0,0,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,0,0,0,1
5,0,0,0,0,1
6,0,0,1,0,0
7,1,0,0,0,0
8,0,0,0,0,1
9,0,1,0,0,0


이렇게 보면, 훨씬 보기가 편한것 같다.

## 문자열 다루기

python이 문자열이나 텍스트 처리에 용이하다. 

텍스트 연산은 대부분 문자열 객체의 내장 메소드를 통해 간단하게 처리 가능.

그러나 좀 더 복잡한 패턴 매칭이나 텍스트 조작은 regular expression이 필요.

pandas는 배열 데이터 전체에 쉽게 정규 표현식을 적용하고 추가적으로 누락된 데이터를 편리하게 처리할 수 있는 기능을 포함하고 있음.

### 문자열 객체 메소드

In [32]:
# split 메소드 사용 예제

val = 'a, b,  guido'

val.split(',')

['a', ' b', '  guido']

split 는 종종 공백 문자를 제거하는 strip 메소도와 조합해서 사용하기도 함. 

In [33]:
pieces = [x.strip() for x in val.split(",")]
pieces

['a', 'b', 'guido']

이렇게 분리된 문자열을 '+' 연산을 통해서 합칠 수 있다.

In [34]:
first, second, third = pieces

first + '::' + second + '::' + third

'a::b::guido'

되긴 하나, python 스럽지 못한 방법이다. 

python 스러운 방법은 join 메소드를 이용하는 것이다. 

In [36]:
'::'.join(pieces)

'a::b::guido'

일치하는 부분 문자열의 위치를 찾는 방법도 있다. 

index나 find를 사용하는 것도 가능하지만, python의 in 예약어를 사용하면, 일치하는 부분 문자열을 쉽게 찾을 수 있다. 

In [37]:
'guido' in val

True

In [38]:
val.index(',')

1

In [39]:
val.find(':')

-1

find와 index의 차이점은 index의 경우 문자열을 찾지 못하면, 예외를 발생시킨다는 점이다. find의 경우 보다 싶이 -1을 반환.

In [40]:
val.index(':')

ValueError: ignored

count는 특정 부분 문자열이 몇 건 발견되었는지를 반환.

In [41]:
val.count(',')

2

replace는 찾아낸 패턴을 다른 문자열로 치환한다. 

이 메소드는 대체할 문자열로 비어있는 문자열을 넘겨서 패턴을 삭제하기 위한 방법으로 자주 사용.

In [42]:
val.replace(',', '::')

'a:: b::  guido'

In [43]:
val.replace(',', '') # ','를 공백으로 치환

'a b  guido'

파이썬 내장 문자열 메소드 다양한 사용처는 아래 표에서 확인할 수 있다. 

* count: 문자열에서 겹치지 않는 부분 문자열의 개수를 반환
* endswith, startswith: 문자열이 주어진 접미사, 접두사로 끝날 경우 True반환
* join: 문자열을 구분자로 하여 다른 문자열을 순서대로 이어 붙이기.
* index: 부분 문자열의 첫번째 글자의 위치를 반환. 
* find: index와 비슷한 역할 하지만, 찾으려는 내용이 없을 경우 -1 반환.
* rfind: 마지막 부분 문자열의 첫번째 글자의 위치를 반환. 
* replace: 문자열을 다른 문자열로 치환
* strip, rstrip(): 개행문자를 포함한 공백문자를 제거. lstrip는 문자열의 시작부분의 공백문자만을 제거, rstrip은 마지막 부분에 있는 공백문자 제거.
* split: 문자열을 구분자를 기준으로 부분 문자열의 리스트로 분리.
* lower, upper: 각각 알파벳 문자를 소문자 혹은 대문자로 변환
* ljust, rjust: 문자열을 오른쪽 혹은 왼쪽으로 정렬하고 주어진 길이에서 문자열의 길이를 제외한 곳은 공백 문자로 채워 넣어 주어진 길이를 가지는 문자열을 반환.

### 정규 표현식

졍규표현식은 텍스트에서 문자열 패턴을 찾는 유연한 방법을 제공. 

파이썬의 re모듈을 사용해서 처리.

여러가지 공백문자(Tab, space, 개행문자)가 포함된 문자열을 나누고 싶다면, 하나 이상의 공백문자를 의미하는 \s+를 사용해서 문자열을 분리 가능.

In [0]:
import re

In [0]:
text = "foo     bar\t  baz    \tqux"

In [3]:
re.split(r"\s+", text)

['foo', 'bar', 'baz', 'qux']

이 과정은 아래와 같다. 먼저 정규 표현식이 컴파일 되고, 그 다음에 split메소드가 실행된다. 

아래와 같이 먼저 re.compile을 통해 직접 정규표현식을 컴파일하고 그 렇게 얻은 정규 표현식 객체를 재사용하는것도 가능하다.

In [0]:
regex = re.compile('\s+')

In [5]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [6]:
# 정규 표현식에 매칭하는 모든 패턴을 보고 싶다면, findall()를 사용하면 됨.

regex.findall(text)

['     ', '\t  ', '    \t']

주의: 정규 표현식에서 '\' 문자가 escape되는 것을 피하려면, raw 문자열 표기법으로 문제를 회피할 수 있음. r'C:\x'는 'C:\\x'와 동일.

---

같은 정규표현식을 다른 문자열에도 적용해야 한다면, re.compile을 이용해서 정규표현식 객체를 만들어 사용하는 방법을 추천. CPU사용량을 아낄 수 있음.

findall은 문자열에서 일치하는 모든 부분의 문자열을 찾아주지만, 

search는 패턴과 일치한 첫번재 존재를 반환.

match 는 이보다 엄격해서 문자열의 시작부분에서 일치하는 것만 찾아준다.

예제를 통해서 상세히 알아보자.

In [0]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

In [0]:
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
# re.IGNORECASE는 정규 표현식이 대.소 문자를 가리지 않도록 한다.
regex = re.compile(pattern, flags=re.IGNORECASE)

In [19]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

search는 텍스트에서 첫번재 이메일 주소만 찾아준다. 

이 정규표현식에 대한 match 객체는 그 패턴이 문자열 안에서 위치하는 시작점과 끝점만을 알려준다.

In [0]:
m = regex.search(text)

In [21]:
m

<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>

In [22]:
text[m.start():m.end()]

'dave@google.com'

regex.match는 None은 반환한다. 왜냐하면, 그 패턴이 문자열의 시작점에서부터 일치하는지 검사하기 때문이다.

In [23]:
print(regex.match(text))

None


sub 메소드는 찾은 패턴을 주어진 문자열로 치환하여 새로운 문자열을 반환.

In [24]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



이메일 주소를 찾아서 동시에 각 이메일 주소를 사용자 이름, 도메인 이름, 도메인 접미사의 세가지 컴포넌트로 나누어야 한다면, 각 패턴을 괄호로 묶어주면 된다.

In [0]:
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"

In [0]:
regex = re.compile(pattern, flags=re.IGNORECASE)

이렇게 만든 match객체를 이용하면, groups 메소드를 통해 각 패턴 컴포넌트의 tuple을 얻을 수 있다.

In [0]:
m = regex.match('wesm@bright.net')

In [28]:
m.groups()

('wesm', 'bright', 'net')

패턴에 그룹이 있다면, findall 메소드는 튜플의 목록을 반환.

In [29]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

sub 역시 마찬 가지로 \1, \2 같은 특수한 기호를 사용해서 각 패턴의 그룹에 접근 할 수 있다.

In [30]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



이 밖에도 python에서 할 수 있는 정규표현식이 많이 있지만 대부분 이책의 범위를 벗어나므로 생략.

한가지만 더 소개 하자면, 이메일 주소 정규표현식의 매치 그룹에 다음처럼 이름을 줄 수 있다.

In [0]:
regex = re.compile(r"""
  (?P<username>[A-Z0-9._%+-]+)
  @
  (?P<domain>[A-Z0-9.-]+)
  \.
  (?P<suffix>[A-Z]{2,4})
""", flags =re.IGNORECASE|re.VERBOSE)

In [0]:
m = regex.match('wesm@bright.net')

In [33]:
m

<_sre.SRE_Match object; span=(0, 15), match='wesm@bright.net'>

In [34]:
m.groupdict()

{'domain': 'bright', 'suffix': 'net', 'username': 'wesm'}

정규표현식 메소드

* findall: 문자열에서 겹치지 않는 발견된 모든 패턴을 반환.
* finditer: 이터레이터를 통해서 하나씩 반환
match: 문자열의 시작부터 패턴을 찾고 선택적으로 태펀 컴포넌트를 그룹으로 나눈다. 일치하는 패턴이 있다면, match 객체 반환 없으면, None반환.
* search: 문자열에서 패턴과 일치하는 내용을 검색하고 match 객체를 반환. 
* split: 문자열에서 패턴과 일치하는 부분을 쪼갠다.
* sub, subn: 문자열에서 일치히는 모든 패턴 혹은 처음 n개 를 대체 표현으로 치환.
대체 표현 문자열은 \1, \2 같은 기호를 사용해서 매치 그룹의 요소를 참조.