# Pandas
: 행과 열로 이루어진 2차원 데이터를 가공/처리할 수 있는 다양한 기능을 제공
- 사용 이유: 특정 데이터 분석을 위한 데이터 조회, 평균, 분산, 표준편차, 분포 등의 작업을 쉽게 할 수 있다!
- DataFrame: 여러 개의 행과 열로 이뤄진 2차원 데이터를 담는 데이터 구조체이다.
- Series: 데이터프레임과 가장 큰 차이는 컬럼이 하나뿐인 데이터 구조체라는 것이다.
- Index: RDBMS의 PK처럼 개별 데이터를 고유하게 식별하는 키 값, 데이터프레임과 시리즈 모두 인덱스를 키 값으로 가진다.

In [3]:
import pandas as pd
import numpy as np

## 판다스는 다양한 포멧으로 된 파일을 데이터프레임으로 로딩할 수 있는 편리한 API 제공
- read_csv(), 콤마 구분 파일
- read_table(), 탭 구분 파일
- read_fwf() 등

### read.csv의 인자:
- sep: 구분 문자를 입력하는 인자로 default는 콤마(예시: 탭 구분 sep='\t')
- filepath: 파일 경로 입력/파일명만 입력되면 실행파일이 있는 디렉터리와 동일한 디렉터리에 있는 파일명을 로딩

In [4]:
titanic_df = pd.read_csv('./titanic_train.csv')
print('titanic 변수 type:', type(titanic_df))
titanic_df

titanic 변수 type: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [5]:
books_df = pd.read_csv('./books.csv')
print('books 변수 type:', type(books_df))
books_df

books 변수 type: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Fundamentals of Wavelets,"Goswami, Jaideva",signal_processing,228,Wiley
0,Data Smart,"Foreman, John",data_science,235,Wiley
1,God Created the Integers,"Hawking, Stephen",mathematics,197,Penguin
2,Superfreakonomics,"Dubner, Stephen",economics,179,HarperCollins
3,Orientalism,"Said, Edward",history,197,Penguin
4,"Nature of Statistical Learning Theory, The","Vapnik, Vladimir",data_science,230,Springer
...,...,...,...,...,...
205,Structure and Randomness,"Tao, Terence",mathematics,252,
206,Image Processing with MATLAB,"Eddins, Steve",signal_processing,241,
207,Animal Farm,"Orwell, George",fiction,180,
208,"Idiot, The","Dostoevsky, Fyodor",fiction,197,


- DataFrame.head(): 데이터프레임의 맨 앞에 있는 일부 행을 출력, 기본값은 5개.

In [6]:
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [7]:
print('DataFrame 크기:', titanic_df.shape)

DataFrame 크기: (891, 12)


In [8]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [9]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Fundamentals of Wavelets  210 non-null    object
 1   Goswami, Jaideva          186 non-null    object
 2   signal_processing         210 non-null    object
 3   228                       210 non-null    int64 
 4   Wiley                     114 non-null    object
dtypes: int64(1), object(4)
memory usage: 8.3+ KB


In [10]:
# 데이터 통계
# count: 데이터 수, min: 값 중의 최솟값
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [11]:
titanic_df['Pclass']

0      3
1      1
2      3
3      1
4      3
      ..
886    2
887    1
888    3
889    1
890    3
Name: Pclass, Length: 891, dtype: int64

In [12]:
# value_counts() : 해당 데이터 개수
value_counts = titanic_df['Pclass'].value_counts()
value_counts

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [13]:
titanic_pclass = titanic_df['Pclass'] # 데이터프레임 내 컬럼
print(type(titanic_pclass))
print(type(titanic_df))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [14]:
titanic_pclass.head()

0    3
1    1
2    3
3    1
4    3
Name: Pclass, dtype: int64

In [15]:
value_counts = titanic_df['Pclass'].value_counts()
print(type(value_counts))
print(value_counts)

<class 'pandas.core.series.Series'>
3    491
1    216
2    184
Name: Pclass, dtype: int64


## DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

In [16]:
# ndarray, list를 DataFrame으로 변환
col_name1 = ['col1']
list1 = [1, 2, 3]
array1 = np.array(list1)
print('array1 shape:', array1.shape)

df_list1 = pd.DataFrame(list1, columns=col_name1)
print('1차원 리스트로 만든 DataFrame:\n', df_list1)
df_array1 = pd.DataFrame(array1, columns=col_name1)
print('1차원 ndarray로 만든 DataFrame:\n', df_array1)

array1 shape: (3,)
1차원 리스트로 만든 DataFrame:
    col1
0     1
1     2
2     3
1차원 ndarray로 만든 DataFrame:
    col1
0     1
1     2
2     3


In [17]:
# 3개의 컬럼명 필요
col_name2 = ['col1', 'col2', 'col3']

# 2행X3열 형태의 리스트와 ndarray 생성 한 뒤 이를 DataFrame으로 변환
list2 = [[1, 2, 3], [4, 5, 6]]
array2 = np.array(list2)
print('array2 shape:', array2.shape)

df_list2 = pd.DataFrame(list2, columns=col_name2)
print('2차원 리스트로 만든 DataFrame:\n', df_list2)
df_array2 = pd.DataFrame(array2, columns=col_name2)
print('2차원 ndarray로 만든 DataFrame:\n', df_array2)
df_array2

array2 shape: (2, 3)
2차원 리스트로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1     4     5     6
2차원 ndarray로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1     4     5     6


Unnamed: 0,col1,col2,col3
0,1,2,3
1,4,5,6


In [18]:
# 딕셔너리를 DataFrame으로 변환
# Key는 컬럼명으로, Value는 리스트형 또는 ndarray로 매핑
dict = {
    'col1': [1, 11],
    'col2': [2, 22],
    'col3': [3, 33]
}
df_dict = pd.DataFrame(dict)
print('딕셔너리로 만든 DataFrame:\n', df_dict)

딕셔너리로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    22    33




---

- DataFrame을 넘파이 ndarray,list, dict로 변환하기

In [19]:
df_dict.index

RangeIndex(start=0, stop=2, step=1)

In [20]:
df_dict.values

array([[ 1,  2,  3],
       [11, 22, 33]])

In [21]:
df_dict.columns

Index(['col1', 'col2', 'col3'], dtype='object')

In [22]:
# DataFrame을 ndarray로 변환
array3 = df_dict.values
print('df_dict.values 타입:', type(array3), 'df_dict.values shape:', array3.shape)
print(array3)

df_dict.values 타입: <class 'numpy.ndarray'> df_dict.values shape: (2, 3)
[[ 1  2  3]
 [11 22 33]]


In [23]:
# DataFrame을 리스트로 변환
list3 = df_dict.values.tolist()
print('df_dict.values.tolist() 타입:', type(list3))
print(list3)

# DataFrame을 딕셔너리로 변환
dict3 = df_dict.to_dict('list')
print('\n df_dict.to_dict() 타입:', type(dict3))
print(dict3)

df_dict.values.tolist() 타입: <class 'list'>
[[1, 2, 3], [11, 22, 33]]

 df_dict.to_dict() 타입: <class 'dict'>
{'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}


## DataFrame의 컬럼 데이터셋 Access

In [24]:
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [25]:
# 새로운 컬럼 추가
titanic_df['Age_0'] = 0
titanic_df['구남석'] = None
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0,구남석
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,


In [26]:
titanic_df['Age_by_10'] = titanic_df['Age'] * 10
titanic_df['Family_No'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0,구남석,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,,220.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,,380.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,,260.0,1


In [27]:
titanic_df['Age_by_10'] = titanic_df['Age_by_10'] + 100
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0,구남석,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,,320.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,,480.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,,360.0,1


## DataFrame 데이터 삭제
- axis=0(행), axis=1(열)
- inplace=True : 기존 데이터프레임에 변경된 설정으로 덮어쓰기

In [28]:
titanic_drop_df = titanic_df.drop('Age_0', axis=1, inplace=True)
titanic_drop_df = titanic_df.drop('구남석', axis=1)
titanic_drop_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,320.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,480.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,360.0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,450.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,450.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,370.0,1
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,290.0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,,4
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,360.0,1


In [29]:
# inplace=True일 때 drop 후 반환값 None
drop_result = titanic_df.drop(['Age_by_10', 'Family_No'], axis=1, inplace=True)
print('inplace=True로 drop 후 반환값:', drop_result)
titanic_df.head(3)

inplace=True로 drop 후 반환값: None


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,구남석
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,


In [30]:
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 15)
print('#### before axis 0 drop ####')
print(titanic_df.head(3))

titanic_df.drop([0, 1, 2], axis=0, inplace=True) # index drop
print('### after axis 0 drop ###')
print(titanic_df.head(3))

#### before axis 0 drop ####
   PassengerId  Survived  Pclass            Name     Sex   Age  SibSp  Parch          Ticket     Fare Cabin Embarked   구남석
0            1         0       3  Braund, Mr....    male  22.0      1      0       A/5 21171   7.2500   NaN        S  None
1            2         1       1  Cumings, Mr...  female  38.0      1      0        PC 17599  71.2833   C85        C  None
2            3         1       3  Heikkinen, ...  female  26.0      0      0  STON/O2. 31...   7.9250   NaN        S  None
### after axis 0 drop ###
   PassengerId  Survived  Pclass            Name     Sex   Age  SibSp  Parch  Ticket     Fare Cabin Embarked   구남석
3            4         1       1  Futrelle, M...  female  35.0      1      0  113803  53.1000  C123        S  None
4            5         0       3  Allen, Mr. ...    male  35.0      0      0  373450   8.0500   NaN        S  None
5            6         0       3  Moran, Mr. ...    male   NaN      0      0  330877   8.4583   NaN        Q

## Index 객체

In [31]:
# 원본 파일 재로딩
books_df = pd.read_csv('./books.csv')
# Index 객체 추출
indexes = books_df.index
print(indexes)
# Index 객체를 실제 값 array로 변환
print('Index 객체 array 값:\n', indexes.values)

RangeIndex(start=0, stop=210, step=1)
Index 객체 array 값:
 [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209]


In [32]:
print(type(indexes.values))
print(indexes.values.shape)
print(indexes[:5].values)
print(indexes.values[:5])
print(indexes[6])

<class 'numpy.ndarray'>
(210,)
[0 1 2 3 4]
[0 1 2 3 4]
6


In [33]:
series_fair = titanic_df['Fare']
print('Fair Series max 값:', series_fair.max())
print('Fair Series sum 값:', series_fair.sum())
print('sum() Fair Series:', sum(series_fair))
print('Fair Series + 3:\n', (series_fair + 3).head(3))

Fair Series max 값: 512.3292
Fair Series sum 값: 28607.491
sum() Fair Series: 28607.49099999997
Fair Series + 3:
 3    56.1000
4    11.0500
5    11.4583
Name: Fare, dtype: float64


In [34]:
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,구남석
3,4,1,1,"Futrelle, M...",female,35.0,1,0,113803,53.1,C123,S,
4,5,0,3,"Allen, Mr. ...",male,35.0,0,0,373450,8.05,,S,
5,6,0,3,"Moran, Mr. ...",male,,0,0,330877,8.4583,,Q,


In [35]:
# 인덱스 재정렬, 0부터 다시 정렬
titanic_reset_df = titanic_df.reset_index(inplace=False)
titanic_reset_df.head(3)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,구남석
0,3,4,1,1,"Futrelle, M...",female,35.0,1,0,113803,53.1,C123,S,
1,4,5,0,3,"Allen, Mr. ...",male,35.0,0,0,373450,8.05,,S,
2,5,6,0,3,"Moran, Mr. ...",male,,0,0,330877,8.4583,,Q,


In [36]:
print('### before reset_index ###')
value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)
print('value_counts 객체 변수 타입:', type(value_counts))
print(value_counts[1])

new_value_counts = value_counts.reset_index(inplace=False)
print('### After reset_index ###')
print(new_value_counts)
print('new_value_counts 객체 변수 타입:', type(new_value_counts))
# 행 방향 데이터 접근 - loc[인덱스명], iloc[정수 인덱스 형식]
print(new_value_counts.iloc[0])

### before reset_index ###
3    489
1    215
2    184
Name: Pclass, dtype: int64
value_counts 객체 변수 타입: <class 'pandas.core.series.Series'>
215
### After reset_index ###
   index  Pclass
0      3     489
1      1     215
2      2     184
new_value_counts 객체 변수 타입: <class 'pandas.core.frame.DataFrame'>
index       3
Pclass    489
Name: 0, dtype: int64
