## Python: Pandas

- Pandas for __panel data(big data)__ analysis and AI research with mixed data types. Panel data는 시간에 따른 데이터 변화를 나타낸 것이다.

- __Panel data__: the __same subjects__ (cross-sectional units) are observed over time.

- __Flexibile indexing__, allowing users to use non-ineger indexs.

- One dimensional array: __Series__, two dimensional array: __DataFrames__, Multi-dimentional array: __MultiIndex__. Series는 1차원 array, DataFrame는 2차원 array를 의미한다.

__1.Pandas: Series creation and indexing__

In [2]:
import pandas as pd

In [4]:
grades = pd.Series(range(80, 100, 2))
print(grades)

0    80
1    82
2    84
3    86
4    88
5    90
6    92
7    94
8    96
9    98
dtype: int64


In [5]:
print(grades.describe()) # descriptive statistics
print(len(grades))

count    10.000
mean     89.000
std       6.055
min      80.000
25%      84.500
50%      89.000
75%      93.500
max      98.000
dtype: float64
10


In [6]:
#creating a series with custom indices 리스트 사용
height = pd.Series([175, 184, 170], index=['Kim','Kwon','Lee'])
print(height)
print(height.Kwon, height['Lee'], height[0])

Kim     175
Kwon    184
Lee     170
dtype: int64
184 170 175


In [7]:
# Using dictionary for custimizing indexes: keys become indexes 딕션어리 사용
nations = pd.Series({'Korea':82,'Japan':81,'China':'cn'})
print(nations)

Korea    82
Japan    81
China    cn
dtype: object


__2.Pandas: DataFrames creation and indexing__

In [8]:
season_temps = pd.DataFrame({'Spring': [10, 14, 18], 'Summer': [24,27,30], 'Fall': [24,21,18], 'Winter': [8, 0, -5]})
print(season_temps) # column index로 바로 지정

   Spring  Summer  Fall  Winter
0      10      24    24       8
1      14      27    21       0
2      18      30    18      -5


In [9]:
scores = {'Kim':[87, 96, 70], 'Park':[100, 87, 90], 'Sam':[94, 77, 90],\
          'Kwon':[100, 90, 95], 'Lee':[83, 65, 85]}
scores_df = pd.DataFrame(scores)
print(scores_df)

   Kim  Park  Sam  Kwon  Lee
0   87   100   94   100   83
1   96    87   77    90   65
2   70    90   90    95   85


In [10]:
# Changing index
scores_ni = pd.DataFrame(scores, index = ['Math', 'Econ', 'Physics'])
print(scores_ni)
scores_df.index = ['Math', 'Econ', 'Physics']
print(scores_df)

         Kim  Park  Sam  Kwon  Lee
Math      87   100   94   100   83
Econ      96    87   77    90   65
Physics   70    90   90    95   85
         Kim  Park  Sam  Kwon  Lee
Math      87   100   94   100   83
Econ      96    87   77    90   65
Physics   70    90   90    95   85


__3.Pandas: DataFrame slicing__

- loc: selecting rows with row name

- iloc: selecting rows with index number

- at: getting a specific element of a DataFrame(특정 데이터만 뽑아내는 것)

- iat: getting a specific element of a DataFrame(index를 특정하여 데이터를 뽑아내는 것)

In [11]:
print(f'{season_temps.loc[0]}\n{season_temps.iloc[0]}')

Spring    10
Summer    24
Fall      24
Winter     8
Name: 0, dtype: int64
Spring    10
Summer    24
Fall      24
Winter     8
Name: 0, dtype: int64


In [12]:
# slicing by setting row index range
print(f'{scores_ni.loc["Math":"Econ"]}\n{scores_df.iloc[:2,:3]}')

      Kim  Park  Sam  Kwon  Lee
Math   87   100   94   100   83
Econ   96    87   77    90   65
      Kim  Park  Sam
Math   87   100   94
Econ   96    87   77


In [13]:
# Slicing with speicific row indexes
print(f'{scores_df.loc[["Math", "Physics"]]}\n{scores_df.iloc[[0,2],:3]}')

         Kim  Park  Sam  Kwon  Lee
Math      87   100   94   100   83
Physics   70    90   90    95   85
         Kim  Park  Sam
Math      87   100   94
Physics   70    90   90


In [14]:
# Slicing with a row index range and colum names or indexes
print(f'{scores_ni.loc["Math":"Physics",["Kim","Sam"]]}\n{scores_df.iloc[[0,2], 0:3]}')

         Kim  Sam
Math      87   94
Econ      96   77
Physics   70   90
         Kim  Park  Sam
Math      87   100   94
Physics   70    90   90


In [15]:
# Selecting specific rows and columns, which are not consecutive
print(f'{scores_ni.loc[["Math", "Physics"],["Kim","Sam"]]}')

         Kim  Sam
Math      87   94
Physics   70   90


In [16]:
print(scores_df)
print(scores_df.at['Econ','Kwon'], scores_df.iat[0,4])

         Kim  Park  Sam  Kwon  Lee
Math      87   100   94   100   83
Econ      96    87   77    90   65
Physics   70    90   90    95   85
90 83


__4. Pandas: Boolean indexing__

In [17]:
scores_df[scores_df >= 90]

Unnamed: 0,Kim,Park,Sam,Kwon,Lee
Math,,100.0,94.0,100,
Econ,96.0,,,90,
Physics,,90.0,90.0,95,


In [18]:
scores_df[(scores_df < 90) & (scores_df > 70)]

Unnamed: 0,Kim,Park,Sam,Kwon,Lee
Math,87.0,,,,83.0
Econ,,87.0,77.0,,
Physics,,,,,85.0


__5.Pandas: Descriptive statistics__

In [19]:
pd.set_option('display.precision', 3)
print(scores_df.describe())

          Kim     Park     Sam   Kwon     Lee
count   3.000    3.000   3.000    3.0   3.000
mean   84.333   92.333  87.000   95.0  77.667
std    13.204    6.807   8.888    5.0  11.015
min    70.000   87.000  77.000   90.0  65.000
25%    78.500   88.500  83.500   92.5  74.000
50%    87.000   90.000  90.000   95.0  83.000
75%    91.500   95.000  92.000   97.5  84.000
max    96.000  100.000  94.000  100.0  85.000


In [18]:
scores_df.mean()

Kim     84.333333
Park    92.333333
Sam     87.000000
Kwon    95.000000
Lee     77.666667
dtype: float64

__6.Pandas: Transposing__

In [20]:
scores_df.T

Unnamed: 0,Math,Econ,Physics
Kim,87,96,70
Park,100,87,90
Sam,94,77,90
Kwon,100,90,95
Lee,83,65,85


In [21]:
scores_df.T.describe()

Unnamed: 0,Math,Econ,Physics
count,5.0,5.0,5.0
mean,92.8,83.0,86.0
std,7.661593,12.186058,9.617692
min,83.0,65.0,70.0
25%,87.0,77.0,85.0
50%,94.0,87.0,90.0
75%,100.0,90.0,90.0
max,100.0,96.0,95.0


__7.Pandas: Sorting by index and values__

In [25]:
season_temps.sort_index(ascending=False)

Unnamed: 0,Spring,Summer,Fall,Winter
2,18,30,18,-5
1,14,27,21,0
0,10,24,24,8


In [37]:
scores_df.sort_index() # column열 오름차순

Unnamed: 0,Kim,Park,Sam,Kwon,Lee
Econ,96,87,77,90,65
Math,87,100,94,100,83
Physics,70,90,90,95,85


In [38]:
scores_df.sort_index(axis=1) # row열 오름차순

Unnamed: 0,Kim,Kwon,Lee,Park,Sam
Math,87,100,83,100,94
Econ,96,90,65,87,77
Physics,70,95,85,90,90


In [39]:
scores_df.sort_values(by = 'Econ', axis=1, ascending =True) # 올림차순

Unnamed: 0,Lee,Sam,Park,Kwon,Kim
Math,83,94,100,100,87
Econ,65,77,87,90,96
Physics,85,90,90,95,70


In [40]:
scores_df.T.sort_values(by = 'Econ', ascending = False) # 내림차순

Unnamed: 0,Math,Econ,Physics
Kim,87,96,70
Kwon,100,90,95
Park,100,87,90
Sam,94,77,90
Lee,83,65,85


__8.Pandas: One hot vector__

- One hot vector: among the elements of a vector, only one element has 1 and others have 0 (특정 데이터마 '1'로 전환하고 그 외는 '0'으로 할당하는 기능이다.)

In [47]:
auto_firms = ['Hundai','Honda','Kia','Audi','Benz','Hundai','Benz','Audi','Hundai',
              'Kia','Honda','Kia','Audi','Hundai','Benz']
Year = list(range(1990,2005,1)); Rank = list(range(15))
auto_df = pd.DataFrame({'Year':Year, 'Rank':Rank, 'Maker':auto_firms})
print(auto_df)

    Year  Rank   Maker
0   1990     0  Hundai
1   1991     1   Honda
2   1992     2     Kia
3   1993     3    Audi
4   1994     4    Benz
5   1995     5  Hundai
6   1996     6    Benz
7   1997     7    Audi
8   1998     8  Hundai
9   1999     9     Kia
10  2000    10   Honda
11  2001    11     Kia
12  2002    12    Audi
13  2003    13  Hundai
14  2004    14    Benz


In [48]:
am_onehot = pd.get_dummies(auto_df['Maker'])
print(am_onehot)

    Audi  Benz  Honda  Hundai  Kia
0      0     0      0       1    0
1      0     0      1       0    0
2      0     0      0       0    1
3      1     0      0       0    0
4      0     1      0       0    0
5      0     0      0       1    0
6      0     1      0       0    0
7      1     0      0       0    0
8      0     0      0       1    0
9      0     0      0       0    1
10     0     0      1       0    0
11     0     0      0       0    1
12     1     0      0       0    0
13     0     0      0       1    0
14     0     1      0       0    0
