# Reference

[1.] [從 pandas 開始 Python 與資料科學之旅](https://medium.com/datainpoint/%E5%BE%9E-pandas-%E9%96%8B%E5%A7%8B-python-%E8%88%87%E8%B3%87%E6%96%99%E7%A7%91%E5%AD%B8%E4%B9%8B%E6%97%85-8dee36796d4a)

[2.] [python pandas 中 loc & iloc 用法區別](https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/462517/)

**Pandas** 取名自 pan(el)-da(ta)-s，也與套件主要提供的三個資料結構：
 - Panel
 - DataFrame 
 - Series 
 
[GitHub repository](https://github.com/pandas-dev/pandas) 的介紹：
```
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.
```


In [1]:
# load the library
import numpy as np
import pandas as pd

# data soure (.csv)
csv_htm = "https://storage.googleapis.com/learn_pd_like_tidyverse/gapminder.csv"

## Read the CSV file


```
code: pd.read_csv(data_from)
```

In [2]:
gapminder = pd.read_csv(csv_htm)

### Check gapminder 的 型態

```
code: type(XXX)
```

In [3]:
print(type(gapminder))

<class 'pandas.core.frame.DataFrame'>


### 觀察前幾筆的資料 （預設前五筆）

```
code: XXX.head()
```

In [4]:
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


***

## Read the EXCEL
```
code: pd.read_excel()
```

In [5]:
# data soure (.xlsx)
xlsx_file = "https://storage.googleapis.com/learn_pd_like_tidyverse/gapminder.xlsx"

In [6]:
gapminder = pd.read_excel(xlsx_file)
print(type(gapminder))
print("\n")
gapminder.head()

<class 'pandas.core.frame.DataFrame'>




Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


***

## shape of data

```
code: XXX.shape
```

In [7]:
print("gapminder.shape => {0}".format(gapminder.shape))

gapminder.shape => (1704, 6)


## columns of data

```
code: XXX.columns
```

In [8]:
print("gapminder.columns =>\n{0}".format(gapminder.columns))

gapminder.columns =>
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')


## index of data

```
code: XXX.index
```

In [9]:
print("gapminder.index =>\n{0}".format(gapminder.index))

gapminder.index =>
RangeIndex(start=0, stop=1704, step=1)


### How to use RangeIndex(start=0, stop=1704, step=1) ?

In [10]:
df = pd.DataFrame({
    'A': range(1,21)
})

print ("可以看到左邊的 index by 1 為一個 range:\n{0}".format(df)) 

可以看到左邊的 index by 1 為一個 range:
     A
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
9   10
10  11
11  12
12  13
13  14
14  15
15  16
16  17
17  18
18  19
19  20


In [11]:
print(df.index)

RangeIndex(start=0, stop=20, step=1)


In [12]:
# 更改 RangeIndex 之 index 的 range
df.index = pd.RangeIndex(start=0, stop=99, step=5)
print ("可以看到左邊的 index 改變 by 5 為一個 range:\n{0}".format(df)) 

可以看到左邊的 index 改變 by 5 為一個 range:
     A
0    1
5    2
10   3
15   4
20   5
25   6
30   7
35   8
40   9
45  10
50  11
55  12
60  13
65  14
70  15
75  16
80  17
85  18
90  19
95  20


In [13]:
step = 10
df.index = pd.RangeIndex(start=0, stop=len(df.index) * step - 1, step=step)
print ("可以看到左邊的 index 改變 by {0} 為一個 range:\n{1}".format(step,df)) 

可以看到左邊的 index 改變 by 10 為一個 range:
      A
0     1
10    2
20    3
30    4
40    5
50    6
60    7
70    8
80    9
90   10
100  11
110  12
120  13
130  14
140  15
150  16
160  17
170  18
180  19
190  20


## information of data

```
code: XXX.info()
```

In [14]:
gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


## describe of data

 - 摘要五數 （min, q1: 25%, midiam, q3: 75%, max)
 - count
 - mean
 - std

```
code: XXX.describe()
```

In [15]:
gapminder.describe()

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,59.474439,29601210.0,7215.327081
std,17.26533,12.917107,106157900.0,9857.454543
min,1952.0,23.599,60011.0,241.165876
25%,1965.75,48.198,2793664.0,1202.060309
50%,1979.5,60.7125,7023596.0,3531.846988
75%,1993.25,70.8455,19585220.0,9325.462346
max,2007.0,82.603,1318683000.0,113523.1329


***

# Data Cleaning


 
## 1. filtering
 
 
 
## 2. catch the variable
 
 
 
## 3. additional the new variable
 
 
 
## 4. sum()
 
 
 
 
## 5. mean()
 
 
 
 
## 6. groupby().sum()
 
 
 
 
## 7. groupby().mean()

***

### 1. Filtering 

In [16]:
gapminder[
    ### 符合條件的將會被篩選出來
    ### 若有多個條件則用小括弧（）表示
    (gapminder['year'] == 2007) & (gapminder['continent'] == 'Asia')
]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
11,Afghanistan,Asia,2007,43.828,31889923,974.580338
95,Bahrain,Asia,2007,75.635,708573,29796.04834
107,Bangladesh,Asia,2007,64.062,150448339,1391.253792
227,Cambodia,Asia,2007,59.723,14131858,1713.778686
299,China,Asia,2007,72.961,1318683096,4959.114854
671,"Hong Kong, China",Asia,2007,82.208,6980412,39724.97867
707,India,Asia,2007,64.698,1110396331,2452.210407
719,Indonesia,Asia,2007,70.65,223547000,3540.651564
731,Iran,Asia,2007,70.964,69453570,11605.71449
743,Iraq,Asia,2007,59.545,27499638,4471.061906


***

### 2. Catch the variable

 - **Case 1**: 若採用 **兩個大括號**，型態為：DataFrame
 
 
 
 
 
 
 
 - **Case 2**: 若採用**一個大括號**，型態為：Series

In [17]:
# Case 1:
print(gapminder[['country','continent']].head())


print(type(gapminder[['country','continent']].head()))

       country continent
0  Afghanistan      Asia
1  Afghanistan      Asia
2  Afghanistan      Asia
3  Afghanistan      Asia
4  Afghanistan      Asia
<class 'pandas.core.frame.DataFrame'>


In [18]:
# Case 2:
print(gapminder['country'].head())

print(type(gapminder['country'].head()))

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object
<class 'pandas.core.series.Series'>


### 將 catch 出來的資料存到某個變數

In [19]:
country = gapminder[['country']]

# 可看到使用雙大引號，他的型態為 DataFrame
print(type(country))

country.head()


<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,country
0,Afghanistan
1,Afghanistan
2,Afghanistan
3,Afghanistan
4,Afghanistan


***

### 3. 新增變數到 gapminder

### Step :
1. catch the [1 variable]: **gapminder['country']**.
2. 1 variable's type is a **Series**.
3. using **apply** (by row scanning).
4. catch the string
```
# catch the string
string = "python"
print(string[:3])
# 輸出為 pyt
print(string[3:])
# 輸出為 hon
print(string[:])
# 輸出為 python
```
5. lamda x: f(x)
  - 類似於, 但是他是 by row 去執行
```
def f(x)
   output = x[:3]
   return(output)
```

In [20]:
# 1. 利用 apply by row & lambda 函數 & x[:3]（擷取字串）
# 2. 創造新的變數 “country_abb” 裡面就只有擷取 country 的前三個字
# 3. 因為 lambda 所產生出來的為 Series, 所以 country_abb 則必須為 Series 型態
gapminder['country_abb'] = gapminder['country'].apply(lambda x: x[:3])

gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,country_abb
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,Afg
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,Afg
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,Afg
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,Afg
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,Afg


***

### 4. XXX.sum( )

In [21]:
# case 1:
print("\n=====Step of case 1: =====")
print(" 1. catch 'pop', and need to use 1 []")
print(" 2. sum of 'pop'            \n")
c1 = gapminder['pop'].sum()
print("gapminder['pop'].sum() => \n{0}".format(c1))



# case 2: 
print("\n\n\n=====Step of case 2: =====")
print(" 1. filtering year == 2007  ")
print(" 2. catch 'pop', and need to use 1 []")
print(" 3. sum of 'pop'            \n")

c2 = gapminder[ gapminder['year'] == 2007]['pop'].sum()
print("gapminder[ gapminder['year'] == 2007]['pop'].sum() => \n{0}".format(c2))


=====Step of case 1: =====
 1. catch 'pop', and need to use 1 []
 2. sum of 'pop'            

gapminder['pop'].sum() => 
50440465801



=====Step of case 2: =====
 1. filtering year == 2007  
 2. catch 'pop', and need to use 1 []
 3. sum of 'pop'            

gapminder[ gapminder['year'] == 2007]['pop'].sum() => 
6251013179


### 5. XXX.mean( )

In [22]:
print("\n\n\n=====Step of mean: =====")
print(" 1. filtering year == 2007  ")
print(" 2. catch 'lifeExp' & 'gdpPercap', and need to use doube [['','']]")
print(" 3. mean of 'lifeExp' & 'gdpPercap' \n")
gapminder[gapminder['year'] == 2007][['lifeExp', 'gdpPercap']].mean()




=====Step of mean: =====
 1. filtering year == 2007  
 2. catch 'lifeExp' & 'gdpPercap', and need to use doube [['','']]
 3. mean of 'lifeExp' & 'gdpPercap' 



lifeExp         67.007423
gdpPercap    11680.071820
dtype: float64

***

### 6. groupby().sum()

In [23]:
# case 1:
print("\n=====Step of case 1: =====")
print(" 1. groupby('continent')")
print(" 2. sum of'pop' using ['pop']\n")

c1 = gapminder.groupby('continent')['pop'].sum()
print("gapminder['pop'].groupby('continent')['pop'].sum() => \n{0}".format(c1))



# case 2: 
print("\n\n\n=====Step of case 2: =====")
print(" 1. filtering year == 2007  ")
print(" 2. groupby('continent')")
print(" 3. sum of'pop' using ['pop']\n")


c2 = gapminder[ gapminder['year'] == 2007 ].groupby('continent')['pop'].sum()
print("gapminder[ gapminder['year'] == 2007].groupby('continent')['pop'].sum() => \n{0}".format(c2))


=====Step of case 1: =====
 1. groupby('continent')
 2. sum of'pop' using ['pop']

gapminder['pop'].groupby('continent')['pop'].sum() => 
continent
Africa       6187585961
Americas     7351438499
Asia        30507333901
Europe       6181115304
Oceania       212992136
Name: pop, dtype: int64



=====Step of case 2: =====
 1. filtering year == 2007  
 2. groupby('continent')
 3. sum of'pop' using ['pop']

gapminder[ gapminder['year'] == 2007].groupby('continent')['pop'].sum() => 
continent
Africa       929539692
Americas     898871184
Asia        3811953827
Europe       586098529
Oceania       24549947
Name: pop, dtype: int64


***

### 7. groupby().mean()

In [24]:
# case 1:
print("\n=====Step of case 1: =====")
print(" 1. groupby('continent')")
print(" 2. mean of 'lifeExp' and 'gdpPercap' using [['lifeExp', 'gdpPercap']]\n")

c1 = gapminder.groupby('continent')[['lifeExp', 'gdpPercap']].mean()
print("gapminder['pop'].groupby('continent')['pop'].mean() => \n{0}".format(c1))


# case 2: 
print("\n\n\n=====Step of case 2: =====")
print(" 1. filtering year == 2007  ")
print(" 2. groupby('continent').")
print(" 3. mean of 'lifeExp' and 'gdpPercap' using [['lifeExp', 'gdpPercap']]\n")


c2 = gapminder[ gapminder['year'] == 2007 ].groupby('continent')[['lifeExp', 'gdpPercap']].mean()
print("gapminder[ gapminder['year'] == 2007 ].groupby('continent')[['lifeExp', 'gdpPercap']].mean() => \n{0}".format(c2))


=====Step of case 1: =====
 1. groupby('continent')
 2. mean of 'lifeExp' and 'gdpPercap' using [['lifeExp', 'gdpPercap']]

gapminder['pop'].groupby('continent')['pop'].mean() => 
             lifeExp     gdpPercap
continent                         
Africa     48.865330   2193.754578
Americas   64.658737   7136.110356
Asia       60.064903   7902.150428
Europe     71.903686  14469.475533
Oceania    74.326208  18621.609223



=====Step of case 2: =====
 1. filtering year == 2007  
 2. groupby('continent').
 3. mean of 'lifeExp' and 'gdpPercap' using [['lifeExp', 'gdpPercap']]

gapminder[ gapminder['year'] == 2007 ].groupby('continent')[['lifeExp', 'gdpPercap']].mean() => 
             lifeExp     gdpPercap
continent                         
Africa     54.806038   3089.032605
Americas   73.608120  11003.031625
Asia       70.728485  12473.026870
Europe     77.648600  25054.481636
Oceania    80.719500  29810.188275
