# Reference

[1.] [從 pandas 開始 Python 與資料科學之旅](https://medium.com/datainpoint/%E5%BE%9E-pandas-%E9%96%8B%E5%A7%8B-python-%E8%88%87%E8%B3%87%E6%96%99%E7%A7%91%E5%AD%B8%E4%B9%8B%E6%97%85-8dee36796d4a)

[2.] [pandas筆記2---reset_index函數drop與inplace參數的理解](https://www.twblogs.net/a/5c21eb8abd9eee16b3dae800)


>pandas 除了提供 DataFrame 這個資料結構，尚有 **Series** 與 **Panel** 兩種資料結構

>只要單選 DataFrame 之中的**單一變數**且不要以 list 標註就可以獲得 Series

```
e.g.,
  DATA['v1'] -> This is a "Series".
```

In [2]:
# load the library
import numpy as np
import pandas as pd

# data soure (.csv)
csv_htm = "https://storage.googleapis.com/learn_pd_like_tidyverse/gapminder.csv"

# read the csv data from csv_htm
gapminder = pd.read_csv(csv_htm)

In [3]:
country = gapminder['country']
type(country)

pandas.core.series.Series

***

## **Series** 還可以再拆分為
 - **index** 與 **values** 兩個部分
 
 
 - 其中 **values** 就是一個 Numpy 的 **ndarray**

In [4]:
print("country.index => \n{0}".format(country.index))

print("\n----------------------\n")

print("從 Series 中再拆分出 ndarry:")
print("Numpy 的 ndarray: \ncountry.values => \n{0}".format(country.values))

print("\n----------------------\n")

print("Check 型態是否為 numpy.ndarray:")
print("type(country.values) => \n{0}".format(type(country.values)))

country.index => 
RangeIndex(start=0, stop=1704, step=1)

----------------------

從 Series 中再拆分出 ndarry:
Numpy 的 ndarray: 
country.values => 
['Afghanistan' 'Afghanistan' 'Afghanistan' ... 'Zimbabwe' 'Zimbabwe'
 'Zimbabwe']

----------------------

Check 型態是否為 numpy.ndarray:
type(country.values) => 
<class 'numpy.ndarray'>


### 因此可以得到以下資料結構的關係：
 - 一個 DataFrame 可以解構為多個 Series

   - 一個 Series 可以再解構為 ndarray

     - ndarray 可以再解構取得之中的數字、布林或文字。

***

## Panel 則是能儲存**多個 DataFrame** 資料結構

For Example:
 - 可以將原本的 gapminder 依照 **年份** 拆開
 
 
 - 一個年份的資料用一個 DataFrame 儲存
 
 
 - 將 12 個 DataFrame 都儲存到一個 Panel 物件之中
 
 ****
 
 資料結構的關係由大到小可以這樣想像：
 
 - 一個 Panel 可儲存多個 DataFrame
 
   - 一個 DataFrame 可以解構為多個 Series

     - 一個 Series 可以再解構為 ndarray

       - ndarray 可以再解構取得之中的數字、布林或文字。

***

## Example of Panel

In [28]:
df_grouped = gapminder.groupby(['year']) # 擷取一個 Series
df_dict = {} # 開一個空的字典，儲存

for i in range(1952, 2011, 5):
    
    # reset_index: 當作groupby時，必須指標重新排列，否則會亂掉
    df_dict[i] = df_grouped.get_group(i).reset_index(drop = True)

    
print("df_dict[1952].head() => \n{0}\n".format(df_dict[1952].head()))
print("df_dict[1957].head() => \n{0}\n".format(df_dict[1957].head()))



df_dict[1952].head() => 
       country continent  year  lifeExp       pop    gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333   779.445314
1      Albania    Europe  1952   55.230   1282697  1601.056136
2      Algeria    Africa  1952   43.077   9279525  2449.008185
3       Angola    Africa  1952   30.015   4232095  3520.610273
4    Argentina  Americas  1952   62.485  17876956  5911.315053

df_dict[1957].head() => 
       country continent  year  lifeExp       pop    gdpPercap
0  Afghanistan      Asia  1957   30.332   9240934   820.853030
1      Albania    Europe  1957   59.280   1476505  1942.284244
2      Algeria    Africa  1957   45.685  10270856  3013.976023
3       Angola    Africa  1957   31.999   4561361  3827.940465
4    Argentina  Americas  1957   64.399  19610538  6856.856212



In [32]:
df_dict

{1952:                       country continent  year  lifeExp        pop  \
 0                 Afghanistan      Asia  1952   28.801    8425333   
 1                     Albania    Europe  1952   55.230    1282697   
 2                     Algeria    Africa  1952   43.077    9279525   
 3                      Angola    Africa  1952   30.015    4232095   
 4                   Argentina  Americas  1952   62.485   17876956   
 5                   Australia   Oceania  1952   69.120    8691212   
 6                     Austria    Europe  1952   66.800    6927772   
 7                     Bahrain      Asia  1952   50.939     120447   
 8                  Bangladesh      Asia  1952   37.484   46886859   
 9                     Belgium    Europe  1952   68.000    8730405   
 10                      Benin    Africa  1952   38.223    1738315   
 11                    Bolivia  Americas  1952   40.414    2883315   
 12     Bosnia and Herzegovina    Europe  1952   53.820    2791000   
 13           

In [38]:
gapminder_panel = pd.Panel(df_dict)

print(gapminder_panel)

print("\n得到三維度的空間 12 個月份 Ｘ 142(最大筆數的軸) Ｘ 6(最小筆數的軸)")

<class 'pandas.core.panel.Panel'>
Dimensions: 12 (items) x 142 (major_axis) x 6 (minor_axis)
Items axis: 1952 to 2007
Major_axis axis: 0 to 141
Minor_axis axis: country to gdpPercap

得到三維度的空間 12 個月份 Ｘ 142(最大筆數的軸) Ｘ 6(最小筆數的軸)


<class 'pandas.core.panel.Panel'>
Dimensions: 12 (items) x 142 (major_axis) x 6 (minor_axis)
Items axis: 1952 to 2007
Major_axis axis: 0 to 141
Minor_axis axis: country to gdpPercap


***

## What is reset_index( drop = True )

### 1. reset_index:

由於 DataFrame 做完
 - contcat (合併)
 - groupby (分組聚合)
 - agg 
 - transform
 
會產生 **修改排序**
以至於 可能會產生錯誤。
最好做一個 **reset_index** 處理。

### 2. drop parameter:
 
 - drop = True: 把原來的索引index列去掉，丟掉。
 
 
 - drop = False:保留原來的索引（以前的可能是亂的）

### 3. inplace parameter：
修改一個對象時：
 
 - inplace = True：不創建新的對象，直接對原始對象進行修改；
 
 
 - inplace = False：對數據進行修改，創建並返回新的對象承載其修改結果。
 
 

In [14]:
import pandas as pd
# import sys

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']})

frames = [df1, df2, df3]
result = pd.concat(frames)

print("\n Case 1: 可看到指標 index 012012012...在跳，超亂")
print(result)

print("\n Case 2: 可看到指標 重新排列 \n")
print(result.reset_index(drop = True))


 Case 1: 可看到指標 index 012012012...在跳，超亂
     A    B    C    D
0   A0   B0   C0   D0
1   A1   B1   C1   D1
2   A2   B2   C2   D2
3   A3   B3   C3   D3
0   A4   B4   C4   D4
1   A5   B5   C5   D5
2   A6   B6   C6   D6
3   A7   B7   C7   D7
0   A8   B8   C8   D8
1   A9   B9   C9   D9
2  A10  B10  C10  D10
3  A11  B11  C11  D11

 Case 2: 可看到指標 重新排列 

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11


***

# Practice

### Define a function about read_excel

In [52]:
def get_data(url_list):
    df_list = []
    for url in url_list:
        df_list.append(pd.read_excel(url, sheetname = 'Data'))
    return df_list

In [54]:
url_list = ['https://storage.googleapis.com/learn_pd_like_tidyverse/indicator_gapminder_population.xlsx', 
            'https://storage.googleapis.com/learn_pd_like_tidyverse/indicator_gapminder_gdp_per_capita_ppp.xlsx', 
            'https://storage.googleapis.com/learn_pd_like_tidyverse/indicator_life_expectancy_at_birth.xlsx']

# index 0 : population
# index 1 : gdp_per_capita
# index 2 : life_expectancy

wide_df_list = get_data(url_list)

  return func(*args, **kwargs)


In [62]:
wide_df_list

[                                 Total population        1800        1810  \
 0                                        Abkhazia         NaN         NaN   
 1                                     Afghanistan   3280000.0   3280000.0   
 2                           Akrotiri and Dhekelia         NaN         NaN   
 3                                         Albania    410445.0    423591.0   
 4                                         Algeria   2503218.0   2595056.0   
 5                                  American Samoa      8170.0      8156.0   
 6                                         Andorra      2654.0      2654.0   
 7                                          Angola   1567028.0   1567028.0   
 8                                        Anguilla      2025.0      2025.0   
 9                             Antigua and Barbuda     37000.0     37000.0   
 10                                      Argentina    534000.0    534000.0   
 11                                        Armenia    413326.0  

### Data from 1800 ~ 2015

 - Total Population


 - GDP per capital
 
 
 - Life of expectancy

In [59]:
wide_df_list[0].head()

Unnamed: 0,Total population,1800,1810,1820,1830,1840,1850,1860,1870,1880,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Abkhazia,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,3280000.0,3280000.0,3323519.0,3448982.0,3625022.0,3810047.0,3973968.0,4169690.0,4419695.0,...,25183615.0,25877544.0,26528741.0,27207291.0,27962207.0,28809167.0,29726803.0,30682500.0,31627506.0,32526562.0
2,Akrotiri and Dhekelia,,,,,,,,,,...,15700.0,15700.0,15700.0,,,,,,,
3,Albania,410445.0,423591.0,438671.0,457234.0,478227.0,506889.0,552800.0,610036.0,672544.0,...,3050741.0,3010849.0,2968026.0,2929886.0,2901883.0,2886010.0,2880667.0,2883281.0,2889676.0,2896679.0
4,Algeria,2503218.0,2595056.0,2713079.0,2880355.0,3082721.0,3299305.0,3536468.0,3811028.0,4143163.0,...,33749328.0,34261971.0,34811059.0,35401790.0,36036159.0,36717132.0,37439427.0,38186135.0,38934334.0,39666519.0


In [60]:
wide_df_list[1].head()

Unnamed: 0,GDP per capita,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Abkhazia,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,...,1173.0,1298.0,1311.0,1548.0,1637.0,1695.0,1893.0,1884.0,1877.0,1925.0
2,Akrotiri and Dhekelia,,,,,,,,,,...,,,,,,,,,,
3,Albania,667.0,667.0,668.0,668.0,668.0,668.0,668.0,668.0,668.0,...,7476.0,7977.0,8644.0,8994.0,9374.0,9640.0,9811.0,9961.0,10160.0,10620.0
4,Algeria,716.0,716.0,717.0,718.0,719.0,720.0,721.0,722.0,723.0,...,12088.0,12289.0,12314.0,12285.0,12494.0,12606.0,12779.0,12893.0,13179.0,13434.0


In [66]:
wide_df_list[2].head()

Unnamed: 0,Life expectancy,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Abkhazia,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,28.21,28.2,28.19,28.18,28.17,28.16,28.15,28.14,28.13,...,52.4,52.8,53.3,53.6,54.0,54.4,54.8,54.9,53.8,52.72
2,Akrotiri and Dhekelia,,,,,,,,,,...,,,,,,,,,,
3,Albania,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,76.6,76.8,77.0,77.2,77.4,77.5,77.7,77.9,78.0,78.1
4,Algeria,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,28.82,...,75.3,75.5,75.7,76.0,76.1,76.2,76.3,76.3,76.4,76.5


In [67]:
print(type(wide_df_list[2]))

<class 'pandas.core.frame.DataFrame'>


# 轉換成長表格

利用 **pd.melt()** 方法將 **list** 中的寬表格轉換為長表格：

In [82]:
## 定義常表格的函數

def get_long_df(wide_df_list):
    long_df_list = [] # 開一個空的 DataFrame
    source_var = ['Total population','GDP per capita','Life expectancy']
    rename_var = ['pop', 'gdpPercap','lifeExp']
    
    # zip 經常與 for 一起混用，讓多個參數可以一起跑。
    for(i, old_var, new_var) in zip( range(3), source_var, rename_var):
        df = pd.melt(wide_df_list[i], id_vars = [old_var])
        df.columns = ['country', 'year', new_var]
        long_df_list.append(df)
        
    return long_df_list

In [83]:
long_df_list = get_long_df(wide_df_list)

In [85]:
long_df_list[0].head()

Unnamed: 0,country,year,pop
0,Abkhazia,1800,
1,Afghanistan,1800,3280000.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,410445.0
4,Algeria,1800,2503218.0


In [86]:
long_df_list[1].head()

Unnamed: 0,country,year,gdpPercap
0,Abkhazia,1800,
1,Afghanistan,1800,603.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,667.0
4,Algeria,1800,716.0


In [88]:
long_df_list[2].head()

Unnamed: 0,country,year,lifeExp
0,Abkhazia,1800,
1,Afghanistan,1800,28.21
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,35.4
4,Algeria,1800,28.82


# 橫向合併 pd.merge() 變數

 - 1. 最後利用 **pd.merge()** 方法將 list 中三個長表格合併
 
 
 - 2. 用 **pd.dropna()** 移除有遺漏的觀測值
 
 
 
 - 3. 依年份與國名排序（sort_values)
 
 
 
 - 4. 最後重設索引值（reset_index）
 
 

## 1. pd.merge()

In [90]:
# 第一張表與第二張表合併
merged_df = pd.merge(long_df_list[0], long_df_list[1], 
                     on = ['country','year'])
merged_df.head()

Unnamed: 0,country,year,pop,gdpPercap
0,Abkhazia,1800,,
1,Afghanistan,1800,3280000.0,603.0
2,Akrotiri and Dhekelia,1800,,
3,Albania,1800,410445.0,667.0
4,Algeria,1800,2503218.0,716.0


In [92]:
# 再用剛剛已經合併過的 merged_df 再與 第三張表合併
merged_df = pd.merge(merged_df, long_df_list[2], 
                     on = ['country','year'])

merged_df.head()

Unnamed: 0,country,year,pop,gdpPercap,lifeExp
0,Abkhazia,1800,,,
1,Afghanistan,1800,3280000.0,603.0,28.21
2,Akrotiri and Dhekelia,1800,,,
3,Albania,1800,410445.0,667.0,35.4
4,Algeria,1800,2503218.0,716.0,28.82


## 2. 用 pd.dropna() 移除有遺漏的觀測值

In [99]:
merged_df = merged_df.dropna()

merged_df.head()

Unnamed: 0,country,year,pop,gdpPercap,lifeExp
1,Afghanistan,1800,3280000.0,603.0,28.21
3,Albania,1800,410445.0,667.0,35.4
4,Algeria,1800,2503218.0,716.0,28.82
7,Angola,1800,1567028.0,618.0,26.98
9,Antigua and Barbuda,1800,37000.0,757.0,33.54


## 3. 依年份與國名排序  sort_values( ['v1','v2'] )


In [100]:
merged_df = merged_df.sort_values(['year','country']) 
# sort_values() 裡面放的是向量

merged_df.head()

Unnamed: 0,country,year,pop,gdpPercap,lifeExp
1,Afghanistan,1800,3280000.0,603.0,28.21
3,Albania,1800,410445.0,667.0,35.4
4,Algeria,1800,2503218.0,716.0,28.82
7,Angola,1800,1567028.0,618.0,26.98
9,Antigua and Barbuda,1800,37000.0,757.0,33.54


>可以看到上面的表格的 index 是亂的 1,3,4,7,9 是因為被移除遺失值的緣故。
所以我們要重設 index

## 4. 最後重設索引值（reset_index）

In [101]:
merged_df = merged_df.reset_index(drop = True)

merged_df.head()

Unnamed: 0,country,year,pop,gdpPercap,lifeExp
0,Afghanistan,1800,3280000.0,603.0,28.21
1,Albania,1800,410445.0,667.0,35.4
2,Algeria,1800,2503218.0,716.0,28.82
3,Angola,1800,1567028.0,618.0,26.98
4,Antigua and Barbuda,1800,37000.0,757.0,33.54


# Output 
 - pandas 可以支援多種文字、二進位檔案與資料庫的資料寫出



 - 常見的 txt、csv、excel、MySQL or PostgreSQL



 - 詳情可以參考 [pandas 0.21.0 documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)



 - 最後將整理合併完成的資料寫出為 csv 與 excel 試算表，通常**寫出時都會將 索引值移除**。




In [103]:
## write the file

merged_df.to_csv('[1]-gapminder.csv', index = False)

merged_df.to_excel('[1]-gapminder.xlsx', index = False)