# 資料科學套件


# Pandas 簡介

提供靈活直觀的資料結構來處理關聯數據和有標籤的數據

---
# Pandas 提供的資料結構

| 名稱 | 描述 |
|:--:|:------:|
| Series | 可以建立索引的一維陣列       |
| DataFrame | 有列索引與欄標籤的二維資料集 |
| Panel | 有資料集索引、列索引與欄標籤的三維資料集 |

---
# 使用 Pandas

首先我們用 `import` 引入 pandas，一般慣例上會將它重新命名成 pd: 

```python
import pandas as pd
```


In [1]:
import pandas as pd

# Series（序列）

簡單來說，是一個封裝多筆、一維資料的容器

建立一個 series：

```python
ser = pd.Series([1,2,3,4,5])
ser
```

In [2]:
ser = pd.Series([1,2,3,4,5])
ser

0    1
1    2
2    3
3    4
4    5
dtype: int64

跑出結果了！發現有兩列數字（columns）

- 第一條columns顯示0～4，為index
- 第二條就是每個index所對應到的值

到此，Series 看起來和 List 很像，只不過是以縱向的形式呈現，但是我們再來看看不一樣的地方

# 建立一個 Series

將一個裝滿字串的 List 指定給 index 參數：

```python
ser1 = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
ser1
```

In [3]:
ser1 = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
ser1

a    1
b    2
c    3
d    4
e    5
dtype: int64

接下來我們就發現，Series 的每一筆資料的索引(index)都可以被指定一個獨特的標籤

# Series


其實一個 Series 是由 **標籤 (Index)** 與 **值(Values)** 組成，所以剛才的 series 若從 Excel 的角度來看，可以理解成：

![](https://drive.google.com/uc?export=download&id=1G9waSQfNJ3UMrwzYBIUcfbjc6gA_b_fi)


# 建立 Series
Series 的值都是被存在一個**numpy array**中

```python
ser1.values
# array([1, 2, 3, 4, 5], dtype=int64)
```

In [4]:
ser1.values

array([1, 2, 3, 4, 5], dtype=int64)

# Series
Series 的標籤都是被存在一個**numpy Index**物件中

```python
ser1.index
# Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
```

In [5]:
ser1.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

# 讀取序列内的資料

可以使用 iloc 方法：

```python
ser1.iloc[0]
```


In [6]:
ser1.iloc[0]

1

# 切片序列内的資料

如同 List，Series 也支援的切片的功能：

```python
ser.iloc[索引值起點：索引值結束點]
```

與串列的切片方式一樣



請輸入：

```python
ser1.iloc[0:3]
```

In [7]:
ser1.iloc[0:3]

a    1
b    2
c    3
dtype: int64

# 另一種讀取序列資料的方式

loc 方法：

```python
ser1.loc["a"]
```

In [8]:
ser1.loc["a"]

1

# 另一種切片序列資料的方式

loc 方法：

```python
ser1.loc["a":"c"]
```

In [9]:
ser1.loc["a":"c"]

a    1
b    2
c    3
dtype: int64

# .loc vs .iloc

- .loc 是透過標籤查找

- .iloc 是透過索引值查找

# 提取 Series 内的資料

提取多筆**不連續**的資料，可以透過指定多個標籤，並將標籤放入 List 内：
```python
ser1.loc[['a', 'c', 'e']]
```

In [10]:
ser1.loc[['a', 'c', 'e']]

a    1
c    3
e    5
dtype: int64

# 提取 Series 内的資料

提取多筆不連續的資料也可以透過指定多筆數字的**索引值((

```python
ser1.iloc[[0, 2, 4]]
```

In [11]:
ser1.iloc[[0, 2, 4]]

a    1
c    3
e    5
dtype: int64

# Series 一些常用的功能

加總：

```python
ser1.sum()
```
最大：
```python
ser1.max()
```
最小：
```python
ser1.min()
```
平均：
```python
ser1.mean()
```
標準差：

```python
ser1.std()
```

In [12]:
ser1.sum(), ser1.max(), ser1.min(), ser1.mean(), ser1.std()

(15, 5, 1, 3.0, 1.5811388300841898)

# cumsum() 纍加

名稱是英文 cumulative summation 的縮寫，代表將資料由上而下的纍加：

```python
ser1.cumsum()
```

官方文件：[連結](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html)

In [13]:
ser1.cumsum()

a     1
b     3
c     6
d    10
e    15
dtype: int64

# cumprod() 纍乘

名稱是英文 cumulative product 的縮寫，代表將資料由上而下的纍乘：

```python
ser1.cumprod()
```

In [14]:
ser1.cumprod()

a      1
b      2
c      6
d     24
e    120
dtype: int64

# Series 更新現有的值
```python
ser1.loc['a'] = 8
ser1
```

In [15]:
ser1.loc['a'] = 8
ser1

a    8
b    2
c    3
d    4
e    5
dtype: int64

# Series 更新多筆值
```python
ser1.loc["b":"d"] = [9, 10, 11]
ser1
```

需要注意的是，和用索引值切片不一樣的是，用 key 切片**會包含結束點自己 (也就是 key 所對應的值)**

In [16]:
ser1.loc["b":"d"] = [9, 10, 11]
ser1

a     8
b     9
c    10
d    11
e     5
dtype: int64

# 刪除 Series 内的資料

刪除的結果會return會來，原本的series沒變喔
```python
x = ser1.drop(labels='a')
```

In [17]:
ser2 = ser1.drop(labels='a')

In [18]:
ser2

b     9
c    10
d    11
e     5
dtype: int64

# 刪除多筆 Series 内的資料

```python
x = ser1.drop(labels=['b', 'c', 'd'])
```


In [19]:
ser1.drop(labels=['b', 'c', 'd'])

a    8
e    5
dtype: int64

# 透過 dict 宣告 Series

原本:
```python
ser1 = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
```
透過dict:
```python
# series 每一個 index 都會對映到一個 value 
data = {
    'a': 1,
    'b': 3,
    'c': 5,
    'd': 7,
    'e': 9
}

pd.Series(data)
```

In [20]:
data = {
    'a': 1,
    'b': 3,
    'c': 5,
    'd': 7,
    'e': 9
}

pd.Series(data)

a    1
b    3
c    5
d    7
e    9
dtype: int64

# 小結

- Series 適合處理一維的資料
- Series 的 index 可以被指定**獨特的標籤（這點與 Dictionary 十分類似）**
- Series 可以被看作是一個**有序的字典 (dict)**


# 與 Numpy 的整合

```python
import numpy as np

pd.Series(np.arange(5))
```

In [57]:
import numpy as np

# 與 Numpy 的整合

```python
even_num = np.arange(2, 11, 2)

pd.Series(even_num)
```

In [58]:
even_num = np.arange(2, 11, 2)

pd.Series(even_num)

0     2
1     4
2     6
3     8
4    10
dtype: int32

# 產生亂數 Series

```python
import numpy as np
rand_array = np.random.rand(10)
pd.Series(rand_array)
```
或是
```python
pd.Series(np.random.rand(5))
```

# head(), tail(), take()

```python
# 用 head 查詢前五筆資料
ser2.head()

# 用 tail 查詢後三筆資料
ser2.tail(3)

# 用 take 指定查詢索引值為 2, 4, 0 的資料
ser2.take([1, 6, 5])
```


# Series.isin

檢查輸入的資料是否在 series 裡面
```python
ser2.isin([3, 5])
```

Python's list comprehesion:
```python
list = [1,2,3,4,5]
[i + 1 for i in list]
```

In [22]:
list = [1,2,3,4,5]
[i + 1 for i in list]

[2, 3, 4, 5, 6]

# Series 的逐元運算

語法上與 Numpy 的逐元素運算一樣

```python
ser2 * 2
```

In [23]:
print(ser2)
ser2 * 2

b     9
c    10
d    11
e     5
dtype: int64


b    18
c    20
d    22
e    10
dtype: int64

# Series 的逐元運算

讓每一筆資料都去和 3 做比較

```python
ser2 > 3
```

注意比較式的逐元素運算會產生一個 Boolean 的 Series

In [24]:
print(ser1)
ser1 > 9

a     8
b     9
c    10
d    11
e     5
dtype: int64


a    False
b    False
c     True
d     True
e    False
dtype: bool

In [25]:
ser1[ ser1> 9 ] # 把ser1中>9的數值拿出來，放回ser1

c    10
d    11
dtype: int64

# Series 的逐元運算

若希望 Series 能夠有類似 filter() 函數的過濾功能，需要將剛才產生的 Boolean Series 套回到原本的 Series

```python
ser2[ser2 > 3]
```

可以想象成把**比較結果為 True 的資料切片出來**的概念

# 隨堂練習：

以下是每一位復仇者聯盟成員的名稱與年齡：

```python
avengers = {
    "ironman": 46,
    "captainamerica": 99,
    "blackwidow": 37,
    "thor": 430,
    "hulk": 42,
    "spiderman": 15,
    "blackpanther": 39
}

ser_age = pd.Series(avengers, index = avengers.keys())
```

請計算所有復仇者在五年前的歲數

In [26]:
avengers = {
    "ironman": 46,
    "captainamerica": 99,
    "blackwidow": 37,
    "thor": 430,
    "hulk": 42,
    "spiderman": 15,
    "blackpanther": 39
}

ser_age = pd.Series(avengers)
ser_age - 5

ironman            41
captainamerica     94
blackwidow         32
thor              425
hulk               37
spiderman          10
blackpanther       34
dtype: int64

# DataFrame（資料框架）
- 表格型資料結構 (可以想像成是一個**虛擬的 Excel 試算表**)
- 實際上是由多個 Series 組合起來的資料結構
- 適用於封裝/處理二維的資料

```python
import pandas as pd
```

# 來建立一個 DataFrame

```python
avengers = {
    "name": ["ironman","captainamerica","blackwidow","thor","hulk","spiderman", "blackpanther"],
    "age": [48, 100, 33, 430, 48, 15, 39],
    "superpower": [False, True, False, True, True, True, False]
}

df = pd.DataFrame(avengers)
print(type(df))
print(df.info)
```

In [27]:
import pandas as pd
avengers = {
    "name": ["ironman","captainamerica","blackwidow","thor","hulk","spiderman", "blackpanther"],
    "age": [48, 100, 33, 430, 48, 15, 39],
    "superpower": [False, True, False, True, True, True, False]
}
df = pd.DataFrame(avengers)
print(type(df))
print(df.info)

<class 'pandas.core.frame.DataFrame'>
<bound method DataFrame.info of              name  age  superpower
0         ironman   48       False
1  captainamerica  100        True
2      blackwidow   33       False
3            thor  430        True
4            hulk   48        True
5       spiderman   15        True
6    blackpanther   39       False>


# describe()
計算 
```python
df.describe()
```


In [28]:
df.describe()

Unnamed: 0,age
count,7.0
mean,101.857143
std,147.036762
min,15.0
25%,36.0
50%,48.0
75%,74.0
max,430.0


# DataFrame

```python
df = pd.DataFrame(avengers, columns = ["name", "age", "superpower"]) # 指定欄標籤排序
```

# head(), tail()
```python
# 找出最前面的5筆資料 default = 5
df.head()

# 找出最前面的3筆資料
df.head(3)

# 找出最後面的5筆資料
df.tail()

# 找出最後面的3筆資料
df.tail(3)
```

In [29]:
df.head(3)
df.tail(3)

Unnamed: 0,name,age,superpower
4,hulk,48,True
5,spiderman,15,True
6,blackpanther,39,False


- 如果把歲數都 - 5

In [30]:
df['age'] - 5

0     43
1     95
2     28
3    425
4     43
5     10
6     34
Name: age, dtype: int64

In [31]:
# but the result wasn't recorded
df

Unnamed: 0,name,age,superpower
0,ironman,48,False
1,captainamerica,100,True
2,blackwidow,33,False
3,thor,430,True
4,hulk,48,True
5,spiderman,15,True
6,blackpanther,39,False


# 新增一欄資料

```python
# 可以用現有欄的資料算出
df['age_2_yr_ago'] = df['age'] - 2
```

In [32]:
df['age_5yr_ago'] = df['age'] - 5
df

Unnamed: 0,name,age,superpower,age_5yr_ago
0,ironman,48,False,43
1,captainamerica,100,True,95
2,blackwidow,33,False,28
3,thor,430,True,425
4,hulk,48,True,43
5,spiderman,15,True,10
6,blackpanther,39,False,34


# 新增一欄資料

```python
# 直接用 List 指定
df['weapon'] = ["armor", "shield", "taser", "hammer", "himself", "web", "claws"]
```

In [33]:
# 也可以直接用 List 指定
df['weapon'] = ["armor", "shield", "taser", "hammer", "himself", "web", "claws"]
df

Unnamed: 0,name,age,superpower,age_5yr_ago,weapon
0,ironman,48,False,43,armor
1,captainamerica,100,True,95,shield
2,blackwidow,33,False,28,taser
3,thor,430,True,425,hammer
4,hulk,48,True,43,himself
5,spiderman,15,True,10,web
6,blackpanther,39,False,34,claws


# 若要把其中一欄的資料讀取出來

```python
df['age']
```
或是
```python
df.age
```


In [34]:
df.age

0     48
1    100
2     33
3    430
4     48
5     15
6     39
Name: age, dtype: int64

# 若我想用條件選擇出一些資料
選出年齡低於 50 的復仇者

```python
# 產生出一個由布林值構成的 series
age_filter = df['age'] < 50
print(age_filter)
# 再將該 series 套回到 DataFrame
df[age_filter]
```

In [35]:
# 產生出一個由布林值構成的 series
age_filter = df['age'] < 50
print(age_filter)
# 再將該 series 套回到 DataFrame
df[age_filter]

0     True
1    False
2     True
3    False
4     True
5     True
6     True
Name: age, dtype: bool


Unnamed: 0,name,age,superpower,age_5yr_ago,weapon
0,ironman,48,False,43,armor
2,blackwidow,33,False,28,taser
4,hulk,48,True,43,himself
5,spiderman,15,True,10,web
6,blackpanther,39,False,34,claws


# 若我想用條件選擇出一些資料
或者，可以把 code 簡化成(結果套回df? no!):

```python
df[df['age'] > 50]
```

In [36]:
df[df['age'] < 50]

Unnamed: 0,name,age,superpower,age_5yr_ago,weapon
0,ironman,48,False,43,armor
2,blackwidow,33,False,28,taser
4,hulk,48,True,43,himself
5,spiderman,15,True,10,web
6,blackpanther,39,False,34,claws


# 選擇資料

從歲數<50，選擇有超能力的，這樣要新增一個變數(young_avengers_df)，很麻煩~
```python
age_filter = df['age'] < 50
young_avengers_df = df[age_filter]
super_filter = young_avengers_df['superpower'] == True
young_avengers_df[super_filter]
```

In [37]:
# 2 條件
df[df['age'] < 50]
df[df['superpower'] == True]

# combine 2 conditions
df[ (df['age'] < 50) & (df['superpower'] == True) ]

Unnamed: 0,name,age,superpower,age_5yr_ago,weapon
4,hulk,48,True,43,himself
5,spiderman,15,True,10,web


# 另一種寫法...

| 邏輯運算 | Pandas 語法 | 
|:--:|:------:|
| and | &     |
| or | |      |
| not | ~     |

or 是'棒棒''|'

所以我們可以用 & 符號結合上面兩者的搜尋結果：
```python 
(df['age'] <= 50) & (df['superpower'] == False)
```

最後就可以搜尋出符合兩個條件的結果了：
```python
df[(df['age'] <= 50) & (df['superpower'] == False)]
```

# Pandas 實戰專題：判斷股價漲跌

利用 DataFrame 判斷漲跌：

1. 計算出 S&P 500 歷史資料的報酬率
2. 畫出走勢圖
3. 判斷是否為上漲
4. 最後把當天是上漲的股價資料過濾出來
5. 匯出成 Excel 檔案

---
# 範例 CSV 檔

[範例 CSV 檔](https://www.dropbox.com/s/by2hfjhm07kkhbj/s%26p500.csv?dl=1)

---
# 將工作表的資料提取出來，存入 Dataframe

```python
import pandas as pd

df = pd.read_csv(r"你的 s&p500.csv 檔案路徑")
df
```

補充：[官網教學](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)


In [38]:
import pandas as pd

pd.read_csv(r"s&p500.csv")

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2018/7/2,2704.949951,2727.260010,2698.949951,2726.709961,2726.709961,3073650000
1,2018/7/3,2733.270020,2736.580078,2711.159912,2713.219971,2713.219971,1911470000
2,2018/7/5,2724.189941,2737.830078,2716.020020,2736.610107,2736.610107,2953420000
3,2018/7/6,2737.679932,2764.409912,2733.520020,2759.820068,2759.820068,2554780000
4,2018/7/9,2775.620117,2784.649902,2770.729980,2784.169922,2784.169922,3050040000
...,...,...,...,...,...,...,...
247,2019/6/26,2926.070068,2932.590088,2912.989990,2913.780029,2913.780029,3478130000
248,2019/6/27,2919.659912,2929.300049,2918.570068,2924.919922,2924.919922,3122920000
249,2019/6/28,2932.939941,2943.979980,2929.050049,2941.760010,2941.760010,5420700000
250,2019/7/1,2971.409912,2977.929932,2952.219971,2964.330078,2964.330078,3513270000


# 改變 DataFrame 的 row index

```python
# 以 csv 第一欄的 Date 作爲索引
df = pd.read_csv(r"你的 s&p500.csv 檔案路徑", index_col="Date", parse_dates=True)
df
```

In [39]:
df = pd.read_csv(r"s&p500.csv", index_col="Date")
df

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018/7/2,2704.949951,2727.260010,2698.949951,2726.709961,2726.709961,3073650000
2018/7/3,2733.270020,2736.580078,2711.159912,2713.219971,2713.219971,1911470000
2018/7/5,2724.189941,2737.830078,2716.020020,2736.610107,2736.610107,2953420000
2018/7/6,2737.679932,2764.409912,2733.520020,2759.820068,2759.820068,2554780000
2018/7/9,2775.620117,2784.649902,2770.729980,2784.169922,2784.169922,3050040000
...,...,...,...,...,...,...
2019/6/26,2926.070068,2932.590088,2912.989990,2913.780029,2913.780029,3478130000
2019/6/27,2919.659912,2929.300049,2918.570068,2924.919922,2924.919922,3122920000
2019/6/28,2932.939941,2943.979980,2929.050049,2941.760010,2941.760010,5420700000
2019/7/1,2971.409912,2977.929932,2952.219971,2964.330078,2964.330078,3513270000


# 繪製走勢圖功能

讀取收盤價 (過去一年，為天的走勢圖)

```python
# 讀取 Adj Close 這一欄，回傳一個 Series
df["Adj Close"]
```

In [40]:
df["Adj Close"]

Date
2018/7/2     2726.709961
2018/7/3     2713.219971
2018/7/5     2736.610107
2018/7/6     2759.820068
2018/7/9     2784.169922
                ...     
2019/6/26    2913.780029
2019/6/27    2924.919922
2019/6/28    2941.760010
2019/7/1     2964.330078
2019/7/2     2962.560059
Name: Adj Close, Length: 252, dtype: float64

# 畫出走勢圖

```python
# 從 Dataframe 截取收盤價，畫出走勢圖 
plt = df["Adj Close"].plot()
plt.set_xlabel("Time")
plt.set_ylabel("Price")
plt.set_title("S&P 500 Closing Prices")
```

In [41]:
# 從 Dataframe 截取收盤價，畫出走勢圖。 plt > 圖表/畫布
plt = df["Adj Close"].plot()
plt.set_xlabel("Time")
plt.set_ylabel("Price")
plt.set_title("S&P 500 Closing Prices")

Text(0.5, 1.0, 'S&P 500 Closing Prices')

# 計算報酬率
**pct_change()** 函數會幫你計算每一個 row 之間數值的差距，並以百分比的形式呈現出來，英文percentage change：

```python
df["Adj Close"].pct_change(1) * 100
```

補充：[官網教學](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.pct_change.html)

In [42]:
df["Adj Close"].pct_change(1) * 100

Date
2018/7/2          NaN
2018/7/3    -0.494735
2018/7/5     0.862080
2018/7/6     0.848128
2018/7/9     0.882299
               ...   
2019/6/26   -0.123393
2019/6/27    0.382318
2019/6/28    0.575745
2019/7/1     0.767230
2019/7/2    -0.059711
Name: Adj Close, Length: 252, dtype: float64

# 將結果寫入 DataFrame

```python
df["daily return"] = df["Adj Close"].pct_change(1) * 100
```


In [43]:
df["daily return"] = df["Adj Close"].pct_change(1) * 100

In [44]:
df

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,daily return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018/7/2,2704.949951,2727.260010,2698.949951,2726.709961,2726.709961,3073650000,
2018/7/3,2733.270020,2736.580078,2711.159912,2713.219971,2713.219971,1911470000,-0.494735
2018/7/5,2724.189941,2737.830078,2716.020020,2736.610107,2736.610107,2953420000,0.862080
2018/7/6,2737.679932,2764.409912,2733.520020,2759.820068,2759.820068,2554780000,0.848128
2018/7/9,2775.620117,2784.649902,2770.729980,2784.169922,2784.169922,3050040000,0.882299
...,...,...,...,...,...,...,...
2019/6/26,2926.070068,2932.590088,2912.989990,2913.780029,2913.780029,3478130000,-0.123393
2019/6/27,2919.659912,2929.300049,2918.570068,2924.919922,2924.919922,3122920000,0.382318
2019/6/28,2932.939941,2943.979980,2929.050049,2941.760010,2941.760010,5420700000,0.575745
2019/7/1,2971.409912,2977.929932,2952.219971,2964.330078,2964.330078,3513270000,0.767230


# 找出所有當日上漲的資料

建立表頭為 "是否上漲" 的一欄（沒錯！表頭可以用中文指定）

```python
df["是否上漲"] = df["daily return"] > 0
df
```

In [45]:
df["是否上漲"] = df["daily return"] > 0
df

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,daily return,是否上漲
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018/7/2,2704.949951,2727.260010,2698.949951,2726.709961,2726.709961,3073650000,,False
2018/7/3,2733.270020,2736.580078,2711.159912,2713.219971,2713.219971,1911470000,-0.494735,False
2018/7/5,2724.189941,2737.830078,2716.020020,2736.610107,2736.610107,2953420000,0.862080,True
2018/7/6,2737.679932,2764.409912,2733.520020,2759.820068,2759.820068,2554780000,0.848128,True
2018/7/9,2775.620117,2784.649902,2770.729980,2784.169922,2784.169922,3050040000,0.882299,True
...,...,...,...,...,...,...,...,...
2019/6/26,2926.070068,2932.590088,2912.989990,2913.780029,2913.780029,3478130000,-0.123393,False
2019/6/27,2919.659912,2929.300049,2918.570068,2924.919922,2924.919922,3122920000,0.382318,True
2019/6/28,2932.939941,2943.979980,2929.050049,2941.760010,2941.760010,5420700000,0.575745,True
2019/7/1,2971.409912,2977.929932,2952.219971,2964.330078,2964.330078,3513270000,0.767230,True


# 過濾出所有當日上漲的資料

```python
df[df["是否上漲"] == True]
```


In [46]:
df[df["是否上漲"] == True]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,daily return,是否上漲
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018/7/5,2724.189941,2737.830078,2716.020020,2736.610107,2736.610107,2953420000,0.862080,True
2018/7/6,2737.679932,2764.409912,2733.520020,2759.820068,2759.820068,2554780000,0.848128,True
2018/7/9,2775.620117,2784.649902,2770.729980,2784.169922,2784.169922,3050040000,0.882299,True
2018/7/10,2788.560059,2795.580078,2786.239990,2793.840088,2793.840088,3063850000,0.347327,True
2018/7/12,2783.139893,2799.219971,2781.530029,2798.290039,2798.290039,2821690000,0.874904,True
...,...,...,...,...,...,...,...,...
2019/6/19,2920.550049,2931.739990,2911.429932,2926.459961,2926.459961,3287890000,0.298516,True
2019/6/20,2949.600098,2958.060059,2931.500000,2954.179932,2954.179932,3905940000,0.947219,True
2019/6/27,2919.659912,2929.300049,2918.570068,2924.919922,2924.919922,3122920000,0.382318,True
2019/6/28,2932.939941,2943.979980,2929.050049,2941.760010,2941.760010,5420700000,0.575745,True


# 寫入 Excel

```python
result_df = df[df["是否上漲"] == True]

result_df.to_excel("stock_report.xlsx")
```

補充：[官網教學](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html)

```python
result_df.to_excel(r"指定儲存 stock_report.xlsx 的路徑", sheet_name="工作表名稱")
```

In [47]:
result_df = df[df["是否上漲"] == True]
result_df.to_excel("stock_report.xlsx")

# 寫入 Excel (xlwings 做法)

將 DataFrame 的資料輸入至 Excel，其實還有另一種做法，那就是透過 xlwings:

想象一下，將 DataFrame 這個虛擬工作表與欲寫入的 Excel 工作表位置的左上角對應起來：

```python
import xlwings as xw
# 產生新工作簿
wb = xw.Book()
# 選擇工作表1
sheet = wb.sheets[0]
# 將 DataFrame 寫入以 A1 為起點 (左上角) 的範圍
sheet.range("A1").value = result_df
# 設定工作表名稱
sheet.name = "S&P 500 分析報告"
# 儲存檔案
wb.save("stock_report_xw.xlsx")
```

In [48]:
import xlwings as xw
# 產生新工作簿
wb = xw.Book()

In [49]:
# 選擇工作表1
sheet = wb.sheets[0]
# 將 DataFrame 寫入以 A1 為起點 (左上角) 的範圍
sheet.range("A1").value = result_df
# 設定工作表名稱
sheet.name = "S&P 500 分析報告"
# 儲存檔案
wb.save("stock_report_xw.xlsx")

### or, xlwings has a view() function that will open a NEW excel page

In [50]:
xw.view(result_df)

# pandas vs xlwings 輸出

xlwings 的好處在於：
1. 可以自由指定寫入的位置
2. 可以將 DataFrame 輸出至一個已經存在的工作簿
3. 即時顯示的功能，方便使用者查看資料

# rolling() 滾動視窗

代表從第三個 row 開始，把每一個 row 以及前兩筆資料選起來做運算：

```python
ser.rolling(3) 
```

從 Excel 的角度來理解：

![](https://drive.google.com/uc?export=download&id=1Dy2ikKKZkGfLKUlImVT1npcbXNhWN7zo)

但是，在選起來之後，必須針對懸起來的資料做一些運算：

```python
ser.rolling(3).sum()
```

代表把每三筆資料加起來的意思，從 Excel 的角度來理解：

![](https://drive.google.com/uc?export=download&id=15waPZ_036b4mrkaq1RJuI2pgq037enAN)

# 計算 3 日移動平均

以下可以將選起來的每三筆資料的加總除以 3

```python
df["Adj Close"].rolling(3).sum() / 3
```

另外也可以使用 mean() 方法，直接將選擇到的每三筆資料的平均值算出來

```python
df["Adj Close"].rolling(3).mean()
```

In [51]:
df["Adj Close"].rolling(3).mean()

Date
2018/7/2             NaN
2018/7/3             NaN
2018/7/5     2725.513346
2018/7/6     2736.550049
2018/7/9     2760.200032
                ...     
2019/6/26    2925.503337
2019/6/27    2918.693278
2019/6/28    2926.819987
2019/7/1     2943.670003
2019/7/2     2956.216716
Name: Adj Close, Length: 252, dtype: float64

# 將三日移動平均寫入 DataFrame

```python
df["sma3d"] = df["Adj Close"].rolling(3).sum() / 3
df
```

## apply() 應用

將一段運算過程**應用**到每一個 row 的資料

以我們目前的 DataFrame 爲例，我們可以利用 **apply()** 計算出三日移動平均

1. 先思考一下三日移動平均的算法，將運算過程寫成函數：

```python
def sma_3d(prices):
    return sum(prices) / 3
```

2. 接下來將其轉換成 lambda 函數：

```python
lambda prices : sum(prices) / 3
```

3. 接下來我們就可以將其與 apply() 串接起來使用：

```python
df["Adj Close"].rolling(3).apply(lambda prices : sum(prices) / 3)
```

以上代表從第三個 row 開始，把每一個 row 以及前兩筆資料選起來，放入 apply() 内的 lambda 函數做運算

**apply()** 的好處是在於給予使用者更大的自由度，無論是多複雜的運算，只要能夠被封裝進函數，就能夠透過 apply() 做運算，語法上也相對簡潔  

In [52]:
def sma_3d(prices):
    return sum(prices) / 3

In [53]:
lambda prices : sum(prices) / 3

<function __main__.<lambda>(prices)>

In [55]:
df["Adj Close"].rolling(3).apply(lambda prices : sum(prices) / 3, raw=True)

Date
2018/7/2             NaN
2018/7/3             NaN
2018/7/5     2725.513346
2018/7/6     2736.550049
2018/7/9     2760.200032
                ...     
2019/6/26    2925.503337
2019/6/27    2918.693278
2019/6/28    2926.819987
2019/7/1     2943.670003
2019/7/2     2956.216716
Name: Adj Close, Length: 252, dtype: float64

## 作業：計算 5 日加權移動平均

```python
import numpy as np

w5d = np.arange(1, 6)

ser1.rolling(5).apply(lambda _________ : _________ / _________, raw=True)
```

In [60]:
import numpy as np

w5d = np.arange(1, 6)
ser = df["Adj Close"]
ser.rolling(5).apply(lambda prices : sum(prices * w5d) / sum(w5d), raw=True) # 計算5日加權平均:每5筆資料計算一個值 > 權重乘積和/權重

Date
2018/7/2             NaN
2018/7/3             NaN
2018/7/5             NaN
2018/7/6             NaN
2018/7/9     2754.874007
                ...     
2019/6/26    2928.637988
2019/6/27    2924.867969
2019/6/28    2928.661979
2019/7/1     2940.559342
2019/7/2     2950.601367
Name: Adj Close, Length: 252, dtype: float64