# 1112_DS_Lab3  資料聚合及遮罩應用於股價分析

# Aggregations(聚合函數): Min, Max, and Everything In Between
Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question. Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).
** Most texts are released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT).

In [2]:
import numpy as np
A = np.random.random(100)
A

array([0.61481044, 0.58648878, 0.30336951, 0.50970139, 0.624682  ,
       0.01559369, 0.00809928, 0.98168359, 0.79731539, 0.32295485,
       0.98209752, 0.89777023, 0.51119627, 0.91235464, 0.94264963,
       0.24253224, 0.43855602, 0.48404386, 0.80220877, 0.15507692,
       0.5443272 , 0.27590314, 0.52983401, 0.43136337, 0.01974851,
       0.73006806, 0.46344401, 0.79706257, 0.76197812, 0.58693277,
       0.44104438, 0.194191  , 0.50619805, 0.14529923, 0.5462875 ,
       0.44037098, 0.84344508, 0.74416056, 0.37870655, 0.74442391,
       0.81059247, 0.98840715, 0.45661445, 0.40021283, 0.15475491,
       0.82952431, 0.81978391, 0.80559505, 0.93390455, 0.75846933,
       0.07635665, 0.22054207, 0.14172706, 0.48535513, 0.32078206,
       0.93043055, 0.88403272, 0.27624434, 0.08923067, 0.80262983,
       0.68439461, 0.14795202, 0.11314887, 0.62969212, 0.65610244,
       0.0447645 , 0.46264535, 0.85656107, 0.82132091, 0.37594946,
       0.31065618, 0.11386156, 0.06336796, 0.60141256, 0.99699

In [8]:
%timeit sum(A)
%timeit np.sum(A)

<class 'numpy.float64'>
<class 'numpy.float64'>
6.73 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.08 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [9]:
big_array = np.random.rand(1000)
%timeit sum(big_array)
%timeit np.sum(big_array)

57.3 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3.33 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [72]:
np.min(big_array), np.max(big_array) # 比較快

0.001281347965530033
0.9994998633803951


(0.001281347965530033, 0.9994998633803951)

In [73]:
M = np.random.random((3, 4))
print(type(M))

<class 'numpy.ndarray'>
[[0.33261755 0.82963776 0.23498221 0.30203376]
 [0.32787547 0.37595042 0.40113254 0.32311476]
 [0.64659259 0.60622349 0.76266185 0.24823788]]


In [56]:
%timeit sum(sum(M))
%timeit np.sum(M)

3.58 µs ± 517 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.1 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [75]:
lst = np.array([1, 2, 3, 4, 5, 6])
lst = lst.reshape(2, 3)
print(lst)

[[1 2 3]
 [4 5 6]]


## Aggregation functions take an additional argument specifying the axis along which the aggregate is computed. 
For example, we can find the minimum value within each column by specifying axis=0. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. So specifying axis=0 means that the first axis will be collapsed: for two-dimensional arrays, this means that values within each column will be aggregated.

![AXIS範例](axis.jpg)

In [87]:
M.sum(axis=0)

array([1.30708561, 1.81181168, 1.3987766 , 0.8733864 ])

In [88]:
M.sum(axis=1)

array([1.69927128, 1.42807319, 2.26371581])

The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |



In [90]:
import pandas as pd
import numpy as np
import requests

In [91]:
date = "20230320"
url = f'https://www.twse.com.tw/exchangeReport/MI_INDEX?response=json&date={date}&type=ALLBUT0999'
response = requests.get(url)
response_json = response.json()
stockdata = pd.DataFrame(response_json['data9'], columns=response_json['fields9'])
origin = stockdata.copy()
stockdata.head()

{'data4': [['寶島股價報酬指數', '24,837.62', "<p style ='color:green'>-</p>", '34.14', '-0.14', ''], ['發行量加權股價報酬指數', '32,114.54', "<p style ='color:green'>-</p>", '68.73', '-0.21', ''], ['臺灣公司治理100報酬指數', '12,108.39', "<p style ='color:green'>-</p>", '69.45', '-0.57', ''], ['臺灣50報酬指數', '24,732.34', "<p style ='color:green'>-</p>", '169.09', '-0.68', ''], ['臺灣50權重上限30%報酬指數', '23,574.17', "<p style ='color:green'>-</p>", '124.45', '-0.53', ''], ['臺灣中型100報酬指數', '26,809.80', "<p style ='color:red'>+</p>", '129.08', '0.48', ''], ['臺灣資訊科技股報酬指數', '40,803.73', "<p style ='color:green'>-</p>", '265.49', '-0.65', ''], ['臺灣發達報酬指數', '19,157.21', "<p style ='color:red'>+</p>", '10.17', '0.05', ''], ['臺灣高股息報酬指數', '17,888.84', "<p style ='color:green'>-</p>", '46.66', '-0.26', ''], ['臺灣就業99報酬指數', '14,762.67', "<p style ='color:green'>-</p>", '21.60', '-0.15', ''], ['臺灣高薪100報酬指數', '12,050.17', "<p style ='color:green'>-</p>", '41.69', '-0.34', ''], ['未含金融電子報酬指數', '31,542.57', "<p style ='color:red'>+</p>", '37

Unnamed: 0,證券代號,證券名稱,成交股數,成交筆數,成交金額,開盤價,最高價,最低價,收盤價,漲跌(+/-),漲跌價差,最後揭示買價,最後揭示買量,最後揭示賣價,最後揭示賣量,本益比
0,50,元大台灣50,7527505,12258,889553151,118.95,119.05,117.85,118.1,<p style= color:green>-</p>,0.85,118.05,8,118.1,290,0.0
1,51,元大中型100,54091,161,3004569,55.3,55.75,55.3,55.6,<p style= color:red>+</p>,0.2,55.5,2,55.7,3,0.0
2,52,富邦科技,182010,301,19229478,106.1,106.15,105.4,105.65,<p style= color:green>-</p>,0.6,105.65,21,105.7,1,0.0
3,53,元大電子,14110,1010,831554,58.85,59.05,58.85,58.95,<p> </p>,0.0,58.6,36,59.05,2,0.0
4,55,元大MSCI金融,258280,479,5500399,21.44,21.44,21.21,21.26,<p style= color:green>-</p>,0.14,21.28,3,21.31,9,0.0


In [108]:
dayprice = np.array(stockdata['收盤價'])   #也可以這樣寫  dayprice = np.array(stockdata.收盤價)
stockdata
stockdata['收盤價']
print(dayprice)

['118.10' '55.60' '105.65' ... '11.90' '25.50' '99.50']


In [292]:
data={1:[1, 2, 3, 4, 5, 6], 3:[3, 6, 14, 21, 8, 18]}
data_json = pd.DataFrame(data, index=[1, 3, 6, 21, 8, 18])
data_series = data_json[[1,3]]
print(data_json)
print(data_series.loc[1:2])

    1   3
1   1   3
3   2   6
6   3  14
21  4  21
8   5   8
18  6  18
   1   3
3  2   6
6  3  14


In [269]:
a = np.array(stockdata.收盤價[0:9])
print(a)

['118.10' '55.60' '105.65' '58.95' '21.26' '27.76' '85.70' '18.85' '56.90']


In [262]:
print(len(a))
print(type(a))

9
<class 'numpy.ndarray'>


In [27]:
a= a.astype(np.float)
print(a)

[118.1   55.6  105.65  58.95  21.26  27.76  85.7   18.85  56.9 ]


In [28]:
print(a.mean())

60.974444444444444


In [29]:
print("Mean 收盤價:", a.mean())
print("Standard 收盤價:", a.std())
print("Minimum 收盤價:    ", a.min())
print("Maximum 收盤價:    ", a.max())
print("25th percentile:   ", np.percentile(a, 25))
print("Median:            ",  np.median(a))
print("75th percentile:   ", np.percentile(a, 75))

Mean 收盤價: 60.974444444444444
Standard 收盤價: 33.95270701402536
Minimum 收盤價:     18.85
Maximum 收盤價:     118.1
25th percentile:    27.76
Median:             56.9
75th percentile:    85.7


In [31]:
print("Mean 收盤價:", dayprice.mean())
print("Standard 收盤價:", dayprice.std())
print("Minimum 收盤價:    ", dayprice.min())
print("Maximum 收盤價:    ", dayprice.max())
print("25th percentile:   ", dayprice.percentile(heights, 25))
print("Median:            ", dayprice.median(heights))
print("75th percentile:   ", dayprice.percentile(heights, 75))

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [33]:
print(dayprice[0:40])

['118.10' '55.60' '105.65' '58.95' '21.26' '27.76' '85.70' '18.85' '56.90'
 '77.45' '31.55' '30.01' '24.66' '68.20' '--' '110.15' '5.38' '37.29'
 '4.53' '25.12' '21.15' '6.93' '14.68' '9.59' '12.64' '27.81' '9.65'
 '14.07' '14.66' '--' '25.75' '37.15' '50.55' '7.42' '12.16' '8.80'
 '27.80' '38.63' '8.47' '26.35']


In [34]:
b= np.array(stockdata.收盤價[0:40])

In [37]:
print(b)
print(b.size)

['118.10' '55.60' '105.65' '58.95' '21.26' '27.76' '85.70' '18.85' '56.90'
 '77.45' '31.55' '30.01' '24.66' '68.20' '--' '110.15' '5.38' '37.29'
 '4.53' '25.12' '21.15' '6.93' '14.68' '9.59' '12.64' '27.81' '9.65'
 '14.07' '14.66' '--' '25.75' '37.15' '50.55' '7.42' '12.16' '8.80'
 '27.80' '38.63' '8.47' '26.35']
40


In [38]:
b= b.astype(np.float)

ValueError: could not convert string to float: '--'

In [40]:
b_new=(b!='--')
print(b_new)

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True False  True  True  True  True  True  True  True  True  True
  True  True  True  True  True False  True  True  True  True  True  True
  True  True  True  True]


In [41]:
b1= b[b_new].astype(np.float)

In [42]:
print(b1)
print(b1.size)

[118.1   55.6  105.65  58.95  21.26  27.76  85.7   18.85  56.9   77.45
  31.55  30.01  24.66  68.2  110.15   5.38  37.29   4.53  25.12  21.15
   6.93  14.68   9.59  12.64  27.81   9.65  14.07  14.66  25.75  37.15
  50.55   7.42  12.16   8.8   27.8   38.63   8.47  26.35]
38


# Comparisons, Masks, and Boolean Logic

## Comparison Operators as ufuncs
The result of these comparison operators is always an array with a Boolean data type. All six of the standard comparison operations are available:

In [43]:
x = np.array([1, 2, 3, 4, 5])

In [44]:
x < 3  # less than

array([ True,  True, False, False, False])

In [45]:
x > 3  # greater than

array([False, False, False,  True,  True])

In [46]:
x <= 3  # less than or equal

array([ True,  True,  True, False, False])

In [47]:
x >= 3  # greater than or equal

array([False, False,  True,  True,  True])

In [48]:
x != 3  # not equal

array([ True,  True, False,  True,  True])

In [50]:
x == 3  # equal

array([False, False,  True, False, False])

In [51]:
(2 * x) == (x ** 2)

array([False,  True, False, False, False])

## Working with Boolean Arrays

Given a Boolean array, there are a host of useful operations you can do.
We'll work with ``x``, the two-dimensional array we created earlier.

In [52]:
x=np.arange(12).reshape(3,4)
print(x)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [53]:
# how many values less than 6?
np.count_nonzero(x < 6)

6

In [55]:
# how many values less than 6 in each row?
np.sum(x<6 , axis=1)

array([4, 2, 0])

In [56]:
# are there any values greater than 8?
np.any(x > 8)

True

In [57]:
# are all values less than 12?
np.all(x < 12)

True

In [58]:
np.where(x==7)

(array([1], dtype=int64), array([3], dtype=int64))

## 當日股價分析

In [59]:
dayprice = np.array(stockdata.收盤價)

In [60]:
dayprice_yes=(dayprice!='--')
print(dayprice_yes[0:40])

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True False  True  True  True  True  True  True  True  True  True
  True  True  True  True  True False  True  True  True  True  True  True
  True  True  True  True]


In [62]:
dayprice= dayprice[dayprice_yes].astype(np.float)

ValueError: could not convert string to float: '1,145.00'

In [63]:
np.where(dayprice=='1,145.00')

(array([316], dtype=int64),)

In [65]:
dayprice[316]

'1,145.00'

In [66]:
dayprice[316]='1145'

In [71]:
dayprice= dayprice[dayprice_yes].astype(np.float)

ValueError: could not convert string to float: '2,230.00'

In [72]:
np.where(dayprice=='2,230.00')

(array([707], dtype=int64),)

In [73]:
dayprice[707]='2230'

In [76]:
dayprice= dayprice[dayprice_yes].astype(np.float)

ValueError: could not convert string to float: '1,170.00'

In [77]:
np.where(dayprice=='1,170.00')

(array([787], dtype=int64),)

In [78]:
dayprice[787]='1170'

In [88]:
dayprice= dayprice[dayprice_yes].astype(np.float)

In [86]:
np.where(dayprice=='1,680.00')

(array([1008], dtype=int64),)

In [87]:
dayprice[1008]='1680'

In [89]:
print("Mean 當日收盤價:", dayprice.mean())
print("Standard 當日收盤價:", dayprice.std())
print("Minimum 當日收盤價:    ", dayprice.min())
print("Maximum 當日收盤價:    ", dayprice.max())
print("25th percentile:   ", np.percentile(dayprice, 25))
print("Median:            ", np.median(dayprice))
print("75th percentile:   ", np.percentile(dayprice, 75))

Mean 當日收盤價: 66.0662564102564
Standard 當日收盤價: 135.84737531510365
Minimum 當日收盤價:     1.2
Maximum 當日收盤價:     2230.0
25th percentile:    17.84
Median:             33.0
75th percentile:    65.125


# 作業二(Due 3/28): 請用 "date = "2023/03/21" 分析股市的開盤價和收盤價的平均值和標準差。