## Learn numpy

[参考：numpy quick start](https://numpy.org/doc/stable/user/quickstart.html#prerequisites)


numpy的保存多维数组，其类型为`ndarray`，一行是一个axis，每一个axis包含一个多维的tuple. axis的个数，即行数，通过`ndarray.ndim`获取，也可通过`ndarray.size[0]`获取


axis=0, 沿着第0个轴，对于二维数组来说是列进行操作。`numpy.sum(arr, axis=0)`将计算每一列的总和
axis=1, 沿着第1个轴，对于二维数组来说是行进行操作。`numpy.sum(arr, axis=1)`将计算每一行的总和


row-wise操作，即对数组的每一行分别进行操作，通过axis=1来实现或者 `hsplit`


`ndarray.view()`获取是原始数组的浅拷贝，不是原数组了，修改值会影响原始数组，但是重新shape是不影响原始数组的。

`ndarray.copy()`获取是原始数组的深拷贝，修改值不会影响原始数组，但是重新shape也是不影响原始数组的。



 


In [None]:
import numpy as np

# c = np.arange(16)
# print(c)
# cc = c.reshape(4, 4)
# print(cc)
# # print(np.sum(cc, axis=0))
# # print(np.sum(cc, axis=1))

# print(np.hsplit(cc, 2))
# print(np.vsplit(cc, 2))
# # 水平方向进行拆分，即每一row，index分别在[0,1),[1,3),[3,)各为一部分
# print(np.hsplit(cc, (1, 3)))

a= np.arange(8)
print('a:', a)
mask = [1, 3, -1]
print('a[1], a[3], a[-1]:', a[mask])

str_array = np.array([['a', 'aa', 'aaa'], ['b', 'bb', 'bbb'], ['c', 'cc', 'ccc']])
# 定义了row mask
row = np.array([[0, 0], [2, 1]])
# 定义了column mask
column = np.array([[0, 2], [0, 2]])
# 从row mask和column mask中各选取一个组成index
# 即 str_array[0, 0], str_array[0, 2] 一组
#    str_array[2, 0], str_array[1, 2] 一组
print(str_array[row, column])
# 把[0][0]修改
str_array[0, 0] = '000'
print(str_array)
## 把mask的全部替换掉
# str_array[row, column] = '111'

### numpy array split

In [None]:
import matplotlib.pyplot as plt
from PIL import Image
image = Image.open('data/lenna.jpeg')
image_array = np.array(image)
print(image_array.shape)
# 对image array的每一行进行split，split成[0, 200),[200, 400), [400,)
hs1, hs2, hs3 = np.hsplit(image_array, (180, 360) )
# 2行 4列子图，在第1行第1列画出原图
plt.subplot(2,4,1), plt.imshow(image_array)
# 2行 4列子图，在第1行第2列画出第一个水平拆分的子图
plt.subplot(2,4,2), plt.imshow(hs1), plt.title("H Left")
plt.subplot(2,4,3), plt.imshow(hs2),plt.title("H Middle")
plt.subplot(2,4,4), plt.imshow(hs3),plt.title("H Right")

# 对image array的每一列进行split
vs1, vs2, vs3 = np.vsplit(image_array, (160, 320) )
# 2行 4列子图，在第2行第2列画出第一个垂直拆分的子图
plt.subplot(2,4,6), plt.imshow(vs1),plt.title("V Top")
plt.subplot(2,4,7), plt.imshow(vs2),plt.title("V Middle")
plt.subplot(2,4,8), plt.imshow(vs3),plt.title("V Bottom")

#　垂直方向拆分形成的子图像，只能 axis=0 垂直方向进行拼接
plt.subplot(2, 4, 5), plt.imshow(np.concatenate((vs3, vs1), axis=0)), plt.title(" vs3 & vs1")
# 水平方向（即row-wise）进行拼接，出错
# plt.subplot(2, 4, 5), plt.imshow(np.concatenate((vs1, vs3), axis=1))

plt.show()



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

countries = pd.read_csv("data/countries_of_the_world.csv", decimal= ',')
countries.head()
birthrates = countries['Birthrate'].values
birthrates = birthrates[~np.isnan(birthrates)]


In [14]:
np.mean(birthrates)

22.114732142857147

In [15]:
np.count_nonzero(birthrates[birthrates< 10])

24

In [21]:
a = np.sort(birthrates)
a[(a > 10) & (a < 11)][:20]

array([10.02, 10.04, 10.06, 10.21, 10.22, 10.27, 10.38, 10.41, 10.45,
       10.65, 10.7 , 10.71, 10.72, 10.74, 10.78, 10.9 ])

In [30]:
ar = np.arange(12).reshape(3, 4)
print(ar[ar < 8])
print(np.sum(ar < 8, axis=1))

[0 1 2 3 4 5 6 7]
[2 2 2 2]


In [39]:
import numpy as np

names = ['apple', 'bear', 'orange', 'melon']
ids = [1, 2, 3, 4]
scores = [78.2, 57.50, 70, 90]
# U16 : A unicode string of 16 characters
# i4 : An integer of 4 bytes (int32)
# f8 : A float of 8 bytes (float64)
data = np.zeros(4, dtype={'names': ('Name', 'ID', 'Score'), 'formats': ('U16', 'i4', 'f8')})
data['Name'] = names
data['ID'] = ids
data['Score'] = scores

data[1]

('bear', 2, 57.5)

In [6]:
import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 2, 3, 4, 5])
x*y
x*10

ones = np.ones((3, 4))
ones

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [12]:
import numpy as np
ax = np.random.randint(100, 200, size=20)
print(np.hsplit(ax, (7, 17)))

[array([140, 190, 188, 172, 196, 147, 179]), array([148, 132, 182, 146, 106, 199, 153, 134, 129, 177]), array([109, 131, 163])]


In [16]:
import numpy as np
a = np.arange(10)
a[[[4, 5], [1, 2]]]


array([[4, 5],
       [1, 2]])

# AI related libraries
TensorFlow由google开发的，用来创建机器学习和深度学习的AI model，Keras是基于TensorFlow的更高层级的创建AI model的工具。

Meta flow和ML flow用来重新训练模型。

TFserving 和 Kubernetes service用来部署模型。

Tabaleau，Plotly, Dash, Greenlight等python库用来可视化数据。

Scikit-learn库包含许多用于机器学习和策略模型的库，比如分类、回归、聚类、降维、特征选择等。

Numpy是Python的一个科学计算库，主要用来进行矩阵运算。

Thenao是一个用来定义、优化、评价数学表达式的多维数组操作库。

NLTK是一个用来处理自然语言的工具包。

Scipy是一个用来进行科学计算的Python库，功能包括优化、线性代数、统计、计算几何、信号处理、图像处理等。

Pandas是一个用来处理数据集的Python库，主要用来进行多种格式数据读取写入，数据处理分析的库。

Opecv是一个用来处理图像的库。

ROS是一个开源的机器人操作系统，主要用来进行机器人的编程。

PyTorch是一个facebook开发的用来操作Torch的深度学习框架。

Microsfot开发的 CNTK用来创建机器学习模型、数据科学、iot、模式识别等
Amazon开发的MXNet。
CreateML&Turi是apple开发的。

# data analysis
在build model之前，要做的准备工作：
处理掉missing values; 了解数据分布情况； 检测数据的正确性

In [1]:
import pandas as pd
import plotly.express as px


df = pd.read_csv("data/StudentsPerformance.csv")
df.columns = ['gender', 'ethnicity', 'parental_level_of_education','lunch','test_preparation_course','math','reading','writing']
df.head()

Unnamed: 0,gender,ethnicity,parental_level_of_education,lunch,test_preparation_course,math,reading,writing
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [18]:
df.dtypes

gender                         object
ethnicity                      object
parental_level_of_education    object
lunch                          object
test_preparation_course        object
math                            int64
reading                         int64
writing                         int64
dtype: object

In [34]:
print(df.gender.value_counts(),"\n\n",
      df.lunch.value_counts(),"\n\n",
      df.ethnicity.value_counts(),"\n\n",
      df.parental_level_of_education.value_counts(),"\n\n",
      df.test_preparation_course.value_counts(),
     sep='')

gender
female    518
male      482
Name: count, dtype: int64

lunch
standard        645
free/reduced    355
Name: count, dtype: int64

ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

parental_level_of_education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

test_preparation_course
none         642
completed    358
Name: count, dtype: int64


In [35]:
df.describe()

Unnamed: 0,math,reading,writing
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [15]:
gender = pd.crosstab(index=df['gender'], columns='count').reset_index()
fig = px.bar(gender, x='gender', y='count')
fig.show()

In [3]:
gender = pd.crosstab(index=df['ethnicity'], columns='count').reset_index()
fig = px.bar(gender,x = 'ethnicity',y = 'count')
fig.show()

In [6]:
fig = px.bar(df, x='gender', color='ethnicity', barmode='group')
fig.show()

In [8]:
fig = px.bar(df,
             x= 'gender',
             color='ethnicity',
             template='plotly_dark', 
             barmode='group',
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             title= "Ethnicity Distribution on Gender")
fig.show()

In [9]:
fig = px.scatter(df,
                 x='math',
                 y='reading', 
                 color ='gender',
                 template='plotly_white',
                 title="Is a certain gender excels in certain subject?")
fig.show()

In [10]:
fig = px.scatter(df,
                 x='math',
                 y='reading', 
                 color ='gender',
                 marginal_x='histogram',
                 marginal_y='histogram',
                 template='plotly_white',
                 title="Is a certain gender excels in certain subject?")
fig.show()

In [11]:
fig = px.box(df,
             x='gender', 
             y='math',
             template='plotly_white')
fig.show()

In [12]:
fig = px.box(df,
             x='ethnicity', 
             y='math',
             template='plotly_white', 
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             title="Is there a specific ethnicity that better at math?",
             notched=True)
fig.show()

In [13]:
fig = px.box(df,
             x='ethnicity', 
             y='math',
             color = 'gender',
             template='plotly_white', 
             notched=True,
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             facet_col = 'gender',
             title="Is there a specific ethnicity and gender that better at math?")
fig.show()

# data preprocessing

若数据中有NaN 值（数据丢失），找到这些NaN所在行，删掉；或者用mean/median值替代

In [16]:
import numpy as np
import pandas as pd
df = pd.read_csv('data/health_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0,6,148.0,72.0,35.0,,33.6,0.627,5,1
1,1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,2,8,183.0,64.0,,,23.3,0.672,32,1
3,3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,4,0,137.0,4.0,35.0,168.0,43.1,2.288,33,1


In [19]:
df.isnull().sum().sort_values(ascending=False)

Insulin                     374
SkinThickness               227
BloodPressure                35
BMI                          11
Glucose                       5
Unnamed: 0                    0
Pregnancies                   0
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [22]:
# drop NA values
df_no_missing = df.dropna(axis=0)
print(df_no_missing.head(5))

    Unnamed: 0  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  \
3            3            1     89.0           66.0           23.0     94.0   
4            4            0    137.0            4.0           35.0    168.0   
6            6            3     78.0            5.0           32.0     88.0   
8            8            2    197.0            7.0           45.0    543.0   
13          13            1    189.0            6.0           23.0    846.0   

     BMI  DiabetesPedigreeFunction  Age  Outcome  
3   28.1                     0.167   21        0  
4   43.1                     2.288   33        1  
6   31.0                     0.248   26        1  
8    3.5                     0.158   53        1  
13   3.1                     0.398   59        1  


## fill  NaN with mean

In [30]:
from sklearn.impute import SimpleImputer

df1 = pd.read_csv('data/health_data.csv', na_values=['#NAME?'])
#Imputer to replace Null with mean
imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, copy=True)

imputer.fit(df1)
df1 = pd.DataFrame(data=imputer.transform(df1), columns=df1.columns)

#print
print(df1.head(5))

   Unnamed: 0  Pregnancies  Glucose  BloodPressure  SkinThickness     Insulin  \
0         0.0          6.0    148.0           72.0      35.000000  105.659898   
1         1.0          1.0     85.0           66.0      29.000000  105.659898   
2         2.0          8.0    183.0           64.0      25.876155  105.659898   
3         3.0          1.0     89.0           66.0      23.000000   94.000000   
4         4.0          0.0    137.0            4.0      35.000000  168.000000   

    BMI  DiabetesPedigreeFunction   Age  Outcome  
0  33.6                     0.627   5.0      1.0  
1  26.6                     0.351  31.0      0.0  
2  23.3                     0.672  32.0      1.0  
3  28.1                     0.167  21.0      0.0  
4  43.1                     2.288  33.0      1.0  


## Fill NaN with median

In [33]:
from sklearn.impute import SimpleImputer

# Data from excel
df1 = pd.read_csv('data/health_data.csv', na_values=['#NAME?'])

#Imputer to replace Null with mean
imputer = SimpleImputer(missing_values= np.NAN, strategy= 'median', fill_value=None, copy=True)

imputer.fit(df1)
df1 = pd.DataFrame(data=imputer.transform(df1), columns=df1.columns)

#print
print(df1.head(5))

   Unnamed: 0  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  \
0         0.0          6.0    148.0           72.0           35.0     71.0   
1         1.0          1.0     85.0           66.0           29.0     71.0   
2         2.0          8.0    183.0           64.0           27.0     71.0   
3         3.0          1.0     89.0           66.0           23.0     94.0   
4         4.0          0.0    137.0            4.0           35.0    168.0   

    BMI  DiabetesPedigreeFunction   Age  Outcome  
0  33.6                     0.627   5.0      1.0  
1  26.6                     0.351  31.0      0.0  
2  23.3                     0.672  32.0      1.0  
3  28.1                     0.167  21.0      0.0  
4  43.1                     2.288  33.0      1.0  


## Fucntion to find outliers

In [35]:
def find_outliers_tukey(x):
    q1 = x.quantile(.25)
    q3 = x.quantile(.75)
    iqr = q3 - q1
    floor = q1 - 1.5*iqr
    ceiling = q3 + 1.5*iqr
    outlier_indices = list(x.index[(x < floor) | (x > ceiling)])
    outlier_values = list(x[outlier_indices])
    return outlier_indices, outlier_values

glucose_indices, glucose_values = find_outliers_tukey(df1['Glucose'])
print("Outliers for Glucose")
print(np.sort(glucose_values))

print("Outliers for Pregnancies")
pr_indices, pr_values = find_outliers_tukey(df1['Pregnancies'])
print(np.sort(pr_values))

print("Outliers for BloodPressure")
bp_indices, bp_values = find_outliers_tukey(df1['BloodPressure'])
print(np.sort(bp_values))


print("Outliers for SkinThickness")
st_indices, st_values = find_outliers_tukey(df1['SkinThickness'])
print(np.sort(st_values))

print("Outliers for Insulin")
in_indices, in_values = find_outliers_tukey(df1['Insulin'])
print(np.sort(in_values))

print("Outliers for BMI")
bmi_indices, bmi_values = find_outliers_tukey(df1['BMI'])
print(np.sort(bmi_values))

print("Outliers for DiabetesPedigreeFunction")
dpf_indices, dpf_values = find_outliers_tukey(df1['DiabetesPedigreeFunction'])
print(np.sort(dpf_values))

print("Outliers for Age")
age_indices, age_values = find_outliers_tukey(df1['Age'])
print(np.sort(age_values))

Outliers for Glucose
[]
Outliers for Pregnancies
[14. 14. 15. 17.]
Outliers for BloodPressure
[122.]
Outliers for SkinThickness
[ 1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.
  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.  3.
  3.  3.  3.  3.  3.  3.  3.  3.  3.  4.  4.  4.  4.  4.  4.  4.  4.  4.
  4.  4.  4.  4.  4.  4.  4.  5.  5.  5.  6. 48. 48. 48. 48. 49. 49. 49.
 51. 52. 52. 54. 54. 56. 63. 99.]
Outliers for Insulin
[  1.   1.   1.   1.   1.   1.   1.   2.   2.   2.   2.   3.   4.   4.
   5.   5.   5.   6.   6.   6.   7.   7.   7.   9.   9.   9.   9.  11.
  11.  11.  11.  11.  11.  12.  12.  12.  12.  12.  12.  12.  12.  13.
  13.  13.  13.  13.  13.  13.  13.  13.  14.  14.  14.  14.  14.  14.
  14.  14.  14.  14.  15.  15.  15.  15.  15.  15.  15.  15.  15.  15.
  15.  15.  15.  15.  16.  16.  16.  16.  16.  16.  16.  16.  17.  17.
  18.  18.  18.  18.  18.  18.  18.  18.  18.  18.  19.  19.  19.  19.
  21.  21.  21.  21.  21.  2

In [36]:
df_del = df.drop(bp_indices)
print(df_del.head(5))

   Unnamed: 0  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  \
0           0            6    148.0           72.0           35.0      NaN   
1           1            1     85.0           66.0           29.0      NaN   
2           2            8    183.0           64.0            NaN      NaN   
3           3            1     89.0           66.0           23.0     94.0   
4           4            0    137.0            4.0           35.0    168.0   

    BMI  DiabetesPedigreeFunction  Age  Outcome  
0  33.6                     0.627    5        1  
1  26.6                     0.351   31        0  
2  23.3                     0.672   32        1  
3  28.1                     0.167   21        0  
4  43.1                     2.288   33        1  


## Replace with min

In [37]:
min_in = np.min(df_del['Insulin'])
df_del['Insulin'] = np.where(df_del['Insulin'] > 321, min_in, df_del['Insulin'])
print(df_del.head(5))


   Unnamed: 0  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  \
0           0            6    148.0           72.0           35.0      NaN   
1           1            1     85.0           66.0           29.0      NaN   
2           2            8    183.0           64.0            NaN      NaN   
3           3            1     89.0           66.0           23.0     94.0   
4           4            0    137.0            4.0           35.0    168.0   

    BMI  DiabetesPedigreeFunction  Age  Outcome  
0  33.6                     0.627    5        1  
1  26.6                     0.351   31        0  
2  23.3                     0.672   32        1  
3  28.1                     0.167   21        0  
4  43.1                     2.288   33        1  


## Normalization and Reduction

In [38]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(df_del)

PCA(copy=True, n_components=2, whiten=False)

df = pca.transform(df_del)

df_2d = pd.DataFrame(df)

df_2d.index = df_del.index

df_2d.columns = ['PC1', 'PC2']

df_2d.head(5)

ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values