# 統計學導論(Statistics Fundamentals)

## 統計是基於資料所推演出來的資訊，包括一些描述、數量及衡量。
### 資料集(dataset)：我們蒐集到的資料總稱，它包括一些觀察值(Observations)或案例(Cases)，即資料集的列。
### 屬性(attributes)或特徵(features)：資料集或觀察值的欄位，通常以X表示。
### 資料集的欄位也可能包含預測的目標(Target)，通常以Y表示。

In [102]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# fix 中文亂碼 
from matplotlib.font_manager import FontProperties
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS'] 

## 計程車小費預測

In [51]:
df = sns.load_dataset('tips')
df.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2
9,14.78,3.23,Male,No,Sun,Dinner,2


## 請問：
### 觀察值(Observations) = ?
### 特徵(features) = ?
### 目標(Target) = ?

## 迴歸預測

In [52]:
df.sex.unique(), df.smoker.unique(), df.day.unique(), df.time.unique()

(['Female', 'Male']
 Categories (2, object): ['Male', 'Female'],
 ['No', 'Yes']
 Categories (2, object): ['Yes', 'No'],
 ['Sun', 'Sat', 'Thur', 'Fri']
 Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun'],
 ['Dinner', 'Lunch']
 Categories (2, object): ['Lunch', 'Dinner'])

In [53]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 類別欄位轉為數值
df.sex = df.sex.map({'Female':0, 'Male':1})
df.smoker = df.smoker.map({'No':0, 'Yes':1})
df.day = df.day.map({'Thur':0, 'Fri':1, 'Sat':2, 'Sun':3})
df.time = df.time.map({'Lunch':0, 'Dinner':1})

# 定義 X/Y
X = df.drop('tip', axis=1)
y = df['tip']

# 資料切割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2)

# 模型訓練
model = LinearRegression()
model.fit(X_train, y_train)

# 顯示 w、b
print("Coefficient: ", model.coef_)
print("Intercept: ", model.intercept_)

# test data 預測
y_pred = model.predict(X_test)
#print("R2: ", model.score(X_test, y_test))
print("R2: ", r2_score(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))

Coefficient:  [ 0.09435819 -0.03685387 -0.07486677  0.05317029 -0.11131973  0.17496606]
Intercept:  0.7219986517179193
R2:  0.33990388026587737
MSE:  0.0158588092766123


## 企鵝品種分類

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
df = pd.read_csv('./data/penguins.csv')
df.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,FEMALE
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,MALE
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


In [18]:
df.island.unique(), df.sex.unique(), df.species.unique()

(array(['Torgersen', 'Biscoe', 'Dream'], dtype=object),
 array(['MALE', 'FEMALE', nan], dtype=object),
 array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object))

In [19]:
# 刪除 missing data
df = df.dropna()

# 類別欄位轉為數值
df.island = df.island.map({'Torgersen':0, 'Biscoe':1, 'Dream':2})
df.sex = df.sex.map({'FEMALE':0, 'MALE':1})
df.species = df.species.map({'Adelie':0, 'Chinstrap':1, 'Gentoo':2})

# 定義 X/Y
X = df.drop('species', axis=1)
y = df.species

# 資料切割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2)

# 模型訓練
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)

# test data 預測
y_pred = clf.predict(X_test)

# 準確度
print("Accuracy: ", accuracy_score(y_test, y_pred))

Accuracy:  1.0


In [20]:
# 變數關聯度
df.corr()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
species,1.0,-0.009176,0.730548,-0.740346,0.850737,0.750434,0.010964
island,-0.009176,1.0,0.212038,0.189636,-0.162739,-0.201966,0.005834
bill_length_mm,0.730548,0.212038,1.0,-0.228626,0.653096,0.589451,0.344078
bill_depth_mm,-0.740346,0.189636,-0.228626,1.0,-0.577792,-0.472016,0.372673
flipper_length_mm,0.850737,-0.162739,0.653096,-0.577792,1.0,0.872979,0.255169
body_mass_g,0.750434,-0.201966,0.589451,-0.472016,0.872979,1.0,0.424987
sex,0.010964,0.005834,0.344078,0.372673,0.255169,0.424987,1.0


## 欄位依性質不同分為：
### 1. 定性(qualitative)
### 2. 定量(quantitative)

## 定性(qualitative)又分為：
### 1. 有序資料(Ordinal Data)：欄位值有大小、順序的隱含意義。
### 2. 名目資料(Nominal Data)：欄位值並沒有大小、順序的隱含意義。

## 定量(quantitative)又分為：
### 1. 離散型資料(Discrete Data)：不連續。
### 2. 連續型資料(Continuous Data)。


## Quiz 1. 以計程車小費資料集為例，哪一些是定量欄位? 哪一些是有序資料欄位?  哪一些是名目資料欄位? 目標欄位是離散型資料或連續型資料?

## 名目資料(Nominal Data)的處理方式
### 小於或等於2個類別：一般轉換。
### 大於2個類別：One-hot encoding。

## One-hot encoding

In [89]:
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


## Quiz 2. 哪一些是定量欄位? 哪一些是有序資料欄位?  哪一些是名目資料欄位? 

In [90]:
# Pandas One-hot encoding 處理方式
pd.get_dummies(df.color)

Unnamed: 0,blue,green,red
0,0,1,0
1,0,0,1
2,1,0,0


In [91]:
# Scikit-learn One-hot encoding 處理方式
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
X2 = ohe.fit_transform(df[['color']].values).toarray()
X2

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [92]:
set(df['color'].unique())

{'blue', 'green', 'red'}

In [93]:
# 欄位處理
color_list = np.sort('is_'+df['color'].unique())
color_list

array(['is_blue', 'is_green', 'is_red'], dtype=object)

In [94]:
df2 = pd.DataFrame(X2, columns=color_list)
df2

Unnamed: 0,is_blue,is_green,is_red
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0


In [95]:
# 合併
df_new = pd.concat((df.drop('color', axis=1), df2), axis=1)
df_new

Unnamed: 0,size,price,classlabel,is_blue,is_green,is_red
0,M,10.1,class1,0.0,1.0,0.0
1,L,13.5,class2,0.0,0.0,1.0
2,XL,15.3,class1,1.0,0.0,0.0


In [96]:
# 還原
df_inverse = pd.DataFrame(ohe.inverse_transform(X2), columns=['color'])
df_inverse

Unnamed: 0,color
0,green
1,red
2,blue


## 樣本(Sample) 與 母體(Population)
### 以台北市長選舉為例：
### 母體(Population)：全體市民>=20歲
### 樣本(Sample) ：抽樣調查1000份

## 補充：[以總統民調學習抽樣理論](https://ithelp.ithome.com.tw/articles/10229457)