# 機器學習_ML_train_test_split
###### tags: `ML` `model_selection` `train_test_split` `sklearn`
一般來說，我們在訓練模型的時候不會將手上的資料一口氣的丟進去建置模型，我們必需保留一部份的資料做測試驗證，因為我們所要預測的並不是已知的資料，而是後續產生的未知的資料集，並且這份資料必需確定經過洗牌，而不是按序排列，其中一個理由與資料分佈有關。  
訓練資料的比例沒有絕對，端看你手上擁有的資料量而定，如果你的手上有著非常龐大的大數據，千萬或是億，那你並不需要設置30%的測試驗證，或許以99:1的方式來建置足以，這1%用來驗證模型已經足夠我們判斷模型狀況了。  

## IMPORT
```
from sklearn.model_selection import train_test_split
```

## 範例

In [1]:
"""
x:特徵資料集
y:目標資料集
test_size=0.3:代表切了3成去做測試數據集!
random_state:亂數種子，每次設置一樣所得亂數相同!
"""
#  import需求套件
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
#  載入資料集
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', header=None, sep='\s+')

In [3]:
#  設置欄位
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
              'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
              'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

In [4]:
#  驗證一下資料有沒有進來
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [5]:
#  設置資料集
x = df[['RM']].values
y = df['MEDV'].values

In [7]:
#  分割資料集
X_train, X_test, y_train, y_test = train_test_split(x, 
                                                    y, 
                                                    test_size=0.3, 
                                                    random_state=0)

In [12]:
#  確認訓練資料與驗證資料集的資料筆數
print(len(X_train), len(X_test))

354 152


In [9]:
#  總資料集筆數
len(x)

506

In [11]:
X_train

array([[ 5.019],
       [ 6.538],
       [ 6.335],
       [ 6.345],
       [ 5.961],
       [ 6.142],
       [ 5.594],
       [ 5.57 ],
       [ 6.152],
       [ 6.096],
       [ 6.968],
       [ 7.412],
       [ 7.104],
       [ 6.083],
       [ 6.041],
       [ 6.619],
       [ 6.398],
       [ 5.887],
       [ 6.604],
       [ 5.304],
       [ 6.376],
       [ 6.25 ],
       [ 5.871],
       [ 6.223],
       [ 5.569],
       [ 5.39 ],
       [ 6.98 ],
       [ 6.115],
       [ 6.301],
       [ 6.202],
       [ 5.869],
       [ 5.854],
       [ 5.836],
       [ 6.431],
       [ 6.762],
       [ 6.635],
       [ 7.831],
       [ 6.436],
       [ 6.459],
       [ 7.327],
       [ 7.079],
       [ 5.705],
       [ 7.47 ],
       [ 6.211],
       [ 6.59 ],
       [ 5.759],
       [ 6.122],
       [ 7.645],
       [ 6.442],
       [ 5.456],
       [ 7.52 ],
       [ 7.274],
       [ 6.13 ],
       [ 6.066],
       [ 6.122],
       [ 6.487],
       [ 6.312],
       [ 5.877],
       [ 6.63 