# Lecture 07: Data Repositories and Data Split in Machine Learning
By the end of this lecture, you will be able to:
1. Comprehend some machine learning data-repository jargons
2. Describe the training and testing steps in machine learning
3. Describe repeated random sampling in machine learning


# 7.1. Presentation

---



### 7.1.1. Presentation - Estimation Problems

> **Supervised machine learning techniques:**
> * You employ some information/data, so-called input data, to estimate or predict other information/data, so-called target or output data. 

> **Major types of estimation/supervised techniques:**
> *   Classifiers:
 * Address problems with categorical outputs such as classes or integers.
 *  Example: Estimation of crack types in pavements. 
> *   Regressors:
 * Address problems with real output values. 
 *  Example: Estimation of housing prices. 



### 7.1.2. Presentation - Data Jargons in Machine Learning

> <img src=	"https://i.ibb.co/rt0TMVr/7-1.png" width="500"/>

### 7.1.3. Presentation - Training-Testing Split

> **Randome division:**

> <img src=	"	https://i.ibb.co/5rpv84n/7-2.png	"	width="500"/>

> **Ratio of testing (RTT = 20%)**
> **Accuracy/error measurements:**
> * Accuracy percentage
> * Mean squared error (MSE)
> * Mean absolute error (MAE)

### 7.1.4. Presentation - Statistical Bias

> **Repeated random sampling (RRS)**

> <img src=	"	https://i.ibb.co/TrRGVF9/7-3.png	"	width="500"/>

> **Multiple RTTS and RRSs:**

> <img src=	"	https://i.ibb.co/kHgp9Gs/7-4.png	"	width="500"/>

# 7.2. Import Some Necessary Libraries

---



In [None]:
#@title Import
import numpy  as np
import pandas as pd

import sklearn.model_selection as ms
import sklearn.preprocessing   as pg

# 7.3. Train-Test Split

---



In [None]:
#@title Load some data
data = pd.read_csv('/content/sample_data/california_housing_train.csv')
# Let's assume that data is not the training data but the whole data!

datain = data.iloc[:,:-1]
dataou = data.iloc[:,-1:]

In [None]:
#@title Split function

datain_tr, datain_te, dataou_tr, dataou_te = ms.train_test_split(datain, dataou, test_size = 0.2, random_state = 42)

print("Train | Inputs  | Top 5 Rows: \n\n{}\n".format( datain_tr.head() ) )
print('---------------')
print("Train | Outputs | Top 5 Rows: \n\n{}\n".format( dataou_tr.head() ) )
print('---------------')
print("Test  | Inputs  | Top 5 Rows: \n\n{}\n".format( datain_te.head() ) )
print('---------------')
print("Test  | Outputs | Top 5 Rows: \n\n{}\n".format( dataou_te.head() ) )
print('---------------')

In [None]:
#@title Example with multiple RTTs and RRSs

rtt_list  = [0.1, 0.2, 0.3, 0.4, 0.5]

rtt_range = range( len(rtt_list) ) 
rrs_range = range(25)

data = np.empty( ( len(rtt_range) , len(rrs_range) ), dtype = dict)

for rtt in rtt_range:
  for rrs in rrs_range:

    datain_tr, datain_te, dataou_tr, dataou_te = ms.train_test_split(datain, dataou, test_size = rtt_list[rtt], random_state = 42)

    data_dictionary = {'datain_tr':datain_tr, 'dataou_tr':dataou_tr, 'datain_te':datain_te, 'dataou_te':dataou_te}

    data[rtt,rrs]   = data_dictionary

print("Dictionary corresponding to RTT # 2 and RRS # 6 \n\n{}\n".format( data[2,6] ) )

# Lecture 07: Data Repositories and Data Split  in Machine Learning
In this lecture, you learned about:
1. Some machine learning data-repository jargons
2. Training and testing steps in machine learning
3. Repeated random sampling in machine learning

***In the next lecture, we will learn about Data Processing and Calibrations in Machine Learning***