## Train Test Data
How to split the dataset intro training and test data

In [49]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


In [3]:
# import the data
df = pd.read_csv("carprices.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Mileage        20 non-null     int64
 1   Age(yrs)       20 non-null     int64
 2   Sell Price($)  20 non-null     int64
dtypes: int64(3)
memory usage: 612.0 bytes


There are 20 entries and we do not want to use all of the data for training and so we will split it into training and test data.

In [26]:
# segregate X and y
X = df.drop(["Sell Price($)"], axis="columns")
y = df[["Sell Price($)"]]

In [42]:
# Create a training data and a test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
print(f"Length of training dataset {len(X_train)} and test dataset {len(X_test)}")

Length of training dataset 16 and test dataset 4


In [43]:
# lets print X_train
X_train

Unnamed: 0,Mileage,Age(yrs)
17,69000,5
8,91000,8
9,67000,6
12,59000,5
13,58780,4
15,25400,3
0,69000,6
2,57000,5
5,59000,5
4,46000,4


The above output will change if we run this split method next time. To avoid this beviour and get a consistent dataset across reload, we need to set random_state

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
print(
    f"Length of training dataset {len(X_train)} and test dataset {len(X_test)}")

Length of training dataset 16 and test dataset 4


In [47]:
X_train

Unnamed: 0,Mileage,Age(yrs)
8,91000,8
5,59000,5
11,79000,7
3,22500,2
18,87600,8
16,28000,2
13,58780,4
2,57000,5
9,67000,6
19,52000,5


### Understaing list unpacking

In [19]:
def test_train(X):
    length = len(X)
    # returning a list here
    return [X[:length//2], X[length//2:]]

X = [1, 2, 3, 4, 5, 6, 7, 8]
out = test_train(X)
print(out)

x_1, x_2 = test_train(X)
print(x_1, x_2)

[[1, 2, 3, 4], [5, 6, 7, 8]]
[1, 2, 3, 4] [5, 6, 7, 8]


In [53]:
model = LinearRegression()

# Train the model with training data
model.fit(X_train, y_train)

In [54]:
# Use prediction on test data to check accuracy of model
model.predict(X_test)

array([[22262.48189206],
       [22571.64380185],
       [38560.99055662],
       [35176.5451397 ]])

In [52]:
y_test

Unnamed: 0,Sell Price($)
0,18000
17,19700
15,35000
1,34000


In [55]:
# now compare the accuracy
model.score(X_test, y_test)

0.8360253892678235

The model is 83% accurate. We learnt how to split the data into training and test set and do model training and validate results with test data.