# **5. Splitting Data into Train-Test sets**

We want to split the available dataset into training and testing datasets using **the train_test_split()** function from sklearn.model_selection library.

The steps are as follows:

1. Set up $X$ data matrix and $Y$ dependent variable vector
2. **Use train_test_split()** to seperate the dataset into train-test datasets
3. Create a regression object using **model = linear_model.LinearRegression()**
4. Use **model.fit(X_train, y_train)** to train the model
5. Use **model.predict(X_test)** to test the model
6. Use **model.score(X_test, y_test)** to calculate model accuracy
7. Save model

In [17]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import pickle
from sklearn.model_selection import train_test_split
import joblib

In [4]:
url = "https://raw.githubusercontent.com/codebasics/py/refs/heads/master/ML/6_train_test_split/carprices.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,Mileage,Age(yrs),Sell Price($)
0,69000,6,18000
1,35000,3,34000
2,57000,5,26100
3,22500,2,40000
4,46000,4,31500


In [7]:
# 1. Set up X matrix and Y vector
X = df[['Mileage', 'Age(yrs)']]
y = df['Sell Price($)']

In [8]:
# 2.Split dataset into train-test portions with test size as 0.2, and random_state = 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 100)

In [9]:
# 3. Create a regression object using model = linear_model.LinearRegression()
model = linear_model.LinearRegression()

In [10]:
# 4. Use model.fit(X_train, y_train) to train the model
model.fit(X_train, y_train)

In [14]:
# 5. Use model.predict(X_test) to test the model
model.predict(X_test)


array([23010.53958236, 27628.86300547, 18195.41396344, 14809.85325261])

In [15]:
# 6. Use model.score(X_test, y_test) to calculate model accuracy
model.score(X_test, y_test)

0.8574045338184415

In [18]:
# 7. Save the model
joblib.dump(model, 'car_price_model')

['car_price_model']