## Train a `LinearRegression`

In the heart of Silicon Valley, there is a ruthless software engineering company whose CEO is driven by a single goal: profit at any cost.

To maximize its software engineers' output, they have hired you to develop a predictive model to estimate the likely productivity of a software engineer, given their level of sleep deprivation. The company will use this model to overwork its engineers more efficiently.

You have been provided with a dataset from an internal study, where a group of employees (including some sleep-deprived employees who had been forced to work all night beforehand, and some who had been permitted to go home and sleep) were asked to all perform the same 90-minute coding task. Then, the quality of their code was evaluated and recorded. Each sample includes the following columns:

- `id` number of the sample
- `experience` level of the employee (up to 100)
- `had_sleep` (1 or 0, indicating whether the employee was permitted to sleep the night before)
- `passed_unit_tests` of the code they produced

In the attached workspace, you will read this data from a file, and split it into training and test sets. Then, you will fit a `LinearRegression` (using the `sklearn` implementation, you may refer to its documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)) on the training set, and evaluate its accuracy in predicting `passed_unit_tests` on the test set.

You'll need to specify this random state in your notebook:

> random_state = 20

The following items will be graded:

| Name | Type | Description |
| ---- | ---- | ---- |
|`Xtr`	|pandas dataframe	|Training data - features.
|`Xts`	|pandas dataframe	|Test data - features.
|`ytr`	|pandas series OR pandas data frame OR 1d numpy array	|Training data - target variable.
|`yts`	|pandas series OR pandas data frame OR 1d numpy array	|Test data - target variable.
|`yts_hat`	|1d numpy array	|Model prediction for test data.
|`rsq`	|float	|R2 of model on test data.

In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In this question, we will try to predict the number of unit tests that a software engineer's code will pass, given information about their experience level and their sleep status.

First, we'll load the dataset:

In [10]:
df = pd.read_csv('data.csv', names=['id', 'experience', 'had_sleep', 'passed_unit_tests'], header=None, index_col='id')

You can add some code here to inspect the data, see the names of features, and see the data types - the cell below will not be graded.

In [11]:
df.head()

Unnamed: 0_level_0,experience,had_sleep,passed_unit_tests
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,85,1,4
1,86,1,7
2,83,1,8
3,67,1,4
4,80,1,8


(but, note that your code will be evaluated on *different* data organized in a data frame with the same columns - so in your solution, you should not hard-code anything specific to this data.)

Now we will split into training and test sets, using `train_test_split`! 

* Reserve 20% of the data for testing.
* Use the random state specified on the question page.

The following cell should create 

* `Xtr` and `Xts` as pandas data frames including only the `experience` and `had_sleep` features, 
* and `ytr` and `yts` as either pandas data series or 1d numpy arrays containing the target variable. 

In [12]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
features = ['experience', 'had_sleep']
target = ['passed_unit_tests']
X = df[features]
y = df[target]
random_state = 20
Xtr, Xts, ytr, yts = train_test_split(X, y, test_size=0.2, random_state=random_state)

Now we are ready to fit the `LinearRegression`. Using the default settings, fit the model on the training data. Then, use it to make predictions for the test samples, and save this prediction in `yts_hat`. Evaluate the R2 score of the model on the test data, and save this in `rsq`.

In [13]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
model = LinearRegression().fit(Xtr, ytr)
yts_hat = model.predict(Xts)
rsq = r2_score(yts, yts_hat)

0.21836803234855162