# Fitting a LinearRegression - computing your own features


You are working for an online retailer of baby clothes, called Cozy Cubbies. They have hired you to help their customers understand what size clothing to buy.

Since babies grow very fast, parents buying clothing for future use (e.g. buying summer clothing in the spring) don't always know what size to buy, because they don't know how big their baby will be.

To address this, Cozy Cubbies has added a "Help Me Size" feature to their website, where parents can enter some basic information about their baby and get an estimate of their future size. They will enter the following information:

- baby's date of birth
- baby's current weight (kg)
- future date for which they want to estimate baby's weight
and then, they will get an estimate of their baby's weight at that future date. This will help them decide what size clothing to buy.

In the attached workspace, you will develop a linear regression model to realize this "Help Me Size" feature.

The grader will evaluate the following parts of your code:

| Name| 	Type| 	Description |
| --- | --- | --- |
|`y` |	1d numpy array	| The target variable, computed as the difference in weight between two measurements.|
|`transform_df`	| function	| Function that accepts a pandas data frame and returns a 2d numpy array on which you will fit the model.|
|`rsq`	| float	| R2 value of your LinearRegression model on the test set in the workspace.|

In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

To help with this task, you have been given a dataset of weights of a baby at two points in time. For each sample, you have:

* the baby's date of birth (`date_birth`)
* the baby's weight in kg, at some date (`initial_weight`) 
* and the date of that initial weight (`date_weight`)
* then the date of the second weight (`date_future`)
* and the weight in kg at that second time (`future_weight`)

Read the data into a  data frame called `df`:

In [2]:
df = pd.read_csv("data.csv")
df

Unnamed: 0,date_birth,date_weight,initial_weight,date_future,future_weight
0,2023-02-24,2023-09-08,8.611799,2023-12-29,10.423569
1,2023-04-03,2023-07-03,6.215212,2024-01-08,10.242544
2,2023-02-12,2023-10-15,8.279586,2024-06-09,10.416252
3,2023-10-23,2024-03-04,5.980661,2024-04-15,6.543232
4,2023-12-13,2024-08-21,7.661044,2024-12-25,9.033796
...,...,...,...,...,...
1995,2023-07-18,2024-03-12,9.386375,2024-07-23,11.287761
1996,2023-03-11,2023-08-19,7.032677,2024-02-17,9.391245
1997,2023-10-04,2024-05-08,8.605420,2024-10-09,10.440326
1998,2023-06-28,2023-11-29,8.914786,2024-05-29,11.767301


The task you must solve is: given a new sample with `'date_birth', 'date_weight', 'initial_weight', 'date_future'`, what is the `weight_diff`  - i.e. how much weight will the baby have gained - between `date_weight` and `date_future`?

First, create the target variable `y` - the weight gain from the first measurement to the second measurement - as a 1d numpy array:

In [3]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# y = ...
df['weight_diff'] = df['future_weight'] - df['initial_weight'] 
y = df['weight_diff'].to_numpy()

(Note: if you have a pandas DataFrame or DataSeries, you can use `.to_numpy()` to get the underlying numpy array from it.)

Next, you will write a function to return `X` as a 2D numpy array of feature data. For now, we will just use `initial_weight` as the only feature:

In [4]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
def transform_df(df_baby_weights):
    # Create a copy to avoid the SettingWithCopyWarning
    df_baby_weights = df_baby_weights.copy()

    # Convert to datetime
    df_baby_weights['date_birth'] = pd.to_datetime(df_baby_weights['date_birth'], errors='coerce')
    df_baby_weights['date_weight'] = pd.to_datetime(df_baby_weights['date_weight'], errors='coerce')
    df_baby_weights['date_future'] = pd.to_datetime(df_baby_weights['date_future'], errors='coerce')

    # Calculate new features
    age_at_weight = (df_baby_weights['date_weight'] - df_baby_weights['date_birth']).dt.days
    duration_between_weights = (df_baby_weights['date_future'] - df_baby_weights['date_weight']).dt.days
    initial_weight = df_baby_weights['initial_weight']

    # Combine features into a 2D numpy array
    X = np.column_stack((initial_weight, age_at_weight, duration_between_weights))
    return X


In [5]:
X = transform_df(df[['date_birth', 'date_weight', 'initial_weight', 'date_future']])
X

array([[  8.61179869, 196.        , 112.        ],
       [  6.21521219,  91.        , 189.        ],
       [  8.27958621, 245.        , 238.        ],
       ...,
       [  8.60542047, 217.        , 154.        ],
       [  8.91478585, 154.        , 182.        ],
       [  7.85847964, 231.        , 231.        ]])

Then, divide `X` and `y` into a training and test set using `train_test_split`, reserving 500 samples for the test set. Shuffle the data when you split it, and so that your random shuffle will match the autograder's, pass `random_state = 42`.

In [6]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# Xtr = ...
# Xts = ...
# ytr = ...
# yts = ...

Xtr, Xts, ytr, yts = train_test_split(X, y, test_size=500, shuffle=True, random_state=42)

Now, fit a `LinearRegression` model on the training data:

In [7]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# reg = ...
reg = LinearRegression()
reg.fit(Xtr, ytr)

Use it to predict the baby's weight gain for samples in the test set, and compute the R2 score:

In [8]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
# yhat_ts = ...
# rsq = ...

yhat_ts = reg.predict(Xts)
rsq = r2_score(yts, yhat_ts)

What R2 score have you achieved?

In [9]:
rsq

0.890894237590863

With just `initial_weight` it is very difficult to predict a baby's weight! But you can modify `transform_df` to compute other features that will make your linear regression model much more effective. (Don't modify `df` directly, just the `transform_df` function.)

Use the pandas and numpy documentation to help:

* https://numpy.org/doc/stable/reference/
* https://pandas.pydata.org/docs/reference/

For full credit, a linear regression fitted using the feature data from your modified `transform_df` should achieve an R2 score above 0.85 for this problem.

Your design will be evaluated against a new, comparable test data set, not the test data set in your notebook.