# Step 2
> Steps after data cleaning
- toc: true
- categories: [jupyter]
- comments: true

## After data cleaning

This is when you have to start assigning variables and labels to every instance, but before that, we need to make sure every spot has a value, even if it is wrong.

In [None]:
weather = weather.ffill()
#pull the temperature from a specific row and fil it in
# in this case the last value 
# even though the value is wrong, we can't have missing values

Now we start the predictions. Since we have multiple independent variables in the data (snow, rain, sleet, temperature), the data suffers from multicolinearity. Since this is the case, we need to use ridge regression. This type of model assumes that there is some sort of association between each of the variables and creates a prediction model accordingly. 

In [None]:
from sklearn.linear_model import Ridge

rr = Ridge(alpha=.1)

#we're going to apply a ridge regression model since the data sufferes from multicolinearity

In [None]:
def backtest(weather, model, predictors, start=3650, step=90):
    all_predictions = []
    
    for i in range(start, weather.shape[0], step):
        train = weather.iloc[:i,:]
        test = weather.iloc[i:(i+step),:]
        
        model.fit(train[predictors], train["target"])
        
        preds = model.predict(test[predictors])
        preds = pd.Series(preds, index=test.index)
        combined = pd.concat([test["target"], preds], axis=1)
        combined.columns = ["actual", "prediction"]
        combined["diff"] = (combined["prediction"] - combined["actual"]).abs()
        
        all_predictions.append(combined)
    return pd.concat(all_predictions)

This code performs a backtesting analysis using a machine learning model. 

Here is a breakdown of what the code does.
1. The function takes four parameters as input: weather (a DataFrame containing weather data), model (a machine learning model object), predictors (a list of predictor variables/columns from the weather DataFrame), start (an optional parameter specifying the starting index for the backtesting, default is 3650), and step (an optional parameter specifying the step size for each iteration of the backtesting, default is 90)

2. It initializes an empty list called all_predictions to store the predictions made during the backtesting.

3. It starts a for loop that iterates over the range of indices from start to the total number of rows in the weather DataFrame with a step size of step. This loop allows the backtesting to be performed in multiple iterations with overlapping test sets.

4. Within each iteration of the loop, it splits the weather DataFrame into a training set and a test set. The training set (train) includes all rows from the beginning up to the current index i, while the test set (test) includes the rows from i to i+step.

5. The machine learning model (model) is trained using the predictor variables (predictors) from the training set (train) and the corresponding target variable (column) named "target" from the training set.

6. The trained model is then used to make predictions on the predictor variables from the test set (test[predictors]). The predictions are stored in the preds variable.

7. The predicted values are converted to a pandas Series (preds) with the same index as the test set (test.index).

8. The actual target values and the predicted values are concatenated together into a DataFrame (combined) using pd.concat(). The columns of the DataFrame are renamed to "actual" and "prediction".

9. A new column named "diff" is added to the DataFrame combined, which represents the absolute difference between the predicted and actual values.

10. The DataFrame combined is appended to the all_predictions list.

11. After the loop completes, all the prediction results from each iteration are concatenated together using pd.concat() and returned as the final result of the function.

In summary, this code performs a backtesting analysis by training a machine learning model on the weather data and making predictions on overlapping test sets. It collects the predicted values, actual values, and the absolute difference between them in each iteration and returns the concatenated result.