# Welcome to the Linear Regression Example!

## Workshop Steps
Now that you have opened up the MyBinder environment and are reading this, you are already on the right track! Inside this environment,
you will also find:
* sample scripts: This is a folder containing the base of the scripts that you will be working with to finish the exercise. Please look for the triple exclamation points (!!!) as that means that you are being asked to write some code to get things to work!
* README.md: This is just the README file you saw on the Github page.
* requirements.txt: This is a list of the required libraries that were installed upon startup.
* setup.ipynb: The file you are reading right now! Think of this as your home page.
* VW ID. 3 Pro Max EV Consumption.csv: The raw .csv file from Kaggle.com. Please note that we will need to move this into the data job's folder once we create it for a neater environment.

### Step 1: Explore VDK's Functionalities
A simple command like that found in the setup.ipynb "!vdk --help" gives you all the information you need.


In [None]:
!vdk --help

### Step 2: Create a Data Job
Now that we have explored VDK's capabilities, let's create our data job. 

Keep in mind that we would like to have a sub-folder for the data job,so that our Streamlit script is outside of it and in the main directory. 

Based on the information above, try creating a data job titled "linear-reg-data-job". You can chose any team name that you want, but please create the job at the home directory. This will create a sub-folder for the data job. The home directory is /home/jovyan. 

Here's an example code:

In [None]:
!vdk create -n linear-reg-data-job -t team-awesome -p /home/jovyan

### Step 3: Work Out the Data Job Template

Now that you have created a data job, please go inside the subfolder and set up the structure of your data job. Here's the general idea.

We want the data job to have four scripts:
* Let's have one Python script that reads in the data and strips its special characters and re-saves it.
* Let's have another Python script that reads in the fixed data and performs exploratory data analysis.
* Let's have a third Python script that reads in the data from the first script, cleans up the data, and gets it ready for model building and testing.
* Lastly, let's have a Python script that reads the data from the third script, builds a simple Linear Regression model, tests it, and saves it.

Each of these four scripts are present in the sample scripts subfolder. However, we've added some coding challeneges inside of them to make things fun! Let's move those four scripts to the data job subfolder. Please run the code below:

In [None]:
! mv "sample scripts/10_read_in_data.py" ~/linear-reg-data-job
! mv "sample scripts/20_explore_data.py" ~/linear-reg-data-job
! mv "sample scripts/30_process_data.py" ~/linear-reg-data-job
! mv "sample scripts/40_build_model.py"  ~/linear-reg-data-job

Let's move the raw CSV file to the data job's subfolder. It's not usually necessary, but it will create a sense of a neater working environment here. As such, please execute the code below:

In [None]:
! mv "VW ID. 3 Pro Max EV Consumption.csv" ~/linear-reg-data-job

Let's also delete the other template files that we will not be needing:
* The SQL script: our example does not do anything with SQL.
* The sample Python script: we already have moved four sample Python scripts, so we won't be needing this.
* README.md: We already have a README for the entire example, so we can get rid of this.
* requirements.txt: Each data job would need this file if the data job relies on external libraries that VDK does not have. In our case, MyBinder installed those upon startup, so we won't be needing this either.

As such, please run the code below to delete them:

In [None]:
! rm "linear-reg-data-job/10_sql_step.sql"
! rm "linear-reg-data-job/20_python_step.py"
! rm "linear-reg-data-job/README.md"
! rm "linear-reg-data-job/requirements.txt"

Great! Now you're all set up with the data job:
* You have created a data job.
* You have deleted the template files that you do not need.
* You have moved the sample scripts we provided to the data job sub-folder.
* You have moved the raw CSV file to the data job sub-folder for a neater environment!

The next step is to begin working on each script in the data job! Let's do it!

### Step 4: Data Job - Read in the Data and Strip Special Characters (10_read_in_data.py)

Please open up 10_read_in_data.py. Inside it, you will see the code template already populated. Let's explore.

First, we import all of the necessary libraries.

Then, we initiallize a logging variable and change the directory to the script's location.

Then, we open up VDK's "run" function. This is how VDK knows that the following code will be part of its execution path, if you will.

<font color='red'>**ATTENTION!**</font>

Once we request that the log print out the file name's execution status (line 11), we want YOU to make an edit! Please see line 14. Here, we want you to enter the filename of the raw CSV file, so that it is stored in the 'filename_to_import' variable.

<font color='green'>**GOOD JOB!**</font>

Now that you have done this, we read in the data using Pandas' read_csv functionality and use an encoding to allow for the special characters. However, we will later want to strip them from the column names. Lines 25-29 do that by first:
* Getting rid of non-alphanumeric characters and replacing them with blanks.
* Stripping any leading or trailing whitespace.
* Replacing any spacing in between characters with an underscore.
* Making all of the column names in lower case

<font color='red'>**ATTENTION!**</font>

Please note that line 29 has a bit for you to do! Please enter the appropriate method to turn the column names to lowercase.

<font color='green'>**GOOD JOB!**</font>

Congratulations! You've finished with the first script! In actuality, you can execute the data job now, if you want. **Please make sure you save it the script first!** It will fail, as you have not finished making edits to the other three scripts, but you will be able to:
* Check if the first script ran sucessfully.
* Observe the results from the first script.
* Learn to read error messages, as the other scripts will throw them as of now.

In [None]:
! vdk run linear-reg-data-job

### Step 5: Data Job - Exploratory Data Analysis (20_explore_data.py)

Judging from the output above, the first script ran sucessfully! The others.. not so much. But that's only because we haven't made the edits to them, yet! Let's continue to do that with our second script.

With this script, we want to explore the data. Let's open up 20_explore_data, and take a look what's inside.

Again, we import the various libraries, initialize the logger, and change the directory.

Inside the "run" function, let's create a sub-folder within our data job folder to store the exploratory graphics and tables. That's done in lines 17 and 18. 

<font color='red'>**ATTENTION!**</font>

Look at the filename_to_import variable. It is not defined yet. We want YOU to do that. Please go ahead and make the necessary edit, so that the variable is set to the filename of the fixed data set.

<font color='green'>**GOOD JOB!**</font>

Once that's done, we read in our data and explore the structure of it. We send a few exploratory commands to the log for printing, so that we can observe them later.

We then take a look at the numerics. We want to create histograms for each numeric variable using Seaborn. Lines 34 to 53 do exactly that. They also save each histogram in the exploratory folder we created earlier in the program. 

Let's then turn to the cateogrical variables. For them, we want to create value counts. For example, if a categorical variable "color" contains the value "blue" in 14 rows, "red" in 12 rows, and "green" in 4 rows, we want a new table that only has the value (i.e., "blue", "red", "green") and the count of how many times it occurs in the column.

<font color='red'>**ATTENTION!**</font>

Let's first define the cat_cols variable. We want it to be a list of the categorical variables. One super easy way to do this is to just take all the column names that are not in the num_cols variable.. We want YOU to do that! Please head over to line 56 and make the edit so that it looks something like:

```
cat_cols = [i for i in df.columns if i not in xyz]
```

<font color='green'>**GOOD JOB!**</font>

Congratulations! You can run the data job again, if you'd like and look at the status of the various scripts. If all is well, then you should have gotten that the first two scripts have a success message, while the last two fail. Let's go finish those off, now!

In [None]:
! vdk run linear-reg-data-job

### Step 6: Data Job - Processing The Data (30_process_data.py)

Let's now turn our attention to the third script, which will process the data for model building.

Again, we import the various libraries, initialize the logger, and change the directory.

<font color='red'>**ATTENTION!**</font>

Again, we want YOU to make the edit to the filename_to_import variable. Please enter the name of the fixed table.

<font color='green'>**GOOD JOB!**</font>

We then read in the data.

Let's drop the rows with missing values. There are many ways to deal with missing values (estimating them through taking the mean or median, or even performing a linear regression just for them), but let's stick with the simple choice of dropping them, as the loss of data is not that great.

Linear regression only works with numeric variables. As such, if we want to use the categorical variables, we will need to turn them into numerics. Lines 26 to 40 do exactly just that.

Lines 43 to 44 create our dependent variable - the variable we will be trying to estimate: battery drainage.

Lines 45 and 46 create one more variable: temperature change. Who knows - it may have explanatory power!

Let's now limit the data set to the variables we want. This helps declutter.

It is always good practice, however, to look at the variables you created. Who knows - maybe you didn't see something with regard to the relationship between some of the variables. For example, temperature start and temperature end might look perfectly fine on their own, but if we calculate the change we may find some data entry error if we see that the temperature change was 40 degrees Celsius, for example. Let's look at the data.

We are definitely seeing one weird result: a negative battery drainage. That can't be. Let's remove the data point and save the data.

Awesome! We have now processed the data and it's ready to be modeled. Let's run the job again, just to make sure that all of the scripts function as they should!

In [None]:
! vdk run linear-reg-data-job

### Step 7: Data Job - Build the Model, Test the Model, Save the Model (40_build_model.py)

Great! It looks like everything went through, except for the fourth script. Let's turn to that now!

Again, we import the various libraries, initialize the logger, and change the directory.

<font color='red'>**ATTENTION!**</font>

Again, we want YOU to make the edit to the filename_to_import variable. Please enter the name of the table that was outputted from the third script.

<font color='green'>**GOOD JOB!**</font>

We then create a sub-folder within the data job's folder to house the model and model-related outputs.

We read in the data.

<font color='red'>**ATTENTION!**</font>

Let's now split the data into four chunks: x_train, y_train, x_test, and y_test. 

x denotes the independent variables or the predictor variables and y denotes the variable we are trying to estimate - i.e., battery drainage. What's the idea here?

Well, suppose that you build a really good model. Well, how do you know it's good? By setting aside some testing data,
you can create a model purely on the training data and THEN test that model on data it has not seen before. That way,
you'll know how good the model is because you actually have the testing data's dependent variable (y_test) and can measure it against your model's prediction (y_pred). 

We show one way to split the data, but there are many, and we encourage you to read up more about this. We will split the data based on a pre-defined random state, so that the numbers are reproducible. We will take 20 percent of the data and put it aside as testing data.

Please make sure you change the test size parameter to 20 percent!

<font color='green'>**GOOD JOB!**</font>

Let's check out what this looks like in lines 39 through 47.

We will also want to check that none of the independent variables (or predictors) are heavily correlated with one another. In other words, we want to check for multicollinearity. We can do this through a correlation plot. We save the chart in the "explore_data" subfolder that we created in the second script.

We see that there some columns that are pretty strongly correlated with others. For those that appear to be heavily correlated with other main variables, we can drop them from both the train and test.

<font color='red'>**ATTENTION!**</font>

It is often good practice to perform feature selection. That means to narrow down the features/predictors you want to use to be in your model. There are many ways to do this, but one is called Lasso regression. It penalizes the  coefficients of the least important predictors in your model and brings them to 0. Thus, we can take them out. 

Please make sure to normalize the data first! We've left a little coding challenge in there for you!

<font color='green'>**GOOD JOB!**</font>

OK, let's drop those less important features from our explanatory data sets (x_train and x_test) and let's finally fit our model to the training data! We can then create a prediction based on the predictors' values from x_test (y_pred). Then, we can compare how those predictions compare against y_test. Neat, huh?!

Lines 83 to 84 fit the model while lines 87 to 93 predict battery drainage (y_pred) using x_test, compare the actual (y_test) and the predicted (y_pred), and save the output.

Now that we have the true values of battery drain from y_test and the predicted battery drain from y_pred, we can get
some measures of model quality, like the mean squared error, mean absolute error, and R2! Lines 95 to 101 do that!

We can also extract the coefficients in a neat table. This isn't super necessary in every single case, but it could prove useful in model manipulation and exploration. Plus, it doesn't hurt, right?

Lastly, we save the model and we're done! Well... not quite. We still have to run the data job and make sure that it doesn't fail!

In [None]:
! vdk run linear-reg-data-job

You should now get a success message for every single script, as well as one for the entire data job above them. If so, congratulations! You have built your first data job!

### Step 8: Let's Build a Streamlit Visualization!

Now that we have finished with the data job, let's use that hard-earned model to make a cool dashboard!

First, let's move two more files from the sample scripts folder to the main folder.

In [None]:
! mv "sample scripts/build_streamlit_dashboard.py" ~/
! mv "sample scripts/parameters.py" ~/

Second, let's take a look at the coefficients from our model and let's build a parameters file that will serve as the
user input generated values for our predictors. They will be fed to the model, and we'll have a prediction for whatever values a user gives us for the predictors! Neat!

In [None]:
! cat linear-reg-data-job/model/model_coefficients.csv

From here, you can see the various features/predictors that your model uses to predict battery drainage. We want to let the user decide what value to set for each of those predictors.

<font color='red'>**ATTENTION!**</font>

We can create a parameters.py file that will de-clutter our workspace and make things a bit easier. We have actually created this already for you. For example, open up parameters.py, and take a look at the parameters keys (i.e., 'charge_level_start', 'temperature_start_c', etc.) and see if these match up to the output above. Make sure they are in the same order (with the exception of the intercept, which will be first in the model output above).

<font color='green'>**GOOD JOB!**</font>

We can now turn our attention to showcasing our model using Streamlit!

Let's open up the build_streamlit_dashboard.py file.

Notice that we do not have a "run" function, here. This is because the Streamlit dashboard is outside of VDK. We used VDK for our data job and we will now use the results from that data job to build a Streamlit dashboard. Cool, huh?

As per usual, we begin by importing some libraries and the parameters file we spoke about earlier in this section.

<font color='red'>**ATTENTION!**</font>

We now want to give our dashboard a title. We will want to:
* Showcase the model's predictive ability against y_test.
* Allow the user to enter predictors' values and use them to give them a predicted battery drainage.

As such, please give it an appropriate name.

<font color='green'>**GOOD JOB!**</font>

Now that we have the title, let's focus on the first section of our Dashboard - model showcase and quality measures:
* We set up the sub-titles.
* We change the directory.
* We read in the actual versus predicted output from the model building script.

<font color='red'>**ATTENTION!**</font>

Please enter the location of the model in line 19. You can use the "actual_vs_pred_loc" variable as a hint.

<font color='green'>**GOOD JOB!**</font>

Now that we've read everything in, let's create an additional variable (absolute difference between y_test and y_pred) and plot that entire data frame to the dashboard! This is done in lines 23 to 26.

Lines 27 to 37 calculate and plot the mean squared error, mean absolute error, and R2 of our model! It's that easy!

Now let's add the second section, which will allow the user to select the predictors' values that will feed into the
model and give them an estimated battery drain!

Lines 39 to 41 set up the sub-titles.

Lines 43 to 48 accept the user's inputs against the parameters file and save those in a dictionary, which is then turned into a data frame ("results"). We are essentially creating a mini (one row, in fact) x_test data set, if you think about it! We will feed this data set to our model, which will give us a prediction (a one row y_pred, if you think about it). How cool is that?

We read in the model in line 51 and obtain the estimate in line 54. And that's it! We let the user decide the values and we gave them an estimate based on those values!

However, our model is quite simple, as of now. There are many things that we can do to improve it:
* A larger data set with more variation in the dependent variable and in the independent variables.
* An exploration into more flexible predictors, interaction terms, etc.
* An exploration into more flexible models.

The list goes on and on. In reality, your model will never be perfect, but there are ways to improve it. We encourage you to read up on these methods.

Because our model is still quite simple, it may sometimes present the user a negative estimate. Well, how can that be? Your battery can't get charged up while you are drifting on the highway, can it? We hope not, at least...

<font color='red'>**ATTENTION!**</font>

Please make an edit to the elif statement so that the estimate is turned to 0 if it is below 0.

<font color='green'>**GOOD JOB!**</font>

Congratulations! You have built your first Streamlit dashboard that even allows for a user to enter inputs! How cool is that?

As a last step, run the following code:

In [None]:
! streamlit run build_streamlit_dashboard.py

You will get an output, but the kernel will be stuck. That's okay! Just open a new tab in your browser,
copy the link of the MyBinder environment, delete everything after "user/blah blah blah" and paste "/proxy/8501/"
So, something like this: 
```
https://hub.gke2.mybinder.org/user/alexanderavramo-n-example-empty-zkd8q00p/proxy/8501/
```

The Streamlit dashboard will now show up!