# Welcome to the Linear Regression Example!

## Workshop Steps
Now that you have opened up the MyBinder environment and are reading this, you are already on the right track! Inside this environment,
you will also find:
* sample scripts: This is a folder containing the base of the scripts that you will be working with to finish the exercise. Please look for the triple exclamation points (!!!) as that means that you are being asked to write some code to get things to work!
* README.md: This is just the README file you saw on the Github page.
* requirements.txt: This is a list of the required libraries that were installed upon startup.
* setup.ipynb: The file you are reading right now! Think of this as your home page.
* VW ID. 3 Pro Max EV Consumption.csv: The raw .csv file from Kaggle.com. Please note that we will need to move this into the data job's folder once we create it for a neater environment.

### Step 1: Explore VDK's Functionalities
A simple command like that found in the setup.ipynb "!vdk --help" gives you all the information you need.


In [None]:
!vdk --help

### Step 2: Create a Data Job
Now that we have explored VDK's capabilities, let's create our data job. 

Keep in mind that we would like to have a sub-folder for the data job,so that our Streamlit script is outside of it and in the main directory. 

Based on the information above, try creating a data job titled "linear-reg-data-job". You can chose any team name that you want, but please create the job at the home directory. This will create a sub-folder for the data job. The home directory is /home/jovyan. 

Here's an example code:

In [None]:
!vdk create -n linear-reg-data-job -t team-awesome -p /home/jovyan

### Step 3: Work Out the Data Job Template

Now that you have created a data job, please go inside the subfolder and set up the structure of your data job. Here's the general idea.

We want the data job to have four scripts:
* Let's have one Python script that reads in the data and strips its special characters and re-saves it.
* Let's have another Python script that reads in the fixed data and performs exploratory data analysis.
* Let's have a third Python script that reads in the data from the first script, cleans up the data, and gets it ready for model building and testing.
* Lastly, let's have a Python script that reads the data from the third script, builds a simple Linear Regression model, tests it, and saves it.

Each of these four scripts are present in the sample scripts subfolder. However, we've added some coding challeneges inside of them to make things fun! Let's move those four scripts to the data job subfolder. Please run the code below:

In [None]:
! mv "sample scripts/10_read_in_data.py" ~/linear-reg-data-job
! mv "sample scripts/20_explore_data.py" ~/linear-reg-data-job
! mv "sample scripts/30_process_data.py" ~/linear-reg-data-job
! mv "sample scripts/40_build_model.py"  ~/linear-reg-data-job

Let's move the raw CSV file to the data job's subfolder. It's not usually necessary, but it will create a sense of a neater working environment here. As such, please execute the code below:

In [None]:
! mv "VW ID. 3 Pro Max EV Consumption.csv" ~/linear-reg-data-job

Let's also delete the other template files that we will not be needing:
* the SQL script: our example does not do anything with SQL
* the sample Python script: we already have moved four sample Python scripts, so we won't be needing this
* README.md: We already have a README for the entire example, so we can get rid of this
* requirements.txt: Each data job would need this file if the data job relies on external libraries that VDK does not have. In our case, MyBinder installed those upon startup, so we won't be needing this either.

As such, please run the code below to delete them:

In [None]:
! rm "linear-reg-data-job/10_sql_step.sql"
! rm "linear-reg-data-job/20_python_step.py"
! rm "linear-reg-data-job/README.md"
! rm "linear-reg-data-job/requirements.txt"

Great! Now you're all set up with the data job:
* You have created a data job.
* You have deleted the template files that you do not need.
* You have moved the sample scripts we provided to the data job sub-folder.
* You have moved the raw CSV file to the data job sub-folder for a neater environment!

The next step is to begin working on each script in the data job! Let's do it!

### Step 4: Data Job - Read in the Data and Strip Special Characters (10_read_in_data.py)

Please open up 10_read_in_data.py. Inside it, you will see the code template already populated. Let's explore.

First, we import all of the necessary libraries.

Then, we initiallize a logging variable and change the directory to the script's location.

Then, we open up VDK's "run" function. This is how VDK knows that the following code will be part of its execution path, if you will.

<font color='red'>**ATTENTION!**</font>

Once we request that the log print out the file name's execution status (line 11), we want YOU to make an edit! Please see line 14. Here, we want you to enter the filename of the raw CSV file, so that it is stored in the 'filename_to_import' variable.

<font color='green'>**GOOD JOB!**</font>

Now that you have done this, we read in the data using Pandas' read_csv functionality and use an encoding to allow for the special characters. However, we will later want to strip them from the column names. Lines 25-29 do that by first:
* Getting rid of non-alphanumeric characters and replacing them with blanks.
* Stripping any leading or trailing whitespace.
* Replacing any spacing in between characters with an underscore.
* Making all of the column names in lower case

<font color='red'>**ATTENTION!**</font>

Please note that line 29 has a bit for you to do! Please enter the appropriate method to turn the column names to lowercase.

<font color='green'>**GOOD JOB!**</font>

### Step 5: Data Job - Exploratory Data Analysis
Let's go back to the linear-reg-data-job sub-folder and make a copy of the 10_read_in_data.py file. Let's call the copy 20_explore_data.py.
Open up the copy and let's make a few changes. 

Because we will be making some charts and tables, we will need to import some of the libraries that we installed from the requirements.txt file earlier.
As such, please delete all the code before the run function and paste the following:
```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import logging
import pathlib
from vdk.api.job_input import IJobInput

logger = logging.getLogger(__name__)
os.chdir(pathlib.Path(__file__).parent.absolute())
```

Now, inside the run function, let's delete everything. We will write code that creates a sub-folder within the linear-reg-data-job sub-folder, which
will store all the exploratory graphics and tables. As such, delete everything and paste this:
```
logger.info('executing program: ' + os.path.basename(__file__))

if not os.path.exists('explore_data'):
    os.makedirs('explore_data')
```

Let's now read in the fixed .CSV file. Please paste the following code in the run function below the code that created the explore_data sub-folder.
Please note, however, that we've left the filename_to_import variable [BLANK]. Please change that to the name of the fixed .CSV file.
```
filename_to_import = [BLANK]
df = pd.read_csv(filepath_or_buffer=filename_to_import, parse_dates=['date'])
```

Let's also write some code below that to explore the data. Make sure to write it to the log using
```
logger.info(df.info())
```
as an example. Try logging the head and the tail of the data. Get creative with it!

At some point, however, we will need to define the numeric variables. Let's write the following code to store the column names of the numeric
variables.
```
num_cols = df.select_dtypes(include=np.number).columns.tolist()
```

Ok, now that we have the numeric columns. Let's create some histograms for each numeric column. Here is one way to do it using seaborn and matplotlib.
Try out some different functions. Try changing distplot to something else.
```
sns.set(
    style='ticks',
    rc={
        'axes.spines.right': False,
        'axes.spines.top': False,
        'figure.figsize': (18, 14)
        }
    )
for num_col in num_cols:
    sns.distplot(df[num_col],
        bins=100).set(
            xlabel=num_col,
            ylabel='Count'
            )
    plt.savefig('explore_data/' + num_col + '.png')
    # plt.show()
    plt.clf()
```

Ok, now that we have created a histogram for each numeric variable and saved that histogram in the explore_data subfolder, let's turn our attention
to the categorical variables. Let's follow the same process as that above. In other words, let's first create a list of the names of the columns
that are categorical! We've left a little challenge for you in it: change [BLANK] to something that works!
```
cat_cols = [i for i in df.columns if i not in [BLANK]]
```

Let's now run a loop that calculates the value counts for each categorical variable. What does that mean? We want to see the values that occur
for each categorical variable and how often those values occur. We also want to save the result in an Excel file, where each worksheet within the
Excel file is a value count for each categorical variable. Here's the code to do that:

```
cat_writer = pd.ExcelWriter('explore_data/explore_categoricals.xlsx', engine='xlsxwriter')
for cat_col in cat_cols:
    temp = pd.DataFrame(
        df[cat_col].value_counts(dropna=False)
    )
    temp.to_excel(cat_writer, sheet_name=cat_col)
cat_writer.save()
```

After all this, let's go and run the data job again by re-running the "!vdk run..." command in the setup.ipynb.
This time, notice that it will run both scripts, one after the other in an alphabetical order. Let's go and check out the 
results! Please head over to the explore_data sub-folder within the linear-reg-data-job folder!
Congrats!

### Step 6: Data Job - Processing The Data
Now that we have explored the data, we know what we want to do and what we must do. Let's head over to the linear-reg-data-job sub-folder and create
a copy of the last Python script. Rename it 30_process_data.py.

Let's delete everything above the run function, import our libraries and initialize the log:
```
import pandas as pd
import numpy as np
import logging
import pathlib
import os
from vdk.api.job_input import IJobInput

logger = logging.getLogger(__name__)
os.chdir(pathlib.Path(__file__).parent.absolute())
```

Now, inside the run function, delete everything. Let's define the data sets that we will be reading in and later
saving and read in the data:
```
logger.info('executing program: ' + os.path.basename(__file__))

# some definitions...
filename_to_import = 'VW ID. 3 Pro Max EV Consumption_Fixed_Columns.csv'
filename_to_export = 'VW ID. 3 Pro Max EV Consumption_Model_Data.csv'

# reading in the data...
df = pd.read_csv(filepath_or_buffer=filename_to_import, parse_dates=['date'])
```

If you remember from the exploratory data analysis section, we had some missing values in our data set. There many ways to deal with missing values
but the simplest is to just drop observations where any column has a missing value. In our case, that doesn't drop too many data points, so let's go
ahead and do that:
```
df_no_nulls = df.copy().dropna()
```

Linear regression only works with numeric variables. As such, if we want to use the categorical variables, we will need to turn them into numerics:
```
df_no_nulls['ac_use'] = np.where(df_no_nulls['ac_c'] == 'OFF', 0, 1)

df_no_nulls['heated_seats'] = np.where(df_no_nulls['heated_front_seats_level'] == 0, 0, 1)

df_no_nulls['eco_mode'] = np.where(df_no_nulls['mode'] == 'ECO', 1, 0)

df_no_nulls["is_summer"] = np.where((df_no_nulls['date'] >= '2021-06-21') & (df_no_nulls['date'] <= '2021-09-22'), 1, 0)

df_no_nulls['is_bridgestone_tyre'] = np.where(df_no_nulls['tyres'] == 'Bridgestone 215/45 R20 LM32', 1, 0)
```

Let's now create our dependent variable - the variable we will be trying to estimate: battery drainage:
```
df_no_nulls['battery_drain'] = -(df_no_nulls['charge_level_end'] - df_no_nulls['charge_level_start'])
```

Lastly, let's also create one more variable: temperature change. Who knows - it may have explanatory power!
```
df_no_nulls['temperature_increase'] = df_no_nulls['temperature_end_c'] - df_no_nulls['temperature_start_c']
```

Let's now limit the data set to the variables we want. This helps declutter.
```
df_no_nulls_limited = df_no_nulls.copy()[
    [
        'battery_drain',
        'charge_level_start',
        'is_bridgestone_tyre',
        'temperature_start_c',
        'temperature_increase',
        'distance_km',
        'average_speed_kmh',
        'average_consumption_kwhkm',
        'ac_use',
        'heated_seats',
        'eco_mode',
        'is_summer'
    ]
]
```

It is always good practice, however, to look at the variables you created. Who knows - maybe you didn't see something with regard to the relationship
between some of the variables. For example, temperature start and temperature end might look perfectly fine on their own, but if we calculate the change
we may find some data entry error if we see that the temperature change was 40 degrees Celsius, for example. Let's look at the data.
```
logger.info(df_no_nulls_limited.describe())
logger.info(df_no_nulls_limited.loc[df_no_nulls_limited['battery_drain'] < 0])
```

We are definitely seeing one weird result: a negative battery drainage. That can't be. Let's remove the data point and continue:
```
df_no_nulls_limited_final = df_no_nulls_limited.copy().loc[
        df_no_nulls_limited['battery_drain'] >= 0]
logger.info(df_no_nulls_limited_final.describe()) 

df_no_nulls_limited_final.to_csv(
    path_or_buf=filename_to_export,
    index=False
```

Awesome! We have now processed the data and it's ready to be modeled. Let's run the job again, just to make sure that all of
the scripts function as they should! Remember, just go back to setup.ipynb and re-run the "!vdk run..." command.


### Step 7: Data Job - Build the Model, Test the Model, Save the Model
Ok, so let's create a copy of 30_process_data.py and name it 40_build_model.py. Open it up.

Let's delete everything above the run function again and paste the following:
```

import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import logging
import pickle
import pathlib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from vdk.api.job_input import IJobInput

logger = logging.getLogger(__name__)
os.chdir(pathlib.Path(__file__).parent.absolute())
```

Let's now turn our attention to the run function. Let's get rid of everything in there and begin building our code.
First, we're going to want to set the file name to import, create a sub-folder to house our model, and read in the data.
```
logger.info('executing program: ' + os.path.basename(__file__))

filename_to_import = 'VW ID. 3 Pro Max EV Consumption_Model_Data.csv'

if not os.path.exists('model'):
    os.makedirs('model')

df = pd.read_csv(filepath_or_buffer=filename_to_import)
```

Let's now split the data into four chunks: x_train, y_train, x_test, and y_test. x denotes the independent variables or the
predictor variables and y denotes the variable we are trying to estimate - i.e., battery drainage. What's the idea here?
Well, suppose that you build a really good model. Well, how do you know it's good? By setting aside some testing data,
you can create a model purely on the training data and THEN test that model on data it has not seen before. That way,
you'll know how good the model is because you actually have the testing data's dependent variable (y test) and can measure it
against your model's prediction. The code below shows one way to split the data, but there are many, and we encourage 
you to read up more about this. We will split the data based on a pre-defined random state, so that the numbers are
reproducible. We will take 20 percent of the data and put it aside as testing data:
```
y = df.copy()[['battery_drain']]
x = df.copy().drop('battery_drain', axis=1)
x_train, x_test, y_train, y_test = train_test_split(
   x,
   y,
   test_size=0.2,
   random_state=22
 )
```

Let's check out what this looks like:
```
data_sets = {
    "x_train": x_train,
    "x_test": x_test,
    "y_train": y_train,
    "y_test": y_test
}
for name, data_set in data_sets.items():
    logger.info(f"The shape of the {name} dataset is: {data_set.shape}")
```

We will want to check that none of the independent variables (or predictors) are not heavily correlated with one another.
We can do this through a correlation plot:
```
sns.set(
    style='ticks',
    rc={'figure.figsize': (18, 14)}
)
sns.heatmap(
    x_train.corr(),
    annot=True,
    cmap=sns.diverging_palette(10, 250, n=240)
)
plt.savefig('explore_data/features_correlation.png')
# plt.show() the following features need to be dropped: is_bridgestone_tyre, is_summer, and ac_use
# plt.clf()
```

For those that appear to be heavily correlated with other main variables, we can drop them from both the train and test:
```
predictive_data_sets = [x_train, x_test]
for data_set in predictive_data_sets:
    data_set.drop(['is_bridgestone_tyre', 'is_summer', 'ac_use'], axis=1, inplace=True)
```

It is often good practice to perform feature selection. That means to narrow down the features/predictors you want to use to
be in your model. There are many ways to do this, but one is called Lasso regression. It penalizes the coefficients of the
least important predictors in your model and brings them to 0. Thus, we can manually take them out. Please make sure to
normalize the data first! We've left a little [BLANK] in there for you!
```
lasso = LassoCV(normalize=[BLANK], random_state=22)
lasso.fit(
    x_train,
    y_train
)
lasso = pd.Series(lasso.coef_, index=x_train.columns)

features_to_delete = list(lasso[lasso == 0].index)
for data_set in predictive_data_sets:
    data_set.drop(features_to_delete, axis=1, inplace=True)
```

OK, let's finally fit our model to the training data! We can then create a prediction based on the predictors' values
from x_test. Then, we can compare how those predictions compare against y_test. Neat, huh?!
```
linreg = LinearRegression()
linreg.fit(x_train, y_train)

y_pred = linreg.predict(x_test)
y_pred = pd.DataFrame(y_pred, columns=['battery_drain_prediction'])
actual_vs_predicted = pd.concat(
    [y_test.copy().reset_index(drop=True), y_pred.copy().reset_index(drop=True)],
    axis=1
)
actual_vs_predicted.to_csv('model/actual_vs_model_predicted_battery_drain_test.csv')
```

Now that we have the true values of battery drain from y_test and the predicted battery drain from y_pred, we can get
some measures of model quality, like the mean squared error, mean absolute error, and r squared!
```
measurements = {
    'mean squared error': mean_squared_error,
    'mean absolute error': mean_absolute_error,
    'R2': r2_score}
for measure, func in measurements.items():
    logger.info(f"The {measure} is: {func(y_pred, y_test)}")
```

Let's also extract the coefficients of the model - who knows! Maybe we will need them!
```
coeff = pd.DataFrame(linreg.coef_).transpose()
inter = pd.DataFrame(linreg.intercept_).transpose()
inter_and_coeff = pd.concat(
    [inter, coeff],
    ignore_index=True
)
inter_and_coeff.columns = ['coefficients']
intercept = ['intercept']
intercept.extend(x_train.columns.to_list())
feature_names = pd.DataFrame(
    intercept,
    columns=['feature']
)
model_coeffs = pd.concat(
    [inter_and_coeff, feature_names],
    axis=1,
    join='outer'
)
model_coeffs.to_csv('model/model_coefficients.csv')
```

Finally, let's save the model!
```
filename = 'model/amld_linear_regression_model.sav'
pickle.dump(linreg, open(filename, 'wb'))
```

Let's run the data job again!
If it all goes through, congratulations! You have built and ran your first data job. No better feeling!


### Step 8: Let's Build a Streamlit Visualization!
Now that we have finished with the data job, let's use that hard-earned model to make a cool dashboard!

First, let's take a look at the coefficients from our model and let's build a parameters file that will serve as the
user input generated values for our predictors. They will be fed to the model, and we'll have a prediction for whatever
values a user gives us for the predictors! Neat!

Open up the linear-reg-data-job sub-folder and find the model_coefficients .CSV file. Take a note of the feature
column, as that will show you which predictors made it to the model and which predictors we will need to add in the
parameters.py file. As such, let's go back to the main folder and create a new script. Title it 'parameters.py' and
input the following code inside it:
```
parameters = {
    'charge_level_start': dict(
        label="Select Value for the Starting Charge Level",
        value=80,
        max_value=100,
        min_value=1
    ),
    'temperature_start_c': dict(
        label="What's the Temperature Outside? In Celsius Please!",
        value=20,
        max_value=50,
        min_value=-50
    ),
    'distance_km': dict(
        label="How Many Kilometers Are We Going to Drive?",
        value=50,
        max_value=1000,
        min_value=1
    ),
    'average_speed_kmh': dict(
        label="What's the Average Speed We Expect (in Kilometers per Hour)?",
        value=60,
        max_value=300,
        min_value=1
    ),
    'average_consumption_kwhkm': dict(
        label="Any Guesses on the Average Consumption (in kwhkm)?",
        value=15,
        max_value=50,
        min_value=1
    ),
    'heated_seats': dict(
        label="Do We Plan on Using the Heated Seats? (1 = Yes, 0 = No)",
        value=1,
        min_value=0,
        max_value=1
    ),
    'eco_mode': dict(
        label="Do We Plan on Using the Eco Mode? (1 = Yes, 0 = No)",
        value=1,
        min_value=0,
        max_value=1
    )
}
```
This will help set up the user input interface and will bound the user to a min and max value for the predictors.
In other words, a user won't be able to say that they expect to drive a 10000 km per hour.

Close out of the file and create a new Python script in the main folder. This will be the main script for the Streamlit
visualization. Title it "build_streamlit_dashboard.py".

Inside it, let's start with importing the libraries and adding titles:
```
import os
import pandas as pd
import pickle
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pathlib
import streamlit as st
from parameters import parameters

st.title('Electric Cars and Battery Drain Linear Regression Example')
```

Now, let's do the first section of our visualization: the model showcase and quality measure:
```
os.chdir(pathlib.Path(__file__).parent.absolute())
actual_vs_pred_loc = 'linear-reg-data-job/model/actual_vs_model_predicted_battery_drain_test.csv'
model_loc = 'linear-reg-data-job/model/amld_linear_regression_model.sav'
```

Now, we'll read in the actual versus predicted data set that we created in our data job and visualize it in the dashboard:
```
actual_vs_pred = pd.read_csv(actual_vs_pred_loc, usecols=range(1, 3))
actual_vs_pred['absolute_difference'] = abs(
    actual_vs_pred['battery_drain'] - actual_vs_pred['battery_drain_prediction']
)
st.dataframe(actual_vs_pred)
battery_drain = actual_vs_pred[['battery_drain']]
battery_drain_prediction = actual_vs_pred[['battery_drain_prediction']]
```

Let's now visualize the model's quality metrics!
```
mse = round(mean_squared_error(battery_drain, battery_drain_prediction), 2)
mae = round(mean_absolute_error(battery_drain, battery_drain_prediction), 2)
r2  = round(r2_score(battery_drain, battery_drain_prediction), 2)

st.metric("The Mean Squared Error of This Model On This Testing Data Is:", mse)
st.metric("The Mean Absolute Error of This Model On This Testing Data Is:", mae)
st.metric("The R2 is:", r2)
```

Now let's add the second section, which will allow the user to select the predictors' values that will feed into the
model and give them an estimated battery drain!
```
st.header('How Much Will Your Electric Car Battery Drain? You May Be Surprised!')
st.write("Enter Your Custom Values in the SideBar - Please Enter Sensible Values Only!")


results = {}
for measurement, params in parameters.items():
    output = st.sidebar.number_input(**params)
    results[measurement] = output
results_df = pd.DataFrame(results, index=[0])


pickled_model = pickle.load(open(model_loc, 'rb'))


estimate = pickled_model.predict(results_df)
```

Let's fix some minor stuff like: if the user enters inputs that generate a drain prediction higher than the initial charge,
let's tell them that they will run out of battery! Or if the model somehow predicts a negative charge, to fix that to 0.
I know, an easy way out! :)
```
if estimate > results['charge_level_start']:
    estimate = results['charge_level_start']
    st.metric("Your Estimated Battery Drainage (in Percent) Is:", estimate)
    st.write("Note: The Model's Estimate Exceeds the Starting Level Charge; Thus Estimate is Capped")
elif estimate < 0:
    estimate = 0
    st.metric("Your Estimated Battery Drainage (in Percent) Is:", estimate)
else:
    st.metric("Your Estimated Battery Drainage (in Percent) Is:", estimate)
```

Congratulations! You have built your first Streamlit dashboard that even allows for a user to enter inputs! How cool is that?

As a last step, go back to the setup.ipynb and type the following code:
```
!streamlit run building_streamlit_dashboard.py
```

You will get an output, but the kernel will be stuck. That's okay! Just open a new tab in your browser,
copy the link of the MyBinder environment, delete everything after "user/blah blah blah" and paste "/proxy/8501/"
So, something like this: 
```
https://hub.gke2.mybinder.org/user/alexanderavramo-n-example-empty-zkd8q00p/proxy/8501/
```

The Streamlit dashboard will now show up!