#Advanced EDA with Azure Databricks

In order to run this notebook you should have previously run the <a href="$./03 Basic EDA with Azure Databricks">Basic EDA with Azure Databricks</a> notebook to have everything prepared for this step.

You are now done with exploring the dataset feature by feature, which is the main block in an EDA.   
The slightly more advanced section consist of four parts:

* Creating a simple baseline model (the parsimonious model)
* One hot encoding and feature scaling
* Dimensionality reduction
* Estimate feature importance by training a random forest regressor

Since this is a lot of new material that we have not covered in depth yet we have done most of the coding for you. Your job is then to evaluate and understand the results.

###Creating a simple baseline model (the parsimonious model)

Load the clean version of the data.

Be sure to update the table name  "usedcars\_clean\_#####" with the unique name created while running the <a href="$./02.03 Basic EDA with Azure Databricks">Basic EDA with Azure Databricks</a> notebook.

In [6]:
import numpy as np
import pandas as pd

df = spark.sql("SELECT * FROM usedcars_clean_#####")

In this section we will train a parsimonious model, a basic model to get a sense of the predictive capability of our data. 

We are going to try and build a model that can answer the question "Can I afford a car that is X months old and has Y kilometers on it, given I have $12,000 to spend?"

The model will respond with a 1 (Yes) or no 0 (No). 

In order to train a classifier, we need labels that go along with our used car features. The only features our model will be trained with are Age and KM. 

We will engineer the label for Affordable. Our logic will be simple, if the car costs less than $12,000 (our stated budget), then we will label that row in our data with a 1, meaning Yes it is affordable. Otherwise we will label it with a 0.

The following cell will create a new Spark DataFrame that has our two desired features and the engineered label.

In [8]:
df_affordability = df.selectExpr("Age","KM", "CASE WHEN Price < 12000 THEN 1 ELSE 0 END as Affordable")
display(df_affordability)

While we could use matplotlib or ggplot to create a scatter plot of our data, the Azure Databricks notebook has a built in way for us to plot the data from the DataFrame without any material code, just by calling `display()` and passing it the DataFrame. 

We've already configured the plot, so you just need to run the next cell. If you are curious as to the settings we used, select the Plot Options button that appears underneath the chart.

In [10]:
display(df_affordability)

**Challenge #1**

Given the above chart, at approximately what age does it look we start to afford a car irrespective of it's distance driven?

**Training the classifier**

In this particular case, we have chosen to train our classifier using the LogisticRegression module from SciKit Learn, since it's a good starting point for a model, especially when our data is not too large. 

The LogisticRegression module does not understand Spark DataFrames natively. Given our small dataset, one option is to collect the data on to the driver node and then process represent using arrays. The following converts our Spark DataFrame into a Pandas DataFrame. Then the features (Age and KM) are stored in the X array and the labels (Affordability are stored in the y array).

In [13]:
X = df_affordability.select("Age", "KM").toPandas().values
y = df_affordability.select("Affordable").toPandas().values

Run the next two cells to get a quick look at the resulting arrays:

In [15]:
X

In [16]:
y

Now one challenge we will face with the LogisticRegression is that it expects the inputs to be normalized. To make a long story short, if we were just to train the model using KM and Age without normalizing them to a smaller range around 0, then the model would give undue importance to the KM values because they are simply so much larger than the age (e.g., consider 80 months and 100,000 KM). 

To normalize the values, we use the StandardScaler, again from SciKit-Learn.

In [18]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In the next line we look at the result of scaling. The first table of output shows the statistics for the original values. The second table shows the stats for the scaled values. Column 0 is Age and column 1 is KM.

In [20]:
print(pd.DataFrame(X).describe().round(2))
print(pd.DataFrame(X_scaled).describe().round(2))

**Challenge 2**

After scaling, what is the range of values possible for the KM feature?

Next we will train the model.

In [23]:
from sklearn import linear_model
# Create a linear model for Logistic Regression
clf = linear_model.LogisticRegression(C=1)

# we create an instance of Neighbours Classifier and fit the data.
clf.fit(X_scaled, y)

Now that we have a trained model, let's examine a feature of Azure Databricks notebooks that can help us play with the inputs to our model- widgets. 

When you run the following cell, two new text inputs will appear near the top of this notebook. When you edit their value move out of the input field, any cells that depend on that widget's value will be automatically re-run. 

For now, run the following cell and observe the Age and Distance Driven widgets that appear. Notice they have been defaulted to Age of 40 months and Distance Driven of 40000 KM.

In [25]:
dbutils.widgets.text("Age","40", "Age (months)")
dbutils.widgets.text("Distance Driven", "40000","Distance Driven (KM)")

Now run the following cell. It will take as input the values you specified in the widgets, scale the values and then use our classifier to predict the affordability.

In [27]:
age = int(dbutils.widgets.get("Age"))
km = int(dbutils.widgets.get("Distance Driven"))

scaled_input = scaler.transform([[age, km]])
  
prediction = clf.predict(scaled_input)

print("Can I afford a car that is {} month(s) old with {} KM's on it?".format(age,km))
print("Yes (1)" if prediction[0] == 1 else "No (1)")

Experiment with changing the values for Age and Distance Driven by editing the values in the widgets. Notice that every time you edit a value and exit the input field, the above cell is re-executed (HINT: Look at the timestamp output that appears at the bottom of the above cell).

The above approach let's us experiment one prediction at a time. But what if we want to score a list of inputs at once? The following cell shows how we could score all of our original features to see what our model would predict.

In [30]:
scaled_inputs = scaler.transform(X)
predictions = clf.predict(scaled_inputs)
print(predictions)

Now we can "grade" our model's performance using the accuracy measure. To do this we are effectively comparing what the model predicted versus what the label actually was for each row in our data. 

An easy way to do this is by using the `accuracy_score` method from SciKit-Learn.

In [32]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y, predictions)
print("Model Accuracy: {}".format(score.round(3)))

**Challenge #3**

What grade would you give your model based on this score alone? Assume an A is 90% or better, a B is 80%-90% and so on.

###One hot encoding and feature scaling

Until now we have not encoded the feature FuelType, but before we can use this feature as input to a model or a dimensionality reduction we need to apply one hot encoding. In Machine Learning literature, one hot encoding is defined as an approach to encode categorical integer features using a one-hot aka one-of-K scheme. In a nutshell, every distinct value of the categorical integer feature becomes a new column which has all zero values except for rows where that value is present, where it has a value of 1. This is a way to transform categorical values into a form that can be more efficiently used by Machine Learning algorithms.

Running the next cell will store an encoded version of the dataset in a new dataframe called `df_ohe`.

In [36]:
df_ohe = df.toPandas().copy(deep=True)
df_ohe['FuelType'] = df_ohe['FuelType'].astype('category')
df_ohe = pd.get_dummies(df_ohe)

df_ohe.head(15)

To be prepared for any model in the modelling phase, we also make a scaled dataset.    
The code below makes a new dataframe called `df_ohe_scaled`

In [38]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
columns_to_scale = ['Age', 'KM', 'HP', 'CC','Weight']
df_ohe_scaled = df_ohe.dropna().copy()
df_ohe_scaled[columns_to_scale] = scaler.fit_transform(df_ohe.dropna()[columns_to_scale])

df_ohe_scaled.head(15)

###Dimensionality reduction

Dimensionality rediction is the operation that transforms data with n dimensions (in pandas world n columns in the dataframe) to a representation of the data in m dimensions. Obviously m is less than n, and for visualizations we set m to be 2 or 3. 

To reduce the dimensionality of our dataset we use a method called Principal Component Analysis (PCA). With this method we can reduce the dimensionality in a way that preserves as much variance as possible. 

You can play around with the selection of features to see which features affect the PCA.   
You can also try the PCA using the dataframe we didn't scale to see how scale affects the transformation. 

What makes PCA interesting in the context of an EDA is that we can use it to explore the relationship between higher dimensional data and a respons variable. 

Below we send all features (not price) to the PCA to transform it to two dimensions. When we plot the two dimensional data and color it with price we get a graphical representation of the relationship between Price and all the features combined.

In [40]:
from sklearn.decomposition import PCA 
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots()

features = ['Age', 'KM', 'HP', 'Weight', 'CC', 'Doors',  'Automatic', 'MetColor', 'FuelType_cng', 'FuelType_diesel', 'FuelType_petrol']

x_2d = PCA(n_components=2).fit_transform(df_ohe_scaled[features])
sc = plt.scatter(x_2d[:,0], x_2d[:,1], c=df_ohe_scaled['Price'], s=10, alpha=0.7)
plt.colorbar(sc) 

display(fig)

###Estimate feature importance by training a random forest regressor

The model Random Forest has a very valuable side-product. After training the model it can provide a list over all features ranked by importance (we will bump into this concept again later in the workshop). By running the cell below you get one of these feature importance rankings. 

__Question:__ Does the output with feature importance match what you experienced when exploring the dataset?

In [43]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

fig, ax = plt.subplots()

features_RFR = ['Age', 'KM', 'HP', 'Weight', 'CC', 'Doors', 'Automatic', 'MetColor', 'FuelType_cng', 'FuelType_diesel', 'FuelType_petrol']

# Create train and test data
X = df_ohe[features_RFR].as_matrix()
y = df.toPandas()['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state =0)

# Initialize  a random forest regressor
# 'Train' the model
RandomForestReg = RandomForestRegressor()
RandomForestReg.fit(X_train, y_train)

imp = pd.DataFrame(
        RandomForestReg.feature_importances_ ,
        columns = ['Importance'] ,
        index = features_RFR
    )
imp = imp.sort_values( [ 'Importance' ] , ascending = True )
imp['Importance'].plot(kind='barh')

display(fig)

This concludes the Exploratory Data Analysis lab.

In this lab you investigated a dataset with sale prices in $ for used (second-hand) Toyota Corollas.   
During the lab you used a lot of the techniques we introduced in the presentation about EDA (Exploratory data analysis).

# Answers to Challenges

1. Somewhere between 40 and 50 months in age.
2. The scaled range for KM is -1.83 to 4.65. 
3. The percentage score is 92.6%, so this would get an A.