# Practice Exercise: Scikit-Learn 1
## Basic Modeling

### Objectives

In line with the [SK1 Tutorial](https://www.featureranking.com/tutorials/machine-learning-tutorials/sk-part-1-basic-modeling/), the objective of this practice notebook is to familiarize you with working with a regression problem using a `holdout` approach. The dataset under consideration is the `diamonds` dataset that comes with the `ggplot2` library in R.

The `diamonds` dataset contains information on diamonds including carat (numeric), clarity (categorical), cut (categorical), and color (categorical). The dataset has 10 features and 53940 instances. The objective is to predict the price of a diamond in USD given its attributes. 

### Exercise 0: Data Preparation

Prepare the dataset for predictive modeling as follows:

0. Refer to our data prep practice solutions on Canvas as well as our data prep script on GitHub [here](https://github.com/vaksakalli/datasets/blob/master/prepare_dataset_for_modeling.py) for some inspiration on preparing this data for predictive modeling.
1. Set `pd.set_option('display.max_columns', None)`. Read in the raw data `diamonds.csv` on GitHub [here](https://github.com/vaksakalli/datasets). Have a look at the shape and data types of the features. Also have a look at the top 5 rows.
2. Generate descriptive statistics for categorical and numerical features separately.
3. Have a look at the unique values for each categorical feature and check whether everything is OK in the sense that there are no unusual values.
4. Make sure there are no missing values anywhere.
5. Separate the last column from the dataset and set it to "target". Make sure "target" is a `Pandas` series at this point, and not a `NumPy` array (which will be necessary for the sampling below). Set all the other columns to be the "Data" data frame, which will be the set of descriptive features.
6. Make sure all categorical descriptive features are encoded via one-hot-encoding. In this particular dataset, some categorical descriptive features are actually ordinal, but we will go ahead and encode them via one-hot-encoding for simplicity.
7. Make sure all descriptive features are scaled via min-max scaling and the output is a `Pandas` data frame with correct column names. Do **NOT** scale the target feature!
8. Finally have a look at the top 5 rows of "target" and "Data" respectively.

### Exercise 1: Modeling Preparation

- Randomly sample 5000 rows as it's too big for a short demo (using a random seed of 999). Make sure to run `reset_index(drop=True)` on the sampled data to reset the indices.
> - **NOTE:** It's **extremely** important to use the same seed for both Data and target while sampling, otherwise you will happily mix and match different rows without getting any execution errors and all you results will be garbage.
- Split the sampled data as 70% training set and the remaining 30% test set using a random seed of 999. 

### Exercise 2

- Fit a nearest neighbor (NN) regressor with $k=3$ neighbors using the Euclidean distance. 
- Fit the model on the train data and evaluate its $R^2$ (the default "score()" for regressors) performance on the test data. 

### Exercise 3

- Extend Question 2 by fitting $k=1,\ldots,10$ neighbors using the Manhattan and Euclidean distances respectively.
- What is the optimal $k$ value for each distance metric? That is, at which $k$, the NN regressor returns the highest $R^2$ score?
- Which distance metric seems to be better? 

### Exercise 4

- Fit a decision tree regressor with default values on the train data, and then evaluate its performance on the test data. 
- Does it perform better than the best KNN model from the previous question?

### Exercise 5

- Fit a simple linear regression model on train data, and then evaluate its performance on the test data. **Hint:** Use `LinearRegression()` in `sklearn.linear_model`. 
- How does it compare to the previous models?

### Exercise 6

- Fit a random forest regressor with `n_estimators=100` on train data, and then evaluate its performance on the test data. 
- How does it compare to the previous models?

### Exercise 7

- Predict the first 5 observations of the **test** data using the linear regression model you built earlier. 
- Display your results as a data frame with three columns: 'target', 'prediction', 'absolute_diff'.
- How do the predictions look?

**Further exposition**

Let's create histograms to visualize the difference between predicted values from the linear regression and target values on both training and test sets. How are the difference values distributed? Are they centered around zero? What are minimum and maximum difference values for training and test sets?

## Optional: diagnostic of regressors 

Relying on $R^{2}$ to evaluate regressor performance is not sufficient. Sometimes, we need to ensure if the regressors generate reasonable predictions. In this case, we have to check if the predicted diamond prices are positive (a negative price would imply you would get the diamond free and some extra cash!) 

Let's predict on training set for each model developed in the previous exercises. Create a dataframe, named `pred_result`, which consists of four columns corresponding to their predictions. Then, run `pred_result.describe()` to check if any model has a negative minimum value. Does the result surprise you? Will you get a similar result if you predict on test set?


***
www.featureranking.com