# Problem Session 4
## More Regression

The problems in this notebook will cover the content covered in our `Regression` lectures including:
- `Simple Linear Regression`,
- `A First Predictive Modeling Project`,
- `Multiple Linear Regression` and
- `Categorical Variables and Interactions`.

In [1]:
## We first load in packages we will need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

#### 1. Preparing the data

In this notebook you will continue to model the ultimate selling price of various vechicles. First we have to load the data and repeat some of the cleaning we did in `Problem Session 3`.

##### a.

- Load the `car_sales.csv` data set again
- Remove the missing values
- Create the `log_sell` and `log_km` columns
- Clean the `mileage`, `engine` and `max_power` columns with `clean_column` and
- Create the `age` column.

In [2]:
def clean_column(text):
    return float(text.split()[0])

In [3]:
cars = pd.read_csv("Car_sales.csv")

cars.dropna(inplace=True)

In [4]:
cars['mileage'] = cars['mileage'].apply(clean_column)
cars['engine'] = cars['engine'].apply(clean_column)
cars['max_power'] = cars['max_power'].apply(clean_column)
cars['age'] = 2020 - cars['year']
cars['log_sell'] = np.log10(cars['selling_price'])
cars['log_km'] = np.log10(cars['km_driven'])

##### b.

Make the train test split using `sklearn`'s `train_test_split`.

In [5]:
## Import train test split here

from sklearn.model_selection import train_test_split

In [6]:
## Make the train test split
## call the training set cars_train
## call the test set cars_test

cars_train, cars_test = train_test_split(cars.copy(),
                                            test_size=.2,
                                            random_state=440,
                                            shuffle=True)


##### c.

If you need to, take a moment to refresh yourself on these data.

##### d.

Here is a variable summary for your convenience.

<u>Outcome Variable</u>
- `selling_price` or `log_sell` (you will use `log_sell` in your models)

<u>Continuous Features</u>
- `km_driven` and thus `log_km`
- `mileage`
- `engine`
- `max_power`
- `seats`
- `age`

<u>Categorical Features</u>
- `fuel`
- `seller_type`
- `transmission`
- `owner`

You will ignore `torque` because it would require more cleaning than we will spend time on in these problem sessions.

#### 2. More EDA

In `Problem Session 3` you examined potential linear relationships with `log_sell` and:
- `log_km`,
- `mileage` and
- `age`.

In this notebook you will examine potential effects of the various categorical variables listed above.

##### a. 

One way to examine if a cateorical variable has an impact on an outcome variable is to compare the mean or median of the outcome variable among the different categories.

Use `pandas` `groupby`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html</a>, to examine the mean or median of `log_sell` by the four categorical features listed above.

In [None]:
## code here


In [None]:
## code here


In [None]:
## code here


In [None]:
## code here


In [None]:
## code here


##### b.

Another way to investigate the potential impact of categorical variables is to make plots examining the distribution of the outcome variable for each different category. Two common plots that are considered are box and whisker plots and violin plots. These can be made quickly using `seaborn`'s `boxplot`, <a href="https://seaborn.pydata.org/generated/seaborn.boxplot.html">https://seaborn.pydata.org/generated/seaborn.boxplot.html/</a>, and `violinplot`, <a href="https://seaborn.pydata.org/generated/seaborn.violinplot.html">https://seaborn.pydata.org/generated/seaborn.violinplot.html</a> functions.

Below you will see an example of both plot types. These will plot the training distribution of `log_sell` against `fuel`. After that make either a box plot or a violin plot for the remaining three categorical variables.

In [None]:
## Boxplot for fuel
plt.figure(figsize=(6,5))

sns.boxplot(data = cars_train,
               y = 'log_sell',
               x = 'fuel')

plt.yticks(fontsize=10)
plt.xticks(fontsize=10)

plt.ylabel("log_sell", fontsize=12)
plt.xlabel("fuel", fontsize=12)

plt.show()

In [None]:
## violinplot for fuel
plt.figure(figsize=(6,5))

sns.violinplot(data = cars_train,
               y = 'log_sell',
               x = 'fuel')

plt.yticks(fontsize=10)
plt.xticks(fontsize=10)

plt.ylabel("log_sell", fontsize=12)
plt.xlabel("fuel", fontsize=12)

plt.show()

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### c.

Write down any thoughts you have about potentially important categorical variables from your investigations in <i>a.</i> and <i>b.</i> here.

##### Write here




##### d.

From your investigations above you likely noticed that there does seem to be a difference in selling price between vehicles sold by an individual and those sold by some kind of dealer, but the kind of dealer does not seem to matter.

Create a new column in the data set called `dealer` that is `1` if the `seller_type` is a kind of dealership and `0` otherwise.

In [None]:
## Code here


In [None]:
## Code here


##### e.

It appears that different `owner` categories lead to different selling prices. However, it is difficult to tell if this variable is capturing a new signal, or is just reflecting a possible relationship between selling price and the age of the vehicle.

To see what is meant by this statement make a box and whisker plot with `age` on the vertical axis and `owner` on the horizontal. Then make a column called `owner_number` that is `1` when `owner` is `"First Owner"`, is `2` when `owner` is `"Second Owner"` and so on. Calculate the Pearson correlation between `age` and `owner_number`.

In [None]:
## Code here


In [None]:
## Code here


In [None]:
## Code here


The takeaway here is that there is a fair amount of correlation between the age of a vehicle and the number of owners the vehicle has had. From the previous notebook we already know that `log_sell` and `age` have a pretty strong correlation, so it makes sense that `owner` and `log_sell` would also be correlated.


From a predictive modeling sense this means that including `owner` as a categorical feature in a model that also includes `age` may not lead to as huge an improvement to the model as we originally thought.

#### 3. Selecting categorical variables to consider

##### a. 

Using your work in 2. choose some combination of the four categorical variables to add to this model:

$$
\log \left( \text{Selling Price} \right) = \beta_0 + \beta_1 \text{Age} + \epsilon
$$

##### Write here




##### b.

Make any dummy variables you need given the categories you chose in 3 <i>a.</i>

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



#### 4. Comparing models

##### a.

Write out the model(s) you will compare to:

$$
\log \left( \text{Selling Price} \right) = \beta_0 + \beta_1 \text{Age} + \epsilon
$$

using cross-validation below.

##### Write here





##### b.

Fill in the missing code below to perform 5-fold cross-validation to compare all of the models from 4. <i>a.</i>.

In [None]:
from sklearn.model_selection import 
from sklearn.linear_model import 
from sklearn.metrics import 

In [None]:
## Make a KFold object
## remember to set a random_state and set shuffle = True
kfold = KFold()




## This array will hold the mse for each model and split
mses = np.zeros()

## sets a split counter
i = 0

## loop through the kfold here
for train_index, test_index in :
    ## cv training set
    cars_tt = 
    
    ## cv holdout set
    cars_ho = 
    
    ## Make and fit your models here
    
    i = i + 1

In [None]:
## Calculate the average mse here
np.mean(mses, axis=1)

##### c. 

Recall that we ultimately care about predicting the final selling price, not the logarithm of the final selling price. Copy and paste your cross-validation code and alter it slightly so that you compare the prediction of:

$$
10^{\log (\text{Selling Price})}
$$

to the actual selling price. Look at the root mean squared error which is in the original units of `selling_price`.

##### Sample Solution

In [None]:
## code here





In [None]:
## code here





In [None]:
## code here





##### d.

What do you think about these model performances? Do you think these models are good?

##### Write your thoughts here




##### e.

What else do you think could be done to improve model performance?

##### Write your thoughts here




##### f.

If you have time you can use this space to try additional models.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)