This is my first practice Kernel in Data Science. My main goal was to use some basic Statsmodels API to evaluate some simple regression models. Later, I go from what I've learned and build on top of that with Scikit-Learn API as it's, to me, more friendly and more flexible.

Hope to be contributing to the Data Science community. Feel free to make on this Kernel better as well by comenting and suggesting.

# Workflow
1. Load relevant libraries
2. Problem Definition
3. Data acquisition
4. Target variable inspection
5. Features inspection
6. Exploratory Data Analysis
    - 6.1. Univariate
    - 6.2. Bivariate
7. Modelling
    - 7.1. Statsmodels
    - 7.2. Scikit-Learn

# 1. Load relevant libraries

In [None]:
# basic libraries for data acquisition, handling and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# libraries for modelling
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import explained_variance_score, r2_score
from sklearn.metrics import mean_absolute_error, mean_squared_error

# 2. Problem definition
We want to predict diamond price (continuous, numerical) based on certain measurements (features) using use prices already available. It's a Supervised Regression task.

Questions we may want to answer:
1. Is there a relationship between the predictors and price?
2. If so, how strong it this relationship?
3. Which predictors seem to have greater impact?
4. Can we predict price with the predictors available?

# 3. Data Acquisition

In [None]:
os.listdir('../input')

In [None]:
diam = pd.read_csv('../input/diamonds.csv', index_col=0)

In [None]:
diam.head()

I separate a DataFrame `data` so that the changes I perform are recorded on it, but there's still one version of the data untouched.

In [None]:
data = diam.copy()

# 4. Target variable inspection

It's a continuous variable. A few points we may want to check:
1. Are there missing values?
2. Are there absurd values, i.e., negative, zero, strings, data type problems etc.?
3. Are there outliers?
4. How is it distributed over the range of values? Does it seem to follow a particular distribution?

We begin by taking a look at some 10 random observations:

In [None]:
diam.sample(10)

In [None]:
# check fror null values on target variable
diam.price.isnull().any()

In [None]:
print("We have {:.0f} priced diamonds.".format(diam.price.count()))

Checking the data type and some statistics.

In [None]:
diam.price.describe()

`dtype` is `float64`, so only numeric entries.

What about the values themselves? Some very low and some very high. We follow with a visual inspection on the distribution of prices.

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(12,6))
sns.boxplot(y='price', data=diam, ax=ax[0])
sns.boxenplot(y='price', data=diam, ax=ax[1])
plt.show()

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(10,5))
sns.kdeplot(diam.price, color='b', shade=True, ax=ax[0])
sns.kdeplot(diam.price, color='r', shade=True, bw=100, ax=ax[1])

ax[0].set_title('KDE')
ax[1].set_title('KDE, bandwidth = 100')

plt.show()

What we learn:
i. No missing values.

ii. No data type errors or typos.

iii. Seemingly high number of high values above 1.5 * Inter-Quartile Range, but there are too many to be outliers: probrably are highly priced diamonds, therefore, can't lose this information.

iv. About the distribution:
- highly skewed to the right
- 1/4 of the diamonds below 950
- 50% of the diamonds below 2,400
- 1/4 of the diamonds between 2,400 and 5,300
- 50% of the diamonds between 950 and 5,300

Given the skewness of the distribution, I've seen in a lot of kernel authors perfoming a log-transformation on the target variable. After this transformation, the distribution of values allows for better statistical analysis and seems to improve models' performances. We'll see with Statsmodels that this will in fact improve the model and why it is so, but for now, I won't do that.

# 5. Feature inspection

## 5.1. Feature Explanation
As I know absilutely nothing about diamonds beforehand, checking some literature is always a good idead. The following quote from the Gemological Institute of America summarizes the __[diamond quality factors](https://www.gia.edu/diamond-quality-factor)__: "Diamonds with certain qualities are more rare—and more valuable—than diamonds that lack them. These are known as the 4Cs. When used together, they describe the quality of a finished diamond. The value of a finished diamond is based on this combination."

In the American Gem Society there's a comprehensive explanation of the __[4 C's of diamonds](https://www.americangemsociety.org/page/4cs)__, which is presented below, very summarized.

### 5.1.1. Cut
The __[cut](https://www.americangemsociety.org/page/diamondcut)__ of a diamond refers to how well the diamond’s facets interact with light, the proportions of the diamond, and the overall finish of the diamond.

It is not to be confused with the shape, (like emerald or round,) or facet arrangement, (like brilliant, or step cut), but is instead a reference to the craftsmanship of the diamond and how it factors into the diamond’s brilliance. AGS grades cut on a scale from 0 to 10, with 0 being “Ideal” and 10 being “Poor.” AGS has a proprietary numeric and verbal descriptors for cut. The numeric descriptors for the Diamond Cut Grade follow the American Gem Society's standards for how well a diamond is cut. The verbal descriptors are AGS Ideal, Excellent, Very Good, Good, Fair, and Poor.

![cut_grading](https://cdn.ymaws.com/www.americangemsociety.org/resource/resmgr/images/GemsJewelry/135781409756963.jpg)

### 5.1.2. Color
The __[color](https://www.americangemsociety.org/page/diamondcolor)__ of a diamond actually refers to the lack of color in a diamond, with perfectly colorless diamonds considered the highest quality with the highest value, and brown or yellow diamonds being the lowest quality. Using a master set of diamonds specifically chosen based on their range of color, a grader picks up the diamond and places it next to the individual diamonds in the master set. The diamond grader then decides the color grade based on the saturation of the color compared to the master set.

![color_grading](https://cdn.ymaws.com/www.americangemsociety.org/resource/resmgr/images/GemsJewelry/164901433526626.JPG)

### 5.1.3. Clarity
__[Clarity](https://www.americangemsociety.org/page/diamondclarity)__ is the state of being clear or transparent. Diamond clarity is the presence or absence of characteristics called inclusions in the diamond. In short, inclusions are the internal or external flaws of the diamond. The size and severity of these flaws determines the grade. Since many inclusions and blemishes are very small, and can be difficult to see with the naked eye, they are graded at 10x magnification. Clarity grade is determined on a scale of decreasing clarity from the highest clarity (Flawless or FL) to the lowest clarity (Included 3, or I3).

![clarity_grading](https://cdn.ymaws.com/www.americangemsociety.org/resource/resmgr/images/GemsJewelry/79661461782004.png)

**AGS 0 - Flawless or Internally Flawless:** no inclusions or blemishes visible under 10x; Internally Flawless diamonds have no inclusions visible under 10x, but can have very minor blemishes (marks and features confined to the surface only).

**AGS 1 or 2 - VVS:** has minute inclusions that are difficult for a skilled grader to see under 10x magnification.

**AGS 3 or 4 - VS:** have minor inclusions.

**AGS 5, 6, or 7 - SI:** have noticeable inclusions that are fairly easy to see under 10x magnification; sometimes, these inclusions can be visible to the unaided eye.

**AGS (7, 8, 9, or 10) - I:** have inclusions that are obvious at 10x magnification; sometimes, they can be seen with the naked eye. At the lower clarities, may have an effect on the diamond’s durability.

The modern clarity scale was invented in the 1950s, by a former president of GIA, Richard T. Liddicoat, Jr. With minor modifications, it has been the universal standard ever since, using verbal descriptors most are now familiar with: Flawless, Internally Flawless, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, and I3.

### 5.1.4. Carat
__[Carat](https://www.americangemsociety.org/page/diamondcarat)__ is the unit of measurement for the physical weight of diamonds. One carat equals 0.200 grams or 1/5 gram and is subdivided into 100 points.

![carat_ags](https://cdn.ymaws.com/www.americangemsociety.org/resource/resmgr/images/GemsJewelry/116161409755629.jpg)

Large diamonds are rarer than smaller ones, and as the carat weight increases, the value of the diamond increases as well. However, the increase in value is not proportionate to the size increase. For example, a 1-carat diamond will cost more than twice that of a ½-carat diamond (assuming Color, Clarity and Cut grade are the same). Weight does not always enhance the value of a diamond, either. Two diamonds of equal weight may be unequal in value, depending upon other determining factors such as Cut, Color and Clarity.

### 5.1.5. Depth and Table
From a different __[reference](https://beyond4cs.com/grading/depth-and-table-values/)__ I got information about depth and table.

#### 5.1.5.1. Depth

The depth of a diamond is its height (in millimeters) measured from the culet to the table. On a grading report, there are normally two measurements of depth – the first is the actual depth measurement in millimeters, and the second is the depth percentage, which shows how deep the diamond is in relation to its width.

*As explained Dataset page, `depth` here is the depth percentage, which can be approximated by $depth = [z / average(x, y)] * 100$.*

![Depth percentage](https://beyond4cs.com/wp-content/uploads/2013/02/depthpercentagesofdiamond.png)

The ideal depth percentage varies with the shape of the diamond. A depth percentage that may be too much for one shape might be essential for another. For instance, a princess cut with a 75 or 77 percent depth would still be considered acceptable and can yield an attractive diamond. On the other hand, a depth of 65 percent for a round diamond would be excessive and be detrimental to its beauty.

#### 5.1.5.2. Table
The table refers to the flat facet of the diamond which can be seen when the stone is face up. It also happens to be the largest facet on a diamond and plays a vital role on brilliance and light performance of a stone.

The main purpose of the table facet is to refract light rays entering the diamond and to allow reflected light rays from the pavilion facets back into the observer’s eye.

![Table_percentage](https://beyond4cs.com/wp-content/uploads/2013/02/tableandtablepercentagesofdiamond.png)

In a grading report, table percentage is calculated based on the size of the table divided by the average girdle diameter of the diamond. So, a 60 percent table means that the table is 60 percent wide as the diamond’s outline.

### 5.1.6. x, y and z
- x: length, in mm
- y: width, in mm
- z: depth, in mm

Those are simply the dimensions of the diamonds. As most diamonds are approximately round-shaped, we expect `x` ~ `y`. `z` is the absolute value of depth, and should be coherent with the `depth`, `x` and `y` values.

Also, approximating a diamond for a prism, we should expect `carat` to be proportional to $x * y * z$.

### What we learn:
From the basic reserach of the literature, we expect:
- carat, clarity, color and cut to play a big role diamond price
- depth and table are also important, but not clearly how
- x,y and z are important as they help determine carat and depth, but seem to be of secondary importance

I won't assume an order of importance between the 4 Cs since for me it isn't quite clear from the previous explanations which order this should be. On the other hand, I don't expect to any of the dimensions to be largely relevant as other variables already capture their importanca with them.

## 5.2. Data cleaning
This step is crucial. Machine Learning algorithms don't work with NaNs (how missing values are encoded in `pandas` and `numpy`, and some are very sensitive to outliers and absurd values.

Moreover, Data Cleaning focuses on removing problematic data entries whenever possible, be it for computational and/or statistical reasons.

In [None]:
diam.columns

The `.info()` method helps analysing a lot on information regarding data preparation.

In [None]:
diam.info()

### Nulls
No values missing or encoded as `NaN`.

### Data types
Three variables with `object` data type. This is the data type of strings/text in Python. It is relavant to dig a little deeper here and understanding why might it be that these entries are encoded as strings:
1. are they text data?
2. are they categories written as text?
3. are they numeric that due to errors got coerced into `object`?

It's important to differentiate case 2 from case 1 as there's a specific `category` data type in Python, that saves memory and allows for ordering, which helps in data analysis tasks. In the case 3, it might be that we'll need to perform some more cleaning steps.

Below, I summarize the entries on each column to see unique values.

In [None]:
for col in ['cut', 'color', 'clarity']:
    print("Column : {}".format(col))
    print(diam[col].value_counts())
    print()

As expected from the feature explanation section, these entries are categories written in the form of text. Chaging data type is good practice here for visualization and data analysis.

**Cut**
![cut_grading](https://cdn.ymaws.com/www.americangemsociety.org/resource/resmgr/images/GemsJewelry/135781409756963.jpg)

No null entries,as observed with the `.info()` method. Next, turn into `category` and set an order of importance.

In [None]:
# turn to 'categorical' data type and order
cut_dtype = pd.api.types.CategoricalDtype(
    categories=['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], 
    ordered=True)
data['cut'] = diam.cut.astype(cut_dtype)

In [None]:
data.cut.head()

**Color**

![color_grading](https://cdn.ymaws.com/www.americangemsociety.org/resource/resmgr/images/GemsJewelry/164901433526626.JPG)

No null entries, as observed with the `.info()` method. Next, turn into `category` and set an order of importance.

In [None]:
# turn to 'categorical' data type and order
color_dtype = pd.api.types.CategoricalDtype(
    categories=['J', 'I', 'H', 'G', 'F', 'E', 'D'], 
    ordered=True)
data['color'] = diam.color.astype(color_dtype)

In [None]:
data.color.head()

**Clarity**

![clarity_grading](https://cdn.ymaws.com/www.americangemsociety.org/resource/resmgr/images/GemsJewelry/79661461782004.png)

No null entries, as observed with the `.info()` method. Next, turn into `category` and set an order of importance.

In [None]:
clar_dtype = pd.api.types.CategoricalDtype(
    categories=['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'], 
    ordered=True)
data['clarity'] = diam.clarity.astype(clar_dtype)

In [None]:
data.clarity.head()

**Carat**

Continuous numerical variable then it's good to check for: missing values, absurd values, scale and possible errors.

In [None]:
diam.carat.describe()

Taking a quick glance at the distribution:

In [None]:
ax = sns.boxplot(y='carat', data=diam)

In [None]:
print(diam[diam.carat < 3].sample(5))
print()
print(diam[diam.carat > 3].sample(5))

As expected, higher `carat` weights linked to greater dimensions of diamonds. Apparently large values (above 4) are not absurds but big diamonds.

**Depth %**

Continuous numerical variable: check for missing values, absurd values, scale and possible errors.

No null entries,as observed with the `.info()` method. In case we forgot that:

In [None]:
print("At least one null entry? {}".format(diam.depth.isnull().any()))

In [None]:
diam.depth.describe()

In [None]:
ax = sns.boxplot(y='depth', data=diam)

Values highly concentrated between ~58 and 65, but there are observations above and below. How many?

In [None]:
print("# diamonds, depth > 65:", diam[diam.depth > 65].depth.count())
print()
print('Sample')
print(diam[diam.depth > 65].sample(10))

In [None]:
print("# diamonds, depth < 58:", diam[diam.depth < 58].depth.count())
print()
print('Sample')
print(diam[diam.depth < 58].sample(10))

Depth% > 65:
- around 835 diamonds
- values of 'x', 'y', and 'z' don't particularly stand out: 'z' just seems to be close to them, probrably due to the shape of the cut

Depth% < 58:
- around 581 diamonds
- values of 'x', 'y', and 'z' don't particularly stand out: 'z' just seems to be far from them, probrably due to the shape of the cut

No reason to cut out points outside Inter-Quartile Range of boxplot, but it's probably a good thing to verify how cut and depth interact in Exploratory Data Analysis.

**Table %**

Continuous numerical variable: check for missing values, absurd values, scale and possible errors.

From `.info()` performed at the beginning of the cleaning stage, we know there are no missign entries, but if we didn't remember that:

In [None]:
print("At least one null entry? {}".format(diam.table.isnull().any()))

In [None]:
diam.table.describe()

In [None]:
ax = sns.boxplot(y='table', data=diam)

Values highly concentrated between 50 and 65. How many?

In [None]:
print("# diamonds, table% > 65:", diam[diam.table > 65].table.count())
print()
print("Sample:")
print(diam[diam.table > 65].sample(10))

In [None]:
print("# diamonds, table% < 50:", diam[diam.table < 50].table.count())
print()
print("Sample:")
print(diam[diam.table < 50].sample(4))

Values of 'x', 'y' and 'z' don't particularly stand out. Table is high or low probrably due to shape of cut.

**x, y and z**

Continuous numerical variable: check for missing values, absurd values, scale and possible errors.

In [None]:
print("Any null entry for \'x\'? {}".format(diam.x.isnull().any()))
print("Any null entry for \'y\'? {}".format(diam.y.isnull().any()))
print("Any null entry for \'z\'? {}".format(diam.z.isnull().any()))

Taking a look at some summary statistics:

In [None]:
diam.loc[:, ['x', 'y', 'z']].describe()

The `.describe()` method reveals that there are at least three suposedly absurd situations:
- x = 0: no sense in a diamond with zero length
- y = 58.9: very weird when maximum x is about 10.7
- z = 31.8: very weird when maximum x is about 10.7

Visually inspecting:

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(10,5))
sns.scatterplot(x=diam.x, y=diam.y, ax=ax[0])
ax[0].set_title("y vs x")

sns.scatterplot(x=diam.x, y=diam.y, ax=ax[1])
ax[1].set_xlim(0, 15)
ax[1].set_ylim(0, 15)
ax[1].set_title("y vs x - Zoomed in")

plt.show()

Setting the axes limits between 0 and 15 reveals that the hypotheses that x is approximately equal to y was somewhat accurate. Now looking at `y` vs `carat`.

In [None]:
ax = sns.scatterplot(x='carat', y='y', data=diam)

Let's check the potentially absurd values.

In [None]:
diam[(diam.y > 10) | (diam.z > 10)]

Only three rows present values above 12. For simplicity, I'll arbitrarily establish that above 15 is absurd. Domain and historical knowledge would be useful here, but in the absence of any, I'll simplify.

Next,let's look at the other end (low values).

In [None]:
diam[(diam.x < 1) | (diam.y < 1) | (diam.z < 1)].sample(10)

Although some preliminary analysis found no missing values (NaNs), it seems that 0.0 encodes missing values in this dataset.

Since we have confirmed that `x` ~ `y`, I propose a set of steps to deal with absurd dimension values, in the following order:
1. When 'y' is available but 'x' is absurd (> 15 or 0.0): do 'x' = 'y'
2. For the remaining observations, when 'x' is absurd (> 15 or 0.0): replace with the mean 'x' value over the entire dataset
3. Next, when 'y' is absurd (>15 or 0.0): do 'y' = 'x'
4. Finally, when 'z' is absurd (> 15 or 0.0): aproximate using $depth = z / average(x, y) * 100$ (showed in the Feature Explanation).

(1) When 'y' is availabele but 'x' is absurd:

In [None]:
absurdx_i = ((diam.x == 0) | (diam.x > 15)) & (diam.y != 0)
data.loc[absurdx_i, 'x'] = diam.loc[absurdx_i, 'y']

(2) When 'x' is absurd but 'y' is not available

For simplicity, as this only seems to happen when both are zero, I do:

In [None]:
# create Boolean mask to subset DataFrame
absurdxy_i = ((data.x == 0) | (data.y == 0))

# compute mean value of x
mean_x = np.mean(data.x)

# substitute on the dataFrame
data.loc[absurdxy_i, ['x', 'y']] = data.loc[absurdxy_i, ['x', 'y']].replace(0, mean_x)

(3) Where `x` is available but `y` is absurd:

In [None]:
absurdy_i = (((data.y > 15) | (data.y == 0)) & (data.x != 0))
data.loc[absurdy_i, 'y'] = data.loc[absurdy_i, 'x']

(4) When `z` is absurd:

In [None]:
data[(data.z == 0) | (data.z > 15)]

In [None]:
# find rows where z is absurd
absurd_z = ((diam.z == 0) | (diam.z > 15))

# define function to calculate z
calc_z = lambda row: (row['depth']/100) * (row['x'] + row['y'])/2

# apply on dataframe
data.loc[absurd_z, 'z'] = data.loc[absurd_z, :].apply(calc_z, axis=1)

In [None]:
data[['x', 'y', 'z']].describe()

Now all values seem to be fine!

In [None]:
data.info()

In [None]:
data.describe()

## 5.3. Feature engineering

I am not very skilled in this aspect, but one idea comes to mind: since 'cut' is related to 'depth' and 'table', I combine depth and table into one variable.

In [None]:
data['depth_table_ratio'] = data['depth'] / data['table']

# 6. Exploratory Data Analysis - EDA

In [None]:
data.sample(10)

## 6.1. Univariate
How is each predictor alone related to price?

### Carat Weight

In [None]:
ax = sns.boxplot(y='carat', data=data)
print(data[['carat']].describe())

In [None]:
ax = sns.scatterplot(x='carat', y='price', data=data,
                     edgecolors='k', alpha=0.3)

In [None]:
ax = sns.violinplot(y='price', data=data)

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(10, 5))
sns.regplot(x='carat', y='price', data=data, ax=ax[0],
            x_bins=10, x_estimator=np.mean, ci=None)
sns.regplot(x='carat', y='price', data=data, ax=ax[1],
            x_bins=10, x_estimator=np.mean, ci=None, order=3)

ax[0].set_title("Price vs Carat - 1st order linear")
ax[1].set_title("Price vs Carat - 3rd order polynomial")
plt.show()

Partiotining carat into multiple bins for visualization shows:
- 'carat' is positively related to 'price'
- relationship seems non-linear

Looking at how carat changes with price seems to hint yet again that log-transforming price might be a good idea (because of the shape of the relationship). Let's try.

In [None]:
data['log_price'] = data.price.apply(np.log)

In [None]:
f, ax = plt.subplots(ncols=3, figsize=(15,5))
sns.regplot(x='carat', y='log_price', data=data, x_bins=10, ax=ax[0],
                 x_estimator=np.mean, ci=None)
sns.regplot(x='carat', y='log_price', data=data, x_bins=10, ax=ax[1],
                 x_estimator=np.mean, ci=None, order=2)
sns.regplot(x='carat', y='log_price', data=data, x_bins=10, ax=ax[2],
                 x_estimator=np.mean, ci=None, order=3)

ax[0].set_title("Log(Price) vs Carat - 1st order polynom.")
ax[0].set_ylabel("Log (price)")

ax[1].set_title("Log(Price) vs Carat - 2nd order polynom.")
ax[1].set_ylabel("Log (price)")

ax[2].set_title("Log(Price) vs Carat - 3rd order polynom.")
ax[2].set_ylabel("Log (price)")

plt.show()

Indeed log-transforming seems to bring the data closer to fit (with the help of a third order polynomial).

Next, we do indeed create a partition in carat weight, but we split by 1.0 carat.

In [None]:
data['carat_bin'] = pd.cut(data.carat, range(6))

In [None]:
ax = sns.countplot(x='carat_bin', data=data)

print(data.carat_bin.value_counts(normalize=True, sort=True, ascending=False)*100)

- 67.6% weight 1 carat or less.
- 96.5% weight 2 carats or less.
- approximately 0.06% weight more than 3 carats.

Expect to see majority of low prices, because majority of diamonds are are light.
Visualizaing:

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(10,5))
sns.boxplot(x='carat_bin', y='log_price', data=data, palette='husl', ax=ax[0])
sns.pointplot(x='carat_bin', y='log_price', data=data, ax=ax[1])

ax[0].set_title("Distribution - log (Price) vs Carat category")
ax[1].set_title("Mean log(price) vs Carat category")
plt.show()

print("Mean price per carat_bin:")
print(data.groupby('carat_bin').price.mean())

What we've learned so far:
- overall, higher the weight, higher the price
- relationship between mean price and carat bin is not linear:
    - first three bins: mean value increases far more than the proportional increase in carat
    - diamonds between 2 and 3 carats, and between 3 and 4 carats: apporximately same mean price
    - diamonds heavier than 3 carats: the price still goes up with carat but the increase is more moderate
- there are lighter diamonds that were highly valued: why?

For the last observation, let's check highly priced diamonds with small carat weight.

In [None]:
data[(data.carat <= 1) & (data.price > 12000)]

Of highly priced diamonds:
- Three highest prices: maximum grades for 'clarity' and 'color'
- Fourth highest price: maximum grades for 'cut' and color, 2nd highest for 'clarity'

Between diamonds close in 'carat', exceptionally good quality drives price up. Seems to confirm that high quality grades increase price.

### Cut

In [None]:
f, ax = plt.subplots(ncols=3, figsize=(18, 5))
sns.countplot(x='cut', data=data, ax=ax[0])
sns.boxplot(x='cut', y='log_price', data=data, ax=ax[1])
sns.stripplot(x='cut', y='log_price', data=data, ax=ax[1],
              size=1, edgecolor='k', linewidth=.1)
sns.pointplot(x='cut', y='log_price', data=data, ax=ax[2])
sns.pointplot(x='cut', y='log_price', data=data, ax=ax[2],
              estimator=np.median, color='r')

ax[0].set_title("Count diamonds per carat")
ax[1].set_title("log(Price) vs Cut - Distribution")
ax[2].set_title("Mean log(price) vs Cut (blue)\nMedian log(price) vs Cut (red)")
plt.show()

In [None]:
print('% diamonds per cut grade:')
print(data.cut.value_counts(normalize=True, sort=True, ascending=False)*100)

In [None]:
print('Mean log(price) per cut grade:')
print(data.groupby('cut').log_price.mean())

About 'cut':
- 87.9% above 'Very Good'
- seems to be negatively related to 'price'

Here, a non-expected behavior appears. Why would higher 'cut' grades have lower prices?

**Hypotheses:**

Majority of diamonds have good 'cut' grade. Also, majority of diamonds have low carat weight. 'cut' is "receiving bad reputation" for something it may be not responsible for.
- most of the diamonds ~ carat < 1 ~ on average, small price
- most of the diamonds ~ Ideal cut

By association: Ideal cut ~ small prices

It should be true, then, that 'cut' only drives up the price:
- when comparing same carat diamonds
- when alongside other distinctive quality factors

On Bivariate/Multivariate analysis (predictors relationships between themselves), we wnat to check that.

### Color

In [None]:
f, ax = plt.subplots(ncols=3, figsize=(15, 5))
sns.countplot(x='color', data=data, ax=ax[0])
sns.boxplot(x='color', y='log_price', data=data, ax=ax[1])
sns.pointplot(x='color', y='log_price', data=data, ax=ax[2])
sns.pointplot(x='color', y='log_price', data=data, ax=ax[2],
              estimator=np.median, color='r')

ax[0].set_title("Count diamonds per color grade")
ax[1].set_title("log(Price) vs Color - Distribution")
ax[2].set_title("Mean log(price) vs Color (blue)\nMedian log(price) vs Color (red)")
plt.show()

In [None]:
print("# diamonds per color grade")
print(data.color.value_counts(normalize=True, sort=True, ascending=True) * 100)

In [None]:
print("Mean log(price) of diamonds per color grade")
print(data.groupby('color').log_price.mean())

On the basis purely of color, higher grades seem to decrease the value. Unexpected behavior.

**Hypothesis:**

Like with 'cut', the overall behvior is possibly being adversily affected by the majority of diamonds being light.

Need to look at combination between 'color' and other quality factors and how they affect diamond prices.

### Clarity

In [None]:
f, ax = plt.subplots(ncols=3, figsize=(15, 5))
sns.countplot(x='clarity', data=data, ax=ax[0])
sns.boxplot(x='clarity', y='log_price', data=data, ax=ax[1])
sns.pointplot(x='clarity', y='log_price', data=data, ax=ax[2])
sns.pointplot(x='clarity', y='log_price', data=data, ax=ax[2],
              estimator=np.median, color='r')

ax[0].set_title("Count diamonds per clarity grade")
ax[1].set_title("log(Price) vs clarity - Distribution")
ax[2].set_title("Mean log(price) vs clarity (blue)\nMedian log(price) vs clarity (red)")
plt.show()

Again same overall pattern:
- majority of cases receving low price
- lower grades with higher mean prices

### Depth

In [None]:
data[['depth']].describe()

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(10,5))
sns.boxplot(y='depth', data=data, ax=ax[0])
ax[0].set_title("Depth distribution")

sns.boxplot(y='depth', data=data, ax=ax[1])
ax[1].set_ylim(55, 70)
ax[1].set_title("Depth distribution - Zoomed in")

plt.show()

In [None]:
ax = sns.scatterplot(x='depth', y='log_price', data=data,
                     alpha=0.3, edgecolor='k')

Alone, doesn't help predicting price. Let's try turning to bins ("discretizing"):

In [None]:
# create bins
depth_desc = data[['depth']].describe()
depth_bins = depth_desc['min':'max'].depth.tolist()

# create column for bins
data['depth_bin'] = pd.cut(data.depth, depth_bins)
data.depth_bin.value_counts()

In [None]:
f, ax = plt.subplots(ncols=3, figsize=(15, 5))
sns.countplot(x='depth_bin', data=data, ax=ax[0])
sns.boxplot(x='depth_bin', y='log_price', data=data, ax=ax[1])
sns.pointplot(x='depth_bin', y='log_price', data=data, ax=ax[2])
sns.pointplot(x='depth_bin', y='log_price', data=data, ax=ax[2],
              estimator=np.median, color='r')

ax[0].set_title("Count diamonds per depth bin")
ax[1].set_title("log(Price) vs depth_bin - Distribution")
ax[2].set_title("Mean log(price) vs depth_bin (blue)\nMedian log(price) vs depth_bin (red)")
plt.show()

'Depth' values balanced across bins. No particular relationship. Alone, not a good predictor of price.

### Table

In [None]:
data[['table']].describe()

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(10,5))
sns.boxplot(y='table', data=data, ax=ax[0])
ax[0].set_title("Table distribution")

sns.boxplot(y='table', data=data, ax=ax[1])
ax[1].set_ylim(50, 65)
ax[1].set_title("Table distribution - Zoomed in")

plt.show()

In [None]:
ax = sns.scatterplot(x='table', y='log_price', data=data,
                     alpha=0.3, edgecolor='k')

Analog to depth, doesn't seem to help predicting price. Like for depth, let's try "discretizing" and check for patterns.

In [None]:
# create bin list
table_desc = data[['table']].describe()
table_bins=table_desc['min':'max'].table.tolist()
table_bins.append(65)
table_bins.sort()

# create column for bins
data['table_bin'] = pd.cut(data.table, table_bins)
data.table_bin.value_counts()

In [None]:
f, ax = plt.subplots(ncols=3, figsize=(18, 5))
sns.countplot(x='table_bin', data=data, ax=ax[0])
sns.boxplot(x='table_bin', y='log_price', data=data, ax=ax[1])
sns.pointplot(x='table_bin', y='log_price', data=data, ax=ax[2])
sns.pointplot(x='table_bin', y='log_price', data=data, ax=ax[2],
              estimator=np.median, color='r')

ax[0].set_title("Count diamonds per table bin")
ax[1].set_title("log(Price) vs table_bin - Distribution")
ax[2].set_title("Mean log(price) vs table_bin (blue)\nMedian log(price) vs table_bin (red)")
plt.show()

When turned to bins, 'table' seems to display a positive correlation with price, except for depths greater than 65%.

### Depth/Table ratio

In [None]:
data['depth_table_ratio'].describe()

In [None]:
ax = sns.boxplot(y='depth_table_ratio', data=data)

In [None]:
ax = sns.scatterplot(x='depth_table_ratio', y='log_price', data=data,
                     edgecolor='k', alpha=.3, s=10)

Like depth and table, doesn't seem to be correlate with price. Again we trun to "discretization" for better visualization.

In [None]:
# create bins list
dt_ratio_desc = data[['depth_table_ratio']].describe()
dt_ratio_bins = dt_ratio_desc['min':'max'].depth_table_ratio.tolist()

# create columns for bins
data['dt_ratio_bin'] = pd.cut(data.depth_table_ratio, dt_ratio_bins)
data.dt_ratio_bin.value_counts()

In [None]:
f, ax = plt.subplots(nrows=3, figsize=(6, 18))
sns.countplot(x='dt_ratio_bin', data=data, ax=ax[0])
sns.boxplot(x='dt_ratio_bin', y='log_price', data=data, ax=ax[1])
sns.pointplot(x='dt_ratio_bin', y='log_price', data=data, ax=ax[2])
sns.pointplot(x='dt_ratio_bin', y='log_price', data=data, ax=ax[2],
              estimator=np.median, color='r')

ax[0].set_title("Count diamonds per dt_ratio_bin")
ax[1].set_title("log(Price) vs dt_ratio_bin - Distribution")
ax[2].set_title("Mean log(price) vs dt_ratio_bin (blue)\nMedian log(price) vs dt_ratio_bin (red)")
plt.show()

Mean price seems to decrease as depth to table ratio increases.

### Valuable diamonds: are they distributed differently?
As all boxplots displaying price display a large number of diamonds outside Inter-Quartile Range, I investigate a little further highly priced diamonds. Is there any particular pattern for the other variables?

In [None]:
data.price.describe()

In [None]:
ax = sns.boxplot(y='price', data=data)

Let's arbitrarily look at diamonds above 10,000.

In [None]:
data['high_price'] = data.price.apply(lambda x: 1 if x >= 10000 else 0)

**Carat**

In [None]:
ax = sns.boxplot(x='high_price', y='carat', data=data)

With larger diamonds come higher prices.

**Cut**

In [None]:
ax = sns.countplot(x='cut', data=data, hue='high_price')

Not really good to see as there are way less highly priced diamonds. Let's try using a table, normalization and some colors:

In [None]:
pricebin_cut_ct = pd.crosstab(data.high_price, data.cut, values=data.price, 
                              aggfunc='count', normalize='index')
pricebin_cut_ct.style.background_gradient(cmap='autumn', axis=1)

In the case of the top 3 cut grades:

In [None]:
pricebin_cut_ct[['Very Good', 'Premium', 'Ideal']].sum(axis=1)

Highly priced diamonds don't seem to present a different distribution across 'cut' grades: both have majority of diamonds on high grades (pprobably because overall there are more good grades than bad ones).

**Color**

In [None]:
ax = sns.countplot(x='color', data=data, hue='high_price')

In [None]:
pricebin_color_ct = pd.crosstab(data.high_price, data.color, values=data.price, 
                                aggfunc='count', normalize='index')
pricebin_color_ct.style.background_gradient(cmap='autumn', axis=1)

In the case of color highly priced diamonds seem to appear more often alongdside "bad" color grades than with "good" color grades.

**Clarity**

In [None]:
ax = sns.countplot(x='clarity', data=data, hue='high_price')

In [None]:
pricebin_clarity_ct = pd.crosstab(data.high_price, data.clarity, values=data.price, 
                                  aggfunc='count', normalize='index')
pricebin_clarity_ct.style.background_gradient(cmap='autumn', axis=1)

No clear difference in distributions on the basis of clarity.

**Depth**

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(10,5))
sns.boxplot(x='high_price', y='depth', data=data, ax=ax[0])
sns.boxplot(x='high_price', y='depth', data=data, ax=ax[1])

ax[0].set_title("Depth distribution by high_price")
ax[1].set_ylim((58, 66))
ax[1].set_title("Depth distribution by high_price\nZoomed in")

plt.show()

Aside from having less points outside the box-and-whiskers, the distributions seem analogous.

**Table**

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(10,5))
sns.boxplot(x='high_price', y='table', data=data, ax=ax[0])
sns.boxplot(x='high_price', y='table', data=data, ax=ax[1])

ax[0].set_title("Table distribution by high_price")
ax[1].set_ylim((50, 65))
ax[1].set_title("Table distribution by high_price\nZoomed in")

plt.show()

Very similar distributions as well.

### Conclusions: What we've learned
1. All predictors, when used solely, have large and small priced diamonds over the entire range of their values/categories.
1. Carat weight:
    - has positive correlation with price and relationship seems highly non-linear
    - carat distribution is different between more valuable and less valuable diamonds.
2. Cut, Color, Clarity:
    - appear to have negative correlation with mean price
    - apart from Color, frequency of diamonds is very similar across all grades.
3. Depth:
    - don't seem to have any clear relationsip with price
4. Table:
    - appear to have a negative correlation with price

## 6.2. Multivariate

First it's better to encode quality, categorical variables so they get visible.

In [None]:
data['cut_encod'] = LabelEncoder().fit_transform(np.asarray(data.cut))
data['color_encod'] = LabelEncoder().fit_transform(np.asarray(data.color))
data['clarity_encod'] = LabelEncoder().fit_transform(np.asarray(data.cut))

In [None]:
cor_mat = data.corr()

In [None]:
f, ax = plt.subplots(figsize=(10,10))
ax = sns.heatmap(cor_mat, cmap='autumn', annot=True)

From the heatmap, we see that carat, x, y and z are highly correlated with price and between themselves. Drawing a pair plot in order to visualize as scatter plots:

In [None]:
g = sns.pairplot(data, vars=['log_price', 'price', 'carat', 'x', 'y', 'z'])

**The same kind of relationship found between 'price' and 'carat' is reproduced between 'price' and the dimensional features 'x', 'y', and 'z'.** This is largely due to these dimensions being highly correlated with 'carat' and between themselves.

For simplicity, and to prevent high collinearity from affecteing the coefficient estimates of the model I'll drop 'x', 'y', 'z', since carat seem to

In [None]:
data.drop(['carat_bin', 'high_price', 'x', 'y', 'z'], axis=1, inplace=True)

In [None]:
g = sns.pairplot(data, vars=['log_price', 'price', 'depth', 'table', 'depth_table_ratio'])

Since 'depth_table_ratio' doesn't add information, will also drop (alongside epth and table bins.

In [None]:
data.drop(['depth_table_ratio', 'dt_ratio_bin', 'depth_bin', 'table_bin'], axis=1, inplace=True)

In [None]:
data.head()

Define cmap for bivariate visualization:

In [None]:
cmap = sns.cubehelix_palette(light=1, as_cmap=True)

## Cut vs Clarity
How does the interaction between 'cut' and 'clarity' affect the mean price?

In [None]:
ct = pd.crosstab(data.cut, data.clarity, data.log_price, aggfunc=np.mean)
print("Table 1. Mean log(price) map - cut vs clarity")
ct.style.background_gradient(cmap=cmap, axis=1)

In Table 1: for a given value of cut, higher mean prices are more associated with lower clarity grades.

However, when we look at Tables 2 and 3, we verify that the mean price is driven mainly by mean carat weight. In fact, the Table for price seems a combination of the effects of carat and count of diamonds.

In [None]:
ct = pd.crosstab(data.cut, data.clarity, data.carat, aggfunc=np.mean)
print("Table 2. Mean carat map - cut vs clarity")
ct.style.background_gradient(cmap=cmap, axis=1)

In [None]:
ct = pd.crosstab(data.cut, data.clarity)
print("Table 3. Count diamonds - cut vs clarity")
ct.style.background_gradient(cmap=cmap, axis=1)

## Cut vs Color

In [None]:
ct = pd.crosstab(data.cut, data.color, data.log_price, aggfunc=np.mean)
print("Table 4. Mean log(price) - cut vs color")
ct.style.background_gradient(cmap=cmap, axis=1)

In [None]:
ct = pd.crosstab(data.cut, data.color, data.carat, aggfunc=np.mean)
print("Table 5. Mean carat - cut vs color")
ct.style.background_gradient(cmap=cmap, axis=1)

In [None]:
ct = pd.crosstab(data.cut, data.color)
print("Table 6. Count of diamonds - cut vs color")
ct.style.background_gradient(cmap=cmap, axis=1)

Similar behavior to the observed for cut vs clarity: price is mainly affected by carat, but the count of diamonds shifts towards higher grades.

## Clarity vs Color

In [None]:
ct = pd.crosstab(data.clarity, data.color, data.log_price, aggfunc=np.mean)
print("Table 7. Mean log(price) - clarity vs color")
ct.style.background_gradient(cmap=cmap, axis=1)

'Color' seems to matter more for low 'cut' grades.

In [None]:
ct = pd.crosstab(data.clarity, data.color, data.carat, aggfunc=np.mean)
print("Table 8. Mean carat - clarity vs color")
ct.style.background_gradient(cmap=cmap, axis=1)

In [None]:
ct = pd.crosstab(data.clarity, data.color)
print("Table 9. Count of diamonds - clarity vs color")
ct.style.background_gradient(cmap=cmap, axis=1)

Again, similar behavior.

## Depth vs Cut

In [None]:
g = sns.catplot(y='depth', kind='violin', hue='cut', data=data,
                col='cut', col_wrap=3)

Interesting result: it seems that high 'cut' grades tend to be found more often around 60 and 65% (more concentrated).

## Table vs Cut

In [None]:
g = sns.catplot(y='table', kind='violin', hue='cut', data=data,
                col='cut', col_wrap=3)

Here a very similar finding to that of table: higher cut grades tend to be found around specific table values. Specifically, it seems to be between 55 and 63.

## Depth vs Table vs Cut

In [None]:
g = sns.relplot(x='depth', y='table', data=data, hue='cut',
                col='cut', col_wrap=3)

As observed, better cuts tend to be more restrictive in terms of the range of depth and table values. However, being inside this range does not guarantee a good cut.

From this, we learn that cut may help estimate depth and table, but the other way around is not true.

## Fixed carat: how is price related to other features?
How quality features fare for a fixed carat: does increase in quality mean increase in price?

In [None]:
data.carat.value_counts(sort=True, ascending=False).head()

So we select carat = 0.3 for the maximum number of diamonds with same carat.

In [None]:
data_fixcarat = data[data.carat == 0.3]
data_fixcarat.carat.describe()

### Cut vs Color

In [None]:
ct = pd.crosstab(data_fixcarat.cut, data_fixcarat.color, data_fixcarat.log_price, aggfunc=np.mean)
print("Table 10. Mean log(price) for carat = 0.3 - cut vs color")
ct.style.background_gradient(cmap=cmap, axis=1)

In [None]:
ct = pd.crosstab(data_fixcarat.cut, data_fixcarat.color, data_fixcarat.log_price, aggfunc=np.mean)
print("Table 11. Mean log(price) for carat = 0.3 - cut vs color")
ct.style.background_gradient(cmap=cmap, axis=0)

Interpretation:
- Table 10. for a given carat and cut, higher color grade related to increase in price.
- Table 11. for a given carat and color, higher cut grade doesn't necessarily mean increase in price.

Looking at the count of diamonds below.

In [None]:
ct = pd.crosstab(data_fixcarat.cut, data_fixcarat.color, data_fixcarat.log_price, aggfunc='count')
print("Table 12. Count of diamonds for carat = 0.3 - cut vs color")
ct.style.background_gradient(cmap=cmap, axis=1)

### Cut vs Clarity

In [None]:
ct = pd.crosstab(data_fixcarat.cut, data_fixcarat.clarity, data_fixcarat.log_price, aggfunc=np.mean)
print("Table 13. Mean log(price) for carat = 0.3 - cut vs clarity")
ct.style.background_gradient(cmap=cmap, axis=1)

In [None]:
ct = pd.crosstab(data_fixcarat.cut, data_fixcarat.clarity, data_fixcarat.log_price, aggfunc=np.mean)
print("Table 14. Mean log(price) for carat = 0.3 - cut vs clarity")
ct.style.background_gradient(cmap=cmap, axis=0)

Interpretation:
- Table 13. for a given carat and cut, higher clarity grade related to increase in price.
- Table 14. for a given carat and clarity, higher cut grade doesn't necessarily mean increase in price.

Looking at the count of diamonds below.

In [None]:
ct = pd.crosstab(data_fixcarat.cut, data_fixcarat.clarity, data_fixcarat.log_price, aggfunc='count')
print("Table 15. Mean log(price) for carat = 0.3 - cut vs clarity")
ct.style.background_gradient(cmap=cmap, axis=1)

### Color vs Clarity

In [None]:
ct = pd.crosstab(data_fixcarat.color, data_fixcarat.clarity, data_fixcarat.log_price, aggfunc=np.mean)
print("Table 16. Mean log(price) for carat = 0.3 - color vs clarity")
ct.style.background_gradient(cmap=cmap, axis=1)

In [None]:
ct = pd.crosstab(data_fixcarat.color, data_fixcarat.clarity, data_fixcarat.log_price, aggfunc=np.mean)
print("Table 17. Mean log(price) for carat = 0.3 - color vs clarity")
ct.style.background_gradient(cmap=cmap, axis=0)

Interpretation:
- Table 16. for a given carat and cut, higher clarity grade related to increase in price.
- Table 17. for a given carat and clarity, higher cut grade doesn't necessarily mean increase in price.

Here, different than what has been seen previously, both color and clarity grades seem to increase price very clearly.

Below we look at the count of diamonds once again.

In [None]:
ct = pd.crosstab(data_fixcarat.color, data_fixcarat.clarity, data_fixcarat.log_price, aggfunc='count')
print("Table 18. Count of diamonds for carat = 0.3 - color vs clarity")
ct.style.background_gradient(cmap=cmap, axis=1)

### Conclusions: what to expect from the models
1. Carat is highly related to price in a non-linear  fashion
    - log-transforming price + adding polynomial terms improved the fit visually
2. Cut, Color and Clarity: given a carat value, we expect them to increase price as the grade is higher, but:
    - Color and clarity clearly displayed that increase when combined for a fixed carat value
    - Cut, from previous literature, is expected to increase price, but we couldn't see it so clearly (other aspects may be affecting)
3. Table: from univariate analysis, increase in table is related to decrease in price - expect to see negative coefficient
4. Depth: no very clear relationship with price

# 7. Modelling

In [None]:
data.head()

## 7.1. Statsmodels

In [None]:
data.drop(['cut_encod', 'color_encod', 'clarity_encod'], axis=1, inplace=True)
data.head()

In [None]:
data = pd.get_dummies(data, drop_first=True)
data.head()

In [None]:
X = data.drop(['price', 'log_price'], axis=1).values
y = data.price.values

assert X.ndim == 2
assert y.ndim == 1

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, 
                                                    random_state=42)

### 7.1.1. First Model: price ~ carat
As a first model, I start with the simplest possible as a starting point.

In [None]:
train_X = sm.add_constant(X_train[:, 0])
train_y = y_train.copy()

lm1 = sm.OLS(train_y, train_X).fit()
lm1.summary()

- $R^2$ achieved of 0.850, good for a starting value
- Coefficients are significant, as p-values are low
- F-test for regression states there is a correlation between predictors and response
- residuals are non-normal (form Omnibus and Jarque-Bera)

Proceeding to examine residuals:

In [None]:
fitted_y = lm1.fittedvalues
res = lm1.resid
res_student = lm1.get_influence().resid_studentized_internal

In [None]:
f = plt.figure(figsize=(8, 8))
plt.scatter(fitted_y, res, s=70, alpha=.2, edgecolors='k', linewidths=.1)
sns.regplot(fitted_y, res, ci=None, scatter=False, lowess=True,
            line_kws=dict(linewidth=1, color='r'))
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Residuals", fontdict=dict(fontsize=16))
plt.show()

1. LOWESS (Locally WEighted Scatterplot Smoothing) curve shows a U-shape, suggesting a non-linear relationship is present
2. Residuals should be equally scattered around zero-line for all fitted values (constant variance of residuals or homoscedacity): the plot displays a non-equally distributed behavior,i.e., hetroscedacity.

(1) was expected, since we saw visually that price and carat relationship was not linear. One way to solve it would be a non-linear transformation on carat.

(2) could be solved using a non-linear transformation on price, such as $\sqrt{price}$ or $log(price)$.

Both are suggested by Hastie T. et al. in their book *Introduction to Statistical Learning*. We've visually tested both, so we'll now model them to see the results.

### 7.1.2. Second model: log(price) ~ carat
First just the log-transformation of the response.

In [None]:
train_X = sm.add_constant(X_train[:, 0])
train_y = np.log(y_train)

lm2 = sm.OLS(train_y, train_X).fit()
lm2.summary()

In [None]:
fitted_y = lm2.fittedvalues
res = lm2.resid

In [None]:
f = plt.figure(figsize=(8, 8))
plt.scatter(fitted_y, res, s=70, alpha=.2, edgecolors='k', linewidths=.1)
sns.regplot(fitted_y, res, ci=None, scatter=False, lowess=True,
            line_kws=dict(linewidth=1, color='r'))
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Residuals", fontdict=dict(fontsize=16))
plt.show()

Residuals still ehxibit highly non-linear pattern and heteroscadacity wasn't really taken care for. Regarding statistical metrics, $R^2$ got smaller and no improvement is found. We then try the polynomial approach on carat.

### 7.1.3. 3rd model: price ~ carat +carat^2 + carat^3 + carat^4

In [None]:
train_X = X_train[:, 0].reshape(-1, 1)
train_X = PolynomialFeatures(degree=4).fit_transform(train_X)

train_X = sm.add_constant(train_X[:, 1:])
train_y = y_train.copy()

lm3 = sm.OLS(train_y, train_X).fit()
lm3.summary()

In [None]:
fitted_y = lm3.fittedvalues
res = lm3.resid
res_student = lm3.get_influence().resid_studentized_internal

In [None]:
f = plt.figure(figsize=(8, 8))
plt.scatter(fitted_y, res, s=70, alpha=.4)
sns.regplot(fitted_y, res, ci=None, scatter=False, lowess=True,
            line_kws=dict(linewidth=3, color='r'))
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Residuals", fontdict=dict(fontsize=16))
plt.show()

Now the U-shaped is practically gone, suggesting a better fit, but heteroscedacity is still present. Next, both approach are put to work together.

### 7.1.4. 4th model: log(price) ~ carat + carat^2 + carat^3 + carat^4

In [None]:
train_X = X_train[:, 0].reshape(-1, 1)
train_X = PolynomialFeatures(degree=4).fit_transform(train_X)

train_X = sm.add_constant(train_X[:, 1:])
train_y = np.log(y_train)

lm4 = sm.OLS(train_y, train_X).fit()
lm4.summary()

- $R^2$ increased to 0.936, a very significant increase
- Coefficient estimates still significant
- F statistic display there is high relationship between predictors and response still
- Omnibus and JarqueBera suggest residuals are still far from normal
- Skew is very close to zero, suggesting more symmetrical residuals
- Kurtosis is somewhat large, indicating more concentration of residuals around zero mean

In [None]:
fitted_y = lm4.fittedvalues
res = lm4.resid
res_student = lm4.get_influence().resid_studentized_internal

In [None]:
f = plt.figure(figsize=(8, 8))
plt.scatter(fitted_y, res, s=70, alpha=.4)
sns.regplot(fitted_y, res, ci=None, scatter=False, lowess=True,
            line_kws=dict(linewidth=3, color='r'))
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Residuals", fontdict=dict(fontsize=16))
plt.show()

In [None]:
f = plt.figure(figsize=(8, 8))
plt.scatter(fitted_y, res_student, s=70, alpha=.4)
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Studentized Residuals", fontdict=dict(fontsize=16))
plt.show()

Up until now I've only looked at the residuals plot, but now we take a look at the Studentized Residuals as well (the Residuals divied

- Residuals plot displays that:
    - variance of error terms around 0 got more or less constant
    - LOWESS line is almost a straight line
- Studentized Residuals Plot display there are high residuals

### 7.1.5 5th model: log(price) ~ ALL features
This time will use data as DataFrame for better interpretation of results.

In [None]:
data.head()

In [None]:
columns = data.drop(['carat', 'price', 'log_price'], axis=1).columns
carat_columns = ['carat', 'carat^2', 'carat^3', 'carat^4']

# build polynomial carats, exclude cons
carat_poly = X_train[:, 0].reshape(-1, 1)
carat_poly = PolynomialFeatures(degree=4, include_bias=False).fit_transform(carat_poly)
carat_poly_df = pd.DataFrame(data=carat_poly, columns=carat_columns)

# take quality features + concatenate carat and quality features 
train_X_df = pd.DataFrame(data=X_train[:, 1:], columns=columns)
train_X_df = pd.concat([carat_poly_df, train_X_df], axis=1)

# get responde DataFrame
train_y_df = pd.DataFrame(y_train, columns=['log_price']).apply(np.log)

In [None]:
# train model
train_X_df = sm.add_constant(train_X_df)
lm5 = sm.OLS(train_y_df, train_X_df).fit()
lm5.summary()

- $R^2$ got to 0.985, a very significant increase
- Coefficient estimates are all significant
- F statistic's p-value suggests there is high relationship between predictors and response
- Omnibus and JarqueBera suggest residuals are far from normal
- Skew is very close to zero, suggesting more symmetrical residuals
- Kurtosis is somewhat large, indicating more concentration of residuals around zero mean
- Condition Number is very high, indicating high collinearity between terms.
    - Although polynomial terms tend to be somewhat collinear, this value wasn't so high when only the carat terms were used
    - Looking back at the pairplot, we see that depth and table are fairly collinear.

To reduce collinearity and increase the accuracy of the coefficient estimates, we'll try next removing depth and table.

In [None]:
fitted_y = lm5.fittedvalues
res = lm5.resid
res_student = lm5.get_influence().resid_studentized_internal

In [None]:
f = plt.figure(figsize=(6, 6))
plt.scatter(fitted_y, res, s=70, alpha=.4)
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Residuals", fontdict=dict(fontsize=16))
plt.show()

In [None]:
f = plt.figure(figsize=(6, 6))
plt.scatter(fitted_y, res_student, s=70, alpha=.4)
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Studentized Residuals", fontdict=dict(fontsize=16))
plt.show()

- Residuals plot displays that variance of error terms around 0 got more or less constant
- Studentized Residuals Plot display there are still high residuals

### 7.1.6 6th model: log(price) ~ ALL features - depth - table
For this model I tried first removing depth, since it has weaker relationship with price. In doing so, the coefficient estimate for table got a p-value > 0.5, meaning it became non-significant.

Thefere, the next model presents only the version where I removed both depth and table.

In [None]:
train_X_df.head()

In [None]:
# take quality features + concatenate carat and quality features 
train_X_df = train_X_df.drop(['depth', 'table'], axis=1)

# train_y_df remains as used in the previous example

In [None]:
# train model
train_X_df = sm.add_constant(train_X_df)
lm6 = sm.OLS(train_y_df, train_X_df).fit()
lm6.summary()

- $R^2$ stayed at 0.985, one indication that depth and table didn't help much
- Coefficient estimates are all significant
- F statistic's p-value suggests there is high relationship between predictors and response
- Omnibus and JarqueBera still high, but we'll see from residuals that theyare "better behaved"
- Skew is very close to zero, suggesting more symmetrical residuals
- Kurtosis is somewhat large, indicating more concentration of residuals around zero mean
- Condition Number got reduced and warning is now gone, indicating that collinearity is not a huge issue now (would need to see VIF statistic ti be sure, but for simplicity won't do that now).

In [None]:
fitted_y = lm6.fittedvalues
res = lm6.resid
res_student = lm6.get_influence().resid_studentized_internal

In [None]:
f = plt.figure(figsize=(6, 6))
plt.scatter(fitted_y, res, s=70, alpha=.4)
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Residuals", fontdict=dict(fontsize=16))
plt.show()

In [None]:
f = plt.figure(figsize=(6, 6))
plt.scatter(fitted_y, res_student, s=70, alpha=.4)
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Studentized Residuals", fontdict=dict(fontsize=16))
plt.show()

- Residuals plot displays that variance of error terms around 0 got more or less constant
- Studentized Residuals Plot display there are some very high residuals, even higher then before

### 7.1.7 7th model: log(price) ~ ALL features - depth - table - cut
One last model, now I will try removing cut: since it seemed to display a low effect on price in the multivariate analysis (cross tables), want to check.

In [None]:
columns_cut = [col for col in train_X_df.columns if 'cut' in col]

# remove cut columns from DataFrame
train_X_df = train_X_df.drop(columns_cut, axis=1)

# train_y_df is the same as used for the previous model

In [None]:
# train model
train_X_df = sm.add_constant(train_X_df)
lm7 = sm.OLS(train_y_df, train_X_df).fit()
lm7.summary()

Model didn't suffer much:
- $R^2$ dropped to 0.983, one indication that cut didn't help much
- Coefficient estimates are all significant
- F statistic's p-value suggests there is high relationship between predictors and response
- Omnibus and JarqueBera still high, but we'll see from residuals that theyare "better behaved"
- Skew is very close to zero, suggesting more symmetrical residuals
- Kurtosis is somewhat large, indicating more concentration of residuals around zero mean

In [None]:
fitted_y = lm7.fittedvalues
res = lm7.resid
res_student = lm7.get_influence().resid_studentized_internal

In [None]:
f = plt.figure(figsize=(6, 6))
plt.scatter(fitted_y, res, s=70, alpha=.4)
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Residuals", fontdict=dict(fontsize=16))
plt.show()

In [None]:
f = plt.figure(figsize=(6, 6))
plt.scatter(fitted_y, res_student, s=70, alpha=.4)
plt.plot(fitted_y, np.zeros_like(fitted_y), linestyle='--', color='k')
plt.xlabel("Fitted values", fontdict=dict(fontsize=16))
plt.ylabel("Studentized Residuals", fontdict=dict(fontsize=16))
plt.show()

Residuals plot behaves similarly but now we see a few more high studentized residuals appearing.

### Statsmodels: Conclusions

Well, that's pretty much as far as my knowledge in Linear Regression Statistics and Statsmodels go.

**Regarding the best model**

Model 5 ( log(price) ~ polynomial carat) presented the best behavior in terms of residuals. However, by including categorical features like clarity and color, model 7 seemed to outperform model 5 in terms of prediction (as measured by the $R^2$), although there are high studentized residuals for all fitted values.

**What we've done:**
- used Statsmodels to build some less complex models for the price task regression
- evaluated some statistical aspects of each model
- developed some knowledge about the importane of the features
- got an $R^2$ of about 0.985 on the training set
- a few statistics pointed out that the model used might not be the best one, however the ease of interpretation is in favor of the model built.

**What we do next**
- use some powerful tools like Cross-Validation in Scikit-Learn to evaluate other models
- evaluate the generalization capacity of each model on the test set

## 7.2. Scikit-Learn

Through Statsmodels we've already tried out a few models and, using its statistical API, came to a few conclusions regarding important features.

Therefore, I won't start with a simple model here and build it up: rather, I'll use Ridge and Lasso regressors from Scikit-Learn (regularized regression) on the model with all features and see if I come to similar conclusions regarding feature importance and prediction capability.

Also, I'll use some of the Scikit-Learn's funcionalities to speed the preprocessing steps.

### 7.2.1. Get the data

Instead of preparing a different, complete DataFrame as I did for Statsmodels, I'll use `FunctionTransformer()` and `FeatureUnion()` functionalities to transform the data "on the fly" for each model.

In [None]:
data.head()

In [None]:
X = data.drop(['price', 'log_price'], axis=1)    # as DataFrame
y = data[['log_price']]    # as DataFrame

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

print(type(X_train), type(X_test))
print(type(y_train), type(y_test))

For better understanding it's good to go step by step.

In [None]:
# create identifiers for polynomial features and linear features
POLY_COLS = ['carat']
LIN_COLS = [col for col in data.columns if col not in ['carat', 'price', 'log_price']]

In [None]:
# create the functions to get each subset of the data
get_polyfeatures = FunctionTransformer(lambda x: x[POLY_COLS], validate=False)
get_linfeatures = FunctionTransformer(lambda x: x[LIN_COLS], validate=False)

Create pipelines to extract and treat differently each subset of the data.

Below, I extract the features of interest for polynomial transformation and separately the feaures to go untouched (in this case).

In [None]:
poly_pl = Pipeline([
    ('selector', get_polyfeatures),
    ('polynomial', PolynomialFeatures(degree=4, include_bias=False))
])

# display polynomial features after transformation
poly_pl.fit_transform(X_train)[:5, :]

In [None]:
lin_pl = Pipeline([('selector', get_linfeatures)])

# display first few lines of linear features (in this case, no other operation is performed)
lin_pl.fit_transform(X_train).head()

**Notice:** the ouput of lin_pl pipeline is a DataFrame whereas the output of poly_pl pipeline is an array. When they're put together using FeatureUnion, the final output is coerced to an array.

In [None]:
# join both pipelines into one
prep_join = FeatureUnion([
    ('polynomial', poly_pl),
    ('linear', lin_pl)
])

# display final resulting array's first 5 rows
prep_join.fit_transform(X_train)

In order to leave the whole functionality in just one cell, I reproduce the "complete" pipeline below:

In [None]:
# create identifiers for polynomial features and linear features
POLY_COLS = ['carat']
LIN_COLS = [col for col in data.columns if col not in ['carat', 'price', 'log_price']]

# create the functions to get each subset of the data
get_polyfeatures = FunctionTransformer(lambda x: x[POLY_COLS], validate=False)
get_linfeatures = FunctionTransformer(lambda x: x[LIN_COLS], validate=False)

In [None]:
POLY_DEG=4

# join both pipelines into one
prep_join = FeatureUnion([
    ('polynomial', Pipeline([
        ('selector', get_polyfeatures),
        ('polynomial', PolynomialFeatures(degree=POLY_DEG, include_bias=False))
    ])),
    ('linear', Pipeline([('selector', get_linfeatures)]))
])

Now we use this 'union pipeline' alongside other functionalities. It just lazily treats the data on demand for the algorithms and makes it easier to change the polynomial features degree. 

**Procedure**
1. build pipeline to transform the data and apply a "grid search" for best parameters
2. set up and perform GridSearchCV
3. fit model with best paramater(s)
4. inspect coefficients
5. predict on test set
6. print and store metrics
    - $R^2$,
    - Explained Variance,
    - RSME, 
    - MAE
7. inspect residuals

### 7.2.2. Ridge Regression: L2-norm penalization
On top of the Ordinary Least Squares, Ridge Regression adds a penalty term that is proportional to the square of the coefficients from the regression. The equation below is taken from Scikit-Learn's __[documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)__

$||y - Xw||^2_2 + alpha * ||w||^2_2$

This additional term penalizes large coefficients. The consequence is that smaller coefficients tend to produce smaller penalties and the coefficients are shrinked.

The 'alpha' term controls the regularization strength: higher alpha ~ stronger regularization ~ smaller coefficients

**1. Build Pipeline**

In [None]:
POLY_DEG = 4

# create the regressor pipeline
ridge_pl = Pipeline([
    ('union', FeatureUnion([
        ('polynomial', Pipeline([
            ('selector', get_polyfeatures),
            ('polynomial', PolynomialFeatures(degree=POLY_DEG, include_bias=False))
        ])),
        ('linear', Pipeline([
            ('selector', get_linfeatures)
        ]))
    ])),
    ('regressor', Ridge(alpha=0.1))
])

In [None]:
# perform first fit and use as starting point
ridge_pl.fit(X_train, y_train)
ridge_pl.score(X_test, y_test)

**2. Set up GridSearchCV, find best parameter(s)**

In [None]:
# set up grid of alphas to search
alphas = np.logspace(-4, 4, 9)

# set up GridSearch object to select best alpha and fit to data
CV_FOLDS = 5
PARAM_GRID = {'regressor__alpha': alphas}
SCORE = 'neg_mean_squared_error'
gs = GridSearchCV(ridge_pl, cv=CV_FOLDS, param_grid=PARAM_GRID, scoring=SCORE)

In [None]:
# fit to data and print best scores and parameters
gs.fit(X_train, y_train)

print("Best hyperparameters:", gs.best_params_)
print("Best RMSE:           ", (-gs.best_score_) ** 0.5)

**3. Fit w/ best parameter(s)**

In [None]:
ridge_pl.set_params(regressor__alpha=1.0)

ridge_pl.fit(X_train, y_train)

**4. Inspect coefficients visually**

In [None]:
ridge_coef = np.squeeze(ridge_pl.named_steps['regressor'].coef_)

In [None]:
predictors = ['carat', 'carat^2', 'carat^3', 'carat^4']
predictors.extend(X_train.columns[1:].tolist())

plt.figure(figsize=(6, 8))
plt.barh(y=range(len(ridge_coef)), width=ridge_coef)
plt.yticks(range(len(ridge_coef)), predictors)
plt.xlabel("Coefficient estimate")
plt.ylabel("Predictors")
plt.title("Ridge Coefficient Estimates", fontdict=dict(fontsize=16))
plt.show()

And the result is very similar to the that obtained with Statsmodels:
- carat has the biggest average weight on price, followed by the 2nd order term of carat
- depth, table and cut have very low weight on price
- the quality features have increasing weight with increasing quality grade, as expected.

**5. Predict and print metrics on test set**

In [None]:
# predictions and actual values as arrays
y_pred = ridge_pl.predict(X_test)
y_true = y_test.values

In [None]:
print('R^2: %.4f' % (r2_score(y_test, y_pred)))
print('Exp. Var.: %.4f' % (explained_variance_score(y_test, y_pred)))
print('RMSE: %.4f' % (mean_squared_error(y_test, y_pred) ** .5))
print('MAE: %.4f' % (mean_absolute_error(y_test, y_pred)))

**6. Inspect residuals**

In [None]:
resid = (y_true - y_pred)
sns.jointplot(x=y_pred, y=resid, kind='reg', 
              joint_kws=dict(fit_reg=False))
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

As observed in the Statsmodels' models, although there are some very high residuals, there are some good aspects to this Residuals plot:
- Residusl resemble normality (kind of)
- Residuals seem to be equally dispersed around zero mean for all fitted values (homoscedacity)

### 7.2.3. Lasso Regression: L1-norm penalization
On top of the Ordinary Least Squares, Lasso Regression adds a penalty term that is proportional to the absolute magnitude of the coefficients of the regression. The equation below is taken from Scikit-Learn's __[documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)__

$(1 / (2 * n_{samples})) * ||y - Xw||^2_2 + alpha * ||w||_1$

This additional term penalizes large coefficients, but, different from Ridge, it actually shrinks smaller coefficients down to zero (sparsity). As a consequence, Lasso is sometimes used to perform feature selection ("least important" are left out).

As in Ridge, 'alpha' term controls the regularization strength: higher alpha ~ stronger regularization ~ smaller coefficients

**1. Build Pipeline**

In [None]:
POLY_DEG = 4

# create the regressor pipeline
lasso_pl = Pipeline([
    ('union', FeatureUnion([
        ('polynomial', Pipeline([
            ('selector', get_polyfeatures),
            ('polynomial', PolynomialFeatures(degree=POLY_DEG, include_bias=False))
        ])),
        ('linear', Pipeline([
            ('selector', get_linfeatures)
        ]))
    ])),
    ('regressor', Lasso(alpha=0.1))
])

In [None]:
# perform first fit and use as starting point
lasso_pl.fit(X_train, y_train)
lasso_pl.score(X_test, y_test)

**2. Set up GridSearchCV, find best parameter(s)**

In [None]:
# set up grid of alphas to search
alphas = np.logspace(-4, 4, 9)

# set up GridSearch object to select best alpha and fit to data
CV_FOLDS = 5
PARAM_GRID = {'regressor__alpha': alphas}
SCORE = 'neg_mean_squared_error'
gs = GridSearchCV(lasso_pl, cv=CV_FOLDS, param_grid=PARAM_GRID, scoring=SCORE)

In [None]:
# fit to data and print best scores and parameters
gs.fit(X_train, y_train)

print("Best hyperparameters:", gs.best_params_)
print("Best RMSE:           ", (-gs.best_score_) ** 0.5)

**3. Fit w/ best parameter(s)**

In [None]:
lasso_pl.set_params(regressor__alpha=0.0001)

lasso_pl.fit(X_train, y_train)

**4. Inspect coefficients visually**

In [None]:
lasso_coef = np.squeeze(lasso_pl.named_steps['regressor'].coef_)

In [None]:
predictors = ['carat', 'carat^2', 'carat^3', 'carat^4']
predictors.extend(X_train.columns[1:].tolist())

plt.figure(figsize=(6, 8))
plt.barh(y=range(len(lasso_coef)), width=lasso_coef)
plt.yticks(range(len(lasso_coef)), predictors)
plt.xlabel("Coefficient estimate")
plt.ylabel("Predictors")
plt.title("Lasso Coefficient Estimates, alpha = %.4f" % 0.0001, fontdict=dict(fontsize=16))
plt.show()

As we have observed, the GridSearchCV actually dound the best value for alpha to be very low. In practice, this means regularization is almost absent. AS a result, the coefficients are not shrinked and feature selection is not performed.

From the plot, in fact, the coefficients resemble those of Ridge regression.

**5. Predict and print metrics on test set**

In [None]:
# get prediction and actual values as arrays
y_pred = lasso_pl.predict(X_test).reshape(-1, 1)
y_true = y_test.values

In [None]:
print('R^2: %.4f' % (r2_score(y_test, y_pred)))
print('Exp. Var.: %.4f' % (explained_variance_score(y_test, y_pred)))
print('RMSE: %.4f' % (mean_squared_error(y_test, y_pred) ** .5))
print('MAE: %.4f' % (mean_absolute_error(y_test, y_pred)))

**6. Inspect residuals**

In [None]:
resid = (y_true - y_pred)
sns.jointplot(x=y_pred, y=resid, kind='reg', 
              joint_kws=dict(fit_reg=False))
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

Residuals plot resembles that of Ridge Regression as well.

### 7.2.4 Elastic Net: Ridge + Lasso 

Combines the concepts of L1 and L2 regularizations by letting one chose the "weight" of each through the `l1_ratio` parameter. By default, uses `alpha=1` and `l1_ratio=0.5`.

Minimizes:

$1 / (2 * n_{samples}) * ||y - Xw||^2_2 + alpha * l1_{ratio} * ||w||_1 + 0.5 * alpha * (1 - l1_{ratio}) * ||w||^2_2$

In [None]:
en_pl = Pipeline([
    ('union', FeatureUnion([
        ('polynomial', Pipeline([
            ('selector', get_polyfeatures),
            ('polynomial', PolynomialFeatures(degree=POLY_DEG, include_bias=False))
        ])),
        ('linear', Pipeline([
            ('selector', get_linfeatures)
        ]))
    ])),
    ('regressor', ElasticNet())
])

In [None]:
en_pl.fit(X_train, y_train)
en_pl.score(X_test, y_test)

In [None]:
# create alphas space for search
alphas = np.logspace(-4, 4, 9)
l1_ratios = np.linspace(0, 1, 6)

Since two parameters are going to be tested, nw we use `RandomizedSearchCV` to reduce the workload (instead of testing 6 x 9 = 54 combinations, it sample a number of them and return the best results).

In [None]:
# prepare GridSearch arguments
CV_FOLDS = 5
PARAM_GRID = {'regressor__alpha': alphas, 'regressor__l1_ratio': l1_ratios}
SCORE = 'neg_mean_squared_error'

gs = RandomizedSearchCV(en_pl, cv=CV_FOLDS, param_distributions=PARAM_GRID, scoring=SCORE)

In [None]:
gs.fit(X_train, y_train)

print("Best hyperparameters:", gs.best_params_)
print("Best RMSE:           ", (-gs.best_score_) ** 0.5)

**3. Fit w/ best parameter(s)**

In [None]:
en_pl.set_params(regressor__alpha=0.0001, regressor__l1_ratio=0.8)

en_pl.fit(X_train, y_train)

**4. Inspect coefficients visually**

In [None]:
en_coef = np.squeeze(en_pl.named_steps['regressor'].coef_)
en_coef

In [None]:
predictors = ['carat', 'carat^2', 'carat^3', 'carat^4']
predictors.extend(X_train.columns[1:].tolist())

plt.figure(figsize=(6, 8))
plt.barh(y=range(len(en_coef)), width=en_coef)
plt.yticks(range(len(en_coef)), predictors)
plt.xlabel("Coefficient estimate")
plt.ylabel("Predictors")
plt.title("ElasticNet Coefficient Estimates\nalpha = %.4f, l1_ratio = %.1f" % (0.0001,0.8), 
          fontdict=dict(fontsize=16))
plt.show()

As observed before, the L1 penalty term in this particular dataset makes the alpha go very low, almost "turning off" the regularization.

**5. Predict and print metrics on test set**

In [None]:
# get prediction and actual values as arrays
y_pred = en_pl.predict(X_test).reshape(-1, 1)
y_true = y_test.values

In [None]:
print('R^2: %.4f' % (r2_score(y_test, y_pred)))
print('Exp. Var.: %.4f' % (explained_variance_score(y_test, y_pred)))
print('RMSE: %.4f' % (mean_squared_error(y_test, y_pred) ** .5))
print('MAE: %.4f' % (mean_absolute_error(y_test, y_pred)))

**6. Inspect residuals**

In [None]:
resid = (y_true - y_pred)
sns.jointplot(x=y_pred, y=resid, kind='reg', 
              joint_kws=dict(fit_reg=False))
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

Residuals plot resembles that of Ridge Regression as well.

####  Coefficient comparison: Ridge and Lasso

In [None]:
plt.figure(figsize=(10, 10))
plt.barh(y=range(len(ridge_coef)), width=ridge_coef, color='r')
plt.barh(y=range(len(lasso_coef)), width=lasso_coef, color='b')
plt.yticks(range(len(ridge_coef)), predictors)
plt.xlabel("Coefficient estimate")
plt.ylabel("Predictors")
plt.title("Comparison of Coefficient Estimates\nRidge in red, Lasso in blue", 
          fontdict=dict(fontsize=16))
plt.show()

Notice that even with a very low alpha, Lasso shrinks the coefficients even more.

### Scikit-Learn: Conclusions

After using Statsmodels to investigate linear models, non-linear transformation and analyse some easily available statistics, we've used Scikit-Learn's API to:
- build Pipelines to create polynomial terms "on th fly" and "grid search" for the best parameters
- perform regularized regressions with Ridge and Lasso
- investigate results on test sets

Both API's are great tools to fit and analyse models, with some differences on outputs and capabilities. This is as far as my expertise of both APIs go for now, so I stop here. Hope you've enjoyed and that you feel like contributing, commenting, and upvoting!