# **Diamonds: Linear Regression Analysis**

## Background & Goal of the Project 
We will be working with the Diamonds Dataset. This dataset contains information about several thousand diamonds sold in the United States. You can find more information about this dataset, including a description of its columns [here](https://www.kaggle.com/datasets/shivam2503/diamonds).

Our goal is to create and compare two linear regression models to estimate the label **ln_price**. The first model will use **ln_carat** as the only feature. The second will use **ln_carat**, **cut**, **color**, and **clarity** as features. To use the categorical variables in a model, we will need to encode them using one-hot encoding.

## Import Necessary Libraries 

In [1]:
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

## Loadind the Data & Preliminary Analysis 

In [2]:
# Load dataset to DataFrame
diamonds = pd.read_csv('/kaggle/input/diamonds/diamonds.csv', delimiter=',')

In [3]:
# Display the first five rows of the dataset
diamonds.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
diamonds.shape

(53940, 11)

In [5]:
# View the informations about the dataset
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB


In [6]:
diamonds.describe()

Unnamed: 0.1,Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,26970.5,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,15571.281097,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,1.0,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,13485.75,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,26970.5,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,40455.25,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,53940.0,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


## Dataset Preprocessing

In [7]:
# Drop the unnamed column from the dataframe
diamonds.drop(columns='Unnamed: 0', axis=1, inplace=True)


In [8]:
# Adding new columns ln_carat and ln_price to the diamonds dataset
diamonds['ln_carat'] = np.log(diamonds['carat'])
diamonds['ln_price'] = np.log(diamonds['price'])

# Display the first five rows of the dataframe
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,ln_carat,ln_price
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,-1.469676,5.786897
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,-1.560648,5.786897
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,-1.469676,5.78996
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63,-1.237874,5.811141
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,-1.171183,5.814131


In [9]:
# Prepare numerical and categorical features
X_num = diamonds['ln_carat'].values.reshape(-1, 1)
X_cat = diamonds[['cut', 'color', 'clarity']].values
y = diamonds['ln_price'].values

# Print the shapes of the arrays
print(f'Numerical Feature Array Shape:   {X_num.shape}')
print(f'Categorical Feature Array Shape: {X_cat.shape}')
print(f'Label Array Shape:               {y.shape}')

Numerical Feature Array Shape:   (53940, 1)
Categorical Feature Array Shape: (53940, 3)
Label Array Shape:               (53940,)


In [10]:
# Create a OneHotEncoder object
encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the categorical features
encoder.fit(X_cat)

# Encode the categorical features
X_enc = encoder.transform(X_cat)

# Print the shape of the encoded features array
print(f'Encoded Feature Array Shape: {X_enc.shape}')

Encoded Feature Array Shape: (53940, 20)


In [11]:
# Combine numerical and encoded categorical features
X = np.hstack((X_num, X_enc))

# Print the shape of the combined feature array
print(f'Feature Array Shape: {X.shape}')

Feature Array Shape: (53940, 21)


In [12]:
# Split the data into training and holdout sets using a 70/30 split
X_train, X_hold, y_train, y_hold = train_test_split(X, y, test_size=0.20, random_state=1)

# Split the holdout data into validation and test sets using a 50/50 split
X_valid, X_test, y_valid, y_test = train_test_split(X_hold, y_hold, test_size=0.50, random_state=1)

# Print the shapes of the training, validation, and test feature arrays
print(f'Training Features Shape:   {X_train.shape}')
print(f'Validation Features Shape: {X_valid.shape}')
print(f'Test Features Shape:       {X_test.shape}')

Training Features Shape:   (43152, 21)
Validation Features Shape: (5394, 21)
Test Features Shape:       (5394, 21)


## Linear Regression Model with One Feature

In [13]:
# Create a linear regression model
dia_mod_1 = LinearRegression()

# Fit the model to the training data using only the first column of X1_train
dia_mod_1.fit(X_train[:, 0].reshape(-1, 1), y_train)

# Calculate r-squared values for training and validation sets
train_r2 = dia_mod_1.score(X_train[:, 0].reshape(-1, 1), y_train)
val_r2 = dia_mod_1.score(X_valid[:, 0].reshape(-1, 1), y_valid)

# Print the results with formatted messages
print(f"Training r-Squared:   {train_r2:.4f}")
print(f"Validation r-Squared: {val_r2:.4f}")

Training r-Squared:   0.9330
Validation r-Squared: 0.9348


## Linear Regression with Several Features

In [14]:
# Create a linear regression model
dia_mod_2 = LinearRegression()

# Fit the model to the training data using all features in X1_train
dia_mod_2.fit(X_train, y_train)

# Calculate r-squared values for training and validation sets
train2_r2 = dia_mod_2.score(X_train, y_train)
val2_r2 = dia_mod_2.score(X_valid, y_valid)

# Print the results with formatted messages
print(f"Training r-Squared:   {train2_r2:.4f}")
print(f"Validation r-Squared: {val2_r2:.4f}")

Training r-Squared:   0.9825
Validation r-Squared: 0.9834


## Findings

As you can see the r-Squared values increase by nearly 0.05 when we added in the categorical features. While the variable ln_carat alone explains a large proportion of the variance in the target variable ln_price, the three categorical features can be used to explain a bit ore of the variance.

## Score the Model

In [15]:
# Score the model dia_mod_2 using the test set
test_r2 = dia_mod_2.score(X_test, y_test)
print("Testing r-Squared:", f"{test_r2:.4f}")

Testing r-Squared: 0.9825
