## Regression Analysis on Diamonds Dataset

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).

![types of regression](https://miro.medium.com/max/688/0*Yfz5JM6W3a36FLD8.png)


About the data - 
* price price in US dollars (\\$326--\$18,823)
* carat weight of the diamond (0.2--5.01)
* cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
* color diamond colour, from J (worst) to D (best)
* clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
* x length in mm (0--10.74)
* y width in mm (0--58.9)
* z depth in mm (0--31.8)
* depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
* table width of top of diamond relative to widest point (43--95)

### Import required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style='whitegrid')

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV


### Import the dataset

In [None]:
diamond_data = pd.read_csv('../input/diamonds/diamonds.csv')

## High Level Overview of data

In [None]:
diamond_data.shape

In [None]:
diamond_data.info()

In [None]:
diamond_data.describe()

In [None]:
diamond_data.head()

## Data Cleaning and Pre-processing

### Deleting the Index row

In [None]:
diamond_data = diamond_data.drop(["Unnamed: 0"],axis=1)
diamond_data.head()

### Check 0 and Null values

In [None]:
print("Number of rows with x == 0: {} ".format((diamond_data.x==0).sum()))
print("Number of rows with y == 0: {} ".format((diamond_data.y==0).sum()))
print("Number of rows with z == 0: {} ".format((diamond_data.z==0).sum()))
print("Number of rows with depth == 0: {} ".format((diamond_data.depth==0).sum()))

In [None]:
diamond_data[['x','y','z']] = diamond_data[['x','y','z']].replace(0,np.NaN)
diamond_data.isnull().sum()

In [None]:
diamond_data.dropna(inplace=True)
diamond_data.isnull().sum()

In [None]:
diamond_data['volume'] = diamond_data['x']*diamond_data['y']*diamond_data['z']
diamond_data.drop(['x', 'y', 'z'], axis=1, inplace=True)
diamond_data.head()

## Outlier Analysis

In [None]:
sns.pairplot(diamond_data)

In [None]:
diamond_data.plot(kind='box',figsize=(15,10),subplots=True,layout=(3,3))
plt.show()

### Function to remove outliers

In [None]:
def outliers(var):
    a = []
    q1 = diamond_data[var].quantile(.25)
    q2 = diamond_data[var].quantile(.5)
    q3 = diamond_data[var].quantile(.75)
    iqr = q3-q1
    ulim = float(q3+(1.5*iqr))
    llim = float(q1-(1.5*iqr))

    for i in diamond_data[var]:
        if i > ulim:
            i = np.NaN
        elif i < llim:
            i = np.NaN
        else:
            i=i
        a.append(i)
    return a

for col in diamond_data.select_dtypes(exclude='object').columns:
    diamond_data[col] = outliers(col)

In [None]:
for i in diamond_data.select_dtypes(exclude='object').columns:
    diamond_data[i] = diamond_data[i].fillna(diamond_data[i].mean())

In [None]:
diamond_data.isnull().sum()

## Data Visualization

In [None]:
sns.scatterplot(data=diamond_data, x=diamond_data['volume'], y=diamond_data['carat'])

## Analysing Numerical and Categorical data

In [None]:
numerical_cols = diamond_data.select_dtypes(include=np.number).columns.to_list()
categorical_cols = diamond_data.select_dtypes(exclude=np.number).columns.to_list()

In [None]:
numerical_cols

In [None]:
categorical_cols

In [None]:
diamond_data['cut'].value_counts()

In [None]:
diamond_data['color'].value_counts()

In [None]:
diamond_data['cut'].value_counts()

In [None]:
sns.catplot('clarity', data=diamond_data, kind='count',aspect=2.5)

### Label Encoding categorical data

In [None]:
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder() 

diamond_data['cut'] = label_encoder.fit_transform(diamond_data['cut'])
diamond_data['color'] = label_encoder.fit_transform(diamond_data['color'])
diamond_data['clarity'] = label_encoder.fit_transform(diamond_data['clarity'])
diamond_data.head()

## Segregating Features and Labels

In [None]:
X = diamond_data.drop(['price'],1)
y = diamond_data['price']

### Train-Test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=142)

## Linear Regression Model

In [None]:
regr = LinearRegression()
model = regr.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("RMSE: {}".format(np.sqrt(mean_squared_error((y_test),(y_pred)))))
print("R2  : {}".format(np.sqrt(r2_score((y_test),(y_pred)))))

## Improving results by using RandomForestRegressor

In [None]:
rr  = RandomForestRegressor()
rr.fit(X_train,y_train)
y_pred = rr.predict(X_test)

print("RMSE: {}".format(np.sqrt(mean_squared_error((y_test),(y_pred)))))
print("R2  : {}".format(np.sqrt(r2_score((y_test),(y_pred)))))

### Improving results using GridSearchCV

In [None]:
n_estimators = [int(x) for x in np.linspace(10,200,10)]
max_depth = [int(x) for x in np.linspace(10,100,10)]
min_samples_split = [2,3,4,5,10]
min_samples_leaf = [1,2,4,10,15,20]
random_grid = {'n_estimators':n_estimators,'max_depth':max_depth,
               'min_samples_split':min_samples_split,'min_samples_leaf':min_samples_leaf}

random_grid

In [None]:
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf,
                               param_distributions=random_grid,
                               cv = 3)

rf_random.fit(X_train,y_train)
y_pred = rf_random.predict(X_test)

print("RMSE: {}".format(np.sqrt(mean_squared_error((y_test),(y_pred)))))
print("R2  : {}".format(np.sqrt(r2_score((y_test),(y_pred)))))

In [None]:
rf_random.best_params_

In [None]:
rf = RandomForestRegressor(n_estimators=73,
                         min_samples_split=3,
                         min_samples_leaf=4,
                         max_depth=40)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

print("RMSE: {}".format(np.sqrt(mean_squared_error((y_test),(y_pred)))))
print("R2  : {}".format(np.sqrt(r2_score((y_test),(y_pred)))))

### Save Final Output

In [None]:
output = pd.Series(y_pred)
output.to_csv('Final Output.csv')