***Loading the dataset from specified location ***

In [None]:
import pandas as pd
data=pd.read_csv('../input/diamonds/diamonds.csv')

After loading the data , We will first look at the data to get an understanding . 

In [None]:
data.head()

**Description of the variables**

***Independent Variable **
    1. carat - Weight of the diamond 
    2. cut - Quality of the diamond . It is categorical column which contains 5 different categories of the diamond . 
    3. color - Categorical column representing the color of the diamond . It is categorised into 7 types . 
    4. clarity - Refers to the visual appearance of internal characteristics of a diamond called inclusions and blemishes.It has 8    different categories .
    5. Depth - Height of the diamond. the depth of diamond is represent in precentage. 
    6. Table - Represents the flat surface on the top of the diamond. 
    7. x - length in mm 
    8. y - width in mm 
    9. z - depth in mm 

*** Dependent Variable **
    1. Price - Represents the price of the diamond.     

Now that we have gained knowledge about what each variable represents , lets do few basic checks in the data . 

* First lets check if there are any unwanted columns to be removed 
* Next lets check for null values present in the data . 

From the data we can seen that Unnamed:0 column is not required as it is an index column . Therefore we will remove the column from the data.


In [None]:
data.drop('Unnamed: 0',axis=1,inplace=True)

In [None]:
data.isna().sum()

Super!! We do not have any null values . If are null values present then we should have either deleted the rows or replace them mean/median .

Now lets get the summary statistics of the data (count, min,max,mean,std etc ). 

The summary is shown only for numerical variables. 

In [None]:
data.describe()

Wow !! . Now we got to know the stats of the variables . You check and everything seems fine and proceed further.                                 
Stop !!                                                                                                                                  
If you clearly notice the min values for columns x,y and z are 0 which is impratical.

In [None]:
#checking for columns with 0 
data.loc[(data['x']==0)|(data['y']==0)|(data['z']==0)]


We can see that there are 20 rows containing zero in either x,y or z . Since this is impratical we will exclude from the data. 

In [None]:
#deleting values having zero 
data=data[(data[['x','y','z']]!=0).all(axis=1)]

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

X=data.drop('price',axis=1)
y=data['price']
   


In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='cut',data=X,order=X['cut'].value_counts().index)


In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='color',data=X,order=X['color'].value_counts().index)

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x='clarity',data=X,order=X['clarity'].value_counts().index)

In [None]:
object_col=[col for col in X.columns if X[col].dtype=="object"]
oh_encoder=pd.get_dummies(X[object_col])

num_X_train=X.drop(object_col,axis=1)
scale=MinMaxScaler()
scale_X_train=pd.DataFrame(scale.fit_transform(num_X_train),index=num_X_train.index,columns=['carat','depth','table','x','y','z'])


In [None]:
diamond_data=pd.concat([scale_X_train,oh_encoder],axis=1)

In [None]:
#checking correlation 
plt.figure(figsize=(20,20))  # on this line I just set the size of figure to 12 by 10.
p=sns.heatmap(diamond_data.corr(), annot=True,cmap='RdYlGn')

In [None]:
X_train,X_test,y_train,y_test=train_test_split(diamond_data,y,test_size=0.2,random_state=42)

#linear Regression 
lnr_model=LinearRegression()
lnr_model.fit(X_train,y_train)
y_pred=lnr_model.predict(X_test)


print("Accuracy :- "+ str(lnr_model.score(X_test,y_test)*100) +' %')
print("R Squared :- "+ str(metrics.r2_score(y_test,y_pred)))
print("Mean Absolute Error :- {}".format(mean_absolute_error(y_test,y_pred)))


In [None]:
#Decision tree
regressor = DecisionTreeRegressor(random_state = 0)  

regressor.fit(X_train, y_train)

y_pred_dt=regressor.predict(X_test)


print("Accuracy :- " + str(regressor.score(X_test,y_test)*100) +' %')
print("R Squared :- " + str(metrics.r2_score(y_test,y_pred_dt)))
print("Mean absolute error :- {}".format(mean_absolute_error(y_test,y_pred_dt)))

In [None]:
#random forest with default parameters 
rf=RandomForestRegressor(random_state=42)
rf.fit(X_train,y_train)
rfy_pred=rf.predict(X_test)
metrics.r2_score(y_test,rfy_pred)

print("Accuracy :- "+ str(rf.score(X_test,y_test)*100) +' %')
print("R Squared :- "+ str(metrics.r2_score(y_test,rfy_pred)))
print("Mean Absolute Error :- {}".format(mean_absolute_error(y_test,rfy_pred)))

In [None]:
#random forest with estimators

rf = RandomForestRegressor(n_estimators=100,random_state = 42)

rf.fit(X_train,y_train)

y_pred_rf=rf.predict(X_test)

print("Accuracy :- "+ str(rf.score(X_test,y_test)*100) +' %')
print("R Squared :- "+ str(metrics.r2_score(y_test,y_pred_rf)))
print("Mean Absolute Error :- {}".format(mean_absolute_error(y_test,y_pred_rf)))

Stay Tuned !! Will improve the results by tuning the parameters of Random forest ,Decision Tree and XG Boost .

**Thank You :) **