#Predict the Price of Diamonds
##Data cleaning

##Table of Contents

1. Setup
1. Summarize the dataset
1. Clean the data

###1. Setup

Check the versions of libraries.

In [5]:
import sys
import numpy as np
import matplotlib 
import pandas as pd
import sklearn as sk
print('Python: {}'.format(sys.version))
print('numpy: {}'.format(np.__version__))
print('matplotlib: {}'.format(matplotlib.__version__))
print('pandas: {}'.format(pd.__version__))
print('sklearn: {}'.format(sk.__version__))

Import all of the modules, functions and objects we are going to use in this project.

Check that the files exists where we expect them.

In [9]:
%sh ls /dbfs/mnt/datalab-datasets/file-samples

Load the diamonds data from CSV file.

In [11]:
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/diamonds.csv')

###2. Summarize the dataset

####2.1 Dimensions of dataset

Display the number of instances (rows) and attributes (columns) the data contains with the `shape` property.

In [15]:
import_df.shape

####2.2 Peek at the data

Look at the first few rows of data using the `head()` method of the `import_df` dataframe.

In [18]:
import_df.head()

Figure out the features and the target variable:
- Qualitative Features (Categorical): Cut, Color, Clarity.
- Quantitative Features (Numerical): Carat, Depth , Table , X , Y, Z.
- Target Variable: Price.

###3. Clean the data

####3.1 Drop unnecessary columns

Use the `drop()` method to drop the `Unnamed: 0` column as we already have Index.

In [23]:
import_df.drop(['Unnamed: 0'] , axis=1 , inplace=True)
import_df.head()

####3.2 Check null values

__Exercise 1__ Use `info()` method of the `import_df` dataframe to check null values if any, and also check the type of each feature.

In [27]:
import_df.info()

Confirm the number of null values using `isnull()` method combined with `sum()` function.

In [29]:
import_df.isnull().sum()

####3.3 Check invalid values

__Exercise 2__ Look at the statistical summary of numerical features with the `describe()` method of the `import_df` dataframe.

In [32]:
import_df.describe()

The output shows that the min values of `X`, `Y` and `Z` are zero, which makes no sense because the length, width and height should all be positive (non-zero) values.

Look at the rows with zero value in dimensions variables. Use `loc` method to access those rows.

In [35]:
import_df.loc[(import_df['x']==0) | (import_df['y']==0) | (import_df['z']==0)]

__Exercise 3__ Count the number of rows with zero value in the dimension variables using `len()` function.

In [37]:
len(import_df[(import_df['x']==0) | (import_df['y']==0) | (import_df['z']==0)])

Drop rows with any dimension variables equal to zero. Use `all()` method to return all the rows with non-zero values in `x`, `y` and `z`.

In [39]:
import_df = import_df[(import_df[['x','y','z']] != 0).all(axis=1)]

__Exercise 4__ Check the rows with zero value in the dimension variables (`x`, `y` and `z`) of the `import_df` dataframe again. Use `loc` method to access those rows.

In [41]:
import_df.loc[(import_df['x']==0) | (import_df['y']==0) | (import_df['z']==0)]

This section practices checking and cleaning unnecessary data, null values and invalid data.

__Exercise:__ The Pima Indians dataset also has some invalid or missing values. Use that dataset to practice the methods introduced in the notebook.

- Check the Pima Indians dataset (https://www.kaggle.com/uciml/pima-indians-diabetes-database). 
The csv file can be found under the directory of `datalab-datasets` as displayed in the cell below.

In [44]:
%sh ls /dbfs/mnt/datalab-datasets/file-samples/pima-indians-diabetes.csv