# FIT5196 Task 3 in Assessment 2
#### Student Name: Sandeep Kumar Kola
#### Student ID: 28976657

Date: 13/05/2018

Version: 1.0

Environment: Python 3.6 and Jupyter notebook

Libraries used:
* pandas (for dataframes, included in Anaconda Python 3.6)
* numpy (for arrays, included in Anaconda Python 3.6)
* sklearn (for predicting missing values, needs to be downloaded)
* Use pip install sklearn if the package is not installed on your machine. 

### Task 3
Finding missing value and fill in the reasonable values.

#### 1) Import the libraries.

In [None]:
import pandas as pd
import numpy as np

#### 2) Read the data.

In [None]:
dataset3 = pd.read_csv("dataset3_with_missing.csv")
# Create copy of dataframe.
dataset3_solution = dataset3

In [None]:
dataset3_solution.head(5)

In [None]:
# Look at no of missing values:
dataset3_solution.isnull().sum()

#### 3) Fix the sqft_living values
* Upon observing the data we see that sqft_living is the sum of sqft_above and sqft_basement.
* sqft_living = sqft_above + sqft_basement
* From this relation let's fill the missing values for these columns.

In [None]:
# Look at few missing values of sqft_living
dataset3_solution[np.isnan(dataset3_solution['sqft_living'])].head(5)

In [None]:
# The below code imputes the missing values of sqft_living by adding the sqft_above and sqft_basement.
dataset3_solution.loc[dataset3_solution['sqft_living'].isnull(), 
                      'sqft_living'] = dataset3_solution['sqft_above'] + dataset3_solution["sqft_basement"]

#### 4) Fix the sqft_above values.

In [None]:
# The below code imputes the missing values of sqft_above by substracting the sqft_basement from sqft_living.
dataset3_solution.loc[dataset3_solution['sqft_above'].isnull(), 
                      'sqft_above'] = dataset3_solution['sqft_living'] - dataset3_solution["sqft_basement"]

#### 5) Fix the sqft_basement values.

In [None]:
# The below code imputes the missing values of sqft_basement by substracting the sqft_above from sqft_living.
dataset3_solution.loc[dataset3_solution['sqft_basement'].isnull(), 
                      'sqft_basement'] = dataset3_solution['sqft_living'] - dataset3_solution["sqft_above"]

In [None]:
# Let's check if all the missing values are filled.
dataset3_solution.isnull().sum()

* Missing Bedrooms values needs to be calculated.

#### 6) Imputing the bathroom missing values. 

In [None]:
# Let's have a look at the mising values in bathrooms.
bathrooms = dataset3_solution["bathrooms"].unique()
bathrooms = pd.DataFrame(bathrooms)
bathrooms.columns = ['bathrooms']
bathrooms = bathrooms.sort_values(by='bathrooms')
bathrooms

In [None]:
# let's look at the corelationmatrix for bathrooms.
corr = dataset3_solution.corr()
corr

* It seems like bathrooms are corelated to many other columns of data.

#### Let's use Random forest algorithm to predict the missing bathroom data.

In [None]:
# Splitting the dataset into the Training set and Test set
final_test_data = dataset3_solution[np.isnan(dataset3['bathrooms'])]
final_test_data = final_test_data.drop(["bathrooms"],axis=1)

In [None]:
# Let's process training data.
train = dataset3_solution.dropna()
# X is the dataset without bathroom values.
X = train.drop(["bathrooms"],axis=1)
# Dropping columns which are not useful in predciting bathrooms.
# Drop the ID
X = X.drop(["id"],axis=1)
# Drop date and lat and lon values too.
X = X.drop(["date", "lat", "long", "zipcode"],axis=1)
# Y is the dataset with the bathroom values.
y = train["bathrooms"]

In [None]:
# Let's import the tran test split fucntion and split the data randomly at 75%.
# This would be helpful in knowing the accuracy of the model.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
# Let's have a look at 
X_train.head(5)

* Let's scale the values using standard scalar function.


In [None]:
# Feature Scaling train and test data.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# fit Fit_transform fucniton to the datasets.
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Fitting Random Forest Classification to the training set.
from sklearn.ensemble import RandomForestClassifier
# Build classifier.
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
# Convert y_train to str since there are many categorical values in bathrooms.
y_train = y_train.astype(str)
# Fit the classifier to training data.
classifier.fit(X_train, y_train)
# Predicting the test set results
y_pred = classifier.predict(X_test)

#### Let's calculate the accuracy of the model.

In [None]:
# Convert the predicted value dataset to dataframe.
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
y_test.reset_index(drop=True, inplace=True)
y_pred.columns = ['pred']
y_test.columns = ['Actual']

# Convert to lists.
y_pred_list = list(y_pred["pred"])
y_test_list = list(y_test["Actual"])

# Make actual dataframe.
accuracy_df = pd.DataFrame(
    {'Actual': y_test_list,
     'Pred': y_pred_list })

# Duplicated values are nothing but the correclty predicted values.
dup_values = accuracy_df[accuracy_df.duplicated()]

In [None]:
len(accuracy_df)

In [None]:
len(dup_values)

In [None]:
# The accuarcy of the model is: 
(2292/2392)*100

#### The model is having a good accuracy, let's predict the final test set missing values.

In [None]:
# Processing final train data.
# It is important that the final data is also transformed with the same standard scaling and has same no of columns.
# Drop the ID
f_test_data = final_test_data.drop(["id"],axis=1)
# drop date and lat and lon values too.
f_test_data = f_test_data.drop(["date", "lat", "long", "zipcode"],axis=1)

# Apply the same transform function.
f_test_data = sc.transform(f_test_data)
# Using the classifier predict function predict the missing values of bathrooms.
f_y_pred = classifier.predict(f_test_data)

# Make a dataframe of the predicted values.
f_y_pred = pd.DataFrame(f_y_pred)
f_y_pred.columns = ['pred']
# Check the count of values.
f_y_pred.pred.value_counts()

In [None]:
# Building the final dataframe.
# Take id values into a list.
id_list = list(final_test_data["id"])
# Take predicted values into a list.
f_y_pred_list = list(f_y_pred["pred"])
# Create a temporary dataframe of the above two columns.
temp = pd.DataFrame({"id" : id_list, "bathrooms" : f_y_pred_list })

# Set id as index.
temp.set_index('id', inplace=True)
# Set id as index.
dataset3_solution.set_index('id', inplace=True)

# Replace the null values in the solution dataframe by the predicted values from the model.
dataset3_solution.loc[dataset3_solution['bathrooms'].isnull(),
                               'bathrooms'] = temp['bathrooms']

# Check to see if any values are missing.
dataset3_solution.isnull().sum()

In [None]:
# Reset the index of the dataframe.
dataset3_solution = dataset3_solution.reset_index()

In [None]:
dataset3_solution.head(5)

In [None]:
# Save the csv file.
dataset3_solution.to_csv('dataset3_solution.csv', sep=',',index=False)

#### End of Task 3.