# Final Project Submission

Florian Salihovic

## Understanding Business Requirements

### Technical details
R-squared/R-squared adjusted should be between .7 and .9.

## Data Mining

The data is provided as a CSV table containing a series of columns:


Answer to the 1st question - left-align the table - create and run a code cell above the table markdown cell, with the following content:

# Column Names and descriptions for Kings County Data Set

| Key               | Description                                        | Expected Data Type       |
|:------------------|:---------------------------------------------------|:-------------------------|
| id                | unique identified for a house                      | numeric, positive        |
| date              | date house was sold                                | pandas.Timestamp         |
| price             | is prediction target                               | numeric, positive        |
| bedroomsNumber    | number of bedrooms/house                           | numeric, positive        |
| bathroomsNumber   | of bathrooms/bedrooms                              | numeric, positive        |
| sqft_livingsquare | footage of the home                                | numeric, positive        |
| sqft_lotsquare    | footage of the lot                                 | numeric, positive        |
| floorsTotal       | floors (levels) in house                           | numeric, positive        |
| waterfront        | has a view to a waterfront                         | boolean, optional           |
| view              | has been viewed                                    | numeric, positive           |
| condition         | how good the condition is ( overall )              | numeric, positive           |
| grade             | overall grade given to the housing unit, based on King County grading system | |
| sqft_above        | square footage of house apart from basement        | numeric, positive, optional |
| sqft_basement     | square footage of the basement                     | numeric, positive           |
| yr_built          | built year                                         | numeric, positive           |
| yr_renovated      | year when house was renovated                      | numeric, positive, optional |
| zipcode           | zip code                                           | numeric, positive |
| lat               | Latitude coordinate                                | numeric, optional |
| long              | Longitude coordinate                               | numeric, optional |
| sqft_living15     | square footage of interior housing living space for the nearest 15 neighbors | numeric, optional |
| sqft_lot15        | square footage of the land lots of the nearest 15 neighbors | numeric, optional |


##### Additional Credits And Sources
- [Variable Explanation, kaggle.com](https://www.kaggle.com/harlfoxem/housesalesprediction/discussion/23194)
- [King County Home Sales: Analysis and the limitations of a multiple regression model, JuanPablo Murillo, February 23, 2016](https://rstudio-pubs-static.s3.amazonaws.com/155304_cc51f448116744069664b35e7762999f.html)
- 

### Simple File Line Count

Using `wc -l King_County_House_prices_dataset.csv` we get direct feedback over the number of lines in a file. This could be a useful comparison value in case the formatting of the dataset was broken.

In [None]:
!wc -l King_County_House_prices_dataset.csv

### Exploring The Dataset With Pandas

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('King_County_House_prices_dataset.csv', delimiter=",")

#### DataFrame: Basic Information

Understanding the values is important as these define the operations we can perform (consistently) on the DataFrame. Basic information about the DataFrame can be obtained by calling [pandas.DataFrame.info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) and [pandas.DataFrame.describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html).

In [None]:
df.info()

In [None]:
df.describe()

#### DataFrame: Basic Inspections

[pandas.DataFrame.head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [pandas.DataFrame.tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) provide a first glance on the data contained by the DataFrame.

In [None]:
df.head()

In [None]:
df.tail()

### Data Cleaning

`date` is a string following the pattern `month/date/year`. It might be useful to convert the value later into a [pandas.Timestamp](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html) as it may provide additional or more useful comparison mechanics.

In [None]:
df['date_as_datetime'] = df['date'].apply(pd.to_datetime)

Inspecting the DataFrame we can see the format data rendered in `date` and `date_timestamp` is different. `DataFrame.head` gives a first glance on the modified DataFrame while calling `DataFrame.info` shows the exact data types.

In [None]:
df.head()

In [None]:
df.info()

#### Searching For Null Values

In [None]:
df.isnull().any()

#### Unspecified/Missing Value: `sqft_basement`

In [None]:
df[df['sqft_basement'] == '?'].describe()

#### Unspecified/Missing Value: `waterfront`

In [None]:
df[df['waterfront'].isna()].describe()

#### Unspecified/Missing Value: `view`

In [None]:
df[df['view'].isna()].describe()