# **STEP 1: Data Importing and Pre-processing**
## - Import dataset and describe characteristics such as dimensions, data types, file types, and import methods used
## - Clean, wrangle, and handle missing data
## - Transform data appropriately using techniques such as aggregation, normalization, and feature construction
## - Reduce redundant data and perform need-based discretization

In [1]:
# import all packages used for the project in the first cell, use code cells for code and comments, 
#and use markdown cells for headings and descriptions

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("house_sales.csv")
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180.0,5650.0,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570.0,7242.0,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2.0,1.0,770.0,10000.0,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960.0,5000.0,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680.0,8080.0,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Basic Characteristics

1. Shape of the data frame

In [4]:
print("Shape (rows, columns):", df.shape)

Shape (rows, columns): (21613, 21)


2. Defining file type:

The dataset was provided as a CSV file, which is a plain-text tabular file commonly used for structured data.

2. Data types by column

In [5]:
print("Data types:")
print(df.dtypes)

Data types:
id                 int64
date              object
price            float64
bedrooms         float64
bathrooms        float64
sqft_living      float64
sqft_lot         float64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object


3. Missing values

In [6]:
df.isna().sum()

id                  0
date                0
price               0
bedrooms         1134
bathrooms        1068
sqft_living      1110
sqft_lot         1044
floors              0
waterfront          0
view                0
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated        0
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

Cleaning the data

1. Separating missing value columns

In [10]:
cols_na = ["bedrooms", "bathrooms", "sqft_living", "sqft_lot"]

df_na = df[cols_na]


2. Missing percentage in each column

In [12]:
missing_percent = (df_na.isna().sum() / len (df_na)) * 100
print ("Missing percent: \n", missing_percent)

Missing percent: 
 bedrooms       5.246842
bathrooms      4.941470
sqft_living    5.135798
sqft_lot       4.830426
dtype: float64


4. Distribution in missing value columns

In [13]:
df["bedrooms"].describe()

count    20479.000000
mean         3.372821
std          0.930711
min          0.000000
25%          3.000000
50%          3.000000
75%          4.000000
max         33.000000
Name: bedrooms, dtype: float64

5. Filling in missing values

    a. bedrooms
        This columns missing percentage is under 10% and the variable is discrete with clear central tendency. Most homes have 3 bedrooms, due to outliers, the mean would not be a reliable choice. The median is more robust to those outliers and better represents a typical value. For these reasons, the median, was used to fill the missing bedroom values. 

In [15]:
df["bedrooms"] = df["bedrooms"].fillna(df["bedrooms"].median())

# **STEP 2: Data Analysis and Visualization**
## -Identify categorical, ordinal, and numerical variables within the data
## -Provide measures of centrality and distribution with visualizations
## -Diagnose for correlations between variables and determine independent and dependent variables
## -Perform exploratory analysis in combination with visualization techniques to discover patterns and features of interest

# **STEP 3: Data Analytics**
## -Determine the need for a supervised or unsupervised learning method and identify dependent and independent variables
## -Train, test, and provide accuracy and evaluation metrics for model results
