This file will be used for some explinitory data analysis of the kings county data. This data set was provided to students of the flat iron school in order to implement a linear regression model.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder
from itertools  import tee

In [None]:
df = pd.read_csv("..\data\kc_house_data.csv")
df.head() # inspecting our dataframe and seeing our columns and some basic values

Notice that 'waterfront' and 'yr_renovated' appear to be the only columns with null values, however we shall further inspect these columns and see if this is in fact the case.


In [None]:
df.info() # notice that 'waterfront' and 'yr_renovated' and "view" appear to be the only columns with null values

In [None]:
df[["date", "yr_built", "yr_renovated", "waterfront", "view", "condition", "grade", "sqft_basement"]] # a subset of our dataframe with only
# the columns that need cleaning

In [None]:
df["date"] = pd.to_datetime(df["date"]) #changing date column from string to datetime format

df["date"].min(), df["date"].max() #notice our dataset only includes data from one year

In [None]:
df.yr_renovated.fillna(0.0, inplace= True) # replacing NaN values with 0.0 in yr_renovated column

df.yr_renovated.isna().sum() # checking to see if there are any NaN values

In [None]:
df["sqft_basement"].replace(to_replace= "?", value = 0.0, inplace= True ) # replacing ? with 0.0 in our sqft_basement column

df["sqft_basement"] = df["sqft_basement"].astype("float") #converting sqft_basement to type float

In [None]:
df.waterfront.fillna("NO", inplace= True) #filling in NaN values with NO for waterfront

df.waterfront = df.waterfront.eq('YES').mul(1) # now converting YES and NO to 1 and 0 respectively, this will help with our model fitting later

df.sample()

In [None]:
df.view.unique() # inspecting view column to see if there are any NaN values

df.view.fillna("NONE", inplace = True) #filling in NaN values with string NONE

Now it is time to move away from filling in our null values and correcting our data types and to begin dealing with our ordinal columns.

In [None]:
df["renovated"] = df["yr_renovated"].apply(lambda x: 0 if x==0.0 else 1) # adding a new column where 1 means home was renovated and 0 is never renovated

In [None]:
df.yr_renovated.fillna(df['yr_built'], inplace= True) # replacing NaN values with corresponding yr_built value
# in yr_renovated column
df.yr_renovated.isna().sum() # checking to see if there are any NaN values

In [None]:
df["grade"] = df.grade.apply(lambda x: (int(x[0:2]))) # adding a grade numeric column which is derived from the grade column.
# Only using the integer grading (3-13)

In [None]:
ord_cat_selector = ['view', 'condition', 'grade'] # these three columns all have ordinal data and must be dealt with accordingly

cat_subset = df[ord_cat_selector] # a subset of our dataframe with only ordinal data

cat_subset

In [None]:
cat_subset['view'].unique() #inspecting the columns and arranging the values accordingly
cat_subset['condition'].unique()
cat_subset['grade'].unique()

In [None]:
view_list = ['NONE', 'FAIR', 'AVERAGE', 'GOOD', 'EXCELLENT'] # order for each column (least to greatest)
condition_list = ['Poor', 'Fair', 'Average', 'Good', 'Very Good']
grade_list = [3,4,5,6,7,8,9,10,11,12,13]

In [None]:
o_enc = OrdinalEncoder(categories = [view_list, condition_list, grade_list])
o_enc.fit(cat_subset)

In [None]:
X_subset = pd.DataFrame(o_enc.transform(cat_subset), columns = cat_subset.columns) # create a new ordinal encoded dataframe

#Merge with our original dataframe
transformed_df =df.join(X_subset, rsuffix= "_ord")

# dropping columns redundant columns that were used to derive ordinal columns
transformed_df.drop(columns = ["view", "condition", "grade"], inplace= True)

In [None]:
transformed_df["age"] = 2015 - transformed_df["yr_built"] # adding an age column that gives us the total age of the home

Noticed an outlier in our dataframe and decided to manually edit it, using online airbnb data.

In [None]:
transformed_df.loc[transformed_df.bedrooms == transformed_df.bedrooms.max()]

In [None]:
transformed_df['bedrooms'][15856] = 3 # setting the correct number of bedrooms for this entry, found via zillow

In [None]:
transformed_df.loc[transformed_df.bedrooms == transformed_df.bedrooms.max()]

In [None]:
lat_long = transformed_df[['lat', 'long']]

Now it is time to add a special feature to our dataframe, distance to seattle downtown from your home.

In [None]:
#pairwise function implemented to iterate through two consecutive rows (pairs) in a data frame
def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)


In [None]:
#empty list - will be used to store calculated distances
list = [0]

# Loop through each row in the data frame using pairwise
for (i1, row1), (i2, row2) in pairwise(lat_long.iterrows()):
    #Assign latitude and longitude as origin/departure points
    LatOrigin = row1['lat']
    LongOrigin = row1['long']
    point = (LatOrigin, LongOrigin)
    list.append(point)
#Add column 'Distance' to data frame and assign to list values
transformed_df['point'] = list

In [None]:
transformed_df['point'][0] = (47.5112, -122.257)

In [None]:
transformed_df.head(3)

In [None]:
downtown_seattle = (47.6050, -122.3344)

import geopy.distance

transformed_df['distance_to_downtown_seattle_miles'] = transformed_df['point'].apply(lambda point: geopy.distance.geodesic(downtown_seattle, point).miles)

In [None]:
transformed_df['distance_to_downtown_seattle_miles'].describe()

In [None]:
amazon_hq = (47.622620, -122.336739)

import geopy.distance

transformed_df['distance_to_amazon_hq'] = transformed_df['point'].apply(lambda point: geopy.distance.geodesic(amazon_hq, point).miles)

Here are two different ways to visalize the same thing: our desired predicted variable y and the line that best fits for each of our features. This will give us a good understadning of which features we should use to train our linear regression model.

In [None]:
# plotting a heatmap for the correlation of each column to each other

fig, ax = plt.subplots(figsize=(15,10))

sns.heatmap(transformed_df.corr(), center = 0, cmap = "coolwarm", annot=True, linewidths=.5, ax=ax)
plt.tight_layout()

In [None]:
desired_columns = ['bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'sqft_above', 'sqft_basement',
'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'view_ord',
'distance_to_downtown_seattle_miles']

In [None]:
for col in desired_columns:
    sns.lmplot(data=transformed_df, x = col,  y="price", fit_reg =True)

In [None]:
#transformed_df.to_excel("king_county_home_sales.xlsx") #only run this line if this is your first time running this notebook

Finaly we are ready to move onto the model fitting portion of our project. Notice that this is just a preliminary EDA with only the data that was provided for us. However, after doing some basic linear regression models with this data we will return to this phase and try to fit additional data. For example, all our of home sales data is from the year 2014 to 2015 specifically from the kings county region. In order to better anlysis the price of a home in this region we should try to find more recent data, such as data from 2020-2022. Also, in order to better estimate the price of a home, we would like to calculate data such as distance to the nearest park, walking score, distance to public transit, neighborhood score, and demographic data.