# COGS  - Final Project

https://docs.google.com/document/d/1IMQ9_9TBsWFXKovrPqFjuonHIWj6UqlheJpqwwFu0hw/edit?usp=sharing

# Names

- Shenova Davis
- Lauren Lui
- Vincent Sgherzi

# Introduction

Houses are one of the most expensive purchases families make in their lifetime. With housing prices spiraling out of control it is often difficult to determine what physical characteristics of a house most directly influence the sale price. Our project aims to find the most important physical attributes of a house that influence sales price. 

# Question

What physical characteristics of a house most influence the sale price in a neighborhood?

# Hypothesis

Houses located in high density population areas will most affect sales price with ocean proximity being a secondary factor.

# Setup

In [1]:
# import
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Data

The dataset used is pulled from Kaggle at https://www.kaggle.com/datasets/shibumohapatra/house-price, which collects data on California housing prices. Described below are the features in our dataset:
- Longitude: Longitude value for the block in California, USA
- Latitude: Latitude value for the block in California, USA
- Housing_median_age: Median age of the house in the block
- Total_rooms: Count of the total number of rooms (excluding bedrooms) in all houses in the block
- Total_bedrooms: Count of the total number of bedrooms in all houses in the block
- Population: Count of the total number of population in the block
- Households: Count of the total number of households in the block
- Median_income: Median of the total household income of all the houses in the block
- Ocean_proximity: Type of the landscape of the block [ Unique Values : 'NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND' ]
- Median_house_value: Median of the household prices of all the houses in the block



In addition, more information on our dataset as for relevance is as follows:
- Number of observations: 20,600 observations
- Relevant predictors: 7 predictors
    - population
    - ocean proximity (longitude, latitude)
    - median house value
    - number of households
    - total rooms
    - house median age

To clean the data, first we need to drop all variables that are not needed as well as dropping the observations with missing data. 

In [None]:
# read data
df = pd.read_csv('house.csv')

In [None]:
df.head()

In [None]:
# clean data by renaming columns, dropping columns, and removing missing values
df = df.rename(columns={'housing_median_age': 'median_age', 
                        'median_house_value': 'median_price'})
df = df.drop(columns = ['total_bedrooms'])
df = df.dropna(axis='index')

In [None]:
# adjust output of ocean_proximty column
df['ocean_proximity'].unique()

# get one hot encoding 
one_hot = pd.get_dummies(df['ocean_proximity'])
# drop current ocean_proximity as it is now encoded
df = df.drop('ocean_proximity',axis = 1)
# join the encoded df
df = df.join(one_hot)

# rename columns
df = df.rename(columns={'<1H OCEAN': '<1h_ocean', 
                        'INLAND': 'inland', 'ISLAND': 
                        'island', 'NEAR BAY': 'near_bay', 
                        'NEAR OCEAN': 'near_ocean'})

df