# Cleaning house listings for sale

Data cleaning is an important step - it improves the data quality and increases overall productivity. When cleaning the data, all outdated or incorrect information is gone – leaving the data scientist with the highest quality information.

So, after creating the data frames, we cleaned each table according to:

    • Unnecessary features - checked which features are important for the research
    • Missing data (NANs) - cleaned nans and filled part of them with the median
    • Fixing data types - converted the Types to numeric
    • Outliers - cleaned outliers with z_score method and dropping them. 
    • Duplicates – checked and cleaned duplicates rows 

We did *Feature Engineering* to improve the model's accuracy by using domain knowledge to select and transform raw data's most relevant variables into features of predictive models that better represent the underlying problem.

    • For that we've created the "price_per_sqft" column - we divided the price by sqft

In [1]:
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')
#!pip install pydotplus

from IPython.display import Image, display #for tree plot 
import pydotplus 
from scipy import misc

import plotly.express as px

import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)


## Import the data

In [2]:
df=pd.read_csv('RealEstateNewYork.csv',sep=',',low_memory=False)

In [3]:
df.shape

(8652, 15)

In [4]:
df_clean=df.copy()

## Acquaintance with the data

In [5]:
df_clean.describe(include='all')

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,lat,lon,county
count,8650.0,8411.0,8559.0,4602.0,5703.0,8652,6968.0,6740.0,8277.0,8604,8652,8625,8346.0,8346.0,8594
unique,,,,,,11,,,,8531,1,1129,,,62
top,,,,,,single_family,,,,378 Ohayo Mountain Rd,NY,New York City,,,Suffolk
freq,,,,,,5510,,,,4,8652,408,,,949
mean,855162.3,3.333611,2.356116,1.797045,3.408031,,648467.9,2228.476706,1954.867827,,,,41.665412,-74.679838,
std,2085814.0,2.013743,1.772085,3.252687,6.5145,,26028810.0,8106.074704,39.261691,,,,1.071204,1.724791,
min,1.0,0.0,0.0,1.0,1.0,,105.0,228.0,1700.0,,,,40.499465,-79.756718,
25%,249900.0,2.0,1.0,1.0,2.0,,5000.0,1258.0,1930.0,,,,40.753867,-75.085748,
50%,499000.0,3.0,2.0,2.0,2.0,,10602.0,1738.5,1957.0,,,,41.06465,-73.947446,
75%,837750.0,4.0,3.0,2.0,3.0,,31824.0,2400.0,1984.0,,,,42.74769,-73.730989,


In [6]:
df_clean.isnull().sum().sum()

12051

In [7]:
df_clean.isnull().sum()

price            2
beds           241
baths           93
garage        4050
stories       2949
house_type       0
lot_sqft      1684
sqft          1912
year_built     375
address         48
state            0
city            27
lat            306
lon            306
county          58
dtype: int64

## Deal with missing data

### 1. Delete missing values:

In [8]:
df_clean = df_clean.dropna(subset=['city','county']).reset_index(drop=True)
df_clean

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,lat,lon,county
0,159900.0,4.0,2.0,1.0,,mobile,17424.0,1788.0,1973.0,90 E Main St,NY,Granville,43.405985,-73.251559,Washington
1,294900.0,3.0,2.0,2.0,2.0,single_family,74052.0,996.0,2011.0,16326 Ontario Shores Dr,NY,Sterling,43.404835,-76.635019,Cayuga
2,225000.0,3.0,2.0,1.0,,single_family,30056.0,1224.0,1973.0,38 Pine Cir,NY,Newfield,42.357008,-76.607137,Tompkins
3,149000.0,4.0,2.0,2.0,,single_family,223898.0,1608.0,1900.0,8 Gridleyville Rd,NY,Spencer,42.223019,-76.430742,Tioga
4,599999.0,4.0,2.0,,2.0,single_family,7307.0,1827.0,1858.0,59 Hamilton Ave,NY,Oyster Bay,40.874207,-73.531903,Nassau
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8588,975000.0,1.0,2.0,,,condos,,826.0,1987.0,75 Wall St Apt 24K,NY,New York,40.705042,-74.008117,New York
8589,1195000.0,2.0,2.0,,29.0,condos,,,1984.0,311 E 38th St Apt 10E,NY,New York City,40.747242,-73.973217,New York
8590,689000.0,3.0,3.0,1.0,2.0,single_family,7475.0,2100.0,2022.0,223 Endicott Ave,NY,Elmsford,41.063712,-73.809473,Westchester
8591,1862500.0,5.0,4.0,1.0,,single_family,4920.0,2750.0,1955.0,147-40 8th Ave,NY,Whitestone,40.792845,-73.819437,Queens


In [9]:
df_clean.isnull().sum()

price            2
beds           239
baths           93
garage        3992
stories       2918
house_type       0
lot_sqft      1635
sqft          1900
year_built     344
address         30
state            0
city             0
lat            248
lon            248
county           0
dtype: int64

### 2. Fill NaNs :

In [10]:
df_clean.year_built = df_clean.year_built.fillna(df_clean.year_built.median())
df_clean.beds = df_clean.beds.fillna(df_clean.beds.median())
df_clean.baths = df_clean.baths.fillna(df_clean.baths.median())
df_clean.garage =df_clean.garage.fillna(0, inplace = False)
df_clean.stories = df_clean.stories.fillna(df_clean.stories.median())
df_clean.sqft = df_clean.sqft.fillna(df_clean.sqft.median())
df_clean.lot_sqft = df_clean.lot_sqft.fillna(df_clean.lot_sqft.median())
df_clean.address = df_clean.address.fillna('Not Specified')

In [11]:
df_clean.isnull().sum()

price           2
beds            0
baths           0
garage          0
stories         0
house_type      0
lot_sqft        0
sqft            0
year_built      0
address         0
state           0
city            0
lat           248
lon           248
county          0
dtype: int64

## Drop the unnecessary columns

In [12]:
df_clean = df_clean.drop(columns=(['lon', 'lat']))

## Drop the unnecessary rows

In [13]:
df_clean = df_clean[df_clean["house_type"].str.contains("mobile") == False]
df_clean = df_clean[df_clean["house_type"].str.contains("condo_townhome_rowhome_coop") == False]
df_clean = df_clean[df_clean["house_type"].str.contains("condop") == False]
df_clean = df_clean[df_clean["house_type"].str.contains("farm") == False]

## Remove duplicates

In [14]:
df_clean= df_clean.drop_duplicates().reset_index(drop=True)
df_clean

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,294900.0,3.0,2.0,2.0,2.0,single_family,74052.0,996.0,2011.0,16326 Ontario Shores Dr,NY,Sterling,Cayuga
1,225000.0,3.0,2.0,1.0,2.0,single_family,30056.0,1224.0,1973.0,38 Pine Cir,NY,Newfield,Tompkins
2,149000.0,4.0,2.0,2.0,2.0,single_family,223898.0,1608.0,1900.0,8 Gridleyville Rd,NY,Spencer,Tioga
3,599999.0,4.0,2.0,0.0,2.0,single_family,7307.0,1827.0,1858.0,59 Hamilton Ave,NY,Oyster Bay,Nassau
4,299900.0,3.0,2.0,1.0,2.0,single_family,8712.0,1589.0,1960.0,41 Bender Ln,NY,Bethlehem,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8412,975000.0,1.0,2.0,0.0,2.0,condos,10550.5,826.0,1987.0,75 Wall St Apt 24K,NY,New York,New York
8413,1195000.0,2.0,2.0,0.0,29.0,condos,10550.5,1740.0,1984.0,311 E 38th St Apt 10E,NY,New York City,New York
8414,689000.0,3.0,3.0,1.0,2.0,single_family,7475.0,2100.0,2022.0,223 Endicott Ave,NY,Elmsford,Westchester
8415,1862500.0,5.0,4.0,1.0,2.0,single_family,4920.0,2750.0,1955.0,147-40 8th Ave,NY,Whitestone,Queens


## Changing data types

In [15]:
df_clean['year_built']=df_clean['year_built'].astype(np.int64)

df_clean['beds']=df_clean['beds'].astype(np.float64)

df_clean['baths']=df_clean['baths'].astype(np.float64)

df_clean['stories']=df_clean['stories'].astype(np.int64)

df_clean['lot_sqft']=df_clean['lot_sqft'].astype(np.int64)

df_clean['sqft']=df_clean['sqft'].astype(np.int64)

df_clean['garage']=df_clean['garage'].astype(np.int64)

In [16]:
df_clean.dtypes

price         float64
beds          float64
baths         float64
garage          int64
stories         int64
house_type     object
lot_sqft        int64
sqft            int64
year_built      int64
address        object
state          object
city           object
county         object
dtype: object

## Using pandas describe() to find outliers


In [17]:
df_clean.describe(include='all')

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
count,8415.0,8417.0,8417.0,8417.0,8417.0,8417,8417.0,8417.0,8417.0,8417,8417,8417,8417
unique,,,,,,7,,,,8355,1,1107,62
top,,,,,,single_family,,,,Not Specified,NY,Brooklyn,Suffolk
freq,,,,,,5465,,,,30,8417,361,939
mean,843375.5,3.338719,2.359154,0.970655,2.855768,,533711.9,2124.796959,1954.314601,,,,
std,2068201.0,1.99659,1.773179,2.565533,5.107273,,23683220.0,7239.399081,38.330203,,,,
min,1.0,0.0,0.0,0.0,1.0,,105.0,228.0,1700.0,,,,
25%,249999.0,2.0,2.0,0.0,2.0,,6000.0,1398.0,1930.0,,,,
50%,499000.0,3.0,2.0,1.0,2.0,,10550.0,1740.0,1957.0,,,,
75%,829000.0,4.0,3.0,2.0,2.0,,22041.0,2190.0,1980.0,,,,


## Detecting & handling outliers

### 1. Price outliers : 

In [18]:
df_clean['price'].describe()

count    8.415000e+03
mean     8.433755e+05
std      2.068201e+06
min      1.000000e+00
25%      2.499990e+05
50%      4.990000e+05
75%      8.290000e+05
max      1.000000e+08
Name: price, dtype: float64

In [19]:
z_score = (df_clean['price'] - df_clean['price'].mean())/df_clean['price'].std()

In [20]:
price_outliers = abs(z_score)>3
sum(price_outliers)

72

In [21]:
min(df_clean.price[price_outliers])

7095000.0

In [22]:
max(df_clean.price[price_outliers])

100000000.0

In [23]:
df_clean[(df_clean['price'] < 100000) | (df_clean['price'] > 6890000)]

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
42,1.0,3.0,1.0,2,2,single_family,370260,1330,1972,141 Fulton Rd,NY,Caroga Lake,Fulton
44,84999.0,3.0,3.0,0,2,multi_family,871,2280,1930,66 Robin St,NY,Albany,Albany
51,50000.0,2.0,2.0,0,2,multi_family,6970,1097,1910,29 Clark St,NY,Malone,Franklin
56,49900.0,4.0,2.0,0,2,multi_family,2614,1740,1890,801 Livingston Ave,NY,Albany,Albany
87,89900.0,3.0,2.0,0,2,single_family,4792,1280,1990,18 Thomas St,NY,Rochester,Monroe
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8259,99500.0,2.0,1.0,0,2,single_family,17860,1200,1930,4419 Union Rd,NY,Cheektowaga,Erie
8262,93500.0,3.0,2.0,0,2,multi_family,6534,2329,1930,424 W Court St,NY,Rome,Oneida
8274,65000.0,4.0,2.0,0,2,multi_family,4400,1418,1904,479 Bay St,NY,Rochester,Monroe
8296,80000.0,3.0,1.0,0,2,single_family,348480,1586,1900,869 McIntyre Rd,NY,Caledonia,Livingston


### 1.1 Handling price outliers :

In [24]:
df_clean['price'] = np.where((df_clean.price<100000),np.nan,df_clean.price)
df_clean['price'] = np.where((df_clean.price>6890000),np.nan,df_clean.price)
df_clean.isnull().sum()

price         472
beds            0
baths           0
garage          0
stories         0
house_type      0
lot_sqft        0
sqft            0
year_built      0
address         0
state           0
city            0
county          0
dtype: int64

In [25]:
df_clean.price = df_clean.price.fillna(df_clean.price.median())
df_clean.isnull().sum()

price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      0
sqft          0
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

### 2. Bed outliers : 

In [26]:
df_clean['beds'].describe()

count    8417.000000
mean        3.338719
std         1.996590
min         0.000000
25%         2.000000
50%         3.000000
75%         4.000000
max       104.000000
Name: beds, dtype: float64

In [27]:
z_score = (df_clean['beds'] - df_clean['beds'].mean())/df_clean['beds'].std()

In [28]:
beds_outliers = abs(z_score)>3
sum(beds_outliers)

51

In [29]:
min(df_clean.beds[beds_outliers])

10.0

In [30]:
max(df_clean.beds[beds_outliers])

104.0

In [31]:
df_clean[df_clean['beds']>=9].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,1600000.0,9.0,5.0,0,3,single_family,22216,3210,1956,16 Hilltop Ln,NY,Monsey,Rockland
1,525000.0,13.0,19.0,3,2,single_family,200376,22000,2012,335 Little Noyac Path,NY,Water Mill,Suffolk
2,2999999.0,11.0,8.0,0,4,multi_family,10550,1740,2020,1333 E 14th St Unit Townhouse,NY,Brooklyn,Kings
3,2799000.0,10.0,8.0,0,2,multi_family,2000,1740,2009,35-27 109th St,NY,Corona,Queens
4,2799000.0,10.0,8.0,0,2,multi_family,2000,2000,2009,35-25 109th St,NY,Corona,Queens
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74,525000.0,9.0,13.0,2,3,single_family,18731,7600,2004,40 Herrick Rd,NY,Southampton,Suffolk
75,1268000.0,12.0,6.0,0,2,multi_family,2250,1740,1931,101 Saint Nicholas Ave,NY,Brooklyn,Kings
76,4999000.0,12.0,6.0,0,2,multi_family,3267,2520,1925,7416 Bay Pkwy,NY,Brooklyn,Kings
77,1350000.0,10.0,3.0,0,2,single_family,10550,1740,1925,123-12 82nd Ave Unit Townhouse,NY,Queens,Queens


### 2.1 Handling bed outliers :

In [32]:
df_clean['beds'] = np.where((df_clean.beds>=9),np.nan,df_clean.beds)
df_clean.isnull().sum()


price          0
beds          79
baths          0
garage         0
stories        0
house_type     0
lot_sqft       0
sqft           0
year_built     0
address        0
state          0
city           0
county         0
dtype: int64

In [33]:
df_clean=df_clean.dropna(subset=['beds'])
df_clean.isnull().sum()

price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      0
sqft          0
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

### 3. Bath outliers : 

In [34]:
df_clean['baths'].describe()

count    8338.000000
mean        2.308227
std         1.269486
min         0.000000
25%         1.000000
50%         2.000000
75%         3.000000
max        15.000000
Name: baths, dtype: float64

In [35]:
z_score = (df_clean['baths'] - df_clean['baths'].mean())/df_clean['baths'].std()

In [36]:
baths_outliers = abs(z_score)>3
sum(baths_outliers)

77

In [37]:
min(df_clean.baths[baths_outliers])

7.0

In [38]:
max(df_clean.baths[baths_outliers])


15.0

In [39]:
df_clean[df_clean['baths']>=7].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,5180000.0,7.0,7.0,3,3,single_family,91476,1740,2015,22 Rolling Hill Rd,NY,Old Westbury,Nassau
1,1250000.0,7.0,7.0,3,2,single_family,1263240,5396,1995,35 Creek Rd,NY,Brunswick,Rensselaer
2,525000.0,8.0,10.0,3,2,single_family,43560,9221,2023,156 Summerfield Ln,NY,Water Mill,Suffolk
3,5945000.0,7.0,7.0,0,2,single_family,39204,4512,1996,95 Mill Creek Close,NY,Water Mill,Suffolk
4,4000000.0,5.0,8.0,2,2,single_family,29621,5615,2021,151 Newtown Ln,NY,East Hampton,Suffolk
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,2999999.0,6.0,7.0,2,2,single_family,29185,6200,2022,17 Carwin Ln,NY,Westhampton Beach,Suffolk
73,525000.0,5.0,7.0,0,14,coop,25995,1740,1925,1 Sutton Pl S Unit Ph,NY,New York City,New York
74,525000.0,5.0,7.0,0,2,single_family,54014,5000,2016,40 Ocean Ave,NY,Quogue,Suffolk
75,4300000.0,8.0,7.0,0,4,townhomes,10550,5400,1957,41 Claver Pl,NY,New York City,Kings


### 3.1 Handling bath outliers :

In [40]:
df_clean['baths'] = np.where((df_clean.baths>=7),np.nan,df_clean.baths)
df_clean.isnull().sum()

price          0
beds           0
baths         77
garage         0
stories        0
house_type     0
lot_sqft       0
sqft           0
year_built     0
address        0
state          0
city           0
county         0
dtype: int64

In [41]:
df_clean=df_clean.dropna(subset=['baths'])
df_clean.isnull().sum()

price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      0
sqft          0
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

### 4. Garage outliers : 

In [42]:
df_clean['garage'].describe()

count    8261.000000
mean        0.967679
std         2.581888
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max       200.000000
Name: garage, dtype: float64

In [43]:
z_score = (df_clean['garage'] - df_clean['garage'].mean())/df_clean['garage'].std()

In [44]:
garage_outliers = abs(z_score)>3
sum(garage_outliers)

8

In [45]:
min(df_clean.garage[garage_outliers])

9

In [46]:
max(df_clean.garage[garage_outliers])

200

In [47]:
df_clean[df_clean['garage']>=7].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,289900.0,3.0,1.0,8,2,single_family,64469,1726,1900,2642 County Route 23B,NY,Cairo,Greene
1,514900.0,3.0,3.0,10,2,multi_family,5118300,1740,1870,4454 NY 85,NY,Westerlo,Albany
2,675000.0,4.0,5.0,11,2,single_family,175982,3963,2005,1105 Lovers Ln,NY,Camden,Oneida
3,499900.0,4.0,3.0,8,2,single_family,57935,2706,1988,56 Smith Rd,NY,Amherst,Erie
4,469500.0,5.0,4.0,12,2,multi_family,106286,2657,1979,7050 Michael Rd,NY,Orchard Park,Erie
5,279000.0,0.0,1.0,8,1,single_family,819364,4800,1982,28483 County Route 32,NY,Evans Mills,Jefferson
6,1925000.0,4.0,6.0,9,2,single_family,1176556,4182,1890,4569 Route 199,NY,Millerton,Dutchess
7,679000.0,4.0,3.0,8,2,single_family,18295,3158,2005,13 Ironwood Dr,NY,Colonie,Albany
8,350000.0,6.0,2.0,200,2,multi_family,2033,1740,1925,246 Quincy Ave,NY,Bronx,Bronx
9,424900.0,4.0,2.0,10,2,single_family,220849,2210,1900,109 Crane St,NY,Charlton,Saratoga


### 4.1 Handling garage outliers :

In [48]:
df_clean['garage'] = np.where((df_clean.garage>=7),np.nan,df_clean.garage)
df_clean.isnull().sum()

price          0
beds           0
baths          0
garage        15
stories        0
house_type     0
lot_sqft       0
sqft           0
year_built     0
address        0
state          0
city           0
county         0
dtype: int64

In [49]:
df_clean.garage = df_clean.garage.fillna(df_clean.garage.median())
df_clean.isnull().sum()

price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      0
sqft          0
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

### 5. Sqft outliers : 

In [50]:
df_clean['sqft'].describe()

count      8261.000000
mean       2064.250091
std        7277.225428
min         228.000000
25%        1385.000000
50%        1740.000000
75%        2152.000000
max      564196.000000
Name: sqft, dtype: float64

In [51]:
z_score = (df_clean['sqft'] - df_clean['sqft'].mean())/df_clean['sqft'].std()

In [52]:
sqft_outliers = abs(z_score)>3
sum(sqft_outliers)

8

In [53]:
min(df_clean.sqft[sqft_outliers])

29328

In [54]:
max(df_clean.sqft[sqft_outliers])

564196

In [55]:
df_clean[df_clean['sqft']<320].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,279000.0,1.0,1.0,0.0,2,coop,191664,260,1987,231 Dune Rd Unit 603,NY,Westhampton Beach,Suffolk
1,159000.0,2.0,2.0,0.0,1,single_family,24829,228,1970,1390 Tupper Rd,NY,Long Lake,Hamilton
2,525000.0,1.0,1.0,0.0,2,single_family,4792,286,1930,556 Alyssa Way,NY,Cambridge,Washington


In [56]:
df_clean[df_clean['sqft']>=36590].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,274900.0,4.0,2.0,3.0,2,single_family,65776,65776,1920,10866 State Route 126,NY,Carthage,Lewis
1,1199000.0,1.0,1.0,0.0,2,coop,10550,564196,1963,200 Central Park S Apt 3F,NY,New York,New York
2,225000.0,1.0,1.0,0.0,2,coop,50163,195726,1928,446 Kingston Ave Apt B31,NY,Brooklyn,Kings
3,400000.0,1.0,1.0,0.0,2,coop,10550,202841,1928,575 Park Ave # 603,NY,New York,New York
4,1100000.0,2.0,1.0,0.0,2,coop,10550,37782,1930,253 W 16th St # 1B,NY,New York,New York
5,5250000.0,3.0,5.0,0.0,2,coop,10550,170106,1959,900 Fifth Ave Apt 7B,NY,New York,New York


### 5.1 Handling sqft outliers :

In [57]:
df_clean['sqft'] = np.where((df_clean.sqft>=36590),np.nan,df_clean.sqft)
df_clean['sqft'] = np.where((df_clean.sqft<=300),np.nan,df_clean.sqft)
df_clean.isnull().sum()

price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      0
sqft          9
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

In [58]:
df_clean.sqft = df_clean.sqft.fillna(df_clean.sqft.median())
df_clean.isnull().sum()

price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      0
sqft          0
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

### 6. Lot Sqft outliers :

###          6.1 Removing values of "Lot Sqft" that are smaller then "Sqft" :

In [59]:
df_clean

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,294900.0,3.0,2.0,2.0,2,single_family,74052,996.0,2011,16326 Ontario Shores Dr,NY,Sterling,Cayuga
1,225000.0,3.0,2.0,1.0,2,single_family,30056,1224.0,1973,38 Pine Cir,NY,Newfield,Tompkins
2,149000.0,4.0,2.0,2.0,2,single_family,223898,1608.0,1900,8 Gridleyville Rd,NY,Spencer,Tioga
3,599999.0,4.0,2.0,0.0,2,single_family,7307,1827.0,1858,59 Hamilton Ave,NY,Oyster Bay,Nassau
4,299900.0,3.0,2.0,1.0,2,single_family,8712,1589.0,1960,41 Bender Ln,NY,Bethlehem,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8412,975000.0,1.0,2.0,0.0,2,condos,10550,826.0,1987,75 Wall St Apt 24K,NY,New York,New York
8413,1195000.0,2.0,2.0,0.0,29,condos,10550,1740.0,1984,311 E 38th St Apt 10E,NY,New York City,New York
8414,689000.0,3.0,3.0,1.0,2,single_family,7475,2100.0,2022,223 Endicott Ave,NY,Elmsford,Westchester
8415,1862500.0,5.0,4.0,1.0,2,single_family,4920,2750.0,1955,147-40 8th Ave,NY,Whitestone,Queens


In [60]:
count=0

for i, j in df_clean.iterrows():
    if((df_clean['lot_sqft'][i]) < (df_clean['sqft'][i])):
        count+=1
        df_clean.drop([i], axis=0, inplace=True)

print(count)
print()

df_clean = df_clean.reset_index(drop=True)
df_clean

311



Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,294900.0,3.0,2.0,2.0,2,single_family,74052,996.0,2011,16326 Ontario Shores Dr,NY,Sterling,Cayuga
1,225000.0,3.0,2.0,1.0,2,single_family,30056,1224.0,1973,38 Pine Cir,NY,Newfield,Tompkins
2,149000.0,4.0,2.0,2.0,2,single_family,223898,1608.0,1900,8 Gridleyville Rd,NY,Spencer,Tioga
3,599999.0,4.0,2.0,0.0,2,single_family,7307,1827.0,1858,59 Hamilton Ave,NY,Oyster Bay,Nassau
4,299900.0,3.0,2.0,1.0,2,single_family,8712,1589.0,1960,41 Bender Ln,NY,Bethlehem,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7945,975000.0,1.0,2.0,0.0,2,condos,10550,826.0,1987,75 Wall St Apt 24K,NY,New York,New York
7946,1195000.0,2.0,2.0,0.0,29,condos,10550,1740.0,1984,311 E 38th St Apt 10E,NY,New York City,New York
7947,689000.0,3.0,3.0,1.0,2,single_family,7475,2100.0,2022,223 Endicott Ave,NY,Elmsford,Westchester
7948,1862500.0,5.0,4.0,1.0,2,single_family,4920,2750.0,1955,147-40 8th Ave,NY,Whitestone,Queens


### 6.2 The Outliers : 

In [61]:
df_clean['lot_sqft'].describe()

count    7.950000e+03
mean     5.571632e+05
std      2.436590e+07
min      4.350000e+02
25%      6.500000e+03
50%      1.055000e+04
75%      2.265100e+04
max      1.660856e+09
Name: lot_sqft, dtype: float64

In [62]:
z_score = (df_clean['lot_sqft'] - df_clean['lot_sqft'].mean())/df_clean['lot_sqft'].std()

In [63]:
lot_sqft_outliers = abs(z_score)>3
sum(lot_sqft_outliers)

4

In [64]:
min(df_clean.lot_sqft[lot_sqft_outliers])

213444000

In [65]:
max(df_clean.lot_sqft[lot_sqft_outliers])

1660855680

In [66]:
df_clean[df_clean['lot_sqft']>=18360540].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,225000.0,4.0,2.0,2.0,2,multi_family,1660855680,1327.0,1908,1143 Regent St,NY,Niskayuna,Schenectady
1,159900.0,6.0,2.0,0.0,2,multi_family,1313334000,2128.0,1910,1821 Broadway,NY,Schenectady,Schenectady
2,2300000.0,3.0,4.0,3.0,3,single_family,18360540,5312.0,2008,1435 Webster St,NY,Malone,Franklin
3,165000.0,5.0,2.0,2.0,2,multi_family,213444000,1740.0,1900,613 Orchard St,NY,Schenectady,Schenectady
4,999000.0,5.0,4.0,2.0,2,multi_family,435556440,2500.0,2011,260 Zerega Ave,NY,Bronx,Bronx


### 6.2.1 Handling Lot Sqft outliers 

In [67]:
df_clean['lot_sqft'] = np.where((df_clean.lot_sqft>=18360540),np.nan,df_clean.lot_sqft)
df_clean.isnull().sum()

price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      5
sqft          0
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

In [68]:
df_clean=df_clean.dropna(subset=['lot_sqft'])
df_clean.isnull().sum()


price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      0
sqft          0
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

In [69]:
df_clean

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,294900.0,3.0,2.0,2.0,2,single_family,74052.0,996.0,2011,16326 Ontario Shores Dr,NY,Sterling,Cayuga
1,225000.0,3.0,2.0,1.0,2,single_family,30056.0,1224.0,1973,38 Pine Cir,NY,Newfield,Tompkins
2,149000.0,4.0,2.0,2.0,2,single_family,223898.0,1608.0,1900,8 Gridleyville Rd,NY,Spencer,Tioga
3,599999.0,4.0,2.0,0.0,2,single_family,7307.0,1827.0,1858,59 Hamilton Ave,NY,Oyster Bay,Nassau
4,299900.0,3.0,2.0,1.0,2,single_family,8712.0,1589.0,1960,41 Bender Ln,NY,Bethlehem,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7945,975000.0,1.0,2.0,0.0,2,condos,10550.0,826.0,1987,75 Wall St Apt 24K,NY,New York,New York
7946,1195000.0,2.0,2.0,0.0,29,condos,10550.0,1740.0,1984,311 E 38th St Apt 10E,NY,New York City,New York
7947,689000.0,3.0,3.0,1.0,2,single_family,7475.0,2100.0,2022,223 Endicott Ave,NY,Elmsford,Westchester
7948,1862500.0,5.0,4.0,1.0,2,single_family,4920.0,2750.0,1955,147-40 8th Ave,NY,Whitestone,Queens


### 7. Stories outliers :

In [70]:
df_clean['stories'].describe()

count    7945.000000
mean        2.881435
std         5.219916
min         1.000000
25%         2.000000
50%         2.000000
75%         2.000000
max        90.000000
Name: stories, dtype: float64

In [71]:
z_score = (df_clean['stories'] - df_clean['stories'].mean())/df_clean['stories'].std()

In [72]:
stories_outliers = abs(z_score)>3
sum(stories_outliers)

155

In [73]:
min(df_clean.stories[stories_outliers])

19

In [74]:
max(df_clean.stories[stories_outliers])

90

In [75]:
df_clean[df_clean['stories']>=20].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,525000.0,5.0,6.0,0.0,52,condos,10550.0,3429.0,2021,200 Amsterdam Ave Unit 3A,NY,New York,New York
1,6780000.0,3.0,4.0,0.0,52,condos,10550.0,2221.0,2021,200 Amsterdam Ave Unit 26B,NY,New York,New York
2,525000.0,3.0,4.0,0.0,52,condos,10550.0,2221.0,2021,200 Amsterdam Ave Unit 25B,NY,New York,New York
3,525000.0,4.0,6.0,0.0,33,condos,10550.0,3976.0,1996,1965 Broadway Ph 2A,NY,New York,New York
4,2950000.0,4.0,3.0,0.0,21,coop,10550.0,2200.0,1929,410 W 24th St Unit 3IJM,NY,New York,New York
...,...,...,...,...,...,...,...,...,...,...,...,...,...
140,575000.0,1.0,1.0,0.0,21,coop,10550.0,1740.0,1971,345 E 86th St Apt 3A,NY,New York City,New York
141,1775000.0,2.0,2.0,0.0,23,condos,10550.0,968.0,1985,225 Rector Pl Apt 20C,NY,New York,New York
142,5995000.0,3.0,3.0,0.0,40,condos,10550.0,2572.0,2008,101 W 24th St Unit 20ED,NY,New York City,New York
143,1198000.0,1.0,2.0,0.0,38,condos,10550.0,927.0,1987,75 Wall St Apt 35P,NY,New York City,New York


### 7.1 Handling sqft outliers :

In [76]:
df_clean['stories'] = np.where((df_clean.stories>=20),np.nan,df_clean.stories)
df_clean.isnull().sum()

price           0
beds            0
baths           0
garage          0
stories       145
house_type      0
lot_sqft        0
sqft            0
year_built      0
address         0
state           0
city            0
county          0
dtype: int64

In [77]:
df_clean=df_clean.dropna(subset=['stories'])
df_clean.isnull().sum()

price         0
beds          0
baths         0
garage        0
stories       0
house_type    0
lot_sqft      0
sqft          0
year_built    0
address       0
state         0
city          0
county        0
dtype: int64

In [78]:
df_clean = df_clean.reset_index(drop=True)
df_clean

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county
0,294900.0,3.0,2.0,2.0,2.0,single_family,74052.0,996.0,2011,16326 Ontario Shores Dr,NY,Sterling,Cayuga
1,225000.0,3.0,2.0,1.0,2.0,single_family,30056.0,1224.0,1973,38 Pine Cir,NY,Newfield,Tompkins
2,149000.0,4.0,2.0,2.0,2.0,single_family,223898.0,1608.0,1900,8 Gridleyville Rd,NY,Spencer,Tioga
3,599999.0,4.0,2.0,0.0,2.0,single_family,7307.0,1827.0,1858,59 Hamilton Ave,NY,Oyster Bay,Nassau
4,299900.0,3.0,2.0,1.0,2.0,single_family,8712.0,1589.0,1960,41 Bender Ln,NY,Bethlehem,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7795,449000.0,2.0,1.0,0.0,2.0,coop,10550.0,1000.0,1963,2483 W 16th St Apt 12J,NY,Brooklyn,Kings
7796,975000.0,1.0,2.0,0.0,2.0,condos,10550.0,826.0,1987,75 Wall St Apt 24K,NY,New York,New York
7797,689000.0,3.0,3.0,1.0,2.0,single_family,7475.0,2100.0,2022,223 Endicott Ave,NY,Elmsford,Westchester
7798,1862500.0,5.0,4.0,1.0,2.0,single_family,4920.0,2750.0,1955,147-40 8th Ave,NY,Whitestone,Queens


## Feature Engineering

In [79]:
df_clean['price_per_sqft'] = df_clean.apply(lambda row: row.price / row.sqft, axis=1)
df_clean


Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county,price_per_sqft
0,294900.0,3.0,2.0,2.0,2.0,single_family,74052.0,996.0,2011,16326 Ontario Shores Dr,NY,Sterling,Cayuga,296.084337
1,225000.0,3.0,2.0,1.0,2.0,single_family,30056.0,1224.0,1973,38 Pine Cir,NY,Newfield,Tompkins,183.823529
2,149000.0,4.0,2.0,2.0,2.0,single_family,223898.0,1608.0,1900,8 Gridleyville Rd,NY,Spencer,Tioga,92.661692
3,599999.0,4.0,2.0,0.0,2.0,single_family,7307.0,1827.0,1858,59 Hamilton Ave,NY,Oyster Bay,Nassau,328.406678
4,299900.0,3.0,2.0,1.0,2.0,single_family,8712.0,1589.0,1960,41 Bender Ln,NY,Bethlehem,Albany,188.735053
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7795,449000.0,2.0,1.0,0.0,2.0,coop,10550.0,1000.0,1963,2483 W 16th St Apt 12J,NY,Brooklyn,Kings,449.000000
7796,975000.0,1.0,2.0,0.0,2.0,condos,10550.0,826.0,1987,75 Wall St Apt 24K,NY,New York,New York,1180.387409
7797,689000.0,3.0,3.0,1.0,2.0,single_family,7475.0,2100.0,2022,223 Endicott Ave,NY,Elmsford,Westchester,328.095238
7798,1862500.0,5.0,4.0,1.0,2.0,single_family,4920.0,2750.0,1955,147-40 8th Ave,NY,Whitestone,Queens,677.272727


### price_per_sqft outliers :

In [80]:
df_clean['price_per_sqft'].describe()

count    7800.000000
mean      383.383502
std       363.503814
min        27.733074
25%       170.067756
50%       292.968750
75%       451.150390
max      5486.641221
Name: price_per_sqft, dtype: float64

In [81]:
z_score = (df_clean['price_per_sqft'] - df_clean['price_per_sqft'].mean())/df_clean['price_per_sqft'].std()

In [82]:
price_per_sqft_outliers = abs(z_score)>3
sum(price_per_sqft_outliers)

174

In [83]:
min(df_clean.price_per_sqft[price_per_sqft_outliers])

1476.1904761904761

In [84]:
max(df_clean.price_per_sqft[price_per_sqft_outliers])

5486.641221374046

In [85]:
df_clean[df_clean['price_per_sqft']<50].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county,price_per_sqft
0,230000.0,3.0,4.0,2.0,2.0,single_family,2831400.0,7563.0,1906,421 W Church St,NY,Elmira,Chemung,30.411212
1,179900.0,4.0,3.0,3.0,2.0,single_family,17424.0,3658.0,1962,4 Lancaster Dr,NY,Endicott,Tioga,49.17988
2,100000.0,3.0,2.0,0.0,3.0,multi_family,3202.0,2860.0,1920,1112 W Belden Ave Unit 1114,NY,Syracuse,Onondaga,34.965035
3,129900.0,6.0,3.0,0.0,2.0,multi_family,6395.0,2758.0,1860,56 and 60 Colt St,NY,Geneva,Ontario,47.099347
4,134900.0,3.0,4.0,0.0,1.0,single_family,7231.0,3486.0,1950,9 Elm St,NY,Peru,Clinton,38.697648
5,199900.0,3.0,2.0,3.0,1.0,multi_family,129809.0,7208.0,1988,4328 County Route 4,NY,Oswego,Oswego,27.733074
6,119900.0,5.0,2.0,0.0,2.0,multi_family,6578.0,2678.0,1875,196 Bridge St,NY,Corning,Steuben,44.772218
7,115000.0,5.0,2.0,2.0,2.0,multi_family,3572.0,2493.0,1837,339 Fargo Ave,NY,Buffalo,Erie,46.129162
8,225000.0,7.0,3.0,1.0,2.0,single_family,65340.0,5734.0,1813,20 Main St,NY,Deposit,Delaware,39.239623
9,129000.0,3.0,2.0,2.0,2.0,single_family,21780.0,3680.0,1940,1141 Cassadaga Rd,NY,South Dayton,Chautauqua,35.054348


In [86]:
df_clean[df_clean['price_per_sqft']>2500].reset_index(drop=True)

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county,price_per_sqft
0,4785876.0,6.0,5.0,0.0,2.0,single_family,8580.0,1740.0,2003,51 Bay St,NY,East Atlantic Beach,Nassau,2750.503448
1,5750000.0,2.0,3.0,0.0,15.0,condos,10550.0,1048.0,2020,27 E 79th St Unit 14FLOOR,NY,New York City,New York,5486.641221
2,4900000.0,4.0,4.0,0.0,15.0,coop,10550.0,1740.0,1926,1165 Park Ave Unit 12B,NY,New York City,New York,2816.091954
3,4800000.0,2.0,3.0,0.0,13.0,condos,10550.0,1762.0,1928,108 Leonard St Apt 5A,NY,New York,New York,2724.177072
4,4995000.0,2.0,3.0,0.0,17.0,coop,10550.0,1740.0,1959,35 E 75th St Unit 14BC,NY,New York,New York,2870.689655
5,5250000.0,4.0,3.0,0.0,3.0,coop,10550.0,1740.0,1930,1100 Park Ave Unit 3B,NY,New York,New York,3017.241379
6,4999000.0,4.0,4.0,0.0,12.0,coop,10550.0,1740.0,1907,131 E 66th /7f St Units 8EF & 9EF,NY,New York,New York,2872.988506
7,3600000.0,2.0,2.0,0.0,2.0,condos,10550.0,1230.0,1910,455 W 22nd St Unit Ph,NY,Manhattan,New York,2926.829268
8,4150000.0,2.0,3.0,0.0,14.0,condos,10550.0,1611.0,1920,225 W 86th St Unit 412,NY,New York,New York,2576.039727
9,5950000.0,3.0,4.0,0.0,14.0,condos,10550.0,2263.0,1920,225 W 86th St Apt 1107,NY,New York,New York,2629.253204


### Handling price per sqft outliers :

In [87]:
df_clean['price_per_sqft'] = np.where((df_clean.price_per_sqft<50),np.nan,df_clean.price_per_sqft)


In [88]:
df_clean['price_per_sqft'] = np.where((df_clean.price_per_sqft>2500),np.nan,df_clean.price_per_sqft)
df_clean.isnull().sum()

price              0
beds               0
baths              0
garage             0
stories            0
house_type         0
lot_sqft           0
sqft               0
year_built         0
address            0
state              0
city               0
county             0
price_per_sqft    72
dtype: int64

In [89]:
df_clean=df_clean.dropna(subset=['price_per_sqft'])
df_clean.isnull().sum()

price             0
beds              0
baths             0
garage            0
stories           0
house_type        0
lot_sqft          0
sqft              0
year_built        0
address           0
state             0
city              0
county            0
price_per_sqft    0
dtype: int64

In [90]:
df_clean = df_clean.reset_index(drop=True)
df_clean

Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,county,price_per_sqft
0,294900.0,3.0,2.0,2.0,2.0,single_family,74052.0,996.0,2011,16326 Ontario Shores Dr,NY,Sterling,Cayuga,296.084337
1,225000.0,3.0,2.0,1.0,2.0,single_family,30056.0,1224.0,1973,38 Pine Cir,NY,Newfield,Tompkins,183.823529
2,149000.0,4.0,2.0,2.0,2.0,single_family,223898.0,1608.0,1900,8 Gridleyville Rd,NY,Spencer,Tioga,92.661692
3,599999.0,4.0,2.0,0.0,2.0,single_family,7307.0,1827.0,1858,59 Hamilton Ave,NY,Oyster Bay,Nassau,328.406678
4,299900.0,3.0,2.0,1.0,2.0,single_family,8712.0,1589.0,1960,41 Bender Ln,NY,Bethlehem,Albany,188.735053
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7723,449000.0,2.0,1.0,0.0,2.0,coop,10550.0,1000.0,1963,2483 W 16th St Apt 12J,NY,Brooklyn,Kings,449.000000
7724,975000.0,1.0,2.0,0.0,2.0,condos,10550.0,826.0,1987,75 Wall St Apt 24K,NY,New York,New York,1180.387409
7725,689000.0,3.0,3.0,1.0,2.0,single_family,7475.0,2100.0,2022,223 Endicott Ave,NY,Elmsford,Westchester,328.095238
7726,1862500.0,5.0,4.0,1.0,2.0,single_family,4920.0,2750.0,1955,147-40 8th Ave,NY,Whitestone,Queens,677.272727


## Counting categorical features

In [91]:
df_clean['house_type'].value_counts()

single_family    5312
coop              754
multi_family      742
condos            582
townhomes         197
land              139
apartment           2
Name: house_type, dtype: int64

In [92]:
df_clean["house_type"].replace({"apartment": "condos"}, inplace=True)

In [93]:
df_clean['house_type'].value_counts()

single_family    5312
coop              754
multi_family      742
condos            584
townhomes         197
land              139
Name: house_type, dtype: int64

In [94]:
df_clean['city'].value_counts()

Brooklyn          286
New York City     229
New York          219
Staten Island     212
Rochester         187
                 ... 
Harpursville        1
East Chatham        1
Vernon              1
North Salem         1
Fremont Center      1
Name: city, Length: 1089, dtype: int64

In [95]:
df_clean["city"].replace({"New York": "New York City"}, inplace=True)

In [96]:
df_clean['city'].value_counts()

New York City     448
Brooklyn          286
Staten Island     212
Rochester         187
Bronx             160
                 ... 
Harpursville        1
East Chatham        1
Vernon              1
North Salem         1
Fremont Center      1
Name: city, Length: 1088, dtype: int64

In [97]:
df_clean.to_csv('RealEstateNewYork_Clean.csv')