# Data Preprocessing

Because there are some datas that do not contain proper numeric price (SALE PRICE), those data will be removed first.

In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('nyc-rolling-sales.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,5,0,5,1633,6440,1900,2,C2,6625000,2017-07-19 00:00:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,...,28,3,31,4616,18690,1900,2,C7,-,2016-12-14 00:00:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,...,16,1,17,2212,7803,1900,2,C7,-,2016-12-09 00:00:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,10,0,10,2272,6794,1913,2,C4,3936272,2016-09-23 00:00:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,6,0,6,2369,4615,1900,2,C2,8000000,2016-11-17 00:00:00


In [80]:
df.drop(df[df['SALE PRICE'] == ' -  '].index, inplace = True)

In [81]:
df['SALE PRICE'] = df['SALE PRICE'].astype(float)

# 1. Descriptive Statistics

## 1.1 Measure of Central Tendency

Calculate the mean, median and mode (modus) of sale price of buildings with type "07 RENTALS - WALKUP APARTMENTS" in Manhattan (1) with building type of C2 at the time the building is sold!

First off, we need to create a new data frame that qualify the requested description.

In [59]:
df_asked = df[(df['BOROUGH'] == 1) & (df['BUILDING CLASS CATEGORY'] == "07 RENTALS - WALKUP APARTMENTS             ") 
              & (df['BUILDING CLASS AT TIME OF SALE'] == "C2")]

print(df_asked)

       Unnamed: 0  BOROUGH               NEIGHBORHOOD  \
0               4        1              ALPHABET CITY   
4               8        1              ALPHABET CITY   
1456         1460        1                    CLINTON   
3554         3558        1  GREENWICH VILLAGE-CENTRAL   
4199         4203        1     GREENWICH VILLAGE-WEST   
4888         4892        1             HARLEM-CENTRAL   
4919         4923        1             HARLEM-CENTRAL   
4920         4924        1             HARLEM-CENTRAL   
4939         4943        1             HARLEM-CENTRAL   
4954         4958        1             HARLEM-CENTRAL   
5697         5701        1                HARLEM-EAST   
5698         5702        1                HARLEM-EAST   
6027         6031        1               HARLEM-UPPER   
6029         6033        1               HARLEM-UPPER   
6030         6034        1               HARLEM-UPPER   
6132         6136        1                HARLEM-WEST   
13463       13467        1    U

### 1.1.1 Mean

In [61]:
mean = df_asked['SALE PRICE'].mean()
print(mean)

3682050.0


### 1.1.2 Median

In [62]:
median = df_asked['SALE PRICE'].median()
print(median)

2407500.0


### 1.1.3 Mode (Modus)

In [64]:
mode = df_asked['SALE PRICE'].mode()
print(mode)

0    1300000.0
1    2000000.0
Name: SALE PRICE, dtype: float64


There are two modes on the dataset.

From the result of mean, median, and mode, we can see that difference between the mode, median and mean is not too far. Thus, the data looks pretty cohesive. Yet, before we can conclude anything, we need to check the measure of spread first to be able to fully confirm whether the average number can represent the population.

## 1.2 Measure of Spread

Calculate the range, variance and standard deviation (std) of sale price of buildings with type "07 RENTALS - WALKUP APARTMENTS" in Manhattan (1) with building type of C2 at the time the building is sold!

### 1.2.1 Range

In [76]:
range_ = np.ptp(df_asked['SALE PRICE'])
print(range_)

9937000.0


Difference between the maximum value and the minimum value in the data set is 9937000. This difference is pretty big.

### 1.2.2 Variance

In [74]:
variance = np.var(df_asked['SALE PRICE'], ddof=1)
print(variance)

7579940471052.632


### 1.2.3 Standard Deviation

In [77]:
std = np.std(df_asked['SALE PRICE'], ddof=1)
print(std)

2753169.168622341


# 2. Inferential Statistics

## 2.1 Probability Distribution

We need to look whether the data in column SALE PRICE of df_asked is normally distributed or not using Shapiro-Wilk test.

In [89]:
from scipy.stats import shapiro

stat, p = shapiro(df_asked['SALE PRICE'])
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably Gaussian')
else:
    print('Probably Not Gaussian')

stat=0.870, p=0.012
Probably Not Gaussian


## 2.2 Confidence Interval

To be able to have 95% confidence interval, we will use z-score of 1.96.

In [85]:
z_score = 1.96
n = len(df_asked.index)

20


### Calculating Standard Error

In [86]:
se = std / np.sqrt(n)

### Calculating Confidence Interval

In [88]:
lcb = mean - z_score * se # lower limit of CI
ucb = mean + z_score * se

lcb, ucb

(2475420.4107391573, 4888679.589260843)

The confidence interval is 2475420.41 and 4888679.59. This means, if we took another sample, there is 95% chance that the data wil fall between 2475420.41 and 4888679.59.

## 2.3 Hypothesis Testing

Hypothesis that will be tested:

Does average price in Alphabet City is higher than average price in Clinton?

Null hypothesis: mu1 = mu2
Alternative hypothesis: mu1 != mu2

In [96]:
import statsmodels.api as sm

a_city = df[df["NEIGHBORHOOD"] == "ALPHABET CITY"]
clinton = df[df["NEIGHBORHOOD"] == "CLINTON"]

In [94]:
n1 = len(a_city)
mu1 = a_city["SALE PRICE"].mean()
sd1 = a_city["SALE PRICE"].std()

(n1, mu1, sd1)

(148, 2600240.9256756757, 6710061.340155619)

In [95]:
n2 = len(clinton)
mu2 = clinton["SALE PRICE"].mean()
sd2 = clinton["SALE PRICE"].std()

(n2, mu2, sd2)

(296, 1876482.1385135136, 4626385.81681132)

In [97]:
sm.stats.ztest(a_city["SALE PRICE"], clinton["SALE PRICE"], alternative='two-sided')

(1.3290641488841186, 0.18382680692299525)

Because p-value is 0.18 and is higher than alpha, which is 0.05, the null hypothesis is *not rejected*. A further test need to be conducted to verify the null hypothesis.