### Exploring the variables:

1. **BOROUGH**:  A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5). **This variable should be a categorical variable**

2. **NEIGHBORHOOD**: Department of Finance assessors determine the neighborhood name in the course of valuing properties. The common name of the neighborhood is generally the same as the name Finance designates. However, there may be slight differences in neighborhood boundary lines and some sub-neighborhoods may not be included. **This variable should be categorical.**

3. **BUILDING CLASS CATEGORY**: This is a field that we are including so that users of the Rolling Sales Files can easily identify similar properties by broad usage (e.g. One Family Homes) without looking up individual Building Classes. Files are sorted by Borough, Neighborhood, Building Class Category, Block and Lot. **This variable should be categorical.**

4. **TAX CLASS AT PRESENT** Every property in the city is assigned to one of four tax classes (Classes 1, 2, 3, and 4), based on the use of the property. **This variable should be categorical.**

- Class 1: Includes most residential property of up to three units (such as one-, two-, and three-family homes and small stores or offices with one or two attached apartments), vacant land that is zoned for residential use, and most condominiums that are not more than three stories.
- Class 2: Includes all other property that is primarily residential, such as cooperatives and condominiums.
- Class 3: Includes property with equipment owned by a gas, telephone or electric company.
- Class 4: Includes all other properties not included in class 1,2, and 3, such as offices, factories, warehouses, garage buildings, etc.


5. **BLOCK** and **LOT**: A Tax Block is a sub-division of the borough on which real properties are located. The Department of Finance uses a Borough-Block-Lot classification to label all real property in the City. “Whereas” addresses describe the street location of a property, the block and lot distinguishes one unit of real property from another, such as the different condominiums in a single building. Also, block and lots are not subject to name changes based on which side of the parcel the building puts its entrance on. A Tax Lot is a subdivision of a Tax Block and represents the property unique location. **Because there are more than 11k unique blocks in the dataset, it doesn't make sense to define it as a categorical variable! Will leave it as numerical. The same story for LOT**

6. **BUILDING CLASS AT PRESENT**: The Building Classification is used to describe a property’s constructive use. The first position of the Building Class is a letter that is used to describe a general class of properties (for example “A” signifies one-family homes, “O” signifies office buildings. “R” signifies condominiums). The second position, a number, adds more specific information about the property’s use or construction style (using our previous examples “A0” is a Cape Cod style one family home, “O4” is a tower type office building and “R5” is a commercial condominium unit). The term Building Class used by the Department of Finance is interchangeable with the term Building Code used by the Department of Buildings. **This variable should be categorical.**


7. **ADDRESS**: The street address of the property as listed on the Sales File. Coop sales include the apartment number in the address field. ** We are not going to extract any information from the address in this course!**

8. **ZIP CODE**: The property’s postal code. **This variable should be categorical.**
9. **RESIDENTIAL UNITS** : The number of residential units at the listed property. **This variable should be numeric** 

10. **COMMERCIAL UNITS** :The number of commercial units at the listed property. **This variable should be numeric** 

11. **TOTAL UNITS** :The total number of units at the listed property. **This variable should be numeric** 
12. **LAND SQUARE FEET** : The land area of the property listed in square feet. **This variable should be numeric**
13. **GROSS SQUARE FEET** : The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property. **This variable should be numeric** 


14. **YEAR BUILT** :  Year the structure on the property was built.  **This variable should be categorical** 
15. **TAX CLASS AT TIME OF SALE** and **BUILDING CLASS AT TIME OF SALE**. Both of these variables should be **categorical**. 
17. **SALE PRICE** : This variable should be **numeric.**
18. **SALE DATE** : This variable should be **data time.** However, we can save the "year" or "month" part as a new categorical variable.  

19. **EASEMENT**: An easement is a right, such as a right of way, which allows an entity to make limited use of another’s real property. For example: MTA railroad tracks that run across a portion of another property.


Note: **0 Dollar Sales Price**:
A 0 dollar sale indicates that there was a transfer of ownership without a cash consideration. There can be a number of reasons for a 0 dollar sale including transfers of ownership from parents to children. **We will remove all of these observations**



In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

sns.set()
rand_state=1000

In [None]:
df = pd.read_csv('../input/nyc-property-sales/nyc-rolling-sales.csv')
df_raw = df
df.head()

In [None]:
df.info()

In [None]:
# First Let's remove  irrelavant columns: 
df.drop(["Unnamed: 0"], axis=1, inplace=True)

In [None]:
# constructing the date time variable

df['SALE DATE']= pd.to_datetime(df['SALE DATE'], errors='coerce')

In [None]:
df['sale_year'] = pd.DatetimeIndex(df['SALE DATE']).year.astype("category")
df['sale_month'] = pd.DatetimeIndex(df['SALE DATE']).month.astype("category")
pd.crosstab(df['sale_month'],df['sale_year'])

In [None]:
# constructing the numerical variables:
numeric = ["RESIDENTIAL UNITS","COMMERCIAL UNITS","TOTAL UNITS", "LAND SQUARE FEET" , "GROSS SQUARE FEET","SALE PRICE" ]

for col in numeric: 
    df[col] = pd.to_numeric(df[col], errors='coerce') # coercing errors to NAs

In [None]:
# constructing the categorical variables:
categorical = ["BOROUGH","NEIGHBORHOOD",'BUILDING CLASS CATEGORY', 'TAX CLASS AT PRESENT', 'BUILDING CLASS AT PRESENT','ZIP CODE', 'YEAR BUILT', 'BUILDING CLASS AT TIME OF SALE', 'TAX CLASS AT TIME OF SALE']

for col in categorical: 
    df[col] = df[col].astype("category")

In [None]:
df.info()

In [None]:
df.isna().sum()

So far, we have identified the NA values. But what about the black cells?

Let's see if there are any blank cells?

In [None]:
df.replace(' ',np.nan, inplace=True)
df.isna().sum() /len(df) *100

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

### Now let's get rid of the columns with many NAs

In [None]:
df.drop(["EASE-MENT","APARTMENT NUMBER"], axis=1, inplace=True)

In [None]:
# What should we do with the LAND and GROSS sqrf? there are more than 30% of missing data here! One way is to get rid of all the NAs. 
# but this is not the best solution! 
df=df.dropna() 

In [None]:
# finally check if there is any duplicated value:
sum(df.duplicated())

In [None]:
df.drop_duplicates(inplace=True)

### Looking for strange observations: 

Trick: First lets try to convert all the variables into numeric and look at the description. 


In [None]:
temp = df.copy()
for cols in temp.columns:
    temp[cols]=pd.to_numeric(temp[cols], errors='coerce') 
    
temp.info()


In [None]:
temp.describe().T

It seems that some of the numerical variables has been assigned some non-sense values. For example the min of **sale price, year built, total units** are 0! Does this make sense? 


Let's explore some of these variables in more details.

### Data visualization

### Sale price:
It is a good idea to start with the target variable! sometimes you end up dropping many more observations!

In [None]:
df[(df['SALE PRICE']<10000) | (df['SALE PRICE']>10000000)]['SALE PRICE'].count() /len(df)

Ooouch! 25% of the sale prices are either less than 10,000 or greater than $10,000,000. We have to drop all these observations from the data

In [None]:
df2= df[(df['SALE PRICE']>10000) & (df['SALE PRICE']<10000000)].copy()
df2['SALE PRICE'].describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df2['SALE PRICE'], kde=True, bins=50, rug=True)
plt.show()

In [None]:
df2= df2[(df2['SALE PRICE']<4000000)]
plt.figure(figsize=(12,6))
sns.distplot(df2['SALE PRICE'], kde=True, bins=50, rug=True)
plt.show()


### Year built:

In [None]:
df2[df2['YEAR BUILT']==0]['YEAR BUILT'].count()

In [None]:
df3=df2[df2['YEAR BUILT']!=0].copy()
sns.distplot(df3['YEAR BUILT'], bins=50, rug=True)
plt.show()

### Total units:

In [None]:
df3[df3['TOTAL UNITS']==0]['TOTAL UNITS'].count()

In [None]:
df4=df3[df3['TOTAL UNITS']!=0].copy()
sns.distplot(df4['TOTAL UNITS'], bins=50, rug=True)
plt.show()

In [None]:
df4.describe().T

## Finalizing the data set:

In [None]:
df4.info()

In [None]:
df4.drop(['BLOCK','LOT','ADDRESS'], axis=1, inplace=True)

In [None]:
#'1':'Manhattan', '2':'Bronx', '3': 'Brooklyn', '4':'Queens','5':'Staten Island'
df4['BOROUGH']= df4['BOROUGH'].map({1:'Manhattan', 2:'Bronx', 3: 'Brooklyn', 4:'Queens',5:'Staten Island'})
df4.head()

In [None]:
# some other visualizations: 
df_bar =df4[['BOROUGH', 'SALE PRICE']].groupby(by='BOROUGH').mean().sort_values(by='SALE PRICE', ascending=True).reset_index()
df_bar

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(y = 'BOROUGH', x = 'SALE PRICE', data = df_bar )
plt.title('Average SALE PRICE on each BOROUGH')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(y = 'BOROUGH', x = 'SALE PRICE', data = df4 )
plt.title('Box plots for SALE PRICE on each BOROUGH')
plt.show()

In [None]:
df_bar=df4[['sale_month', 'SALE PRICE']].groupby(by='sale_month').count().sort_values(by='sale_month', ascending=True).reset_index()
df_bar.columns.values[1]='Sales_count'
df_bar

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(y = 'sale_month', x = 'Sales_count', data = df_bar )
plt.title('count SALEs by each month')
plt.show()