# `NYC Property Sales` Dataset Exploratory Data Analysis


## Important information about the chosen dataset:

#### the dataset is from kaggle
##### Dataset context:
This dataset is a record of every building or building unit (apartment, etc.) sold in the New York City property market over a 12-month period.
##### Dataset content:
This dataset contains the location, address, type, sale price, and sale date of building units sold. A reference on the trickier fields:

BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5).
BLOCK; LOT: The combination of borough, block, and lot forms a unique key for property in New York City. Commonly called a BBL.
BUILDING CLASS AT PRESENT and BUILDING CLASS AT TIME OF SALE: The type of building at various points in time. See the glossary linked to below.
For further reference on individual fields see the Glossary of Terms. For the building classification codes see the Building Classifications Glossary.

Note that because this is a financial transaction dataset, there are some points that need to be kept in mind:

Many sales occur with a nonsensically small dollar amount: $0 most commonly. These sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement.
This dataset uses the financial definition of a building/building unit, for tax purposes. In case a single entity owns the building in question, a sale covers the value of the entire building. In case a building is owned piecemeal by its residents (a condominium), a sale refers to a single apartment (or group of apartments) owned by some individual.
<ul>
    <li>Dataset source link:<a href="https://www.kaggle.com/new-york-city/nyc-property-sales">nyc_property_sales</a> </li>
    <li>all glossary about the data set is in this pdf:<a href="https://www1.nyc.gov/assets/finance/downloads/pdf/07pdf/glossary_rsf071607.pdf">Glossary</a> </li>
</ul>

### Imorting the required libraries for analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Load the nyc-rolling-sales datasets

In [None]:
df = pd.read_csv('../input/nyc-property-sales/nyc-rolling-sales.csv')

In [None]:
df.info()

In [None]:
#delete 'Unnamed: 0' column
df.drop('Unnamed: 0',axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df['NEIGHBORHOOD'].value_counts().count()

In [None]:
df['BOROUGH'].value_counts().count()

> since BOROUGH is A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5)

> and  we have 254 different NEIGHBORHOOD

## Some Questions to lead the analysis :
#### 1. is borough and neighborhood which represents location is affecting  the sales of NYC property or not? and how each is effect it?
#### 2. what is the mean sales price in each borough?
#### 3. does the size / area per square feet affect the sales?
#### 4. does the property usage or classification influence sales?
#### 5. what area's we should focus on to increase our sales?
#### 6. How to describe property's transactions in each borough?




---
## Univariate Analysis:



#### Some Data Cleaning:
- change borough type to string and replace it's number with its corresponding borough
- change sale price type to float
- change sale date type to date object
- add sale month and sale year to the dataframe


In [None]:
df['BOROUGH'] = df['BOROUGH'].astype(str)
df['BOROUGH'].replace({'1':'Manhattan','2':'Bronx','3':'Brooklyn','4':'Queens','5':'Staten Island'},inplace=True)
df['SALE PRICE'] = df['SALE PRICE'].replace({' -  ':'0'})
df['SALE PRICE'] = df['SALE PRICE'].astype(float)
df['SALE DATE'] = pd.to_datetime(df['SALE DATE'])
df['sale_month']= df['SALE DATE'].dt.month_name()
df['sale_year']= df['SALE DATE'].dt.year


In [None]:
df.info()

In [None]:
df.head()

#### categorical variables:

In [None]:
df['BOROUGH'].value_counts().plot.bar()
plt.title("Borough by property's sales amount");


In [None]:
df['SALE PRICE'].groupby(df['BOROUGH']).mean().plot.bar()
plt.title("Borough by property's average sales");


> Queens have the most sales records but Manhattan have the largest average sale price
so we need more information about the reason?

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/5_Boroughs_Labels_New_York_City_Map.svg/300px-5_Boroughs_Labels_New_York_City_Map.svg.png"/>

<table class="wikitable sortable jquery-tablesorter" border="1" style="float:center; text-align:right; font-size:85%; margin:1em;">

<thead><tr>
<th colspan="9" style="background-color:tan;"><div style="text-align:center; position:relative; white-space:nowrap;">New York City's <a class="mw-selflink selflink">five boroughs</a><style data-mw-deduplicate="TemplateStyles:r992953826">.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}.mw-parser-output .infobox .navbar{font-size:100%}.mw-parser-output .navbox .navbar{display:block;font-size:100%}.mw-parser-output .navbox-title .navbar{float:left;text-align:left;margin-right:0.5em}</style><div class="navbar plainlinks hlist navbar-mini" style="position:absolute; right:0; top:0; margin:0 5px;"><ul><li class="nv-view"><a href="/wiki/Template:NYC_boroughs" title="Template:NYC boroughs"><abbr title="View this template">v</abbr></a></li><li class="nv-talk"><a href="/wiki/Template_talk:NYC_boroughs" title="Template talk:NYC boroughs"><abbr title="Discuss this template">t</abbr></a></li><li class="nv-edit"><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Template:NYC_boroughs&amp;action=edit"><abbr title="Edit this template">e</abbr></a></li></ul></div></div>
</th></tr><tr style="background:#dedebb;">
<th colspan="2" style="background: transparent">Jurisdiction
</th>
<th style="background: transparent">Population
</th>
<th style="background: transparent">GDP
</th>
<th colspan="2" style="background: transparent">Land area
</th>
<th colspan="2" style="background: transparent">Density
</th></tr><tr style="background:#efefcc; font-style: italic">
<th style="font-weight:normal; background: transparent" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Borough</th>
<th style="font-weight:normal; background: transparent" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">County</th>
<th style="font-weight: normal; background: transparent" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Estimate<br> (2019)</th>
<th style="font-weight:normal; background: transparent" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">billions<br>(2012 US$)</th>
<th style="font-weight: normal; background: transparent" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">square<br> miles</th>
<th style="font-weight: normal; background: transparent" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">square<br>km</th>
<th style="font-weight: normal; background: transparent" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">persons /<br>mi<sup>2</sup></th>
<th style="font-weight: normal; background: transparent" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">persons /<br>km<sup>2</sup>
</th></tr></thead><tbody>


<tr style="background:#f9f9f9;">
<td bgcolor="ee5555"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><b><a href="/wiki/The_Bronx" title="The Bronx">The Bronx</a></b></div>
</td>
<td bgcolor="f9f9f9"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;">
  Bronx</div>
</td>
<td>1,418,207
</td>
<td>42.695
</td>
<td>42.10
</td>
<td>109.04
</td>
<td>33,867
</td>
<td>13,006
</td></tr>
<tr style="background:#f9f9f9;">
<td bgcolor="ffee77"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><b><a href="/wiki/Brooklyn" title="Brooklyn">Brooklyn</a></b></div>
</td>
<td bgcolor="f9f9f9"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;">
  Kings</div>
</td>
<td>2,559,903
</td>
<td>91.559
</td>
<td>70.82
</td>
<td>183.42
</td>
<td>36,147
</td>
<td>13,957
</td></tr>
<tr style="background:#f9f9f9;">
<td bgcolor="lightgreen"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><b><a href="/wiki/Manhattan" title="Manhattan">Manhattan</a></b></div>
</td>
<td bgcolor="f9f9f9"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;">
  New York</div>
</td>
<td>1,628,706
</td>
<td>600.244
</td>
<td>22.83
</td>
<td>59.13
</td>
<td>71,341
</td>
<td>27,544
</td></tr>
<tr style="background:#f9f9f9;">
<td bgcolor="orange"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><b><a href="/wiki/Queens" title="Queens">Queens</a></b></div>
</td>
<td bgcolor="f9f9f9"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;">
   Queens</div>
</td>
<td>2,253,858
</td>
<td>93.310
</td>
<td>108.53
</td>
<td>281.09
</td>
<td>20,767
</td>
<td>8,018
</td></tr>
<tr style="background:#f9f9f9;">
<td bgcolor="plum"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><b><a href="/wiki/Staten_Island" title="Staten Island">Staten Island</a></b></div>
</td>
<td bgcolor="f9f9f9"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;">
   Richmond</div>
</td>
<td>476,143
</td>
<td>14.514
</td>
<td>58.37
</td>
<td>151.18
</td>
<td>8,157
</td>
<td>3,150
</td></tr>
<tr style="background:#ddd;" class="sortbottom">
<td colspan="2"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><b><a href="/wiki/New_York_City" title="New York City">City of New York</a></b></div></td>
<td><b>8,336,817</b></td>
<td><b>842.343</b></td>
<td><b>302.64</b></td>
<td><b>783.83</b></td>
<td><b>27,547</b></td>
<td><b>10,636</b>
</td></tr>
<tr style="background:#ccc;" class="sortbottom">
<td colspan="2"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><a href="/wiki/New_York_(state)" title="New York (state)">State of New York</a></div></td>
<td>19,453,561</td>
<td>1,731.910</td>
<td>47,126.40</td>
<td>122,056.82</td>
<td>412</td>
<td>159
</td></tr>
<tr class="sortbottom">
<td colspan="9"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><i>Sources:<sup id="cite_ref-3" class="reference"><a href="#cite_note-3">[3]</a></sup><sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup><sup id="cite_ref-5" class="reference"><a href="#cite_note-5">[5]</a></sup> and see individual borough articles</i></div>
</td></tr></tbody><tfoot></tfoot></table>

> ##### according to this image from <a href="https://en.wikipedia.org/wiki/Boroughs_of_New_York_City">Boroughs_of_New_York_City</a> Manhattan have better geographocal location and it is the most densely populated and geographically smallest of the five boroughs of New York City

> ##### Manhattan GDP is the largest between the five so the property's should be more expensive than others and thats justifies the data that queens have the largest area but not the most expensive propety's

In [None]:
df[(df['BOROUGH'] =='Queens') & (df['SALE PRICE'] != 0)].count()

In [None]:
df[df['SALE PRICE'] == 0]['BOROUGH'].value_counts().sort_values().plot.bar(color=['r', 'b', 'k', 'm', 'c'])
plt.title("Borough property's transaction amount");


> ##### since 0 sales means that it's a property transaction so this graph shows the amount of property transactions in each borough

In [None]:
sns.histplot(df[df['YEAR BUILT']!=0]['YEAR BUILT'],bins=100);
plt.ylabel("Property's")
plt.xlabel("Year That Have been built in");


In [None]:
df[df['YEAR BUILT']!=0]['YEAR BUILT'].value_counts()

> which mean that most of the properties that sold or been transationed is 
built between 1910 to 1950 where the maximum properties sales is in 1920 may be they have a good building structures!!

In [None]:
plt.figure(figsize=(20,20))

df['SALE PRICE'].groupby(df['BUILDING CLASS CATEGORY']).mean().sort_values().plot.barh()
plt.ylabel("Sale Price")

plt.title('average sales price per propertys usage of buildings');


> seems to Luxury hotels and office buildings is represents the largest building category in sales

> and Untility properties , Warehouses/Factory/Indus are the lowest building category in sales

In [None]:
plt.figure(figsize=(10,6))
df['SALE PRICE'].groupby(df['NEIGHBORHOOD']).mean().sort_values()[244:254].plot.barh()
plt.title('Top 10 average sales price per NEIGHBORHOOD ')
plt.xlabel("Sale Price");


> BloomField and Midtown CBD have the largest average sales between all neighborhood

---
## Bivariate Analysis:

In [None]:
plt.figure(figsize=(15,6))

sns.lineplot(x='sale_month',y='SALE PRICE',data=df)
plt.title('Sales Trend per month from 2016 to 2017')
plt.ylabel('Sale Price')
plt.xlabel('Month')
plt.show();

#### This plot will aggregate over repeated values (each year) to show the mean and 95% confidence interval
> Seem to be that in may the mean and the 95% confidence interval is having a the biggest range in both the
maximum value and repeated values which indicate that in may over 2016 and 2017 we have the most of property's sales
or transactions


In [None]:
plt.figure(figsize=(10,6))

sns.lineplot(x='BOROUGH',y='SALE PRICE',data=df)
plt.title('Sales Trend per BOROUGH')
plt.ylabel('Sale Price')
plt.show();

> as we saw before Manhattan has the highest sales

> we need to work on marketing for property's in queens and staten island cause they are have the lowest sales of all

In [None]:
plt.figure(figsize=(20,6))
sns.barplot(x='BOROUGH', y='SALE PRICE', hue='sale_month', data=df, palette='rainbow');
plt.title('Sales Trend per BOROUGH by month')
plt.ylabel('Sale Price')
plt.show()

> ##### May and December are the largest sales month in Manhattan in both years 2016 and 2017
> ##### February is the smallest sales month in Manhattan in both years 2016 and 2017

> ##### December is the smallest sales month in Bronx in both years 2016 and 2017

In [None]:
plt.figure(figsize=(10,6))

sns.heatmap(df.corr());


> so there is no effictive correlation between any numerical variables except residential and commercial units and total units and that's reasonable but not usfull 

> reasonable becaues residentail and commercial is summed in the total units that why we have 0.6 of residentail correlated with the total and 0.4 of commercial correlated with total which indicate that the number of residential units is bigger than commercial units


In [None]:
df['RESIDENTIAL UNITS'].sum(),df['COMMERCIAL UNITS'].sum(),df['TOTAL UNITS'].sum()

In [None]:
plt.figure(figsize=(12,8))
plt.subplot(1, 2, 1)

sns.barplot(x="BOROUGH", y="RESIDENTIAL UNITS", data=df, estimator=sum, ci=None)
plt.ylabel('Residential Units')

plt.subplot(1, 2, 2)

sns.barplot(x="BOROUGH", y="COMMERCIAL UNITS", data=df, estimator=sum, ci=None)
plt.ylabel('Commercial Units');



> ##### Residential units scale have a huge scale difference from commercial units in number of propeties that have been sold

## `Year Over Year Analysis`:

In [None]:
plt.figure(figsize=(20,6))

sns.lineplot(x="BOROUGH",y="SALE PRICE",hue='sale_year',data=df, palette='rainbow',ci=None)
plt.title('Sales Trend per BOROUGH by year');
plt.ylabel('Sale Price');



> ##### This shows the comparison in years of sales for each borough

> ##### obviously Manhattan and queens have better sales in 2016 than 2017, but Bronx and Staten Island is having better sales in 2017 than 2016

In [None]:
plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
plt.title('Sales Trend per Commercial Units by year')
sns.lineplot(x="COMMERCIAL UNITS",y="SALE PRICE",hue='sale_year',data=df, palette='rainbow')
plt.ylabel('Sale Price')
plt.xlabel('Commercial Units')

plt.subplot(1, 2, 2)
plt.title('Sales Trend per Residential Units by year')
sns.lineplot(x="RESIDENTIAL UNITS",y="SALE PRICE",hue='sale_year',data=df, palette='rainbow')
plt.ylabel('Sale Price')
plt.xlabel('Residential Units');


> ##### This explain how Residential units makes more sales in 2016 than 2017
> ##### and this tells us that most property's revenue come from Residential units and not the Commercial units


In [None]:
plt.figure(figsize=(12,8))
plt.title('Sales Trend per TAX CLASS by year')

sns.barplot(x="sale_year",y="SALE PRICE",hue='TAX CLASS AT PRESENT',data=df[df['TAX CLASS AT PRESENT']!= ' '], palette='rainbow');
plt.xlabel('Sale Year')
plt.ylabel('Sale Price');


> Tax Class (4) which Includes all other properties not included in class 1,2, and 3, such as
offices, factories, warehouses, garage buildings, etc, have the highest average sales over years

In [None]:
plt.subplots(figsize=(12,8))
sns.barplot(x='sale_year', y='SALE PRICE', hue='BOROUGH', data=df, palette='rainbow', ci=None)
plt.title('Sales per Borough from 2016-2017')
plt.ylabel('Sale Price')
plt.xlabel('Sale Year');


> so we know that manhattan has the most revenue but by this chart we can tell in which order the all five borough share in the total revenue by 2016 and 2017 and it's as follows` in descending order ( Manhattan , Brooklyn, Bronx, Queens, Staten Island)`

---
## References:
 - https://pandas.pydata.org/docs/reference/
 - https://seaborn.pydata.org/generated/seaborn.lineplot.html
 - https://seaborn.pydata.org/api.html