# 01 Generating Statistics

In this acitvity, you will work with **Boston Housing Price dataset**. The Boston house-price data has been used in many machine learning papers that address regression problems. You will read the data from a CSV file into a Pandas DataFrame and do some data basic wrangling with it.

Following are the details of the attributes of this dataset for your reference. You may have to refer them while answering question on this activity.

* **CRIM**: per capita crime rate by town
* **ZN**: proportion of residential land zoned for lots over 25,000 sq.ft.
* **INDUS**: proportion of non-retail business acres per town
* **CHAS**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* **NOX**: nitric oxides concentration (parts per 10 million)
* **RM**: average number of rooms per dwelling
* **AGE**: proportion of owner-occupied units built prior to 1940
* **DIS**: weighted distances to five Boston employment centres
* **RAD**: index of accessibility to radial highways
* **TAX**: full-value property-tax rate per 10,000 dollars
* **PTRATIO**: pupil-teacher ratio by town
* **B**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* **LSTAT**: % of lower status of the population
* **PRICE**: Median value of owner-occupied homes in $1000's

In [0]:
"CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","PRICE"

### Load necessary libraries

In [0]:
# Write your code here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Read in the Boston housing data set (given as a .csv file) from the local directory

In [0]:
# Hint: The Pandas function for reading a CSV file is 'read_csv'.
# Don't forget that all functions in Pandas can be accessed by syntax like pd.{function_name}
# write your code here
df=pd.read_csv("Boston_housing.csv")

### Check first 10 records

In [0]:
# Write your code here
df.head(10)

### In total, how many records are there?

In [0]:
# Write your code here to answer the question above
df.shape

### Create a smaller DataFrame with columns which do not include 'CHAS', 'NOX', 'B', and 'LSTAT'

In [0]:
# Write your code here
#df1 = df.drop['CHAS', 'NOX', 'B','LSTAT']
df1=df[['CRIM','ZN','INDUS','RM','AGE','DIS','RAD','TAX','PTRATIO','PRICE']]

### Check the last 7 records of the new DataFrame you just created

In [0]:
# Write your code here
df1.tail(7)

### Can you plot histograms of all the variables (columns) in the new DataFrame?
You can of course plot them one by one. But try to write a short code to plot all of them once.
<br>***Hint***: 'For loop'!
<br>***Bonus problem***: Can you also show each plot with its unique title i.e. of the variable that it is a plot of? 

In [0]:
10 **0.5

In [0]:
np.log10(2)

In [0]:
plt.hist(df1["CRIM"],bins=250)
plt.show()

In [0]:
# log10 of 100 = 2 10 **2
# log10 of .01 = -2
# pH = -log10(H+ concentration)
# Richter Scale
# Decibels
df_onecurve = df1.query('CRIM < 1.16')
plt.hist(np.log10(df_onecurve["CRIM"]),bins=50)
plt.show()

In [0]:
df_twocurve = df1.query('CRIM >= 1.16')
plt.hist(np.log10(df_twocurve["CRIM"]),bins=50)
plt.show()

In [0]:
plt.hist(df1["AGE"],bins=20)
plt.show()

In [0]:
plt.hist(np.exp(np.log10(df1["AGE"])),bins=20)
plt.show()

In [0]:
for c in df1.columns:
    plt.title("Plot of "+c,fontsize=15)
    plt.hist(df1[c],bins=20)
    plt.show()

### Crime rate could be an indicator of house price (people don't want to live in high-crime areas). Create a scatter plot of crime rate vs. Price.

In [0]:
# Write your code here
plt.scatter(df1['CRIM'],df1['PRICE'])
plt.show()

### We can understand the relationship better if we plot _log10(crime)_ vs. Price. Create that plot and make it nice. Give proper title, x-axis, y-axis label, make data points a color of your choice, etc...
***Hint***: Try `np.log10` function

In [0]:
# Write your code here
plt.scatter(np.log10(df1['CRIM']),df1['PRICE'],c='red')
plt.title("Crime rate (Log) vs. Price plot", fontsize=18)
plt.xlabel("Log of Crime rate",fontsize=15)
plt.ylabel("Price",fontsize=15)
plt.grid(True)
plt.show()

### Can you calculate the mean rooms per dwelling?

In [0]:
# Write your code here
df1['RM'].mean()

### Can you calculate median Age?

In [0]:
# Write your code here
df1['AGE'].median()

### Can you calculate average (mean) distances to five Boston employment centres?

In [0]:
# Write your code here
df1['DIS'].mean()

### Tricky question: Can you calculate the percentage of houses with low price (< $20,000)?

In [0]:
# Create a Pandas series and directly compare it with 20
# You can do this because Pandas series is basically Numpy array and you have seen how to filter Numpy array
low_price=df1['PRICE']<20
# This creates a Boolean array of True, False
print(low_price)
# True = 1, False = 0, so now if you take an average of this Numpy array, you will know how many 1's are there.
# That many houses are priced below 20,000. So that is the answer. 
# You can convert that into percentage by multiplying with 100
pcnt=low_price.mean()*100
print("\nPercentage of house with <20,000 price is: ",pcnt)

In [0]:
len(df1[df1.PRICE<20]) / len(df1) * 100