# Regression analysis of Boston Housing dataset

## Import libraries

In [8]:
import sklearn, pandas_profiling
import pandas as pd
from sklearn import datasets

## Import data

In [9]:
df = datasets.load_boston()
pred = pd.DataFrame(df.data, columns=df.feature_names)
targ = pd.DataFrame(df.target, columns=['MEDV'])
df = pd.merge(pred, targ, left_on=pred.index, right_on=targ.index)
df.drop(columns = ['key_0'], inplace=True)

## Explore data

In [10]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


Attribute information:

     - CRIM     (Per capita crime rate by town)   
                Since CRIM gauges the threat to well-being that households perceive in various neighborhoods of the 
                Boston metropolitan area, assuming that crime rates are generally proportional to people’s perceptions 
                of danger, it should have a negative effect on housing values.
     - ZN       (Proportion of residential land zoned for lots over 25,000 sq.ft.)
                Since such zoning restricts construction of small lot houses, we expect ZN to be positively related 
                to housing values. A positive coefficient may also arise because zoning proxies the exclusivity,
                social class, and outdoor amenities of a community.
     - INDUS    (Proportion of non-retail business acres per town)
                INDUS serves as a proxy for the externalities associated with industry noise, heavy traffic, 
                and unpleasant visual effects, and thus should affect housing values negatively.
     - CHAS     (Charles River dummy variable = 1 if tract bounds river; 0 otherwise)
                CHAS captures the amenities of a riverside location and thus the coeficient should be positive.
     - NOX      (Nitric oxides concentration parts per 10 million)
                Nitrogen oxide concentrations in pphm (annual average concentration in parts per hundred million)
     - RM       (Average number of rooms per dwelling)
                RM represents spaciousness and, in a certain sense, quantity of housing. It should be positively 
                related to housing value. The RM^2 form was found to provide a better fit than either the linear or 
                logarithmic forms.
     - AGE      (Proportion of owner-occupied units built prior to 1940)
                Age is generally related to structure quality.
     - DIS      (Weighted distances to five employment centres in Boston region)
                According to traditional theories of urban land rent gradients, housing values should be higher near
                employment centers. DIS is entered in logarithm form, the expected sign is negative. 
     - RAD      (Index of accessibility to radial highways)
                The highway access index was calculated on a town basis. Good road access variables are needed so that 
                auto pollution variables do not capture the locational advantages of roadways. RAD captures other sorts 
                of locational advantages besides nearness to workplace. It is entered in logarithmic form, the expected 
                sign is positive.
     - TAX      (Full value property tax rate per USD10k)
                Measures the cost of public services in each commlurity. Nominal tax rates were corrected by local 
                assessment ratios to yield the full value tax rate for each town. Intra-town differences in the 
                assessment ratio were difficult, to obtain and thus not used. The coefficient of this variable 
                should be negative.
     - PTRATIO  (Pupil-teacher ratio by town school district)
                Measures public sector benefits in each town. The relation of the education pupil-teacher ratio to 
                school quality is not entirely clear, although a low ratio should imply each student receives more 
                individual attention. We expect the sign on PTRATIO to be negative.
     - B        (Black proportion of population)
                1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. At low to moderate levels of B, an 
                increase in B should have a negative influence on housing value if Blacks are regarded as undesirable
                neighbors by Whites. However, market discrimination means that housing values are higher at very high 
                levels of R. One expects, therefore, a parabolic relationship between proportion Black in a neighborhood 
                and housing values. 
     - LSTAT    (Proportion of population that is of lower status)
                Proportion of adults without some high school education and proportion of male workers classified as
                laborers. The logarithmic specification implies that socioeconomic status distinctions mean more in the
                upper brackets of society than in the lower classes.
     - MEDV     Median value of owner-occupied homes in USD1000


Source: https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air

In [12]:
pandas_profiling.ProfileReport(df, minimal=True, explorative=True).to_widgets()

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=23.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render widgets', max=1.0, style=ProgressStyle(description…

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…