## Machine Learning Model Building Pipeline: Data Analysis

In this following notebook, we will go through the Data Analysis step in the Machine Learning model building pipeline. There will be a notebook for each one of the Machine Learning Pipeline steps:

1. Data Analysis
2. [Feature Engineering](https://www.kaggle.com/rkb0023/feature-engineering-house-rent-prediction)
3. [Model Building](https://www.kaggle.com/rkb0023/model-building-house-rent-prediction)

**This is the notebook for step 1: Data Analysis**

The dataset can be found in [iNeuron](https://challenge-ineuron.in/mlchallenge.php#) ML Challenge 2.


## Predicting Rent Price of Houses

The aim of the project is to build a machine learning model to predict the rent price of homes based on different explanatory variables describing aspects of residential houses. 


### What is the objective of the machine learning model?

We aim to minimise the difference between the real rent and the rent estimated by our model. We will evaluate model performance using the mean squared error (mse) and the root squared of the mean squared error (rmse).

<br>
<hr>

We will analyse the dataset to identify:

1. Data Description
2. Missing values
3. Numerical variables
4. Distribution of the numerical variables
5. Outliers
6. Categorical variables
7. Cardinality of the categorical variables
Potential relationship between the variables and the target: price

## House Rent dataset: Data Analysis

In the following cells, we will analyse the variables of the House Rent Dataset from iNeuron. We will go through the different aspects of the analysis of the variables.

Let's go ahead and load the dataset.

In [None]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

### Data Import

In [None]:
data = pd.read_csv('../input/houseRent/housing_train.csv')
data.shape

The house price dataset contains 265,190 rows, i.e., houses, and 22 columns, i.e., variables. 

## Data Description

#### *Show data header*

In [None]:
data.head()

In [None]:
data.tail()

#### *Data Information*

In [None]:
data.info()

There are 13 numerical features and 9 categorical features.

#### *Data columns*

In [None]:
data.columns

#### *Show statistical analysis of our dataset*

Let's show min, max, std, and count of each numerical variables in the dataset

In [None]:
data.describe()

## Missing values

Let's go ahead and find out which variables of the dataset contain missing values.

#### *Show if there are missing datapoints*

In [None]:
data.isna().mean().sort_values(ascending=False)

Variables containing missing values:-
- parking_options (36%)
- laundry_options (20%)
- lat (0.5%)
- long (0.5%)
- description & state (nominal percentage)

**Heat Map**

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(data.isnull(), ax=ax, cmap="YlGnBu", center=0).set(
            title = 'Missing Data', 
            xlabel = 'Columns', 
            ylabel = 'Data Points');

In [None]:
# make a list of the variables that contain missing values
vars_with_na = [var for var in data.columns if data[var].isnull().sum() > 0]
data[vars_with_na].isnull().mean()

Our dataset contains a few variables with missing values. We need to account for this in our following notebook, where we will engineer the variables for use in Machine Learning Models.

#### Relationship between values being missing and price

Let's evaluate the price of the house in those observations where the information is missing, for each variable.

**Bar Plot**

In [None]:
def analyse_na_value(df, var):
    df = df.copy()
    # let's make a variable that indicates 1 if the observation was missing or zero otherwise
    df[var] = np.where(df[var].isna(), 1, 0)
    grs = df.groupby(var)['price'].median().reset_index()
    plt.figure(figsize=(10,6))
    sns.barplot(x=grs[var], y=grs['price'])
    plt.title(var)
    plt.show()


# let's run the function on each variable with missing data
for var in vars_with_na:
    analyse_na_value(data, var)

The average rent price in houses where the information is missing, differs from the average rent price in houses where information exists. 

We will capture this information when we engineer the variables in our next pipeline

### Categorical variables

In [None]:
# make a list of the categorical variables that contain missing values

vars_with_na = [
    var for var in data.columns
    if data[var].isnull().sum() > 0 and data[var].dtypes == 'O'
]
print(vars_with_na)
data[vars_with_na].isna().mean()

In [None]:
data[vars_with_na].head()

***Description***

In [None]:
data.description[0]

While going through the description for the house records, I found some interesting information that can be used as a feature in determining the target. Like info about grilling, pool, fireplace etc can be a useful feature. We will look about this more in data cleaning pipeline

***State***

In [None]:
data[['region','state']].head(15)

In [None]:
data.groupby('region')['state'].value_counts()

The state is highly related to the region. So filing the missing values of state with the mode value of state for that region.

***laundry_options***

In [None]:
data.groupby('type')['laundry_options'].value_counts()

Filling in the missing laundry_options with the mode value of laundry_options for the type of the house.

***parking_options***

In [None]:
data.groupby('type')['parking_options'].value_counts()

Same as landry_options, filling in the missing parking_options with the mode value of parking_options for the type of the house.

We will fill in the missing values in the notebook for Feature Engineering Pipeline

### Numerical variables


In [None]:
# make a list with the numerical variables that contain missing values
vars_with_na = [
    var for var in data.columns
    if data[var].isnull().sum() > 0 and data[var].dtypes != 'O'
]
print(vars_with_na)
# print percentage of missing values per variable
data[vars_with_na].isnull().mean()

***lat***

In [None]:
data.groupby('region')['lat'].value_counts()

***long***

In [None]:
data.groupby('region')['long'].value_counts()

As we know lattitudes and longitudes tends to correspond to the region. So it will be appropriate to fill missing lattitudes and longitudes with the mode value for the region.

#### Boolean Variables

Extracting the boolean variables

In [None]:
bool_vars = [var for var in data if data[var].nunique() == 2]

data[bool_vars].head()

In [None]:
# make list of numerical variables
num_vars = [var for var in data.columns if data[var].dtypes != 'O' and var not in bool_vars]

print('Number of numerical variables: ', len(num_vars))

# visualise the numerical variables
data[num_vars].head()

From the above view of the dataset, we notice the variable id, which is an indicator of the house. We will not use this variable to make our predictions, as there is one different value of the variable per each row, i.e., each house in the dataset. See below:

In [None]:
print('Number of House Id labels: ', len(data.id.unique()))
print('Number of Houses in the Dataset: ', len(data))

Same goes for url and image_url, each house have different set of values for these features

## Geographical variables

Plotting lattitude and longitude to get more insights

**Scatter Plot**

In [None]:
plt.scatter(x=data['long'], y=data['lat'],alpha=0.01)
plt.xlim(right=-50)
plt.ylim(bottom=20,top=60)
plt.show()

**Shapely geometry**

In [None]:
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame


geometry = [Point(xy) for xy in zip(data['long'], data['lat'])]
gdf = GeoDataFrame(data, geometry=geometry)   

#this is a simple map that goes with geopandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
gdf.plot(ax=world.plot(figsize=(10, 6)), marker='o', color='red', markersize=15);

From the above map it is clear that most of the houses are from the united states. 
Which are in between longitudes -130 to -50 and lattitudes 20 to 50

## Numerical Features

### Discrete variables

Let's go ahead and find which variables are discrete, i.e., show a finite number of values

In [None]:
#  let's male a list of discrete variables
discrete_vars = [var for var in num_vars if len(
    data[var].unique()) < 20 and var not in ['id', 'price']]


print('Number of discrete variables: ', len(discrete_vars))

In [None]:
# let's visualise the discrete variables

data[discrete_vars].head()

These discrete variables refer to the number of rooms and bathrooms.
Let's go ahead and analyse their contribution to the house price.

In [None]:
def analyse_discrete(df, var):
    df = df.copy()
    grs = df.groupby(var)['price'].median().reset_index()
    plt.figure(figsize=(10,6))
    sns.barplot(x=grs[var], y=grs['price'])
    plt.title(var.upper())
    plt.show()
    
    
for var in discrete_vars:
    analyse_discrete(data, var)

There tend to be a relationship between the variables values and the price, but this relationship is not always monotonic. 

For example, for beds, there is a monotonic relationship: the higher the quantity, the higher the price.  

However, for baths, the relationship is not monotonic. Clearly, some baths number, like 8.5, correlate with higher sale prices, but higher values do not necessarily do so. We need to be careful on how we engineer these variables to extract maximum value for a linear model.

### Continuous variables

Let's go ahead and find the distribution of the continuous variables. We will consider continuous variables to all those that are not temporal or discrete variables in our dataset.

In [None]:
# make list of continuous variables
cont_vars = [
    var for var in num_vars if var not in discrete_vars+['id']]

print('Number of continuous variables: ', len(cont_vars))

In [None]:
# let's visualise the continuous variables

data[cont_vars].head()

**Dist Plot before log transformation**

In [None]:
# Let's go ahead and analyse the distributions of these variables
def analyse_continuous(df, var):
    df = df.copy()  
    df = df.dropna(axis=0)
    plt.figure(figsize=(10,6))
    sns.set_style("darkgrid")
    sns.distplot(df[var], hist=True)
    plt.legend(['Skewness={:.2f} Kurtosis={:.2f}'.format(
            data[var].skew(), 
            data[var].kurt())
        ],
        loc='best')
    plt.title(var)
    plt.show()

for var in cont_vars:
    analyse_continuous(data, var)

The variables are not normally distributed, including the target variable 'price'. 

To maximise performance of linear models, we need to account for non-Gaussian distributions. We will transform our variables in the next lecture / video, during our feature engineering step.

Let's evaluate if a logarithmic transformation of the variables returns values that follow a normal distribution:

**Dist Plot after log transformation**

In [None]:
# Let's go ahead and analyse the distributions of these variables
# after applying a logarithmic transformation
def analyse_transformed_continuous(df, var):
    df = df.copy()
    df = df.dropna(axis=0)

    # log does not take 0 or negative values, so let's be
    # careful and skip those variables
    if var == 'lat' or var == 'long':
        pass
    else:
        # log transform the variable
        df[var] = np.log1p(df[var])
    plt.figure(figsize=(10,6))
    sns.set_style("darkgrid")
    sns.distplot(df[var], hist=True)
    plt.legend(['Skewness={:.2f} Kurtosis={:.2f}'.format(
            data[var].skew(), 
            data[var].kurt())
        ],
        loc='best')
    plt.title(var)
    plt.show()


for var in cont_vars:
    analyse_transformed_continuous(data, var)

We get a better spread of the values for most variables when we use the logarithmic transformation. This engineering step will most likely add performance value to our final model.

From the previous plots, we observe some monotonic associations between price and the variables to which we applied the log transformation, for example 'sqfeet'.

## Outliers

Extreme values may affect the performance of a linear model. Let's find out if we have any in our variables.

**Box Plot**

 Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.

In [None]:
# let's make boxplots to visualise outliers in the continuous variables


def find_outliers(df, var):
    df = df.copy()

    # log does not take negative values, so let's be
    # careful and skip those variables
    if var == 'lat' or var == 'long':
        pass
    else:
        # log transform the variable
        df[var] = np.log1p(df[var])
    ax = sns.boxplot(x=data[var], palette="muted", orient="vertical")
    plt.title(var)
    plt.ylabel(var)
    plt.show()


for var in cont_vars:
    find_outliers(data, var)

The majority of the continuous variables seem to contain outliers. Outliers tend to affect the performance of linear model. So it is worth spending some time understanding if removing outliers will add performance value to our  final machine learning model.

### Let's explore these outliers


Methods for exploring outliers

### IQR

In [None]:
def out_iqr(df , column):
    global lower,upper
    q25, q75 = np.quantile(df[column], 0.25), np.quantile(df[column], 0.75)
    # calculate the IQR
    iqr = q75 - q25
    # calculate the outlier cutoff
    cut_off = iqr * 1.5
    # calculate the lower and upper bound value
    lower, upper = q25 - cut_off, q75 + cut_off
    print('The IQR is',iqr)
    print('The lower bound value is', lower)
    print('The upper bound value is', upper)
    # Calculate the number of records below and above lower and above bound value respectively
    df1 = df[df[column] > upper]
    df2 = df[df[column] < lower]
    return print('Total number of outliers are', df1.shape[0]+ df2.shape[0])

### Standard Deviation

In [None]:
def out_std(df, column):
    global lower,upper
    # calculate the mean and standard deviation of the data frame
    data_mean, data_std = df[column].mean(), df[column].std()
    # calculate the cutoff value
    cut_off = data_std * 3
    # calculate the lower and upper bound value
    lower, upper = data_mean - cut_off, data_mean + cut_off
    print('The lower bound value is', lower)
    print('The upper bound value is', upper)
    # Calculate the number of records below and above lower and above bound value respectively
    df1 = df[df[column] > upper]
    df2 = df[df[column] < lower]
    return print('Total number of outliers are', df1.shape[0]+ df2.shape[0])

Exploring outliers in the variables

## price

In [None]:

fig, ax = plt.subplots()
ax.scatter(x = data['sqfeet'], y = data['price'])
plt.ylabel('price', fontsize=13)
plt.xlabel('sqfeet', fontsize=13)
plt.show()

We can see at the bottom right two with extremely large sqfeet that are of a low price. Also one at the top left with extremely small sqfeet that are of high price. These values are huge oultliers.

#### IQR

In [None]:
out_iqr(data, 'price')

#### STD

In [None]:
out_std(data,'price')

## sqfeet

In [None]:

fig, ax = plt.subplots()
ax.scatter(x = data['sqfeet'], y = data['price'])
plt.ylabel('price', fontsize=13)
plt.xlabel('sqfeet', fontsize=13)
plt.show()

We can see at the bottom right two with extremely large sqfeet that are of a low price. Also one at the top left with extremely small sqfeet that are of high price. These values are huge oultliers.

#### IQR

In [None]:
out_iqr(data, 'sqfeet')

#### STD

In [None]:
out_std(data,'sqfeet')

##   beds

In [None]:

fig, ax = plt.subplots()
ax.scatter(x = data['sqfeet'], y = data['beds'])
plt.ylabel('beds', fontsize=13)
plt.xlabel('sqfeet', fontsize=13)
plt.show()

#### IQR

In [None]:
out_iqr(data, 'beds')

#### STD

In [None]:
out_std(data,'beds')

## baths

#### IQR

In [None]:
out_iqr(data, 'baths')

#### STD

In [None]:
out_std(data,'baths')

## lat

#### IQR

In [None]:
out_iqr(data.dropna(axis=0), 'lat')

## long

#### IQR

In [None]:
out_iqr(data.dropna(axis=0), 'long')

We will be using interquartile range to remove the outliers. As with other methods the upper and lower bounds were irrelevant.

## Categorical variables

Let's go ahead and analyse the categorical variables present in the dataset.

In [None]:
# capture categorical variables in a list
cat_vars = [var for var in data.columns if data[var].dtypes == 'O']

print('Number of categorical variables: ', len(cat_vars))

In [None]:
# let's visualise the values of the categorical variables
data[cat_vars].head()

#### Number of labels: cardinality

Let's evaluate how many different categories are present in each of the variables.

In [None]:
data[cat_vars].nunique().sort_values(ascending=False)

In [None]:
data[cat_vars].nunique() / len(data)

Variables like url, image_url, description has high cardinality. It is worth mentioning that each of houses may have differt values for these variables. Hence the high cardinality. So it okay to remove these. Also region_url contains the region, so dropping region_url will not affect our model.

In [None]:
# recapture categorical variables in a list
cat_vars = [var for var in cat_vars if var not in ['url', 'image_url', 'description', 'region_url']]

In [None]:
data[cat_vars].nunique()

All the categorical variables show low cardinality(except region), this means that they have only few different labels. That is good as we won't need to tackle cardinality during our feature engineering lecture.

#### Rare labels:

Let's go ahead and investigate now if there are labels that are present only in a small number of houses:

In [None]:
def analyse_rare_labels(df, var, rare_perc):
    df = df.copy()

    # determine the % of observations per category
    tmp = df.groupby(var)['price'].count() / len(df)

    # return categories that are rare
    return tmp[tmp < rare_perc]

# print categories that are present in less than
# 1 % of the observations


for var in cat_vars:
    print(analyse_rare_labels(data, var, 0.01))
    print()

Some of the categorical variables show multiple labels that are present in less than 1% of the houses. We will engineer these variables in our next notebook. Labels that are under-represented in the dataset tend to cause over-fitting of machine learning models. That is why we want to remove them.

### frequent labels

In [None]:
def find_frequent_labels(df, var, rare_perc):
    # function finds the labels that are shared by more than
    # a certain % of the houses in the dataset
    df = df.copy()
    tmp = df.groupby(var)['price'].count() / len(df)
    return tmp[tmp > rare_perc].index.values

frequent_ls = {}
for var in cat_vars:
    frequent_ls[var] = find_frequent_labels(data, var, 0.01)
    
frequent_ls

These are the frequent labels for the categorical variables that we are going to keep.

## type

***Pie Plot***

In [None]:
grdsp = data.groupby(["type"])[["price"]].mean().reset_index()

fig = px.pie(grdsp,
             values="price",
             names="type",
             template="seaborn")
fig.update_traces(rotation=90, pull=0.05, textinfo="percent+label")
fig.show()

We can see that the majority of the houses are of type apartment with around 50.2% of the total records.<br>
The mean price for the type apartment is around 14,544.

## state

In [None]:
data['state'].value_counts().sort_values(ascending=False)

***Scatter Plot***

In [None]:
df = data[((data['long']>-125) & (data['long']<-45)) & ((data['lat']>30) & (data['lat']<45))]
df = df[df.price<2400]
df.plot(kind="scatter", x="lat", y="long", alpha=0.4, 
        s=df["state"].value_counts()[1]/100, label="no_of_houses", 
        c="price", cmap=plt.get_cmap("jet"), colorbar=True,
       figsize=(12,12))
plt.title('House Rent Across State')
plt.legend()

The radius of each circle represents the state’s house count (option s), and the color represents the price (option c). The range is from blue (low values) to red (high prices):

## region

In [None]:
data['region'].value_counts().sort_values(ascending=False)

***Scatter Plot***

In [None]:
df = data[((data['long']>-125) & (data['long']<-45)) & ((data['lat']>30) & (data['lat']<45))]
df = df[df.price<2400]
df.plot(kind="scatter", x="lat", y="long", alpha=0.4, 
        s=df["region"].value_counts()[1]/100, label="no_of_houses", 
        c="price", cmap=plt.get_cmap("jet"), colorbar=True,
        figsize=(12,12))
plt.title('House Rent Across Region')
plt.legend()

The radius of each circle represents the region’s house count (option s), and the color represents the price (option c). The range is from blue (low values) to red (high prices):

## Correlation

***Correlation Heatmap***

In [None]:
corr_matrix = data.corr()
mask = np.zeros_like(corr_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)]= True

fig, ax = plt.subplots(figsize=(12,12)) 

sns.heatmap(corr_matrix, 
            annot=True, 
            mask=mask,
            ax=ax, 
            cmap='BrBG').set(
    title = 'Feature Correlation', xlabel = 'Columns', ylabel = 'Columns')

ax.set_yticklabels(corr_matrix.columns, rotation = 0)
ax.set_xticklabels(corr_matrix.columns)
sns.set_style({'xtick.bottom': True}, {'ytick.left': True})

There's hardly any correlation among the independent and dependent features.

Some insights from the above correlation heatmap:

1. expected stronge correlation between beds and baths
2. unexpected correlation between smoking_allowed and lat
3. unexpected correlation between smoking_allowed and infant_mortality
4. expected stronge correlation between cats_allowed and dogs_allowed

# Conclusion

In these notebook we did exploratory data analysis on the dataset from iNeuron ML Challenge-2.

The task of the EDA were:
1. Data Description
2. Missing values
3. Numerical variables
4. Distribution of the numerical variables
5. Outliers
6. Categorical variables
7. Potential relationship between the variables and the target: price

## 1. Data Description:

The house price dataset contains 265,190 rows, i.e., houses, and 22 columns, i.e., variables. <br>
There are 13 numerical features and 9 categorical features.

## 2. Missing Values:
Following variables consist missing values.
- parking_options (36%)
- laundry_options (20%)
- lat (0.5%)
- long (0.5%)
- description & state (nominal percentage)

**Imputate missing values:**

variables | imputation 
--- | ---
parking_options | mode value of the parking_options for the respective house type
laundry_options | mode value of the laundry_options for the respective house type
lat | mode value of the lattitude for the respective house region
long | mode value of the longitude for the respective house region
state | drop records
description | drop records


***Note***: Description column can be explored more to get intriguing new features. For example: having pool, fireplace, grilling place, gym nearby etc.

## 3. Numerical variables:

Numerical variables are: <br>
    
    ['price', 'sqfeet', 'beds', 'baths', 'lat', 'long']

Boolean variables are: <br>
    
    ['cats_allowed', 'dogs_allowed', 'smoking_allowed', 'wheelchair_access', 'electric_vehicle_charge', 
    'comes_furnished']

Discrete variables are: <br>
    
    ['beds', 'baths']

Continuous variables are: <br>
    
    ['price', 'sqfeet', 'lat', 'long']

## 4. Distribution of Numerical Variables

All the numerical variables, except lat and long are skewed. Log transformation found to be useful while analysis the data.

## 5. Outliers

All the numerical variables contain huge outliers. While doing EDA, we decided to remove the outliers with the help of interquartile range.

variables | no of outliers | upper bound | lower bound 
--- | --- | --- | ---
price | 13423 | 2400 | 1
sqfeet | 11212 | 1762 | 146
beds | 10017 | 3 | 1
baths | 1459 | 3 | 1
lat | 3148 | 53 | 23
long | 3110 | -45 | -142

## 6. Categorical Variables

The categorical features are: <br>

    ['region', 'type', 'laundry_options', 'parking_options', 'state']
       
These variables contain all text: <br>
    
    ['url', 'region_url', 'image_url', 'description']

The frequent labels for the categorical variables are: <br>

    {
    'region': 
        ['denver', 'fayetteville', 'jacksonville', 'omaha / council bluffs', 'rochester'],
     'type': 
        ['apartment', 'condo', 'duplex', 'house', 'manufactured', 'townhouse'],
     'laundry_options': 
        ['laundry in bldg', 'laundry on site', 'w/d hookups', 'w/d in unit'],  
     'parking_options': 
        ['attached garage', 'carport', 'detached garage', 'off-street parking', 'street parking'],
     'state': 
        ['al', 'ar', 'az', 'ca', 'co', 'ct', 'fl', 'ga', 'ia', 'id', 'il', 'in', 'ks', 'ky', 'la', 
        'ma', 'md', 'mi', 'mn', 'ms', 'nc', 'nd', 'ne', 'nj', 'nm', 'nv', 'ny', 'oh']
    }

## 7. Relationship between independent and dependent variable

We did not notice much relationship among the features. We look into this topic in our next notebook, i.e. Feature Engineering Pipelin.

In the next notebook, we will transform these strings / labels into numbers, so that we capture this information and transform it into a monotonic relationship between the category and the house price.