# Task

As a data analyst there is plenty of opportunity to improve processes or suggest new ways of doing things. When doing so it is often very smart and efficient (time is a scarce resource) to create a POC (Proof of Concept) which basically is a small demo checking wether it is worthwile going further with something. It is also something concrete which facilitates discussions, do not underestimate the power of that. 

In this example, you are working in a company that sells houses and they have a "manual" process of setting prices by humans. You as a Data Scientist can make this process better by using Machine Learning. Your task is to create a POC that you will present to your team colleagues and use as a source of discussion of wether or not you should continue with more detailed modelling. 

Two quotes to facilitate your reflection on the value of creating a PoC: 

"*Premature optimization is the root of all evil*". 

"*Fail fast*".

**More specifially, do the following:**

1. A short EDA (Exploratory Data Analysis) of the housing data set.
2. Drop the column "ocean_proximity", then you only have numeric columns which will simplify your analysis. Remember, this is a POC!
3. Split your data into train and test set. 
4. Create a pipeline containing a SimpleImputer [ SimpleImputer(strategy="median") ] and a std_scaler (and fit-transform your train set). 

5. Use GridSearchCV when choosing your model. You will look at a RandomForestRegressor with 2, 5, 10 or 100 estimators. More specifically, use the following code: 

```python
param_grid = [{'n_estimators': [2, 5, 10, 100]}]

forest_reg = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(forest_reg, param_grid, cv=3,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(train_feature, train_label)
```

6. Evaluate your model on the test set using the mean squared error as the metric. Conclusions? (Remember, you have fitted your pipeline above so now you just transform your test set without fitting your pipeline on it, else it is "cheating".)

7. Do a short presentation (~ 2-5 min) on your POC that you present to your colleagues (no need to prepare anything particular, just talk from the code). Think of:
- What do you want to highlight/present?
- What is your conclusion?
- What could be the next step? Is the POC convincing enough or is it not worthwile continuing? Do we need to dig deeper into this before taking some decisions?


**(8. If you have time, try to build a better model than the one presented in the POC.)**

# POC

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data splitting
from sklearn.model_selection import train_test_split

# Creating a pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# GridSearch
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

In [3]:
# Below, set your own path where you have stored the data file. 
housing = pd.read_csv('../Chapter 2 - End to end ML/housing.csv')

## 1. EDA

In [4]:
print('------------------ Snippet of the the housing data --------------------')
print(housing.head(3))
print()

print('---------------------- Info about each column -------------------------')
print(housing.info())
print()

print('----------------------- Description of data ---------------------------')
housing.describe()

------------------ Snippet of the the housing data --------------------
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  

---------------------- Info about each column -------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


## 2. Dropping the ocean_proximity column

In [5]:
# Dropping the column with non-numeric values
housing.drop('ocean_proximity', inplace=True, axis=1)
print(housing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
None


## Splitting the data into a training and test set

In [6]:
"""
Splitting the dataset
    test_size = the proportion of dataset to inclide in the test split
    shuffle = shuffle data before splitting
    ranom_state = Result reproducibility. Makes the random number predictable.
"""

train_set, test_set = train_test_split(housing, test_size=0.2, shuffle=True, random_state=42)

In [7]:
# Separating target and preictor data

X_train = train_set.drop('median_house_value', axis=1).values
y_train = train_set.median_house_value.values

X_test = test_set.drop('median_house_value', axis=1).values
y_test = test_set.median_house_value.values

## Pipeline with SimpleImputer and std_scaler

In [8]:
# Create the pipeline object
pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])


#  Fit and transform the train set
train_set_transformed = pipeline.fit_transform(X_train)

## GridSearchCV to choose a model

In [9]:
param_grid = [{'n_estimators': [2, 5, 10, 100]}]

forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(
                        forest_reg, 
                        param_grid, 
                        cv=3,
                        scoring='neg_mean_squared_error',
                        return_train_score=True
)

grid_search.fit(train_set_transformed, y_train)

In [10]:
grid_search.best_params_

{'n_estimators': 100}

In [11]:
grid_search.best_estimator_

In [17]:
best_train_cv_score = np.sqrt(-grid_search.best_score_)
print(f' The prediction of the median house value differs on average by {best_train_cv_score} dollars form the true value')

 The prediction of the median house value differs on average by 50617.67022788325 dollars form the true value


## Model evaulation

- Conclusion?
- What could be the next step? 
- Is the POC convincing enough or is it not worthwile continuing? 
- Do we need to dig deeper into this before taking some decisions?

In [13]:
# Transforming the test data before use
test_set_transformed = pipeline.transform(X_test)

# The best and selected model. Used to make a prediction on the test data
selected_model = grid_search.best_estimator_
y_pred = selected_model.predict(test_set_transformed)

In [14]:
rmse_score = mean_squared_error(y_test, y_pred, squared=False)
print(rmse_score)

49875.648686594046


In [41]:
mean_mhv = housing['median_house_value'].mean()
mean_age = housing['housing_median_age'].mean()
mean_income = housing['median_income'].mean()
diff_per = (rmse_score/mean_mhv) * 100

In [47]:
data = {
    'Training RMSE': [best_train_cv_score], 
    'Test RMSE': [rmse_score], 
    'Mean Median House Value (MHV)': [mean_mhv],
    'Avg. difference predicted vs. actual MHV': [diff_per],
    'Avg. age': [mean_age],
}

df = pd.DataFrame(data)

print('----------------------- Data Summary --------------------------')
print()
print(df)

----------------------- Data Summary --------------------------

   Training RMSE     Test RMSE  Mean Median House Value (MHV)  \
0   50617.670228  49875.648687                  206855.816909   

   Avg. difference predicted vs. actual MHV   Avg. age  
0                                 24.111311  28.639486  


### Conclussion

Based on the Root Mean Square Error (RMSE) values for both the training data (50 618) and the test data (49 875), it can be concluded that the ML model is not performing very well in the context of these medain house values. This means that, on average, the model is able to precit the median house value within 25% of the real value. For the prices of these houses (avg. median value of 206 855) and the average age of the house owners of 28, it can be assumed that such difference in dollars is would be notable for this demographic group. In the worst case scenario, if the model is underestimating the prices, house sellers will be losing on average 50 000 dollars which affects the real estate company even more. 

However, it is worth noting that the test data RMSE is slightly better than the training data RMSE, which suggests that the model may be overfitting the training data. This is difficult to conclude with the current test data as it might be just a coincidence, due to the small difference between the test and training RMSE. The test RMSE is around 800 dollars or 1.6% more accurate in its predictions.

Currently, the POC is not convincing enough, althought it can be imporved by furhter analysis and refinment. In this case, additional work, such as hyperparameter tuning, trying different model, or cleaning up the data, may be necessary to improve the model's performance. 