# CPSC 4300/6300-001 Applied Data Science (Fall 2020)

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Wenkang Wei"
COLLABORATORS = "Wenkang Wei"

# CPSC4300/6300-001 Problem Set #3

In this problem set, you work on an end-end data science project using the California Housing Data from the StatLib repository (http://lib.stat.cmu.edu/datasets/).

The original data appeared in Pace, R. Kelley and Ronald Barry, "Sparse Spatial Autoregressions," Statistics and Probability Letters, 33 (1997) 291-297 (https://www.sciencedirect.com/science/article/pii/S016771529600140X). 

Aurélien Géron has used this dataset in his book "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition", O'Reilly Media (2019). (https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) You can access this book from Clemson library online subscription (https://learning-oreilly-com.libproxy.clemson.edu/home/). This problem set is adapted from Chapter 2 of Aurélien's book (a supplemental notebook is available at https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb).

## Context

As a newly-hired data science consultant of the Housing Investment Corporation, your manager asks you to build a housing price prediction model for the California housing market using the California census data. An investment analysis team will use the price model to determine whether it is worth investing in a given area or not.

This California Housing dataset includes the population, median income, median housing price, and so on for each block group in California. A block group is the smallest geographical unit for which the US Census Bureau publishes census data. Typically, a block group has a population of 600 to 3,000 people. For short, we call each block group a "district".

Your manager also tells you that the current practice in the company is that a team of experts manually estimate the district housing prices: they gather up-to-date information about a district, and when they cannot get the median housing price, they estimate it using complex rules. The current practice is not only costly and time-consuming, the estimates by the experts are not great either. The experts often realize that their price estimates were off by more than 20%.

Now you come on board. Your manager thinks it would be useful to build a model to predict a district’s median housing price given other data about that district. So, let's start.

# Part A. Basic Data Examination

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## 1. Frame the Porblem (5 points)

Among the following choices, as what types of machine learning tasks would you frame the problem described above? (Note there acan be multipel choice).

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
4. Classification
5. Regression
6. Univariate regression
7. Multivariate regression
8. Clustering

Save your answers into a list of integers named __task_types__. 

If you are not sure about a choice because you need further information, you can put a 'minus' sign before that choice and provides an explanation  for why you cannot decide the task types. Save you explanation in a dict named __reasons__ whose key is the negative id of the task type. For example, if you are certain 1, 2, and 3 are right but think 4 might be also right, you should write:
```
task_types = [1, 2, 3, -4]
reasons = {-4: "The investment analysis team might want a price categories instead of price"}
```

In [1]:
# YOUR CODE HERE
# raise NotImplementedError()
task_types = [1,-3,-4,5,6,7]
reasons = { -3: "Reinforcement learning allows the agent/model to learn to interact with environment either online or offline. In this problem,\
           we might think the prediction of median housing price as an action and need a designed reward function to compute reward based on given labels and predictions to train model\
           So it may be modeled as a reinforcement learning problem for median housing price prediction as well. \
           Whether to use reinforcement learning to predict median housing price depends on the strategy the team chooses",
           -4: "the analysis team might want to discretize the range of median housing price into different labels and do multiple classes \
           classification as well. It may depend on the strategy they want"}
task_types

[1, -3, -4, 5, 6, 7]

In [2]:
assert task_types is not None and type(task_types).__name__ == "list"

In [3]:
assert 1 in task_types or -1 in task_types

In [4]:
assert 5 in task_types or -5 in task_types

In [5]:
assert 6 in task_types or -6 in task_types

In [6]:
assert 7 in task_types or -7 in task_types

## 2. Organize Project Files (6 points)

When you work on a project, a good practice is to arrange your project files into multiple folders such as input, lib, src, models, notebooks, doc, and etc. The function below can be used to create a project template.

In [7]:
import os
import subprocess
def create_project_temple(basedir=os.getcwd(), subdirs=["input"], mode=0o755):
    for d in subdirs:
        absolute_path = os.path.join(basedir, d)
        if not os.path.exists(absolute_path):
            os.makedirs(absolute_path, mode=mode)
    for f in ["README.md", "LICENSE"]:
        subprocess.call(["touch", os.path.join(basedir, f)])

__Question 2(a)__. __Use the function `create_project_temple` to create a project template that includes subfolders: `input`, `models`, `src`, `figs`, and `report`__. (2 points)

In [8]:
# YOUR CODE HERE
# raise NotImplementedError()
create_project_temple(subdirs=["input", "models","src","figs","report"])

In [9]:
# list the project file in a tree structure
!tree -a

[01;34m.[0m
├── [01;34mfigs[0m
├── [01;34minput[0m
│   ├── housing.csv
│   ├── housing_prediction_scaled.csv
│   ├── housing_prediction_scaled_y.csv
│   ├── housing_test_cleaned.csv
│   ├── housing_test_scaled.csv
│   ├── housing_train_cleaned.csv
│   └── housing_train_scaled.csv
├── [01;34m.ipynb_checkpoints[0m
│   ├── part_a-checkpoint.ipynb
│   ├── part_b-checkpoint.ipynb
│   ├── part_c-checkpoint.ipynb
│   └── part_d-checkpoint.ipynb
├── LICENSE
├── .mapbox_token
├── [01;34mmodels[0m
├── part_a.ipynb
├── part_b.ipynb
├── part_c.ipynb
├── part_d.ipynb
├── README.md
├── [01;34mreport[0m
└── [01;34msrc[0m
    ├── datautil.py
    └── [01;34m__pycache__[0m
        └── datautil.cpython-38.pyc

7 directories, 20 files


In [10]:
# Check if all require folders exist
import os
basedir = os.getcwd()
assert all([os.path.exists(os.path.join(basedir, d)) for d in ['input', 'models', 'src', 'figs', 'report']])

__Question 2(b)__. __To make the created subfolder only accessible by yourself, which value you should assign to the argument `mode` in create_project_temple__ (2 points)

```
A. 0o700
B. 0o644
C. 700
D. 0o600
```

Assign your answer to a string variable `answer`.

In [11]:
# YOUR CODE HERE
# raise NotImplementedError()
answer= 'A'
import hashlib
hashlib.md5(answer.encode()).hexdigest()

'7fc56270e7a70fa81a5935b72eacbe29'

In [12]:
# There is a hidden test in this cell.
import hashlib
assert hashlib.md5(answer.encode()).hexdigest() == '7fc56270e7a70fa81a5935b72eacbe29'

__Question 2(c)__. Assume you want to put your project subfolders on the parallel file system under the folder `/scratch1/<your-user-name>/cpsc6300/housing` on Palmetto so that your program can have better I/O performance, __what value would you provide for the `basedir` argument?__(2 points)

In [13]:
import getpass
my_user_name= getpass.getuser()
basedir = '/scratch1/'+my_user_name+'/'+'cpsc6300/housing'
# YOUR CODE HERE
# raise NotImplementedError()
basedir

'/scratch1/wenkanw/cpsc6300/housing'

In [14]:
import getpass
assert len(basedir.split("/")) == 5
assert basedir.split("/")[1] == 'scratch1' and basedir.split("/")[2] == getpass.getuser()


## 3. Get the Data (6 points)

Although you can always manually download and extract a dataset, a better practice is to automate the download process and cache the data locally. Keeping a local copy of massive dataset on a fast and reliable storage systems is important in practice.

Below is the source code that load the housing data. Because the code is required in your notebook but is not essential to your model development, to keep your notebook clean, you can put the code into the `src` sub folder.

In [15]:
%%writefile src/datautil.py
import os
import requests
import pandas as pd

def download_data(data_url, file_path):
    r = requests.get(data_url, verify=False)
    with open(file_path, "wb") as f:
        f.write(r.content)

def load_data(data_url, local_cached_datafile):
    if not os.path.exists(local_cached_datafile):
        if not os.path.exists(os.path.dirname(local_cached_datafile)):
            os.makedirs(os.path.dirname(local_cached_datafile))
        download_data(data_url, local_cached_datafile)
    return pd.read_csv(local_cached_datafile)

def load_housing_data():
    data_url = 'https://webapp02.palmetto.clemson.edu/dsci/datasets/housing/housing.csv'
    input_dir = os.path.join(os.getcwd(), 'input', 'housing.csv')
    return load_data(data_url, input_dir)

Overwriting src/datautil.py


### Python module search path

When working with Python, it is necessary to understand how the Python interpreter finds the module specified in the `import` statement. Below is an excerpt from the Python documentation https://docs.python.org/3/tutorial/modules.html:

```
When a module named spam is imported, the interpreter first searches for a built-in module with that name. If not found, it then searches for a file named spam.py in a list of directories given by the variable sys.path. sys.path is initialized from these locations:

1. The directory containing the input script (or the current directory when no file is specified).
2. PYTHONPATH (a list of directory names, with the same syntax as the shell variable PATH).
3. The installation-dependent default.

After initialization, Python programs can modify sys.path. The directory containing the script being run is placed at the beginning of the search path, ahead of the standard library path. This means that scripts in that directory will be loaded instead of modules of the same name in the library directory.
```

In order to import the Python modules that are not in current search path, you can insert the path that contains those modules in `sys.path`. For example, to ensure that the path /zfs/courses/CPSC6300/anaconda3/lib/python3.8/site-packages is included in sys.path and is included only once, you can use the follow code:

In [16]:
import sys
course_python_packages_dir = '/zfs/courses/CPSC6300/anaconda3/lib/python3.8/site-packages'
if course_python_packages_dir not in sys.path:
    sys.path.append(course_python_packages_dir)

__Question 3(a)__. __Modify the python search path to ensure that your project's `src` sub folder (which contains the `datautil.py`) is included in `sys.path` and is included only once__? (3 points)

In [17]:
import sys

# YOUR CODE HERE
# raise NotImplementedError()
module_path = os.getcwd()+'/src'
if module_path not in sys.path:
    sys.path.append(module_path)
    print("Add module path:",module_path)
sys.path

Add module path: /home/wenkanw/cpsc6300-001/ps03/src


['/home/wenkanw/cpsc6300-001/ps03',
 '',
 '/software/spackages/linux-centos8-x86_64/gcc-8.3.1/opencv-4.2.0-ox4iebcjsf4q6r2m2jxlw7hdcpvtvgqu/lib/python3.7/site-packages',
 '/software/spackages/linux-centos8-x86_64/gcc-8.3.1/vtk-9.0.0-gh57nkhkfx5vrrhvsn7eqap7j5lp3ipd/lib/python3.7/site-packages',
 '/zfs/courses/CPSC6300/anaconda3/lib/python38.zip',
 '/zfs/courses/CPSC6300/anaconda3/lib/python3.8',
 '/zfs/courses/CPSC6300/anaconda3/lib/python3.8/lib-dynload',
 '/home/wenkanw/.local/lib/python3.8/site-packages',
 '/zfs/courses/CPSC6300/anaconda3/lib/python3.8/site-packages',
 '/zfs/courses/CPSC6300/anaconda3/lib/python3.8/site-packages/IPython/extensions',
 '/home/wenkanw/.ipython',
 '/home/wenkanw/cpsc6300-001/ps03/src']

In [18]:
basedir = os.getcwd()
script_dir = os.path.join(basedir, "src")
assert script_dir in sys.path

In [19]:
assert len([p for p in sys.path if p == script_dir]) == 1

__Question 3(b)__.__Write some code to load the housing data using the load_housing_data() function in the datautil module. The data needs to be in your project's input subfolder.__(3 points)

In [20]:
# YOUR CODE HERE
# raise NotImplementedError()
from datautil import *
housing = load_housing_data()
housing.head(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY


In [21]:
import os
assert housing is not None and type(housing).__name__ == 'DataFrame'
assert housing.shape == (20640, 10)
assert os.path.exists('input/housing.csv')

## 4. Examine the Data (10 points)

Answer each of the following questions (2 points per question)

__Question 4(a)__. Use the `housing.info()` to get a quick overview of the housing data set. According to the output, which column has missing data and which column is of categorical data type? Save your answer to two string variables `column_with_missing_data` and `column_with_categorical_data`.

In [22]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [23]:
housing.info()
# YOUR CODE HERE
# raise NotImplementedError()
column_with_missing_data= "total_bedrooms"
column_with_categorical_data = "ocean_proximity"
column_with_missing_data, column_with_categorical_data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


('total_bedrooms', 'ocean_proximity')

__Question 4(b)__. Write some Python code to find and print the column(s) with missing data and the total number of missing values in that column. 

Hint: housing.count() returns a Series whose indices are the column names and values are the counts of non-missing values. 

In [24]:
# YOUR CODE HERE
# raise NotImplementedError()
col_missing_data = [(c, housing.count()[c]) for c in housing.columns if housing.count()[c] < len(housing)-1]
col_missing_data

[('total_bedrooms', 20433)]

__Question 4(c)__ __Write some Python code to find which categories exist in the `ocean_proximity` column and which category has the most districts.__ Save all categories to a Series variable `cats` and the category name with most districts `cat_most_district`.

Hints: you may need a Series' methods like `value_counts()` and `sort_values()`.

In [25]:
housing['ocean_proximity'].value_counts().sort_values()

ISLAND           5
NEAR BAY      2290
NEAR OCEAN    2658
INLAND        6551
<1H OCEAN     9136
Name: ocean_proximity, dtype: int64

In [26]:
# YOUR CODE HERE
# raise NotImplementedError()
# sorting in ascending order
cats = housing['ocean_proximity'].value_counts().sort_values()
cat_most_district = cats.index[-1]
print(cats)
print(cat_most_district)

ISLAND           5
NEAR BAY      2290
NEAR OCEAN    2658
INLAND        6551
<1H OCEAN     9136
Name: ocean_proximity, dtype: int64
<1H OCEAN


__Question 4(d)__. Use the `housing.describe()` method to get a quick summary of the numerical attributes. __Which two columns are mostly likely to have their maximum value capped?__ Save your answer to a list of strings named `capper_columns`.

In [27]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [29]:
housing.describe()
# YOUR CODE HERE
# raise NotImplementedError()
capper_columns = ["median_house_value","housing_median_age"]
capper_columns

['median_house_value', 'housing_median_age']

__Question 4(e)__. __Write some Python code to find out how many districts whose median_house_value are greater or equal (>=) to the max median_house_value__?

In [34]:
max_median_house_value = housing.describe().loc['max', 'median_house_value']
# YOUR CODE HERE
# raise NotImplementedError()
num_expensive_districts = len(housing[housing['median_house_value']>=max_median_house_value])
num_expensive_districts


965

## 5. Data Query Question (5 points)

Write one of your own data query questions that you want to know about the data and provide a solution.

Write your question in this box.

Query Question: 
1. Query all districts that have medidan_house_value >= Q1-1.5*(Q3-Q1)  and  <= Q3+1.5*(Q3-Q1), where Q1 and Q3 are the 25% quantile and 75% quantile respectively

In [32]:
# Write you answer in this box.
# YOUR CODE HERE
# raise NotImplementedError()
Q1_median_house_value =  housing.describe().loc["25%", "median_house_value"]
Q3_median_house_value =  housing.describe().loc["75%", "median_house_value"]
IQR = Q3_median_house_value- Q1_median_house_value
lower_bound = Q1_median_house_value - 1.5* IQR
upper_bound = Q3_median_house_value + 1.5* IQR

inter_range_districts = housing.loc[(housing["median_house_value"] >=lower_bound) & 
                                    (housing["median_house_value"] <=upper_bound)]
inter_range_districts

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


__End of Part A__.