# Amazon SageMaker Autopilot Data Exploration

This report provides insights about the dataset you provided as input to the AutoML job.
It was automatically generated by the AutoML training job: **housing-automl**.

As part of the AutoML job, the input dataset was randomly split into two pieces, one for **training** and one for
**validation**. The training dataset was randomly sampled, and metrics were computed for each of the columns.
This notebook provides these metrics so that you can:

1. Understand how the job analyzed features to select the candidate pipelines.
2. Modify and improve the generated AutoML pipelines using knowledge that you have about the dataset.

We read **`16555`** rows from the training dataset.
The dataset has **`10`** columns and the column named **`median_house_value`** is used as the target column.
This is identified as a **`Regression`** problem.
The labels were found to be within the range `[14999.0, 500001.0]`.

<div class="alert alert-info"> 💡 <strong> Suggested Action Items</strong>

- Look for sections like this for recommended actions that you can take.
</div>


---

## Contents
1. [Dataset Sample](#Dataset-Sample)
1. [Column Analysis](#Column-Analysis)
---


## Dataset Sample
The following table is a random sample of **10** rows from the training dataset.

<div class="alert alert-info"> 💡 <strong> Suggested Action Items</strong>

- Verify the input headers correctly align with the columns of the dataset sample.
    If they are incorrect, update the header names of your input dataset in Amazon Simple Storage Service (Amazon S3).
</div>


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>longitude</th>
      <th>latitude</th>
      <th>housing_median_age</th>
      <th>total_rooms</th>
      <th>total_bedrooms</th>
      <th>population</th>
      <th>households</th>
      <th>median_income</th>
      <th>ocean_proximity</th>
      <th>median_house_value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-117.82</td>
      <td>33.64</td>
      <td>18</td>
      <td>1974</td>
      <td>260</td>
      <td>808</td>
      <td>278</td>
      <td>9.8589</td>
      <td>&lt;1H OCEAN</td>
      <td>500001</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-116.38</td>
      <td>33.73</td>
      <td>10</td>
      <td>11836</td>
      <td>2405</td>
      <td>3811</td>
      <td>1570</td>
      <td>4.0079</td>
      <td>INLAND</td>
      <td>134500</td>
    </tr>
    <tr>
      <th>2</th>
      <td>-117.16</td>
      <td>34.08</td>
      <td>9</td>
      <td>5306</td>
      <td>993</td>
      <td>2630</td>
      <td>925</td>
      <td>4.51</td>
      <td>INLAND</td>
      <td>135800</td>
    </tr>
    <tr>
      <th>3</th>
      <td>-117.85</td>
      <td>33.83</td>
      <td>26</td>
      <td>1904</td>
      <td>292</td>
      <td>945</td>
      <td>303</td>
      <td>5.6784</td>
      <td>&lt;1H OCEAN</td>
      <td>232400</td>
    </tr>
    <tr>
      <th>4</th>
      <td>-121.34</td>
      <td>37.96</td>
      <td>27</td>
      <td>1839</td>
      <td>442</td>
      <td>2010</td>
      <td>416</td>
      <td>2.1284</td>
      <td>INLAND</td>
      <td>59400</td>
    </tr>
    <tr>
      <th>5</th>
      <td>-118.1</td>
      <td>33.84</td>
      <td>36</td>
      <td>1915</td>
      <td>316</td>
      <td>850</td>
      <td>319</td>
      <td>4.7222</td>
      <td>&lt;1H OCEAN</td>
      <td>225800</td>
    </tr>
    <tr>
      <th>6</th>
      <td>-117.99</td>
      <td>33.78</td>
      <td>15</td>
      <td>4273</td>
      <td>993</td>
      <td>2300</td>
      <td>946</td>
      <td>3.5313</td>
      <td>&lt;1H OCEAN</td>
      <td>213000</td>
    </tr>
    <tr>
      <th>7</th>
      <td>-118.06</td>
      <td>33.89</td>
      <td>26</td>
      <td>2483</td>
      <td>412</td>
      <td>1538</td>
      <td>449</td>
      <td>5.1104</td>
      <td>&lt;1H OCEAN</td>
      <td>220500</td>
    </tr>
    <tr>
      <th>8</th>
      <td>-118.41</td>
      <td>33.92</td>
      <td>32</td>
      <td>2590</td>
      <td>607</td>
      <td>1132</td>
      <td>555</td>
      <td>4.2333</td>
      <td>&lt;1H OCEAN</td>
      <td>358000</td>
    </tr>
    <tr>
      <th>9</th>
      <td>-117.99</td>
      <td>33.69</td>
      <td>16</td>
      <td>1476</td>
      <td>294</td>
      <td>886</td>
      <td>270</td>
      <td>5.3259</td>
      <td>&lt;1H OCEAN</td>
      <td>216400</td>
    </tr>
  </tbody>
</table>
</div>



## Column Analysis
The AutoML job analyzed the **`10`** input columns to infer each data type and select
the feature processing pipelines for each training algorithm.
For more details on the specific AutoML pipeline candidates, see [Amazon SageMaker Autopilot Candidate Definition Notebook.ipynb](./SageMakerAutopilotCandidateDefinitionNotebook.ipynb).

### Percent of Missing Values
Within the data sample, the following columns contained missing values, such as: `nan`, white spaces, or empty fields.

SageMaker Autopilot will attempt to fill in missing values using various techniques. For example,
missing values can be replaced with a new 'unknown' category for `Categorical` features
and missing `Numerical` values can be replaced with the **mean** or **median** of the column.

We found **1 of the 10** of the columns contained missing values.
The following table shows the **1** columns with the highest percentage of missing values.

<div class="alert alert-info"> 💡 <strong> Suggested Action Items</strong>

- Investigate the governance of the training dataset. Do you expect this many missing values?
    Are you able to fill in the missing values with real data?
- Use domain knowledge to define an appropriate default value for the feature. Either:
    - Replace all missing values with the new default value in your dataset in Amazon S3.
    - Add a step to the feature pre-processing pipeline to fill missing values, for example with a
    [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).
</div>

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>% of Missing Values</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>total_bedrooms</th>
      <td>0.98%</td>
    </tr>
  </tbody>
</table>
</div>



### Count Statistics
For `String` features, it is important to count the number of unique values to determine whether to treat a feature as `Categorical` or `Text`
and then processes the feature according to its type.

For example, SageMaker Autopilot counts the number of unique entries and the number of unique words.
The following string column would have **3** total entries, **2** unique entries, and **3** unique words.

|       | String Column     |
|-------|-------------------|
| **0** | "red blue"        |
| **1** | "red blue"        |
| **2** | "red blue yellow" |

If the feature is `Categorical`, SageMaker Autopilot can look at the total number of unique entries and transform it using techniques such as one-hot encoding.
If the field contains a `Text` string, we look at the number of unique words, or the vocabulary size, in the string.
We can use the unique words to then compute text-based features, such as Term Frequency-Inverse Document Frequency (tf-idf).

**Note:** If the number of unique values is too high, we risk data transformations expanding the dataset to too many features.
In that case, SageMaker Autopilot will attempt to reduce the dimensionality of the post-processed data,
such as by capping the number vocabulary words for tf-idf, applying Principle Component Analysis (PCA), or other dimensionality reduction techniques.

The table below shows **10 of the 10** columns ranked by the number of unique entries.

<div class="alert alert-info"> 💡 <strong> Suggested Action Items</strong>

- Verify the number of unique values of a feature is expected with respect to domain knowledge.
    If it differs, one explanation could be multiple encodings of a value.
    For example `US` and `U.S.` will count as two different words.
    You could correct the error at the data source or pre-process your dataset in your S3 bucket.
- If the number of unique values seems too high for Categorical variables,
    investigate if using domain knowledge to group the feature
    to a new feature with a smaller set of possible values improves performance.
</div>

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Number of Unique Entries</th>
      <th>Number of Unique Words (if Text)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>ocean_proximity</th>
      <td>5</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>housing_median_age</th>
      <td>52</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>longitude</th>
      <td>824</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>latitude</th>
      <td>843</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>households</th>
      <td>1732</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>total_bedrooms</th>
      <td>1827</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>population</th>
      <td>3657</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>median_house_value</th>
      <td>3693</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>total_rooms</th>
      <td>5495</td>
      <td>n/a</td>
    </tr>
    <tr>
      <th>median_income</th>
      <td>11031</td>
      <td>n/a</td>
    </tr>
  </tbody>
</table>
</div>

### Descriptive Statistics
For each of the numerical input features, several descriptive statistics are computed from the data sample.

SageMaker Autopilot may treat numerical features as `Categorical` if the number of unique entries is sufficiently low.
For `Numerical` features, we may apply numerical transformations such as normalization, log and quantile transforms,
and binning to manage outlier values and difference in feature scales.

We found **9 of the 10** columns contained at least one numerical value.
The table below shows the **9** columns which have the largest percentage of numerical values.

<div class="alert alert-info"> 💡 <strong> Suggested Action Items</strong>

- Investigate the origin of the data field. Are some values non-finite (e.g. infinity, nan)?
    Are they missing or is it an error in data input?
- Missing and extreme values may indicate a bug in the data collection process.
    Verify the numerical descriptions align with expectations.
    For example, use domain knowledge to check that the range of values for a feature meets with expectations.
</div>


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>% of Numerical Values</th>
      <th>Mean</th>
      <th>Median</th>
      <th>Min</th>
      <th>Max</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>longitude</th>
      <td>100.0%</td>
      <td>-119.562</td>
      <td>-118.49</td>
      <td>-124.35</td>
      <td>-114.31</td>
    </tr>
    <tr>
      <th>latitude</th>
      <td>100.0%</td>
      <td>35.6287</td>
      <td>34.255</td>
      <td>32.54</td>
      <td>41.95</td>
    </tr>
    <tr>
      <th>housing_median_age</th>
      <td>100.0%</td>
      <td>28.5632</td>
      <td>29.0</td>
      <td>1.0</td>
      <td>52.0</td>
    </tr>
    <tr>
      <th>total_rooms</th>
      <td>100.0%</td>
      <td>2650.8</td>
      <td>2144.5</td>
      <td>2.0</td>
      <td>39320.0</td>
    </tr>
    <tr>
      <th>population</th>
      <td>100.0%</td>
      <td>1431.91</td>
      <td>1166.5</td>
      <td>3.0</td>
      <td>35682.0</td>
    </tr>
    <tr>
      <th>households</th>
      <td>100.0%</td>
      <td>502.169</td>
      <td>410.0</td>
      <td>1.0</td>
      <td>6082.0</td>
    </tr>
    <tr>
      <th>median_income</th>
      <td>100.0%</td>
      <td>3.88434</td>
      <td>3.54025</td>
      <td>0.4999</td>
      <td>15.0001</td>
    </tr>
    <tr>
      <th>median_house_value</th>
      <td>100.0%</td>
      <td>2.0707e+05</td>
      <td>1.807e+05</td>
      <td>14999.0</td>
      <td>5.00001e+05</td>
    </tr>
    <tr>
      <th>total_bedrooms</th>
      <td>99.02%</td>
      <td>540.489</td>
      <td>435.0</td>
      <td>1.0</td>
      <td>6445.0</td>
    </tr>
  </tbody>
</table>
</div>
