## Introduction

In this homework, you will explore the California Housing Dataset. You will begin by loading and accessing the dataset, then proceed to explore its distributions and correlations. Next, you will apply feature transformation techniques, such as log transformation and one-hot encoding, to prepare the data for analysis.

By the end of this assignment, you will have gained hands-on experience in data preprocessing, an essential step in any data mining project. Complete the required tasks using Python code in a Jupyter Notebook. For textual responses, insert a text block and write your answers using Markdown.

## California Housing Dataset

### Data Set Columns:

1. **longitude:** A measure of how far west a house is located; higher values indicate a location farther west.
2. **latitude:** A measure of how far north a house is located; higher values indicate a location farther north.
3. **housingMedianAge:** The median age of houses within a block; lower numbers represent newer buildings.
4. **totalRooms:** The total number of rooms within a block.
5. **totalBedrooms:** The total number of bedrooms within a block.
6. **population:** The total number of people residing within a block.
7. **households:** The total number of households, where a household is defined as a group of people residing in a single home unit, within a block.
8. **medianIncome:** The median income for households within a block (measured in tens of thousands of US dollars).
9. **medianHouseValue:** The median house value for households within a block (measured in US dollars).
10. **oceanProximity:** The location of the house in relation to the ocean or sea.

### References:

- Pace, R. Kelley, and Ronald Barry. "Sparse Spatial Autoregressions." *Statistics and Probability Letters*, 33 (1997): 291-297.


### Question 1: Access and Explore the Dataset

1. **Load the Dataset:**
   - Use the `pandas.read_csv()` function to load the California Housing Dataset. The dataset will be provided to you as a CSV file.
   
    Follow this link for more details on how to use `read_csv`: [pandas.read_csv() Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

2. **View the Dataset:**
   - Display the first few rows of the dataset using the `head()` method to get an initial understanding of its structure.
   
    Refer to this link for details: [pandas.DataFrame.head() Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html).
   - Generate a statistical summary of the dataset using the `describe()` method. This will give you insights into the distribution and basic statistics of each column.
   
    For more information, visit: [pandas.DataFrame.describe() Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html).

### Question 2: Examine Column Distributions

1. **Plot Box-and-Whisker Plots:**
   - Use the `boxplot()` function from `pandas` to create box-and-whisker plots for each of the following columns. This will help you visualize the distribution, identify outliers, and understand the spread of the data.
   [pandas.DataFrame.boxplot() Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html)
   
   - Columns to plot:
     - `housingMedianAge`
     - `medianIncome`
     - `medianHouseValue`

2. **Plot Histograms:**
   - Use the `hist()` function to create histograms for each of the following columns. This will allow you to examine the distribution and identify any patterns such as skewness or bimodality.
   [pandas.DataFrame.hist() Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html)

   - Columns to plot:
     - `housingMedianAge`
     - `medianIncome`
     - `medianHouseValue`


### Question 3: Examine Correlations Among Numerical Columns

1. **Calculate the Correlation Matrix:**
   - Use the `.corr()` method in `pandas` to calculate the correlation matrix for all numerical columns in the dataset. This will help you understand the relationships between different numerical features.

    [pandas.DataFrame.corr() Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)

   - Columns to include in the correlation matrix:
     - `longitude`
     - `latitude`
     - `housingMedianAge`
     - `totalRooms`
     - `totalBedrooms`
     - `population`
     - `households`
     - `medianIncome`
     - `medianHouseValue`

2. **Display the Correlation Matrix:**
   - Visualize the correlation matrix using the `heatmap` function from `seaborn`. This will allow you to see the strength and direction of correlations between the numerical columns.
   
    [seaborn.heatmap() Documentation](https://seaborn.pydata.org/generated/seaborn.heatmap.html)

   - After visualizing, identify and note the highest and lowest correlations between pairs of columns.



### Authentication: Write Down Your Information

In the following code block, print your Student ID, Name, and Homework number in the specified format:

```python
# Replace the placeholders with your actual information
info = [yourid, yourname, homework_number]
for id, name, homework in info:
    print(f'ID: {id}\nName: {name}\nHomework: {homework}')


In [8]:
info = [('1001', 'Jon Doe', '001')]
for id, name, homework in info:
    print(f'ID: {id}\nName: {name}\nHomework: {homework}')

ID: 1001
Name: Jon Doe
Homework: 001


### Question 4: Feature Transformation

1. **Apply One-Hot Encoding:**
   - Use the `OneHotEncoder` from `sklearn.preprocessing` to transform the `ocean_proximity` column into multiple binary (0/1) columns.
   
    [OneHotEncoder Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
   - Ensure that the new columns accurately represent the different categories within `ocean_proximity`.
   - Display the first 5 rows of the transformed DataFrame to verify the changes.

2. **Apply Log Transformation:**
   - Use the `log` function from either the `math` or `numpy` library to transform the `households` column.
   
    [Log Documentation](https://numpy.org/doc/stable/reference/generated/numpy.log.html)
   - Plot histograms of the `households` column both before and after the log transformation to compare the distributions.
