#### Introduction

In this homework, you will build and evaluate classification models using the California Housing Dataset to predict if a  house is in land. Your goal is to train two types of classifiers—the Linear Probability Model (LPM) and Logistic Regression—and explore how Lasso (L1 regularization) can be used for feature selection. Additionally, you will apply a train-test split to evaluate your models on unseen data and report key performance metrics.

By the end of this assignment, you will have gained hands-on experience in training classifiers, performing feature selection with Lasso, and comparing the performance of different models, which are crucial steps in many data mining and machine learning projects. All tasks should be completed using Python in a Jupyter Notebook. For questions requiring textual explanations, please use Markdown to format your responses within text blocks.

## California Housing Dataset

### Data Set Columns:

1. **longitude:** A measure of how far west a house is located; higher values indicate a location farther west.
2. **latitude:** A measure of how far north a house is located; higher values indicate a location farther north.
3. **housingMedianAge:** The median age of houses within a block; lower numbers represent newer buildings.
4. **totalRooms:** The total number of rooms within a block.
5. **totalBedrooms:** The total number of bedrooms within a block.
6. **population:** The total number of people residing within a block.
7. **households:** The total number of households, where a household is defined as a group of people residing in a single home unit, within a block.
8. **medianIncome:** The median income for households within a block (measured in tens of thousands of US dollars).
9. **medianHouseValue:** The median house value for households within a block (measured in US dollars).
10. **oceanProximity:** The location of the house in relation to the ocean or sea.

### References:

- Pace, R. Kelley, and Ronald Barry. "Sparse Spatial Autoregressions." *Statistics and Probability Letters*, 33 (1997): 291-297.


### Question 1: Load and Preprocess the Dataset

1. **Load the Dataset:**
   - Use the `pandas.read_csv()` function to load the California Housing Dataset. The dataset will be provided to you as a CSV file.
   
   Refer to this link for more details on how to use `read_csv`: [pandas.read_csv() Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

2. **Convert `oceanProximity` to a Boolean Variable:**
   - Convert the `oceanProximity` column into a binary variable:
     - Assign a value of `1` if the location is "INLAND".
     - Assign a value of `0` for all other values.
     
   **Hint:** You can use the `.apply()`,or `.map()` function in `pandas` to map the "INLAND" values to 1 and other values to 0.

3. **Train-Test Split:**
   - Split the dataset into training and testing sets, where 80% of the data will be used for training and 20% for testing.
   - Assign the feature columns (including the old target varaible `medianHouseValue` column) to `X` and the target column (`oceanProximity`) to `y`.
   - Use the `train_test_split` function from `scikit-learn` to perform the split.

   Refer to this link for more details on how to use `train_test_split`: [scikit-learn train_test_split Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [37]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [38]:
data = pd.read_csv('/content/sample_data/housing-2.csv')

if data.isnull().values.any():
    data = data.dropna()

data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [39]:
data['INLAND'] = data['ocean_proximity'].apply(lambda x: 1 if x == 'INLAND' else 0)
data = data.drop('ocean_proximity', axis=1)

data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,INLAND
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0


In [40]:
X = data.drop('INLAND', axis=1)
y = data['INLAND']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (16346, 9)
X_test shape: (4087, 9)
y_train shape: (16346,)
y_test shape: (4087,)


In [41]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Question 2: Train Linear Probability Model and Report Performance

1. **Train the Linear Probability Model (LPM):**
   - Train a Linear Regression model on the training data (`X_train` and `y_train`) using the `LinearRegression` class from `scikit-learn`.
   
   **Hint:** Since this is a regression model predicting binary outcomes (probabilities between 0 and 1), make sure you later convert the predicted values into binary labels using a threshold (e.g., 0.5).

   Refer to this link for more details on `LinearRegression`: [scikit-learn LinearRegression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

2. **Train Lasso (L1 Regularization):**
   - Train a Lasso regression model using the `Lasso` class from `scikit-learn` to apply L1 regularization on the linear probability model. The Lasso model will help you perform feature selection by shrinking some coefficients to zero.
   
   **Hint:** Set the `alpha` parameter to a small value (e.g., `alpha=0.1`) to control the strength of the regularization.

   Refer to this link for more details on `Lasso`: [scikit-learn Lasso Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html).

3. **Predict on the Test Set:**
   - Use both the trained Linear Probability Model and the Lasso model to predict the outcomes on `X_test`. Since these are probability estimates (continuous values between 0 and 1), convert them into binary outcomes using a threshold (e.g., 0.5).
   
   **Hint:** For LPM, use the following logic to threshold the probabilities:
   ```python
   y_pred_class = (y_pred_prob > 0.5).astype(int)
   ```

4. **Report Accuracy:**
   - Calculate and report the accuracy of both models (Linear Probability Model and Lasso) on the test set using the `accuracy_score` function from `scikit-learn`.

   Refer to this link for more details on `accuracy_score`: [scikit-learn accuracy_score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

5. **Report Coefficients:**
   - Extract and report the coefficients from both the Linear Probability Model and Lasso regression.


In [42]:
lpm_model = LinearRegression()
lpm_model.fit(X_train_scaled, y_train)

In [43]:
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)


In [44]:
y_pred_prob_lpm = lpm_model.predict(X_test_scaled)
y_pred_class_lpm = (y_pred_prob_lpm > 0.5).astype(int)

y_pred_prob_lasso = lasso_model.predict(X_test_scaled)
y_pred_class_lasso = (y_pred_prob_lasso > 0.5).astype(int)


In [45]:
accuracy_lpm = accuracy_score(y_test, y_pred_class_lpm)
print(f"Accuracy of Linear Probability Model: {accuracy_lpm:.4f}")

accuracy_lasso = accuracy_score(y_test, y_pred_class_lasso)
print(f"Accuracy of Lasso Model: {accuracy_lasso:.4f}")

Accuracy of Linear Probability Model: 0.9281
Accuracy of Lasso Model: 0.7123


In [46]:
coefficients_lpm = lpm_model.coef_
print("Coefficients of Linear Probability Model:", coefficients_lpm)

coefficients_lasso = lasso_model.coef_
print("Coefficients of Lasso Model:", coefficients_lasso)

Coefficients of Linear Probability Model: [ 0.75114094  0.84121956 -0.02565401  0.11060483 -0.10607063 -0.01509402
  0.01341117 -0.00753263 -0.07160812]
Coefficients of Lasso Model: [ 0.          0.04463313 -0.          0.          0.         -0.
 -0.         -0.         -0.11785062]


### Question 3: Train Logistic Regression Model and Report Performance

1. **Train the Logistic Regression Model:**
   - Train a Logistic Regression model on the training data (`X_train` and `y_train`) using the `LogisticRegression` class from `scikit-learn`.
   
   **Hint:** Ensure to use the `LogisticRegression` class with `penalty='none'` for the basic Logistic Regression model (no regularization).

   Refer to this link for more details on `LogisticRegression`: [scikit-learn LogisticRegression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

2. **Train Logistic Regression with Lasso (L1 Regularization):**
   - Train a Logistic Regression model using L1 regularization by setting `penalty='l1'` and using the `solver='saga'` to handle the regularization in the `LogisticRegression` class.
   
   **Hint:** The `C` parameter controls the strength of regularization (smaller `C` means stronger regularization). Set `solver='saga'`, as it supports L1 regularization.

   Refer to this link for more details on `Lasso Logistic Regression`: [scikit-learn Logistic Regression L1 Documentation](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression).

3. **Predict on the Test Set:**
   - Use both the trained Logistic Regression model and the Lasso Logistic Regression model to predict outcomes on `X_test`. Since Logistic Regression directly predicts probabilities, convert these into binary outcomes using a threshold of 0.5.
   
   **Hint:** You can use the following logic to threshold the probabilities:
   ```python
   y_pred_class = (y_pred_prob > 0.5).astype(int)
   ```

4. **Report Accuracy:**
   - Calculate and report the accuracy of both models (basic Logistic Regression and Lasso Logistic Regression) on the test set using the `accuracy_score` function from `scikit-learn`.

   Refer to this link for more details on `accuracy_score`: [scikit-learn accuracy_score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

5. **Report Coefficients:**
   - Extract and report the coefficients from both the basic Logistic Regression and the Lasso Logistic Regression models. Compare how Lasso impacts feature selection by shrinking some coefficients to zero.

In [47]:
logistic_model = LogisticRegression(penalty= None, max_iter=200, random_state=42)
logistic_model.fit(X_train_scaled, y_train)

In [48]:
lasso_logistic_model = LogisticRegression(penalty='l1', solver='saga', C=0.1, max_iter=500, random_state=42)
lasso_logistic_model.fit(X_train_scaled, y_train)

In [49]:
y_pred_prob_logistic = logistic_model.predict_proba(X_test_scaled)[:, 1]
y_pred_class_logistic = (y_pred_prob_logistic > 0.5).astype(int)

y_pred_prob_lasso_logistic = lasso_logistic_model.predict_proba(X_test_scaled)[:, 1]
y_pred_class_lasso_logistic = (y_pred_prob_lasso_logistic > 0.5).astype(int)

In [50]:
accuracy_logistic = accuracy_score(y_test, y_pred_class_logistic)
accuracy_lasso_logistic = accuracy_score(y_test, y_pred_class_lasso_logistic)

print("Accuracy of Logistic Regression Model:", accuracy_logistic)
print("Accuracy of Lasso Logistic Regression Model:", accuracy_lasso_logistic)

Accuracy of Logistic Regression Model: 0.9610961585515048
Accuracy of Lasso Logistic Regression Model: 0.9632982627844384


In [51]:
print("Coefficients of Logistic Regression Model:", logistic_model.coef_)
print("Coefficients of Lasso Logistic Regression Model:", lasso_logistic_model.coef_)

Coefficients of Logistic Regression Model: [[14.89970803 16.19504919 -0.10588432  0.51125031 -0.72035156  0.23002697
  -0.14151927  0.01964717 -0.18433293]]
Coefficients of Lasso Logistic Regression Model: [[13.02783767 14.19000611 -0.12088906  0.31068432 -0.3987077   0.11665126
  -0.10911086  0.06230749 -0.28833161]]


In [54]:
coefficients_logistic = logistic_model.coef_[0]
coefficients_lasso = lasso_logistic_model.coef_[0]

# Compare coefficients
coefficients_df = pd.DataFrame({
    'Feature': X.columns,
    'Logistic Coefficients': coefficients_logistic,
    'Lasso Coefficients': coefficients_lasso
})

# Count number of features shrunk to zero by Lasso
num_zeroed_features = (coefficients_lasso == 0).sum()
print("Number of features shrunk to zero by Lasso:", num_zeroed_features)

print(coefficients_df)


Number of features shrunk to zero by Lasso: 0
              Feature  Logistic Coefficients  Lasso Coefficients
0           longitude              14.899708           13.027838
1            latitude              16.195049           14.190006
2  housing_median_age              -0.105884           -0.120889
3         total_rooms               0.511250            0.310684
4      total_bedrooms              -0.720352           -0.398708
5          population               0.230027            0.116651
6          households              -0.141519           -0.109111
7       median_income               0.019647            0.062307
8  median_house_value              -0.184333           -0.288332


### Comparison of Coefficients: Logistic Regression vs. Lasso Logistic Regression

- **Logistic Regression Coefficients**:
  \[
  [14.8997, 16.1950, -0.1059, 0.5113, -0.7204, 0.2300, -0.1415, 0.0196, -0.1843]
  \]
  
- **Lasso Logistic Regression Coefficients**:
  \[
  [13.0278, 14.1900, -0.1209, 0.3107, -0.3987, 0.1167, -0.1091, 0.0623, -0.2883]
  \]

#### Key Points:
1. **Magnitude Reduction**: Lasso reduced all coefficients, indicating a more conservative estimate of feature influence.
  
2. **No Zero Coefficients**: All features retained some contribution, suggesting they are relevant but less impactful in the Lasso model.
  
3. **Feature Impact**: Notable reductions in coefficients (e.g., total rooms from 0.5113 to 0.3107) indicate moderated effects on the target variable.

4. **Regularization Benefits**: Lasso helps prevent overfitting by discouraging large weights, promoting a more generalizable model.

5. **Future Tuning**: Lowering the `C` parameter may lead to some coefficients being shrunk to zero for stricter feature selection.

### Conclusion:
While Lasso did not eliminate any features, it effectively reduced coefficient magnitudes, enhancing model robustness and interpretability.

### Authentication: Write Down Your Information

In the following code block, print your Student ID, Name, and Homework number in the specified format:

```python
# Replace the placeholders with your actual information
info = [yourid, yourname, homework_number]
for id, name, homework in info:
    print(f'ID: {id}\nName: {name}\nHomework: {homework}')


In [None]:
info = [('1002162937', 'Swathi Manjunatha', '003')]
for id, name, homework in info:
    print(f'ID: {id}\nName: {name}\nHomework: {homework}')

ID: 1002162937
Name: Swathi Manjunatha
Homework: 003
