## Homework 01: First Steps with Linear Regression

This homework will introduce you to foundational techniques in downloading and setting up datasets, running `sklearn` models, and examining their results. The primary goal is to establish the workflow for accessing and submitting assignments, writing basic code, and interpreting outputs. This assignment will also help us test the autograder and ensure that feedback mechanisms work seamlessly.

### Submission Instructions via Gradescope

We will use Gradescope for homework submissions this term. Please follow these instructions carefully:

1. **Do not rename the file**  
   - Renaming the notebook (**`Homework_01.ipynb`**) will cause the autograder to fail.

2. **Include the `requirements.txt` file**  
   - Download both **`Homework_01.ipynb`** and `requirements.txt` from the GitHub repository.  
   - The `requirements.txt` file specifies the dependencies (imported libraries) required for grading.  
   - **Do not modify this file** unless explicitly instructed by the LFs.

3. **Verify before submission**  
   - Before submitting, run `Restart Kernel and Run All` to ensure that all cells execute without errors.

4. **Submitting to Gradescope**  
   - You should have received an email inviting you to join Gradescope. If not, please contact us immediately.  
   - Log in to [Gradescope](https://gradescope.com), navigate to your dashboard, and locate **Homework 01**.  
   - Drag and drop the following files into the upload section:
     - **`Homework_01.ipynb`**
     - `requirements.txt`
   - Or, you may zip up a folder containing these two files and submit that.
   - Click `Upload` to submit your files.


5. **Review your submission**  
   - You will receive the autograder results and a confirmation email after submission.  
   - For the first two homeworks, **multiple resubmissions** will be allowed. Use this opportunity to get your debugging and submission workflow established.   
   - Starting from Homework 03, **only one submission will be permitted.** You will need to ensure that your work is complete and carefully verified before uploading.

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("Homework_01.ipynb")

In [7]:
# Useful imports and utilities

import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import os

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from tqdm import tqdm


## Problem:  Linear Regression on the Kaggle Salary Dataset

This is a great dataset to start with: it is a univariate regression dataset predicting salary from years of experience. It is probably the smallest dataset on Kaggle!

### (A) Install `kagglehub` if you don't already have it

The first thing to do is to install `kagglehub` if you don't have it already. If you *do*, make sure you have the most recent version.

In [None]:
# Since you only need to do this once, uncomment the following line, run the cell, and then recomment or delete this cell.
# Or do this the usual way you do installs (e.g., in Terminal on a Mac). 

# !pip install kagglehub


In [None]:
import kagglehub
print("Kagglehub version:", kagglehub.__version__)
  

In [1]:
# If you need to upgrade, uncomment and run this cell, then delete or recomment.
# But do not worry excessively about upgrading to the most recent version at this point, 
# even if you get "Warning: Looks like you're using an outdated...." when you download the dataset.

# !pip install --upgrade kagglehub


### (B) Download the dataset and prepare it for modeling.

Continue running cells as shown, following the instructions in text cells and comments in code cells (usually "Your code here"), and then answer the questions below.

#### B.1 Download

In [None]:
# Download latest version, which will be installed on your local machine
# After running this cell once, you could comment this out.  

salary_dataset_path = kagglehub.dataset_download("abhishek14398/salary-dataset-simple-linear-regression")

print("Path to dataset files:", salary_dataset_path)

In [None]:
# Assuming the dataset is named "Salary_dataset.csv" inside the path
salary_dataset_path_to_file = os.path.join(salary_dataset_path, "Salary_dataset.csv")
salary_data_raw = pd.read_csv(salary_dataset_path_to_file)

#### B.2  Print out the head and info about the dataset

In [None]:
# Your code here

#### B.3  Using Pandas `hist()`, display histograms of the columns.  Set the `bins` parameter to make the visualization as useful as possible (YMMV, so don't stress about it).

In [None]:
# Your code here

#### B.4  Print out some simple stats about the data

**Set the variable b4_a to the maximum salary in the dataset as an integer.**

In [None]:
#Your code here
b4_a = ...

**Set the variable b4_b to the average years of experience in the dataset as a floating point number to 2 decimal places.** 

In [None]:
# Your code here
b4_b = ...

### (C) Clean the data

There seems to be a problem, namely an extra column that is completely unnecesssary! 

1. Remove that column using appropriate Pandas code and assign the result to a new variable `salary_data`;
2. Check to see all is well, by setting the variable c1 to a **Python list** of the feature names in the new dataframe and then printing it (you might want to do this before and after, just to get the precise name of the column to remove). (Hint: if your value is in the form `Index(...)` then it is not a Python list.)

In [None]:
# Your code here
c1 = ...

**Set the variable c2 to the shape (number of rows and columns) of the dataset (e.g., in form (rows, columns)).** 

In [None]:
# Your code here
c2 = ...

### (D) Convert the dataframe to (X,y) form for processing. 

Create a numpy array `X` from the first column and array `y` from the second column. Create `X` by deleting the second column from a copy of the dataframe, **not** by just selecting the first column (which won't work when there is more than one feature). 

Confirm by setting the variable `d1` to the shape of `X` and `d2` to the shape of `y` and printing them.

In [None]:
# Your code here
d1 = ...
d2 = ...

### (E) Display the data as a scatterplot

Display a scatterplot of the data using appropriate title and axis labels.  YMMV, but make it attractive!

In [None]:
# Your code here

 ### (F) Linear Regression in Sklearn
 
Now we will run linear regression on the dataset, plot the regression line, and print out the intercept and slope of the
least-squares line with some evaluation metrics.

Train your model on the whole dataset, and set `f1` to the intercept (a float) rounded to 4 decimal places (note: `sklearn` stores the intercept/bias separately from the coefficients)

In [2]:
# Your code here
f1 = ...

Set `f2` to the slope (a float), rounded to 4 decimal places.

In [3]:
# Your code here
f2 = ...

Set `f3` to the mean square error (a float) of the model on the whole dataset, rounded to 4 decimal places.
Hint: predict the results on the whole dataset to get `y_pred`, then calculate MSE from `y` and `y_pred`.

In [4]:
# Your code here
f3 = ...

Set `f4` to the mean absolute error (a float) of the model on the whole dataset, rounded to 4 decimal places.

In [5]:
# Your code here
f4 = ...

Set `f5` to the R2 score (a float) for the model on the whole dataset, rounded to 4 decimal places. 

In [6]:
# Your code here
f5 = ...