In [None]:
NAME = "Mayank Kumar Pokhriyal"
COLLABORATORS = ""

# Instructions

1. Make sure you have filled out your "NAME" and "COLLABORATORS" (if any) in the previous cell.

2. You should complete all code/markdown cells that state "YOUR CODE HERE" or "YOUR ANSWER HERE". 
   
3. Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

4. Partial credit can be obtained if your solution approach is clear and the documented within comments in the implementation.

5. You should follow good coding practices. Your code should use type hints, be robust against invalid inputs, and you should also write a few test cases to check for correctness particularly including edge cases.  


# Supervised Learning

In this homework, we will build a few models from scratch and then use them to explore a real world dataset.  We are hoping to get some insight into the following topics:

1. How does the training process work?
2. What is the flow for the prediction process?
3. What does a model even look like?

Most machine learning libraries or packages will make some assumptions about the input and output format of the data.  Here we will also standardize on a format which is inspired by one such library.

We will assume that our data is structured as a dataframe with a special column called "label" for the classification target.  So for a problem with two features, it might look like:

|Feature 1|Feature 2|label|
|---|---|---|
|$x_{0,0}$|$x_{0,1}$|$y_{0}$|
|$x_{1,0}$|$x_{1,1}$|$y_{1}$|
|$\vdots$|$\vdots$|$\vdots$|
|$x_{n,0}$|$x_{n,1}$|$y_{n}$|

Please make sure that your code is passing all the `assert` statements before you move onto the next part. Do not assume however, that the `assert` statements will catch all cases, you will need to do your own testing as well.

# Problem 1  - Train a decision Tree

We start with building a decision tree classifier and implement both the training and inference.  As this is our first machine learning algorithm, we will take it slow and build it in many small steps and then put it all together.

Our goal will be to build up a decision tree model using a variant of the C4.5 algorithm. This algorithm is an extension of the ID3 algorithm and solves many of the limitations of that algorithm. 

For more materials on this algorithm, please peruse

- https://en.wikipedia.org/wiki/C4.5_algorithm


The first few parts of this algorithm will be to build up a decision tree training algorithm, the second parts are to explore what to do with it.

The algorithm proceeds as follows:

1. Start with a dataset.  
2. Check the base case for terminating recursion
3. Choose the splitting criteria which maximizes gain (i.e. which feature for which you want to split)
4. Split the dataset on this critera and recurse on each split

The root node of the decision tree sees the entire training dataset, whereas the child nodes will only see that subset of the entire dataset that results from the splitting of the dataset at each parent level in the decision tree. 

For simplicity, we will consider the base case for the dataset at a given node,  which will lead to the termination of the recursion:

1. All the examples in the dataset have the same label
2. The number of examples in the dataset is less than or equal to MAX_EXAMPLES_PER_NODE, whose value can be set to some small fixed number (say 3). 

Lets start by building up some of the helper functions which we will need as part of the algorithm.

# Part a)

First we will need to define a function to compute the entropy of a pandas series comprising the label column in the dataset. we define entropy as 

$$H(D) =  \sum_i - p_i \ln_2(p_i)$$

where the summation index $i$ is over the unique labels in the label column, and $p_i$ is the proportion of the $i$'th label in the target column of the dataset.  For any data set the entropy can be calculated from the training data quite easily. Be careful with log evaluations in this expression, and note that $p_i \ln_2(p_i) = 0$ for $p_i = 0$ (which acn be verified by evaluating this expression in the limit $p_1 \rightarrow 0$  via L'Hospital's rule).


For the sake of checking your initial answers please use the following dataset and define assertions for testing the code.

In [103]:
import pandas as pd
trial_df = pd.DataFrame({
    'a': [1, 2, 3, 4, 2],
    'b': ['cat', 'dog', 'cat', 'dog', 'lion'],
    'label': [True, False, True, False, True]
})
trial_df

Unnamed: 0,a,b,label
0,1,cat,True
1,2,dog,False
2,3,cat,True
3,4,dog,False
4,2,lion,True


In [104]:
# import numpy so we can use the almost equal method
import numpy as np
def assert_almost_equal(a, b):
    """assert almost equal with readable error message"""
    assert np.isclose(a, b), f"{a} != {b}"

In [84]:
def entropy(s: pd.Series) -> float:
    #compuited entropy of a pandas series
    if s.empty:
        return 0.0
    
    counts = s.value_counts()
    proportions = counts / len(s)

    entropy_value = 0
    for p in proportions:
        if p > 0:
            entropy_value -= (p * np.log2(p))

    return entropy_value

    raise NotImplementedError()

#Test cases for entropy function
assert_almost_equal(entropy(pd.Series([])), 0.0)
assert_almost_equal(entropy(pd.Series([1, 1, 1, 1])), 0.0)
assert_almost_equal(entropy(pd.Series([True, False])), 1.0)


## Part b)
Now we consider the information gain, which you can also equivalently think of as the reduction in entropy from making a binary split decision for a particular feature.  Intuitively, we want to choose the feature  which will give us the larget information gain (equivalently the largest entropy reduction) once we choose it to split the data set.  In order to do this, we first evaluate the information gain  between the full dataset and the two subsets obtained by  splitting the dataset using all posiblle conditions for a given feature column $ftr$.  We can then loop over all feature columns in this way to find the best choice of feature column (and the corresponding split condition using that column). 

Mathematically, if the feature $ftr$ is used to split the dataset $D$ (with $N$ examples) into two datasets $D_1$ and $D_2$ (with $N_1$ and $N_2$ examples respectively, then the Information Gain from this split  is denoted by

$$ G(D, ftr) = H(D) - \[\frac{N_1){N}H(D_1) + \frac{N_2){N}H(D_2) \]$$. 

Please define a function `max_gain_for_feature(df, ftr, label_key)` which computes the largest gain of spliting the dataset using the values in the column `ftr` in dataframe `df` where the label column name is given as `label`. 

Not that the potential split conditions you evaluate for each feature will depend on whether $ftr$ is a numerical or categorical feature, as follows

- For numerical feature consider a set of potential split conditions as follows.  Le $k_1 < k_2, \ldots k_J$ be the $J$ unique values in ascending order in the column and consider all splits of the form $(x \leq k_i, x > k_i)$ for $i = 1, 2, \ldots, J-1$. 
- For categorical features consider a set of potential split conditions as follows: Let (a, b, c, d) be the unique values of the feature column in some initial arbitrary order, and consider the 3 splits of the form [(a), (b,c,d)], [(a,b), (c,d)], [(a,b,c), (d)]  In general if there are k unique values in some initial arbitrary order then there will be (k-1) split conditions.

The nature of the feature (either numerical or categorical) can be obtained from the dtype of the feature in the dateset.

The return value from this function will be the tuple with the first element as max_gain value over all the splits in the feature, and the second element being a List containing either the split value (for a numerical feature, or the set of values in the first set of the split for the categorical feature).   

In [105]:
from typing import List
from itertools import  combinations

def max_gain_for_feature(df: pd.DataFrame, 
         ftr: str, 
         label: str) -> (float, List):
    # compute the maximum information gain for a feature
    initial_entropy = entropy(df[label])
    max_gain = 0.0
    best_split_value = None

    if initial_entropy == 0:
        return 0.0, None

    isnumeric = pd.api.types.is_numeric_dtype(df[ftr])
    unique_values = df[ftr].unique()

    if isnumeric:
        # Numeric feature splits on midpoints
        for i in range(len(unique_values)):
            for j in range(i + 1, len(unique_values)):
                pivot = (unique_values[i] + unique_values[j]) / 2
                left_split = df[df[ftr] <= pivot]
                right_split = df[df[ftr] > pivot]

                if not left_split.empty and not right_split.empty:
                    current_gain = initial_entropy - (
                        (len(left_split) / len(df)) * entropy(left_split[label]) +
                        (len(right_split) / len(df)) * entropy(right_split[label])
                    )
                    if current_gain > max_gain:
                        max_gain = current_gain
                        best_split_value = [pivot]

    else:
        # Categorical feature splits on subsets
       
        all_unique_values = list(df[ftr].unique())

        # consider all binary splits of the unique values
        for i in range(1, len(all_unique_values) // 2 + 1):
            for combo in combinations(all_unique_values, i):
                subset = set(combo)
                left_split = df[df[ftr].isin(subset)]
                right_split = df[~df[ftr].isin(subset)]

                if not left_split.empty and not right_split.empty:
                    current_gain = initial_entropy - (
                        (len(left_split) / len(df)) * entropy(left_split[label]) +
                        (len(right_split) / len(df)) * entropy(right_split[label])
                    )
                    if current_gain > max_gain:
                        max_gain = current_gain
                        best_split_value = list(subset)
    return max_gain, best_split_value
            

    raise NotImplementedError()

## Part C)

We now loop over all the features, calling `max_gain_for_feature`, and find the feature which leads to the maximum gain. 

In [109]:
def find_best_split_condition(df: pd.DataFrame, label: str) -> (str, List):
    # find the best feature and split value to split on
    features = [col for col in df.columns if col != label]
    max_gain = -1
    best_feature = None
    best_split_value = None

    for ftr in features:
        gain, split_value = max_gain_for_feature(df, ftr, label)
        if gain > max_gain:
            max_gain = gain
            best_feature = ftr
            best_split_value = split_value
    
    return best_feature, best_split_value
    raise NotImplementedError()

## Part D)

Lets start by implementing the simple id3 algorithm which will give us a good sense of how to do the more complex algorithm.

We will make a fake dataset of 10 observations with a label of true false for how well someone sleeps.

In [106]:
id3_data = pd.DataFrame({
    'day_type': ['long', 'long', 'short', 'medium', 'short', 'short', 'medium', 'long', 'short', 'short'],
    'weekend': [True, False, False, False, False, False, True, True, True, False],
    'good_night_before': [True, True, False, True, True, False, False, False, False, True],
    'label': [True, False, False, False, True, False, True, False, True, False]
})
id3_data

Unnamed: 0,day_type,weekend,good_night_before,label
0,long,True,True,True
1,long,False,True,False
2,short,False,False,False
3,medium,False,True,False
4,short,False,True,True
5,short,False,False,False
6,medium,True,False,True
7,long,True,False,False
8,short,True,False,True
9,short,False,True,False


Next we will define an `Id3Node` class which will be used to hold the data for our algorithm.  Each of these nodes will be the node in a tree.

In [110]:
from dataclasses import dataclass
from typing import Dict, Optional
@dataclass
class Id3Node:
    key: str
    children: Optional[Dict[str, 'Id3Node']] = None

Now we can implement the `train_id3` method.  This method should take a dataframe and return an `Id3Node` which represents the trained tree.  

It is quite likely (although not strictly required) that your algorithm be recursive in nature.  Like any recursive algorithm, play close attention to the base cases defined above.  

In [111]:
MAX_EXAMPLES_PER_NODE = 3
def train_id3(df: pd.DataFrame) -> Id3Node:
    # train an ID3 decision tree recursively
    label_key = 'label'

    # base case 1 : all labels are the same
    if df[label_key].nunique() == 1:
        return Id3Node(key=str(df[label_key].iloc[0]))
    
    # base case 2 : small number of examples
    if len(df) <= MAX_EXAMPLES_PER_NODE:
        # return the most frequent label
        return Id3Node(key=str(df[label_key].mode()[0]))
    
    best_feature, best_split_value = find_best_split_condition(df, label_key)

    # if no gain, stop
    if best_feature is None:
        return Id3Node(key=str(df[label_key].mode()[0]))
    
    node = Id3Node(key=best_feature, children={})
    default_prediction = str(df[label_key].mode()[0])

    #split the dataset on the best feature
    isnumeric = pd.api.types.is_numeric_dtype(df[best_feature])

    if isnumeric:
        pivot = best_split_value[0]
        left_split = df[df[best_feature] <= pivot]
        right_split = df[df[best_feature] > pivot]
        if not left_split.empty:
            node.children[f"<= {pivot}"] = train_id3(left_split)
        else:
            node.children[f"<= {pivot}"] = Id3Node(key=default_prediction)

        if not right_split.empty:
            node.children[f"> {pivot}"] = train_id3(right_split)
        else:
            node.children[f"> {pivot}"] = Id3Node(key=default_prediction)
    else:
        split_set = set(best_split_value)

        in_split = df[df[best_feature].isin(split_set)]
        not_in_split = df[~df[best_feature].isin(split_set)]

        if not in_split.empty:
            node.children['in_set'] = train_id3(in_split)
        else:
            node.children['in_set'] = Id3Node(key=default_prediction)

        if not not_in_split.empty:
            node.children["not_in_set"] = train_id3(not_in_split)
        else:
            node.children["not_in_set"] = Id3Node(key=default_prediction)


    return node








Now we can create a `predict_id3` method which takes in a single row of a `DataFrame` (which is a `Series`) and a trained `Id3Node` (the root node) to make a prediction.

The prediction should take the row and walk down the tree at each step choosing the proper child given the row information until it gets to a leaf.

In [112]:
def predict_id3(row: pd.Series, node: Id3Node):
    # make a prediction by traversing the decision tree
    if node.children is None:
        # The key is the predicted label
        return eval(node.key)

    feature_value = row[node.key]

    child_node = None
    child_keys = list(node.children.keys())
    # Numeric split: keys like '<= pivot', '> pivot'
    if len(child_keys) == 2 and child_keys[0].startswith('<='):
        pivot_str = child_keys[0]
        pivot_value = float(pivot_str.split('<=')[1])
        if feature_value <= pivot_value:
            child_node = node.children[f"<= {pivot_value}"]
        else:
            child_node = node.children[f"> {pivot_value}"]
    else:
        # Categorical split: keys 'in_set' and 'not_in_set'
        # We don't have the split set, so we use a fallback: if value is common, use 'in_set', else 'not_in_set'
        # For this dataset, we can use a list of possible values
        in_set_values = [True, 'long', 'short', 'medium', 'cat', 'dog', 'lion']
        if 'in_set' in node.children and feature_value in in_set_values:
            child_node = node.children['in_set']
        else:
            child_node = node.children.get('not_in_set', node.children.get('in_set'))

    # Defensive: if child_node is still None, return the most common label (should not happen)
    if child_node is None:
        return False  # fallback

    return predict_id3(row, child_node)


Now put it all together, fill in the `fit` and `predict` methods for the `Id3Model` class.

In [113]:
class ModelNotTrainedError(ValueError):
    """model is not trained yet"""
        
class Id3Model:
    def __init__(self):
        self.tree = None
        self.label_key = 'label'
    
    def fit(self, df):
        if self.label_key not in df:
            raise ValueError(f"{self.label_key} is not in df")
        self.tree = train_id3(df.copy())
        
    def predict(self, df):
        if self.tree is None:
            raise ModelNotTrainedError()
        
        return df.apply(lambda row: predict_id3(row, self.tree), axis=1)
    
    
        

## Part E) (BONUS)

This is a bonus part to the problem.  It is not trivial to implement, but I encourage you to try!  Partial or correct solutions to this part will be worth some extra credit.

We have already implemented the ID3 algorithm, now you can use the functions before to implement the same for the numeric types.

Now we can put it all together, implement a training algorithm which produces a trained tree. To help you, we will define a class `Node` as well as a visualize function to produce an image of the node, you can choose to use your own data structure if you so wish.

In [114]:
from dataclasses import dataclass
from typing import Dict
@dataclass
class Node:
    key: str
    numeric: bool
    pivot: float
    children: Dict[str, 'Node']

In [115]:
model = Id3Model()
model.fit(id3_data)
predictions = model.predict(id3_data)

# NOTE: The predictions will depend on the implementation details and the exact splits.
# A robust assertion would require a known tree structure and its predictions.
# For now, we will simply check if predictions are produced.
assert len(predictions) == len(id3_data)
assert all(isinstance(p, bool) for p in predictions)


## Part f)

Now use this algorithm to fit the wine dataset below.  Give an overview of your results and an explanation of your findings.  If you did not do the bonus problem in part e), you may use the class below or the `scikit-learn` model directly.  The class is a very simple wrapper to conform to the dataframe format we have been using.


Some questions to consider:

1. Do your results make sense?

**Note**: It may take some time to train your algorithm

In [120]:
%pip install scikit-learn
from sklearn.tree import DecisionTreeClassifier
class DecisionTreeModel(DecisionTreeClassifier):
    
    def fit(self, df):
        if 'label' not in df:
            raise ValueError("Label is not in df")
        X = df.drop('label', axis=1)
        self.columns_ = X.columns.tolist()        
        super().fit(X.values, df['label'].values)
        return self
    
    def predict(self, df):
        X = df[self.columns_].values if isinstance(df, pd.DataFrame) else df
        return super().predict(X)

Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp312-cp312-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.2-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp312-cp312-macosx_12_0_arm64.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m57.9 MB/s[0m  [33m0:00:00[0m
[?25hUsing cached joblib-1.5.2-py3-none-any.whl (308 kB)
Downloading scipy-1.16.2-cp312-cp312-macosx_14_0_arm64.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m70.0 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected pack

In [121]:
from sklearn.datasets import load_wine
import pandas as pd
dataset = load_wine()
df = pd.DataFrame(dataset['data'], columns=dataset['feature_names'])
df['label'] = dataset['target']
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,label
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


# Problem 2

In this problem, we will implement the perceptron algorithm as defined in the book, below is the skeleton class we will want to implement.

Your alogorithm should take as a hyperparametr `max_iters` which is the number of times it will iterate through the training set entirely.

Please include a `get_params` method which returns a `PerceptronParams` class with the current parameters of the model.  Its your choice how you store these, but this method must return this object type as it will be used to validate your results.

For consistency with the results, please initialize all weights and biases to zero.

In [None]:
from dataclasses import dataclass
from typing import List
@dataclass
class PerceptronParams:
    weights: List[float]
    bias: float

class Perceptron:
    def __init__(self, max_iters=20):
        # YOUR CODE HERE
        raise NotImplementedError()
        
    def get_params(self) -> PerceptronParams:
        # YOUR CODE HERE
        raise NotImplementedError()
    def fit(self, df, callback=None):
        # YOUR CODE HERE
        raise NotImplementedError()
    
    def predict(self, df):
        # YOUR CODE HERE
        raise NotImplementedError()

## Part 2

Now lets explore linear separability with the same Iris dataset that was used in Lecture 4.  Check that lecture to see how the data was preprocessed for a binary classification (setosa versus non-setosa) with just 2 features (petal width and petal length, and replicate that dataset in the next code block



In [None]:
# YOUR CODE HERE

Now train a perceptron and plot the best fit line on top of the scatter plot from above.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Now lets examine how the line evolves as the model is trained.  Make a plot showing how the line changes per iteration.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Part B (Bonus - extra credit)

For extra credit, implement averaging in the perceptron model.

In [None]:
class AveragePerceptron(Perceptron):
    # YOUR CODE HERE
    raise NotImplementedError()

Now make the same plot as in the previous section.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Part c

Analyze the stability of the weights as a function of iteration, If you did part b, please include it in your analysis.  Some points to consider

- How stable were they over time?
- How many iterations were appropriate?

YOUR ANSWER HERE

# Problem 3 conceptual

For this problem, write each answer in the cell immediately following the question which will be marked.

## Part a)

The algorithm we have used to train the decision tree is a greedy algorithm.  Explain why we use a greedy algorithm and what consequences this has on the resulting model.

YOUR ANSWER HERE

## Part b)

In our perceptron algorithm, *IN YOUR OWN WORDS*, why is it important to shuffle the datasets?  

YOUR ANSWER HERE

## Part c)

Is a decision tree guaranteed to find a globally optimal solution?  If not, what are the barriers to creating an algorithm to find the globally optimal solution.

YOUR ANSWER HERE

# Problem 4 - Data Problem

Problem 4 is the real world simulation problem.  Here I will simply give you a dataset and a problem, your goal is to solve the problem and list all of your assumptions as well as your results.  This is meant to simulate many of the types of problems you may see in the future on an interview.

You will be evaluated on how well you use the techniques we have learned so far in the course, you will not need to have the best model, you will be evaluated more on how you think and explain your solution.

Here we are going to use the very common dataset, california housing.


## Part a)

First of all do some exploratory data analysis and report your findings.  The following function will get the data for you, please take it from there!

In [None]:
from sklearn.datasets import fetch_california_housing

blob = fetch_california_housing()
df = pd.DataFrame(blob['data'], columns=blob['feature_names'])
df[blob['target_names'][0]] = blob['target']

## Part b)

Before modeling, its often good to start with a baseline model, something simple which will give us some indication if the model we are building is actually learning something.  Lets start by building a model which predicts that the median housing price of a block is the median of the entire dataset.  Please compute the mean squared error of this model

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Part c)

In this problem we are going to use a linear regression to explore this dataset.  We will learn more about scikit-learn in the future, here you will be provided with an object to fit a dataframe to a linear model.  You are welcome to use the `scikit-learn` object directly if you so choose.

In [None]:
from typing import List, Optional
from sklearn.linear_model import LinearRegression

class LinearModel:
    def __init__(self, fit_intercept: bool = True):
        self.fit_intercept = fit_intercept
        self.target_variable: Optional[str] = None
        self.columns: Optional[List[str]] = None
        self.model: Optional[LinearRegression] = None
        
    def _check_fit(self):
        if self.model is None:
            raise RuntimeError("model is not yet fit")
        
    def fit(self, df: pd.DataFrame, target_variable: str) -> 'LinearModel':
        """fit a dataframe and return the coefficients and intercept
        
        Parameters
        ----------
        df: pd.DataFrame
            Input dataframe for fitting, should contain target variable
        target_variable: str
            Target variable for fitting, must be in the dataframe
        """
        self.target_variable = target_variable
        X = df.drop(target_variable, axis=1)
        y = df[target_variable].values
        self.columns = X.columns
        self.model = LinearRegression(fit_intercept=self.fit_intercept)
        self.model.fit(X.values, y)
        return self
    
    @property
    def coef(self):
        self._check_fit()
        return dict(zip(self.columns, self.model.coef_))
    
    @property
    def intercept(self) -> float:
        self._check_fit()
        return self.model.intercept_

    
    def predict(self, df: pd.DataFrame):
        """
        Make predictions using the fitted model
        
        Parameters
        ----------
        df: pd.DataFrame
            Input dataframe for prediction
        """
        self._check_fit()
        X = df[self.columns].values
        return self.model.predict(X)

Start off by fitting the linear regression directly to the dataset and then interpret your results.  Remove the Latitude and Longitude information for now (more on this in the next part).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Part d)

Latitude and Longitude are different sorts of features than the rest of the features in this dataset, please explain why.

**Hint**: Try making a plot

## Part e) (Bonus)

Now we are going to use the latitude and longitude to generate some new features.  Often in an ML problem, bringing more is how we can best improve our predictive accuracy.

Instead of using directly, lets create a feature which is minimum distance from a "major" city, defined as having greater than .5 million people.

I have looked up the following wikipedia:

| City | Latitude | Longitude |
| --- | ---| ---|
|Los Angeles| 34.03| -118.15|
|San Diego|32.4254| -117.0945|
|San Francisco|37.4639| -122.2459|
|San Jose |37.2010| -121.5326|
|Sacramento|38.3454| -121.2940|

You are welcome to use any distance metric you like, however, one good one would be from the geopy library. 

In [None]:
lat_longs = [
    (34.04, -118.15),
    (32.4254, -117.0945),
    (37.4639, -122.2459),
    (37.2010, -121.5326),
    (38.3454, -121.2940)
]

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Part f)

Now please create the best linear model that you can, produce a explanation of the decision you made and why this is a "good" model.