# Assignment 6: Feature selection and regularization

# Total: /100

## Instructions

* Complete the assignment

* Once the notebook is complete, **restart** your kernel and **rerun** your cells

* Submit your completed notebook to owl by the deadline

* You may use any python library functions you wish to complete the assignment

In [2]:
# You may need these
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
import sklearn.linear_model as skl
from sklearn import preprocessing
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt
from IPython.display import display

# To get geo info of IP addresses:
#!pip install maxminddb-geolite2
from geolite2 import geolite2

seed = 2023
np.random.seed(seed)

## Question 1: /20 pts

The dataset `customer_data.csv` lists certain attributes providing valuable insights into customer behavior and demographics:

- **full.name**: Customer's full name
- **ip.address**: Customer's IP address
- **region**: Customer's geographical region
- **age**: Customer's age
- **items**: Number of items purchased by the customer
- **amount**: The total amount spent by the customer

Businesses can leverage this dataset to make data-driven decisions, understand customer preferences, and tailor their strategies to meet customer needs and interests.


### 1.1 Load the dataset and display the first 5 rows.

In [3]:
#your code here
customer_data = pd.read_csv('customer_data.csv')
display(customer_data.head())

Unnamed: 0,full.name,ip.address,region,in.store,age,items,amount
0,Carter Stokes,,2,0,37,4,281.03
1,Jacob Jerde,,2,0,35,2,219.51
2,Tressa Ratke,192.90.208.202,4,1,45,3,1525.7
3,Rudolf Abshire,251.55.128.164,3,1,46,3,715.25
4,Theresa Davis,182.19.192.186,1,1,33,4,1937.5


### 1.2 First, remove any rows where the entry of "Age" column is below 18 or above 80, and then extract two new features from `ip.address`: one called `latitude` and the other `longitude`. Use the package `geolite2` for the conversion of the IP addresses to latitude and longitude. Use [pandas.DataFrame.apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) to do the conversion (in a vectorized way) in one go for each new feature. Avoid using `for` loops. At the end, drop the column `ip.address` as well as any rows with a missing value. Display the first 5 rows of the new dataframe and report its shape.

In [6]:
#your code here
customer_data = customer_data[(customer_data['age'] >= 18) & (customer_data['age'] <= 80)]
def get_lat_lon(ip):
    try:
        location = geolite2.reader().get(ip)
        if location and 'location' in location:
            return pd.Series([location['location']['latitude'], location['location']['longitude']])
        else:
            return pd.Series([None, None])
    except Exception:
        return pd.Series([None, None])


customer_data[['latitude', 'longitude']] = customer_data['ip.address'].apply(get_lat_lon)
df = customer_data.drop(columns=['ip.address'])
df = df.dropna()

df.head()
df.shape

(34303, 8)

### 1.3 Perform one-hot encoding on the `region` column using pd.get_dummies(). Display the first 5 rows of the encoded dataframe.

In [7]:
#your code here
customer_data_encoded = pd.get_dummies(customer_data, columns=['region'], drop_first=True)
display(customer_data_encoded.head())

Unnamed: 0,full.name,ip.address,in.store,age,items,amount,latitude,longitude,region_2,region_3,region_4
0,Carter Stokes,,0,37,4,281.03,,,True,False,False
1,Jacob Jerde,,0,35,2,219.51,,,True,False,False
2,Tressa Ratke,192.90.208.202,1,45,3,1525.7,42.5879,-71.3498,False,False,True
3,Rudolf Abshire,251.55.128.164,1,46,3,715.25,,,False,True,False
4,Theresa Davis,182.19.192.186,1,33,4,1937.5,1.2931,103.8558,False,False,False


### 1.4 Calculate the natural logarithm of the column reporting clients' total amount spent and store it as a new column `log_amount`. Create your design matrix `X` and target vector `y` with `log_amount` as target (No training/test splitting yet).



In [13]:
#your code here
customer_data_encoded['log_amount'] = np.log(customer_data_encoded['amount'])
X = customer_data_encoded.drop(columns=['full.name', 'amount', 'log_amount', 'ip.address'])
y = customer_data_encoded['log_amount']
X

Unnamed: 0,in.store,age,items,latitude,longitude,region_2,region_3,region_4
0,0,37,4,,,True,False,False
1,0,35,2,,,True,False,False
2,1,45,3,42.5879,-71.3498,False,False,True
3,1,46,3,,,False,True,False
4,1,33,4,1.2931,103.8558,False,False,False
...,...,...,...,...,...,...,...,...
79995,1,71,3,37.3042,-122.0946,False,False,False
79996,0,59,7,,,False,True,False
79997,0,54,1,,,True,False,False
79998,1,49,4,45.3548,-75.5773,False,False,False


### 1.5 Build a new design matrix by applying polynomial expansion using `PolynomialFeatures()` on `X` with degree=2. Do not include the column with power 0 (*i.e.*, the column with all elements being 1) and make sure to not set the argument `interaction_only` to `True`.


In [12]:
#your code here
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

ValueError: Input X contains NaN.
PolynomialFeatures does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


### 1.6 Standardize your design matrix from Question 1.5 using `StandardScaler()`, and store the result as a Pandas dataframe.

In [None]:
#your code here
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_poly), columns=poly.get_feature_names(X.columns))


## Question 2: /7 pts



### 2.1 Split the data into training and test sets. Hold out 30% of observations as the test set. How many observations are in your training dataset? What is the average value of the target variable in the training dataset (rounded to 2 decimal places)?

In [None]:
#your code here


## Question 3: /23 pts



### 3.1 Create a SciKit Learn `Ridge` regression object. Train it on the training data using an `alpha` of $4.0$ and do fit the intercept.

In [None]:
#your code here


### 3.2 Now use `RidgeCV` to find the best `alpha` for the penalty term through a 5-fold cross-validation. As input for `alpha`, your code must try integer values from 30 to 50 inclusive. Report the `alpha` that yields the smallest loss.

In [None]:
#your code here


### 3.3 Fit a `Ridge` regression on the training data with the best `alpha` found in the previous question.

In [None]:
#your code here


### 3.4 Fit a simple `LinearRegression` without any penalty using the training data (again, `fit_intercept=True`). Compare the regression coefficients obtained in questions 3.1, 3.3 and 3.4. How do they compare?

In [None]:
#your code here


#### YOUR ANSWER HERE

Here


### 3.5 Use your trained linear regression models in Q3.3 and Q3.4 to predict over the test set and print the median of their perdictions.

In [None]:
#your code here


## Question 4: /25 pts



### 4.1 Fit a Lasso regression to the train dataset using lasso_path(). Show the full path of the first 20 coefficients of the Lasso regression. Include eps=8e-3 and n_alphas=50. Describe the trends you see in the figure.

In [None]:
#your code here


### 4.2 Use Scikit Learn's cross-validated LASSO to automatically search for the best alpha of the LASSO regression on the training set with intercept. Include arguments `eps=8e-3`, `n_alphas=30`, `tol=0.001`, `cv=5`, and `random_state=seed`. Report the best tuning parameters and the number of coefficients that the model shrinks to zero.

In [None]:
#your code here



### 4.3 Use Scikit Learn's cross-validated ElasticNet to automatically search for the best tuning parameters of the ElasticNet regression (with intercept) on the training set. Include the same arguments as in question 4.2 as well as `l1_ratio=[0.7, 0.9, 0.95, 0.99, 1]`. Report the best tuning parameters. Is the ElasticNet regression model equivalent to the Lasso regression? Briefly describe how they defer and under what circumstances they become the same.

In [None]:
#your code here


#### YOUR ANSWER HERE

Here

## Question 5 : /16 pts



### 5.1 Use `SequentialFeatureSelector()` to conduct forward selection for the features of the Ridge model tuned in Q 3.3. Include the argument `n_features_to_select=20`. Report the indices of the selected features.

In [None]:
#your code here


### 5.2 Fit a regular `LinearRegression` (with `fit_intercept=True`) on the training set using the selected features from the previous question. Print the first 3 coefficients of your model.

In [None]:
#your code here


## Question 6: /9 pts



### 6.1 Make predictions on the test set using models from questions 3.3, 4.2, 4.3, and 5.2, respectively. Create a DataFrame with  predicted values obtained from the different models. Name the columns of the dataframe consistent with the names used for the models, or their question number. Display the first 5 rows of this dataframe.

In [None]:
#your code here


### 6.2 Use `mean_squared_error` as your scorer to assess the performance of the different models (those reported in the previous question) based on all the predicted values over test set. Based on this scorer which model is the best?

In [None]:
#your code here


#### YOUR ANSWER HERE

Here