# Regression
This analysis is divided into several parts.
1) In the first section, we have imported the necessary libraries.
2) In the next section, we imported the dataset.

# Abalone Dataset
Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.  Other measurements, which are easier to obtain, are used to predict the age.  Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).

[Get Dataset](https://archive.ics.uci.edu/dataset/1/abalone)

| Variable Name        | Role           | Type  |Description|Units	| Missing Values|
| ------------- |:-------------:| :-------------:|:-------------:|:-------------:|-----:|
| Sex     | Feature | Categorical |M, F, and I (infant) | |no|
| Length  | Feature | Continuous |Longest shell measurement |mm |no|
| Diameter |	Feature	| Continuous |	perpendicular to length |	mm |	no|
| Height |	Feature |	Continuous |	with meat in shell |	mm |	no|
|Whole_weight	| Feature |	Continuous |	whole abalone |	grams	| no |
| Shucked_weight	| Feature |	Continuous |	weight of meat |	grams |	no |
| Viscera_weight	| Feature |	Continuous	| gut weight (after bleeding) |	grams |	no |
| Shell_weight |	Feature	| Continuous |	after being dried |	grams |	no |
| Rings |	Target |	Integer |	+1.5 gives the age in years| |		no|


## Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import tree
from sklearn import ensemble 
from sklearn.metrics import mean_squared_error

## Load Dataset 

In [2]:
df = pd.read_csv("./data/abalone/abalone.csv")
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


## Data Preprocessing

In [4]:
df.dtypes

Sex                object
Length            float64
Diameter          float64
Height            float64
Whole_weight      float64
Shucked_weight    float64
Viscera_weight    float64
Shell_weight      float64
Rings               int64
dtype: object

In [5]:
df.isna().any()

Sex               False
Length            False
Diameter          False
Height            False
Whole_weight      False
Shucked_weight    False
Viscera_weight    False
Shell_weight      False
Rings             False
dtype: bool

In [6]:
sex = df[["Sex"]].iloc[:].values
sex

array([['M'],
       ['M'],
       ['F'],
       ...,
       ['M'],
       ['F'],
       ['M']], shape=(4177, 1), dtype=object)

In [7]:
sex[0]

array(['M'], dtype=object)

In [8]:
np.where(sex==['M'])

(array([   0,    1,    3, ..., 4173, 4174, 4176], shape=(1528,)),
 array([0, 0, 0, ..., 0, 0, 0], shape=(1528,)))

In [9]:

col_name = ["Sex",	
	"Length",		
	"Diameter",	
	"Height",		
	"Whole_weight",	
	"Shucked_weight",	
	"Viscera_weight",	
	"Shell_weight",	
	"Rings"]		

In [10]:
X = df.drop(["Sex","Rings"], axis=1).values

In [11]:
X

array([[0.455 , 0.365 , 0.095 , ..., 0.2245, 0.101 , 0.15  ],
       [0.35  , 0.265 , 0.09  , ..., 0.0995, 0.0485, 0.07  ],
       [0.53  , 0.42  , 0.135 , ..., 0.2565, 0.1415, 0.21  ],
       ...,
       [0.6   , 0.475 , 0.205 , ..., 0.5255, 0.2875, 0.308 ],
       [0.625 , 0.485 , 0.15  , ..., 0.531 , 0.261 , 0.296 ],
       [0.71  , 0.555 , 0.195 , ..., 0.9455, 0.3765, 0.495 ]],
      shape=(4177, 7))

In [12]:
y = df["Rings"].values

In [13]:
y = y.reshape(1,-1).transpose()

In [14]:
y

array([[15],
       [ 7],
       [ 9],
       ...,
       [ 9],
       [10],
       [12]], shape=(4177, 1))

### Spliting Data

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
X_train

array([[0.55  , 0.445 , 0.125 , ..., 0.288 , 0.1365, 0.21  ],
       [0.475 , 0.355 , 0.1   , ..., 0.2535, 0.091 , 0.14  ],
       [0.305 , 0.225 , 0.07  , ..., 0.0585, 0.0335, 0.045 ],
       ...,
       [0.51  , 0.395 , 0.125 , ..., 0.244 , 0.1335, 0.188 ],
       [0.575 , 0.465 , 0.12  , ..., 0.516 , 0.2185, 0.235 ],
       [0.595 , 0.475 , 0.16  , ..., 0.547 , 0.231 , 0.271 ]],
      shape=(3341, 7))

## Models

### Linear Regression

$$ MSE(\beta)  = ||Y - X\beta||^2_2$$
$$ \nabla MSE(\beta) = 0 $$ 
$$ \beta = (X^TX)^{-1}X^Ty $$

In [17]:
lr_model = linear_model.LinearRegression()
lr_model.fit(X_train, y_train)

y_pred = lr_model.predict(X_test)

mean_squared_error(y_test, y_pred)

5.055541144299382

### Ridge Regression

$$ Cost(\beta)  = ||Y - X\beta||^2_2 + \lambda ||\beta||^2_2$$
$$ \nabla Cost(\beta) = 0 $$ 
$$ \beta = (X^TX + I\lambda)^{-1}X^Ty $$

In [18]:
rg_model = linear_model.Ridge(alpha=0.1)
rg_model.fit(X_train,y_train)

y_pred = rg_model.predict(X_test)

mean_squared_error(y_test, y_pred)

5.057961921231911

### Lasso Regression

$$ Cost(\beta)  = ||Y - X\beta||^2_2 + \lambda||\beta||_1$$ 
$$ \nabla Cost(\beta) = 0 $$

In [19]:
la_model = linear_model.Lasso(alpha=.01)
la_model.fit(X_train,y_train)

y_pred = la_model.predict(X_test)

mean_squared_error(y_test, y_pred)

5.317912405795542

### Elastic Net Regression

$$ Cost(\beta)  = ||Y - X\beta||^2_2 + \lambda_1||\beta||_1 + \lambda_2||\beta||^2_2$$ 
$$ \nabla Cost(\beta) = 0 $$

In [20]:
elastic_model = linear_model.ElasticNet(random_state=0)
elastic_model.fit(X_train, y_train)

y_pred = elastic_model.predict(X_test)
mean_squared_error(y_test, y_pred)

10.050325006346077

### Decision Tree Regression

In [21]:
dt_model = tree.DecisionTreeRegressor(max_depth=6)
dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)

mean_squared_error(y_test, y_pred)

5.389993233280394

### Random Forest Regression

In [22]:
rf_model = ensemble.RandomForestRegressor(max_depth=6, random_state=64)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

mean_squared_error(y_test, y_pred)

  return fit_method(estimator, *args, **kwargs)


5.065186030207752

### GBoostRegression

In [23]:
params = {
    "n_estimators": 500,
    "max_depth": 4,
    "min_samples_split": 5,
    "learning_rate": 0.01,
    "loss": "squared_error",
}

gb_model = ensemble.GradientBoostingRegressor(**params)
gb_model.fit(X_train, y_train)

mean_squared_error(gb_model.predict(X_test), y_test)

  y = column_or_1d(y, warn=True)  # TODO: Is this still required?


5.097569454795717