## 7324 Assignment A2 : Regression
##### Name: Thang Nguyen
##### SMU ID: 48689334

### Compare the accuracy of different ML Regression algorithms in predicting the price of housing given a set of features and a final sales price. The models are:
- LinearRegression
- DecisionTreeRegressor
- RandomForestRegressor   [using: RandomForestRegressor(n_estimators = 300 ,  random_state = 0) ]
- Lasso Regression
- Ridge Regression
### Measure: R2, RMSE and Time (before and after the call to fit)

In [635]:
# import libraries 
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV, LassoCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

from sklearn import preprocessing
import pandas as pd
import numpy as np
import time

# other imports ..

In [636]:
### Set up Results Table as a DataFrame to summarizing  results

d = {
    'Model': ['Linear Regression',  'Decision Tree Regressor','Random Forest Regressor','Lasso Regression',  'Ridge Regression'],
    'Details': ['', '', '', 'alpha=', 'alpha=' ],
    'R2': ['-', '-', '-', '-', '-', ],
    'RMSE': ['-', '-', '-', '-', '-'],
    'Time' : ['-', '-', '-', '-', '-'],
   
}
df_results = pd.DataFrame(data=d)

In [637]:
# How to add values to the  results dataframe -- remove in final submission

some_val = 0.84848484

# add value with constrained decimals
df_results.at[1, 'R2'] = "%.4f" % some_val

df_results

Unnamed: 0,Model,Details,R2,RMSE,Time
0,Linear Regression,,-,-,-
1,Decision Tree Regressor,,0.8485,-,-
2,Random Forest Regressor,,-,-,-
3,Lasso Regression,alpha=,-,-,-
4,Ridge Regression,alpha=,-,-,-


##  Part A
Data Wrangling
- Load the data in a2.data.csv into the Jupyter notebook as a Dataframe.
- Remove any rows that have missing data.


In [638]:
# loading data and initial scan
car_df = pd.read_csv('../data/7324.a2.cardata.csv')
car_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Year          303 non-null    int64  
 1   price         303 non-null    float64
 2   Kms_Driven    302 non-null    float64
 3   fuel          302 non-null    object 
 4   seller        303 non-null    object 
 5   Transmission  303 non-null    object 
 6   Owner         303 non-null    int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 16.7+ KB


In [639]:
# removing missing data and verification
car_df.dropna(inplace = True)
car_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 301 entries, 0 to 302
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Year          301 non-null    int64  
 1   price         301 non-null    float64
 2   Kms_Driven    301 non-null    float64
 3   fuel          301 non-null    object 
 4   seller        301 non-null    object 
 5   Transmission  301 non-null    object 
 6   Owner         301 non-null    int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 18.8+ KB


## Part B
Convert text fields to numeric fields
- Convert text fields to numeric values using with one-hot-encoding or creating an ordinal sequence depending on the data
- Move price to the last column of the dataframe
- Display the first two rows of data using head(2)




In [640]:
# inital scan of data
car_df.head()

Unnamed: 0,Year,price,Kms_Driven,fuel,seller,Transmission,Owner
0,2014,3.35,27000.0,Petrol,Dealer,Manual,0
1,2013,4.75,43000.0,Diesel,Dealer,Manual,0
3,2017,7.25,6900.0,Petrol,Dealer,Manual,0
4,2011,2.85,5200.0,Petrol,Dealer,Manual,0
5,2014,4.6,42450.0,Diesel,Dealer,Manual,0


In [641]:
# one-hot fuel
car_df = pd.concat([car_df, pd.get_dummies(car_df['fuel'], prefix = 'fuel_type')], axis = 1)
car_df.drop(['fuel'], axis = 1, inplace = True)

In [642]:
# one-hot seller
car_df = pd.concat([car_df, pd.get_dummies(car_df['seller'], prefix = 'source')], axis = 1)
car_df.drop(['seller'], axis = 1, inplace = True)

In [643]:
# one-hot transmission
car_df = pd.concat([car_df, pd.get_dummies(car_df['Transmission'], prefix = 'transmission')], axis = 1)
car_df.drop(['Transmission'], axis = 1, inplace = True)

In [644]:
# move price feature to end of dataframe
price_col = car_df.pop("price")
car_df.insert(len(car_df.columns), "price", price_col)

In [645]:
# displaying first 2 of data
car_df.head(2)

Unnamed: 0,Year,Kms_Driven,Owner,fuel_type_CNG,fuel_type_Diesel,fuel_type_Petrol,source_Dealer,source_Individual,transmission_Automatic,transmission_Manual,price
0,2014,27000.0,0,0,0,1,1,0,0,1,3.35
1,2013,43000.0,0,0,1,0,1,0,0,1,4.75


## Part C
Measure the accuracy of three different regression models : LinearRegression, DecisionTreeRegression and RandomForestRegression
- Use test_train_split to create training and testing datasets
- Use your results to populate the results dataframe which has been set up for you.
- Use the guide below to adjust the number of decimal places to display the table in your notebook.
- R2 : 4 decimal place accuracy
- RMSE : 0 decimals
- Time : 2 decimal place accuracy


In [646]:
# split the dataframe 
## split the variables 
X = car_df.drop('price', axis=1)
y = car_df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=0)

In [647]:
# linear regression 
lin_reg = LinearRegression()
## capture time start
lin_reg_time = time.time()
lin_reg.fit(X_train, y_train)
## capture time stop
lin_reg_fit_time = time.time() - lin_reg_time
lin_reg_y_predict = lin_reg.predict(X_test)
lin_reg_mse = mean_squared_error(y_test, lin_reg_y_predict)
lin_reg_r2 = r2_score(y_test, lin_reg_y_predict)

In [648]:
# peeking at scores and placing them into the results dataframe
print(f'LinRegMSE: {lin_reg_mse}, LinRegR2: {lin_reg_r2}, LinRegFitTime: {lin_reg_fit_time}')
df_results.at[0, 'R2'] = "%.4f" % lin_reg_r2
df_results.at[0, 'RMSE'] = "%.0f" % lin_reg_mse
df_results.at[0, 'Time'] = "%.3f" % lin_reg_fit_time

LinRegMSE: 7.731351822738985, LinRegR2: 0.6941387670830657, LinRegFitTime: 0.0013301372528076172


In [649]:
# decision tree regression 
dec_tree = DecisionTreeRegressor()
## capture time start
dec_tree_time = time.time()
dec_tree.fit(X_train, y_train)
## capture time stop
dec_tree_fit_time = time.time() - dec_tree_time
dec_tree_y_predict = lin_reg.predict(X_test)
dec_tree_mse = mean_squared_error(y_test, dec_tree_y_predict)
dec_tree_r2 = r2_score(y_test, dec_tree_y_predict)

In [650]:
# peeking at scores and placing them into the results dataframe
print(f'DecTreeMSE: {dec_tree_mse}, DecTreeR2: {dec_tree_r2}, DecTreeFitTime: {dec_tree_fit_time}')
df_results.at[1, 'R2'] = "%.4f" % dec_tree_r2
df_results.at[1, 'RMSE'] = "%.0f" % dec_tree_mse
df_results.at[1, 'Time'] = "%.3f" % dec_tree_fit_time

DecTreeMSE: 7.731351822738985, DecTreeR2: 0.6941387670830657, DecTreeFitTime: 0.0017647743225097656


In [651]:
# random forest regression
ran_for = RandomForestRegressor(n_estimators = 300, random_state = 0)
## capture time start
ran_for_time = time.time()
ran_for.fit(X_train, y_train)
## capture time stop
ran_for_fit_time = time.time() - ran_for_time
ran_for_y_predict = lin_reg.predict(X_test)
ran_for_mse = mean_squared_error(y_test, ran_for_y_predict)
ran_for_r2 = r2_score(y_test, ran_for_y_predict)

In [652]:
# peeking at scores and placing them into the results dataframe
print(f'RanForMSE: {ran_for_mse}, RanForR2: {ran_for_r2}, RanForFitTime: {ran_for_fit_time}')
df_results.at[2, 'R2'] = "%.4f" % ran_for_r2
df_results.at[2, 'RMSE'] = "%.0f" % ran_for_mse
df_results.at[2, 'Time'] = "%.3f" % ran_for_fit_time

RanForMSE: 7.731351822738985, RanForR2: 0.6941387670830657, RanForFitTime: 0.17662477493286133


In [653]:
df_results

Unnamed: 0,Model,Details,R2,RMSE,Time
0,Linear Regression,,0.6941,8,0.001
1,Decision Tree Regressor,,0.6941,8,0.002
2,Random Forest Regressor,,0.6941,8,0.177
3,Lasso Regression,alpha=,-,-,-
4,Ridge Regression,alpha=,-,-,-


## Part D. Best Alpha Paramters for Ridge and Lasso Regression
### Both Ridge and Lasso regression have alpha parameters. Your job is to find the best alpha parameter for LassoCV and RidgeCV which run the models with the alpha parameters you specify and return the best one as model.alpha_ 
Do a bit of research to find appropriate alpha values to try
There are two ways to specify the list of alpha values:
1.	create a list of values, as in: my_alphas = [0.1, 0.2, 0.3]
2.	use the numpy arrange to generate a list of values, as in arrange(start, end, increment) – an easy way to test multiple values easily

When you determine the best alpha parameters for Lasso and Ridge Regression use these values in part E


In [654]:
# setting up alphas and scaling features 
my_alphas = np.arange(1, 6, 0.1)
scaler = preprocessing.StandardScaler()
X_train_standardized = scaler.fit_transform(X_train)

In [655]:
# ridge regression alpha tuning
regr_cv = RidgeCV(alphas=my_alphas)
regr_model_cv = regr_cv.fit(X_train_standardized, y_train)
print(regr_model_cv.coef_)
print(regr_model_cv.alpha_)

[ 0.86010098 -0.18143223  0.05709366 -0.3253458   1.00007348 -0.90909294
  1.04124991 -1.04124991  0.70742374 -0.70742374]
5.900000000000004


In [656]:
# lasso regression alpha tuning
lass_cv = LassoCV(alphas=my_alphas)
lass_model_cv = lass_cv.fit(X_train_standardized, y_train)
print(lass_model_cv.coef_)
print(lass_model_cv.alpha_)

[ 0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00
  1.27971725e+00 -0.00000000e+00  1.29951483e+00 -4.73695157e-16
  4.58633713e-01 -0.00000000e+00]
1.0


In [657]:
# adding alpha results to dataframe
df_results.at[4, 'Details'] = "%.4f" % regr_model_cv.alpha_

In [658]:
# adding alpha results to dataframe
df_results.at[3, 'Details'] = "%.4f" % lass_model_cv.alpha_

## Part E. Use best alpha parameters and compute results for Lasso & Ridge Regression 

In [659]:

# ridge regression with best alpha parameter
ridr_reg = Ridge(alpha=regr_model_cv.alpha_)
## capture time start
ridr_for_time = time.time()
ridr_model = ridr_reg.fit(X_train, y_train)
## capture time stop
ridr_fit_time = time.time() - ridr_for_time
ridr_y_predict = ridr_model.predict(X_test)
ridr_mse = mean_squared_error(y_test, ridr_y_predict)
ridr_r2 = r2_score(y_test, ridr_y_predict)


In [660]:
# peeking at scores and placing them into the results dataframe
print(f'RidrMSE: {ridr_mse}, RidrR2: {ridr_r2}, RidrFitTime: {ridr_fit_time}')
df_results.at[3, 'R2'] = "%.4f" % ridr_r2
df_results.at[3, 'RMSE'] = "%.0f" % ridr_mse
df_results.at[3, 'Time'] = "%.3f" % ridr_fit_time

RidrMSE: 7.7318837827923526, RidrR2: 0.6941177221272137, RidrFitTime: 0.0011110305786132812


In [661]:
# lasso regression with best alpha parameter
lass_reg = Ridge(alpha=lass_model_cv.alpha_)

## capture time start
lass_for_time = time.time()
lass_model = lass_reg.fit(X_train, y_train)
## capture time stop
lass_fit_time = time.time() - lass_for_time
lass_y_predict = lass_model.predict(X_test)
lass_mse = mean_squared_error(y_test, lass_y_predict)
lass_r2 = r2_score(y_test, lass_y_predict)

In [662]:
# peeking at scores and placing them into the results dataframe
print(f'LassMSE: {lass_mse}, LassR2: {lass_r2}, LassFitTime: {lass_fit_time}')
df_results.at[4, 'R2'] = "%.4f" % lass_r2
df_results.at[4, 'RMSE'] = "%.0f" % lass_mse
df_results.at[4, 'Time'] = "%.3f" % lass_fit_time

LassMSE: 7.713634926975682, LassR2: 0.6948396679999903, LassFitTime: 0.0010371208190917969


## Display Final Results Table

In [663]:
df_results

Unnamed: 0,Model,Details,R2,RMSE,Time
0,Linear Regression,,0.6941,8,0.001
1,Decision Tree Regressor,,0.6941,8,0.002
2,Random Forest Regressor,,0.6941,8,0.177
3,Lasso Regression,1.0,0.6941,8,0.001
4,Ridge Regression,5.9,0.6948,8,0.001


### Answer the following questions:
#### 1. Which model takes the most time and why?


#### 2. Did Ridge or Lasso regression show any improvement over Linear Regression?  Why?





####  3.Which technique took the most time? Why?




#### 4. It was recommended that if you received the "Data Conversion" warning listed in the assignment handout, that it can be eliminated by: "changing your y_train parameter to: y_train.values.ravel() "   
#####  What does the ravel() function actually do?



1) Random forest ended up taking the most time, likely because of the n_estimators parameter, which determines how many decision trees to construct, and the default max_features parameter, which is sqrt(num_of_features). Since there are 300 trees and at each node, about 2 features, this complexity is likely contributing to more compute time.
2) In this run, Ridge regression showed a nominal improvement over Linear Regression. Since Ridge regression progressively "shrinks" the model in terms ignoring independent variables when predicting the dependent variable, we are observing that each feature being evaluated has a slightly acceptable relation to the dependent variable, price. 
3) In the unformated times, Ridge ended up taking nominally more time. To preface, the dataset is relatively small in both features and observations. Since Ridge takes the square of the coefficients or features, the model gets infinitely closer to zero as it approaches a coefficient, taking proportionally more time to compute.
4) The ravel() function returns a continguous flattened array, or a 1D array of the same type as the input array. This helps to avoid conversions due to differences in dimensions between features and observations.