# Car Features and MSRP Dataet

- `Number of Instances`: **11914**
- `Number of Attributes`: **16**
- `Attributes`<u>(Description of Attributes, according to me which I got from dicussions and Google beacuse it was not given in the [Kaggle](https://www.kaggle.com/CooperUnion/cardataset) page)</u>: 
  - `Make`: Make of a car(BMW, Volkswagen and so on)
  - `Model`: Model of a car
  - `Year`: Year when the car was manufactured
  - `Engine Fuel Type`: Type of fuel engine needs(disel and so on)
  - `Engine HP`: Horsepower of engine
  - `Engine Cylinders`: Number of cylinders in engine
  - `Transmission Type`: Type of transmission(automatic or manual)
  - `Driven Wheels`: front, rear, all
  - `Number of Doors`: Number of doors a car has
  - `Market Category`: luxury, crossover and so on
  - `Vehicle Size`: compact, midsize, large
  - `Vehicle Style`: Style of vehicle(sedan, convertible and so on)
  - `Highway MPG`: miles per gallon(MPG) in highway
  - `City MPG`: miles per gallon(MPG) in city
  - `Popularity`: Number of times the car was mentioned in a Twitter stream
  - `MSRP`: Manufacturer's Suggested Retail Price

## Understand the Business Requirements

**Problem statement:**

`Cars dataset with features including make, model, year, engine, and other properties of the car used to predict its price.`

## Exploratory Data Analysis(EDA):

In [None]:
#Python Libraries 
import pandas as pd #Data Processing and CSV file I/o
import numpy as np #for numeric operations
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline 
#to make sure that plots rendered correctly in jupyter notebook

# to make this notebook's output stable across runs
np.random.seed(42)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
car_df = pd.read_csv('/kaggle/input/cardataset/data.csv') #reading the .csv file which is present in archive.zip file 

In [None]:
car_df.head(8) #top 8 rows

In [None]:
#lowercasing all the column names and replacing space with underscores
car_df.columns = car_df.columns.str.lower().str.replace(' ', '_')

In [None]:
car_df.columns #columns name

In [None]:
car_df.dtypes #data type of every column

In [None]:
#similary lowercasing all the rows and replacing space with underscores
string_columns = list(car_df.dtypes[car_df.dtypes == 'object'].index)
for col in string_columns:
    car_df[col] = car_df[col].str.lower().str.replace(' ', '_')

In [None]:
car_df.sample(4)

In [None]:
print(f"The Numbers of Rows and Columns in this data set are: {car_df.shape[0]} rows and {car_df.shape[1]} columns.")

In [None]:
#Concise Summary of the DataFrame
car_df.info()

In [None]:
#Statistical Summary of DataFrame
car_df.describe().T

In [None]:
#Missing Values
car_df.isnull().sum()

In [None]:
#first step should always be check the distribution of target variable(in my opinion)
plt.figure(figsize=(5,4))
sns.histplot(car_df['msrp'], bins=30)
plt.title("Distribution of Prices")
plt.ylabel("Counts")
plt.xlabel("Price")
plt.show(); 
#as we have seen that max price is 2065902 so in this graph 1e6 means 10^6
#this graph has long tail(imp)

In [None]:
#zooming the above graph 
plt.figure(figsize=(5,4))
sns.histplot(car_df['msrp'][car_df['msrp'] < 100000], bins=30)
plt.title("Distribution of Prices")
plt.ylabel("Counts")
plt.xlabel("Price")
plt.show(); 
#in this graph the long tail make quite difficult to see distribution.
#to solve this problem we have to transform this graph by log transformation

In [None]:
log_price_plus1 =  np.log1p(car_df['msrp']) #``log(1 + x)``

plt.figure(figsize=(5,4))
sns.histplot(log_price_plus1, bins=30)
plt.title("Distribution of Prices after Log tranformation")
plt.ylabel("Counts")
plt.xlabel("Price")
plt.show(); 
# +1 part important in cases that have zeroes.
#as we can see that there is no longer, long tail is present and now the distribution resembles a bell-shaped curve.

## Splitting data into Train, Validation and Test Sets

`Full DataSet is divided into`:
- `20% of data goes to validation`
- `20% of data goes to test`
- `and remainig 60% goes to train`

In [None]:
rows = len(car_df) # No. of Rows in car_df

#calculating how many rows shoulg go to train, validation and test
val_rows = int(0.2*rows)
test_rows = int(0.2*rows)
train_rows = rows - (val_rows+test_rows)

In [None]:
#creating a numpy array with indices from 0 to n-1 and shuffle it.
index = np.arange(rows)
np.random.shuffle(index)

In [None]:
#using above array with indices to get a shuffled dataframe
car_shuffled_df = car_df.iloc[index]

#Split the shuffled datafram into train, validation and test
car_train_df = car_shuffled_df.iloc[:train_rows].copy()
car_val_df = car_shuffled_df.iloc[:val_rows].copy()
car_test_df = car_shuffled_df.iloc[:test_rows].copy()

In [None]:
print(f"Training DataSet: \n ~> Rows: {car_train_df.shape[0]}\n ~> Columns: {car_train_df.shape[1]}")
print(f"Validation DataSet: \n ~> Rows: {car_val_df.shape[0]}\n ~> Columns: {car_val_df.shape[1]}")
print(f"Testing DataSet: \n ~> Rows: {car_test_df.shape[0]}\n ~> Columns: {car_test_df.shape[1]}")

In [None]:
#from above analysis we have got long tail in distribution of price and to remove its effect, log transformation is used
y_train = np.log1p(car_train_df['msrp'].values)
y_val = np.log1p(car_val_df['msrp'].values)
y_test = np.log1p(car_test_df['msrp'].values)

In [None]:
car_train_df.drop(['msrp'], axis=1, inplace=True)
car_val_df.drop(['msrp'], axis=1, inplace=True)
car_test_df.drop(['msrp'], axis=1, inplace=True)

In [None]:
car_train_df.head()

In [None]:
car_val_df.head()

## Linear Regression:

In [None]:
#linear regression implemented with Numpy
def linear_regression(X, y):
    """
    This function is for implementation of Linear regression.
    X = it is matrix(features).
    y = it is a vector(target).
    """
    ones = np.ones(X.shape[0]) #creating an array that contains only 1s.
    X = np.column_stack([ones, X]) #adding the array of 1's as the column of X
    #normal equation formula
    XTX = X.T.dot(X) 
    XTX_inv = np.linalg.inv(XTX) #inverse of XTX
    w = XTX_inv.dot(X.T).dot(y) #computing the rest of the normal equation
    
    return w[0], w[1:] #spliting the weight vector into the bias and the rest of the weights 

### Naive solution

In [None]:
naive_features = ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg', 'popularity']

In [None]:
def preparing_X(df):
    """
    This function is used to replace all Nan to 0 and assign the values to variable X.
    """
    df_num = df[naive_features]
    df_num = df_num.fillna(0)
    X = df_num.values
    return X

In [None]:
naive_X_train = preparing_X(car_train_df)

w_0, w = linear_regression(naive_X_train, y_train) #training the model
y_pred = w_0 + naive_X_train.dot(w) #predicting 

In [None]:
#let see the how good was the prediction
plt.figure(figsize=(5,4))

sns.histplot(y_train, label='target', color='black',bins=30)
sns.histplot(y_pred, label='prediction',color='red', bins=30)
plt.legend()
plt.xlabel('log(1+price)')
plt.ylabel('Count')
plt.title('Predictions vs Actual Distribution')

plt.show();
#from the graph it clear that the prediction aren't good enough.

In [None]:
#perfomace metric RMSE(root mean square error)
def rmse(y, y_pred):
    error = y_pred - y
    mse = (error**2).mean()
    return np.sqrt(mse)

In [None]:
print(f"RSME for training is: {round(rmse(y_train, y_pred), 4)}")

# Validating the model
X_val = preparing_X(car_val_df)
y_val_pred = w_0 + X_val.dot(w)

print(f"RSME for validation is: {round(rmse(y_val, y_val_pred), 4)}")

In [None]:
def preparing_X(df):
    """
    trying some features engineering, here I'm adding age column which is:
    age = 2017 - year(from main dataframe) 
    then appending this into features.
    """
    df = df.copy()
    features = naive_features.copy()
    
    df['age'] = 2017 - df['year']
    features.append('age')
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values
    return X

In [None]:
X_train = preparing_X(car_train_df)
w_0, w = linear_regression(X_train, y_train) #training the model
y_pred = w_0 + X_train.dot(w) #predicting 

print(f"RSME for training is: {round(rmse(y_train, y_pred), 4)}")

# Validating the model
X_val = preparing_X(car_val_df)
y_val_pred = w_0 + X_val.dot(w)
print(f"RSME for validation is: {round(rmse(y_val, y_val_pred), 4)}")

In [None]:
#let see the how good was the prediction
plt.figure(figsize=(5,4))

sns.histplot(y_train, label='target', color='black',bins=30)
sns.histplot(y_pred, label='prediction',color='red', bins=30)
plt.legend()
plt.xlabel('log(1+price)')
plt.ylabel('Count')
plt.title('Predictions vs Actual Distribution')

plt.show();
#with new features, the model follows the orginial distribution closer than previously

In [None]:
car_df['number_of_doors'].value_counts()

In [None]:
car_df['make'].value_counts().head()

In [None]:
car_df['engine_fuel_type'].value_counts().head()

In [None]:
def preparing_X(df):
    """
    Trying some more simple feature engineering.
    """
    df = df.copy()
    features = naive_features.copy()
    
    df['age'] = 2017 - df['year']
    features.append('age')
    
    for index in ['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge']:
        feature = 'is_make_%s' % index #giving a meaning full name
        #creating the one hot encoding feature and adding the feature back to dataframe
        df[feature] = (df['make'] == index).astype(int)
        features.append(feature)
        
    for index in ['regular_unleaded', 'premium_unleaded_(required)', 
              'premium_unleaded_(recommended)', 'flex-fuel_(unleaded/e85)', 'diesel']:
        feature = 'is_type_%s' % index
        df[feature] = (df['engine_fuel_type'] == index).astype(int)
        features.append(feature)
    
    for index in [2, 3, 4]: 
        feature = 'num_doors_%s' % index 
        df[feature] = (df['number_of_doors'] == index).astype(int) 
        features.append(feature)

    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values
    return X

In [None]:
X_train = preparing_X(car_train_df)
w_0, w = linear_regression(X_train, y_train) #training the model
y_pred = w_0 + X_train.dot(w) #predicting 

print(f"RSME for training is: {round(rmse(y_train, y_pred), 4)}")

# Validating the model
X_val = preparing_X(car_val_df)
y_val_pred = w_0 + X_val.dot(w)
print(f"RSME for validation is: {round(rmse(y_val, y_val_pred), 4)}")

In [None]:
#let see the how good was the prediction
plt.figure(figsize=(5,4))

sns.histplot(y_train, label='target', color='black',bins=30)
sns.histplot(y_pred, label='prediction',color='red', bins=30)
plt.legend()
plt.xlabel('log(1+price)')
plt.ylabel('Count')
plt.title('Predictions vs Actual Distribution')

plt.show();
#with new features, the model follows the orginial distribution closer than previously

In [None]:
car_df.columns

In [None]:
naive_features

In [None]:
car_df['transmission_type'].value_counts()

In [None]:
car_df['driven_wheels'].value_counts()

In [None]:
car_df['market_category'].value_counts().head(4)

In [None]:
car_df['vehicle_size'].value_counts()

In [None]:
car_df['vehicle_style'].value_counts().head(4)

In [None]:
def preparing_X(df):
    """
    Trying some more simple feature engineering.
    """
    df = df.copy()
    features = naive_features.copy()
    
    df['age'] = 2017 - df['year']
    features.append('age')
    
    for index in ['chevrolet', 'ford', 'volkswagen', 'toyota']:
        feature = 'is_make_%s' % index #giving a meaning full name
        #creating the one hot encoding feature and adding the feature back to dataframe
        df[feature] = (df['make'] == index).astype(int)
        features.append(feature)
        
    for index in ['regular_unleaded', 'premium_unleaded_(required)',
                  'premium_unleaded_(recommended)', 'flex-fuel_(unleaded/e85)']:
        feature = 'is_type_%s' % index
        df[feature] = (df['engine_fuel_type'] == index).astype(int)
        features.append(feature)
    
    for index in ['automatic', 'manual', 'automated_manual', 'direct_drive']:
        feature = 'is_tranmission_%s' % index
        df[feature] = (df['transmission_type'] == index).astype(int)
        features.append(feature)
    
#     for index in ['front_wheel_drive', 'rear_wheel_drive', 'all_wheel_drive', 'four_wheel_drive']:
#         feature = 'is_driven_wheel_%s' % index
#         df[feature] = (df['driven_wheels'] == index).astype(int)
#         features.append(feature)
    
    for index in ['crossover', 'flex_fuel', 'luxury', 'luxury,performance']:
        feature = 'is_market_category_%s' % index
        df[feature] = (df['market_category'] == index).astype(int)
        features.append(feature)
    
#     for index in ['compact', 'midsize', 'large']:
#         feature = 'is_vehicle_size_%s' % index
#         df[feature] = (df['vehicle_size'] == index).astype(int)
#         features.append(feature)
    
    
#     this features give LinAlgError which means it is not possible to find an inverse for this matrix.
#     If we try to invert a singular matrix, Numpy will raise an error which is LinAlgError: Singular Matrix
#     this also happens in features which are multiple of each other by some constant(imp) or prefect linear combination
#     for index in ['sedan', 'dr_suv', 'coupe', 'convertible']:
#         feature = 'is_vehicle_style_%s' % index
#         df[feature] = (df['vehicle_style'] == index).astype(int)
#         features.append(feature)

    for index in [2, 3, 4]: 
        feature = 'num_doors_%s' % index 
        df[feature] = (df['number_of_doors'] == index).astype(int) 
        features.append(feature)

    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values
    return X

#all the commented features are not useful

In [None]:
X_train = preparing_X(car_train_df)
w_0, w = linear_regression(X_train, y_train) #training the model
y_pred = w_0 + X_train.dot(w) #predicting 

print(f"RSME for training is: {round(rmse(y_train, y_pred), 4)}")

# Validating the model
X_val = preparing_X(car_val_df)
y_val_pred = w_0 + X_val.dot(w)
print(f"RSME for validation is: {round(rmse(y_val, y_val_pred), 4)}")

In [None]:
#let see the how good was the prediction
plt.figure(figsize=(5,4))

sns.histplot(y_train, label='target', color='black',bins=30)
sns.histplot(y_pred, label='prediction',color='red', bins=30)
plt.legend()
plt.xlabel('log(1+price)')
plt.ylabel('Count')
plt.title('Predictions vs Actual Distribution')

plt.show();
#with new features, the model follows the orginial distribution closer than previously

# Regularization

- `Regularized Linear Regression is often called Ridge Regression`.

In [None]:
def linear_regression_reg(X, y, r=0.0):
    """
    This function is for implementation of Regularized Linear regression.
    """
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])
    
    XTX = X.T.dot(X)
    reg = r*np.eye(XTX.shape[0]) 
    #adding r to the main diagonal of XTX
    XTX = XTX + reg
    
    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)
    
    return w[0], w[1:]

In [None]:
X_train = preparing_X(car_train_df)

In [None]:
for r in [0, 0.001, 0.01, 0.1, 1, 10]:
    w_0, w = linear_regression_reg(X_train, y_train, r=r)
    y_pred = w_0 + X_train.dot(w) #predicting 
    print(f"RSME for training when r = {r} is: {round(rmse(y_train, y_pred), 6)}")
    # Validating the model
    X_val = preparing_X(car_val_df)
    y_val_pred = w_0 + X_val.dot(w)
    print(f"RSME for validation when r = {r} is: {round(rmse(y_val, y_val_pred), 6)}")
    print('-'*15)

In [None]:
for r in [0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]:
    w_0, w = linear_regression_reg(X_train, y_train, r=r)
    y_pred = w_0 + X_train.dot(w) #predicting 
    print(f"RSME for training when r = {r} is: {round(rmse(y_train, y_pred), 6)}")
    # Validating the model
    X_val = preparing_X(car_val_df)
    y_val_pred = w_0 + X_val.dot(w)
    print(f"RSME for validation when r = {r} is: {round(rmse(y_val, y_val_pred), 6)}")
    print('-'*15)