# CODING TASK #1: UNDERSTAND THE PROBLEM STATEMENT AND IMPORT LIBRARIES/DATASETS

- In this hands-on project, we will train a multiple linear regression model to predict the price of used cars.
- This project can be used by car dealerships to predict used car prices and understand the key factors that contribute to used car prices.
- Features (inputs): 
    - Make 
    - Model
    - Type
    - Origin 
    - Drivetrain
    - Invoice
    - EngineSize
    - Cylinders
    - Horsepower
    - MPG_City
    - MPG_Highway
    - Weight
    - Wheelbase
    - Length
- Outputs: MSRP (Price)



In [None]:
# Import Numpy and check the version
import numpy as np
print(np.__version__)

In [None]:
# Import Pandas and check the version
import pandas as pd
print(pd.__version__)

In [None]:
# Updgrade Numpy version
!pip3 install numpy --upgrade

In [None]:
# Updgrade Pandas version
!pip3 install pandas --upgrade

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px # Interactive Data Visualization

In [None]:
# Read the CSV file 
car_df = pd.read_csv("used_car_price.csv")

In [None]:
# Load the top 6 instances
car_df.head(6)

In [None]:
# Load the bottom 6 instances 
car_df.tail(6)

In [None]:
# Display the feature columns
car_df.columns

In [None]:
# Check the shape of the dataframe
car_df.shape

In [None]:
# Check if any missing values are present in the dataframe
car_df.isnull().sum()

In [None]:
car_df = car_df.dropna()

In [None]:
car_df.dtypes

In [None]:
car_df.info()

**PRACTICE OPPORTUNITY #1 [OPTIONAL]:**
- **What is the maximum price of the used car?**
- **What is the minimum price of the used car?**

# CODING TASK #2: PERFORM EXPLORATORY DATA ANALYSIS AND VISUALIZATION - PART #1

In [None]:
# check if there are any Null values
sns.heatmap(car_df.isnull(), yticklabels = False, cbar = False, cmap="Blues")


In [None]:
car_df.dtypes

In [None]:
sns.scatterplot(x = 'Horsepower', y = 'MSRP', data = car_df)

In [None]:
# scatterplots for joint relationships and histograms for univariate distributions
sns.pairplot(car_df) 


In [None]:
# Let's view various makes of the cars
car_df.Type.unique()

In [None]:
plt.figure(figsize = (16, 8))
sns.countplot(x = car_df['Type'])
locs, labels = plt.xticks();
plt.setp(labels, rotation = 45);

In [None]:
plt.figure(figsize = (16, 8))
sns.countplot(x = car_df['Origin'])
locs, labels = plt.xticks();
plt.setp(labels, rotation = 45);

In [None]:
plt.figure(figsize = (16, 8))
sns.countplot(x = car_df['DriveTrain'])
locs, labels = plt.xticks();
plt.setp(labels, rotation = 45);

**PRACTICE OPPORTUNITY #2 [OPTIONAL]:**
- **List all unique car makes in the dataset**
- **Using Seaborn, plot the countplot for the vehicle Make?**
- **List the top 3 brands?**

# CODING TASK #3: PERFORM EXPLORATORY DATA ANALYSIS AND VISUALIZATION - PART #2

In [None]:
!pip install wordcloud
# Let's view the model of all used cars using WordCloud generator
from wordcloud import WordCloud, STOPWORDS

In [None]:
car_df

In [None]:
text = car_df.Model.values

In [None]:
stopwords = set(STOPWORDS)

In [None]:
wc = WordCloud(background_color = "black", max_words = 2000, max_font_size = 100, random_state = 3, 
              stopwords = stopwords, contour_width = 3).generate(str(text))  

In [None]:
fig = plt.figure(figsize = (25, 15))
plt.imshow(wc, interpolation = "bilinear")
plt.axis("off")
plt.show()

**PRACTICE OPPORTUNITY #3 [OPTIONAL]:**
- **Plot the correlation matrix**
- **Comment on the correlation matrix, which feature has the highest positive correlation with MSRP?**

# CODING TASK #4: PREPARE THE DATA BEFORE MODEL TRAINING

In [None]:
car_df.head()

In [None]:
# Perform One-Hot Encoding for "Make", "Model", "Type", "Origin", and "DriveTrain"
car_df = pd.get_dummies(car_df, columns=["Make", "Model", "Type", "Origin", "DriveTrain"])

In [None]:
car_df.head()

In [None]:
# Feeding input features to X and output (MSRP) to y
X = car_df.drop("MSRP", axis = 1)
y = car_df["MSRP"]

In [None]:
X = np.array(X)

In [None]:
y = np.array(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size = 0.25)

In [None]:
X_train.shape

In [None]:
X_test.shape

**PRACTICE OPPORTUNITY #4 [OPTIONAL]:**
- **Perform train test split without indicating a test_size, what do you conclude?**

# CODING TASK #5: TRAIN AND TEST A LINEAR REGRESSION MODEL IN SK-LEARN (NOTE THAT SAGEMAKER BUILT-IN ALGORITHMS ARE NOT USED HERE)

In [None]:
# using linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score

regresssion_model_sklearn = LinearRegression()
regresssion_model_sklearn.fit(X_train, y_train)


In [None]:
regresssion_model_sklearn_accuracy = regresssion_model_sklearn.score(X_test, y_test)
regresssion_model_sklearn_accuracy

In [None]:
y_predict = regresssion_model_sklearn.predict(X_test)

In [None]:
y_predict

In [None]:
k = 13 # List the number of independant variables
k

In [None]:
n = len(X_test)
n

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt


RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-( (1-r2)*(n-1) / (n-k-1))

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 

In [None]:
# Visualize how accurate predictions are relative to y_test
plt.figure(figsize = (12, 6))
plt.scatter(y_test, y_predict)

**PRACTICE OPPORTUNITY #5 [OPTIONAL]:**
- **Train a Random Forest Regressor**
- **Evaluate trained model performance**

# EXCELLENT JOB!

# PRACTICE OPPORTUNITIES SOLUTIONS

**PRACTICE OPPORTUNITY #1 SOLUTION:**
- **What is the maximum price of the used car?**
- **What is the minimum price of the used car?**

In [None]:
car_df['MSRP'].max()

In [None]:
car_df['MSRP'].min()

**PRACTICE OPPORTUNITY #2 SOLUTION:**
- **List all unique car makes in the dataset**
- **Using Seaborn, plot the countplot for the vehicle Make?**
- **List the top 3 brands?**

In [None]:
car_df

In [None]:
# Let's view various makes of the cars
car_df['Make'].unique()

In [None]:
plt.figure(figsize = (16, 8))
sns.countplot(x = car_df['Make'])
locs, labels = plt.xticks();
plt.setp(labels, rotation = 45);

**PRACTICE OPPORTUNITY #3 SOLUTION:**
- **Plot the correlation matrix**
- **Comment on the correlation matrix, which feature has the highest positive correlation with MSRP?**

In [None]:
# Obtain the correlation matrix
car_df.corr()

In [None]:
plt.figure(figsize = (10,10))
sns.heatmap(car_df.corr(), cmap="YlGnBu", annot = True);

In [None]:
# Positive correlation between engine size and number of cylinders
# Positive correlation between horsepower and number of cylinders
# highest positive correlation with MSRP is = horsepower


**PRACTICE OPPORTUNITY #4 SOLUTION:**
- **Perform train test split without indicating a test_size, what do you conclude?**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y)

In [None]:
X_train.shape

In [None]:
X_test.shape

**PRACTICE OPPORTUNITY #5 SOLUTION:**
- **Train a Random Forest Regressor**
- **Evaluate trained model performance**

In [None]:
from sklearn.ensemble import RandomForestRegressor
RandomForest_model = RandomForestRegressor(n_estimators= 20, max_depth= 10)
RandomForest_model.fit(X_train, y_train)
accuracy_RandomForest= RandomForest_model.score(X_test, y_test)
accuracy_RandomForest

y_predict = RandomForest_model.predict(X_test)

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-( (1-r2)*(n-1) / (n-k-1))

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 