ASSIGNMENT NUMBER 1
In this assignment, you will solve a problem, i.e., Chaky company makes some car but he has
difficulty setting the price for the car. Please make a simple web-based car price prediction system.
Note: You are ENCOURAGED to work with your friends, but DISCOURAGED to blindly copy
other’s work. Both parties will be given 0.
Note: Comments should be provided sufficiently so we know you understand. Failure to do so can
raise suspicion of possible copying/plagiarism.
Note: You will be graded upon (1) documentation, (2) experiment, (3) implementation.
Note: This is a two-weeks assignment, but start early.
Deliverables: The GitHub link containing the jupyter notebook, a README.md of the github, and
the folder of your web application called ‘app’.

***************************************************************************************************

Task 1. Preparing the datasets
Download the Car Price dataset from Google classroom. 
Perform loading, 
EDA, 
preprocessing,
model selection, · · · , inference.

There are some important coding considerations:
• For the feature owner, map First owner to 1, ..., Test Drive Car to 5
• For the feature fuel, remove all rows with CNG and LPG because CNG and LPG use a different
mileage system i.e., km/kg which is different from kmfeaturepl for Diesel and Petrol 
• For the feature mileage, remove “kmpl” and convert the column to numerical type (e.g., float). Hint: use
df.mileage.str.split
• For the feature engine, remove “CC” and convert the column to numerical type (e.g., float)
• Do the same for max power
• For the feature brand, take only the first word and remove the rest
• Drop the feature torque, simply because Chaky’s company does not understand well about it 
• You will found out that Test Drive Cars are ridiculously expensive. Since we do not want to involve
  this, we will simply delete all samples related to it.
• Since selling price is a big number, it can cause your prediction to be very unstable. One trick is
  to first transform the label using log transform, i.e., y = np.log(df['selling_price'])
• During inference/testing, you have to transform your predicted y backed before comparing with y
test, i.e., pred_y = np.exp(pred_y)

**********************************************************************************************

Task 2. Report - In the end of the notebook, please write a 2-3 paragraphs summary deeply
discussing and analysing the results. Possible points of discussion:
• Which features are important? Which are not? Why?
• Which algorithm performs well? Which does not? Why? (here, you haven’t learned about any
algorithms yet, but you can search online a bit and start building an intuition)

**********************************************************************************************

Task 3. Deployment - Develop a web-based application that contains the model. Here you will be
tasked to self-study how to deploy the model into production. Here are some guidelines: Here you
have multiple options. Those who are veteran web developer may prefer their own web app stack
which is welcomed. For those who are new to this realm, you may consider a simpler/one-stop
solution rather than learning the traditional/flexible approach.
The goal of this task is to expose/deploy our model for public use via the web interface. The main
scenario is the following:
1) Users enter the domain on their browser. They land on your page.
2) (optional) Users may need to navigate to a prediction page.
3) Users read the instruction given on the page that instructs them on how the prediction
works. 4) Users find the input form, put in the appropriate data, and click submit.
5) Note that if users do not have information on certain field, you have to allow users to skip that
field. In that case, we recommend you to fill the missing field with imputation technique you
have learned in the class.
6) A moment later (depending on your model and hardware performance), the result is returned
and printed below the form.
Deploying aside, the app should work on the local environment (your machine) first. I would suggest
you use ‘Dash’ by ‘Plotly’ https://dash.plotly.com/ as a one-stop solution. Spend time studying the
‘Quick Start’ tutorial on the site and also ‘Dash Fundamental’. They are essential for you to know how
‘Dash’ works.
The deliverable for the app would be, in GitHub, you have a folder ‘app’ with ‘.Dockerfile’, ‘docker
compose.yaml’ files, and ‘code’ folder.
Bootstrap: I know Dockerizing the app could be difficult for newcomers, you will get confused when
searching for stuff online, especially, when you just trust ChatGPT to give you the right answer. So, for
those who want to postpone the process of learning “Docker”, here is the Dockerized Dash project
link. Don’t worry, you will eventually need to do this yourself in this shortcoming weeks. You can not
escape this.

In [None]:
import pandas as pn
df_cars=pn.read_csv('Cars.csv')

In [None]:
df_cars.columns

In [None]:
df_cars.shape

In [None]:
df_cars.head()

# coding feature "owner": First Owner --> 1, Second Owner --> 2, Third owner --> 3, Fourth & Above Owner --> 4, Test Drive Car --> 5 

In [None]:
owner_coding = {
    'First Owner': 1,
    'Second Owner': 2,
    'Third Owner': 3,
    'Fourth & Above Owner': 4,
    'Test Drive Car': 5
}

df_cars['owner'] = df_cars['owner'].map(owner_coding)

# Remove rows with fuel values 'CNG' or 'LPG'

In [None]:
df_cars = df_cars[df_cars['fuel'].isin(['Petrol', 'Diesel'])]

# For the feature mileage, remove “kmpl” and convert the column to numerical type (e.g., float). 
# For the feature engine, remove “CC” and convert the column to numerical type (e.g., float)
# Do the same for max power

In [None]:
df_cars.mileage=df_cars.mileage.str.split(expand=True)[0].astype(float)

In [None]:
df_cars.engine=df_cars.engine.str.split(expand=True)[0].astype(float)

In [None]:

df_cars['max_power'] = df_cars['max_power'].str.replace(' bhp','')
df_cars.max_power=df_cars.max_power.astype(float)

Taking only the first word and removing the rest For the feature brand

In [None]:
df_cars.name=df_cars.name.str.split(expand=True)[0]

Drop the feature torque, simply because Chaky’s company does not understand well about it

In [None]:
df_cars = df_cars.drop(columns=['torque'])

You will found out that Test Drive Cars are ridiculously expensive. Since we do not want to involve
  this, we will simply delete all samples related to it.

In [None]:
df_cars = df_cars[df_cars['owner'] != 5]

Since selling price is a big number, it can cause your prediction to be very unstable. One trick is
  to first transform the label using log transform, i.e., y = np.log(df['selling_price'])

In [None]:
import numpy as np
df_cars['selling_price'] = np.log(df_cars['selling_price'])

In [None]:
df_cars

In [None]:
df_cars.info()

In [None]:

df_cars.engine=df_cars.engine.astype(float)
