# 1. Assignment
## Background & Context

Link: https://olympus.greatlearning.in/courses/40800/assignments/166698?module_item_id=1427496

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.

In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it. 

## Objective

-Explore and visualize the dataset <br/>
-Build a linear regression model to predict the prices of used cars. <br/>
-Generate a set of insights and recommendations that will help the business. <br/>

Data Dictionary:

S.No. : Serial Number

Name : Name of the car which includes Brand name and Model name

Location : The location in which the car is being sold or is available for purchase Cities

Year : Manufacturing year of the car

Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.

Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)

Transmission : The type of transmission used by the car. (Automatic / Manual)

Owner : Type of ownership

Mileage : The standard mileage offered by the car company in kmpl or km/kg

Engine : The displacement volume of the engine in CC.

Power : The maximum power of the engine in bhp.

Seats : The number of seats in the car.

New_Price : The price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)

Price : The price of the used car in INR Lakhs (1 Lakh = 100, 000)

In [29]:
%%html
<!--- ###This is a trick I found  
from https://stackoverflow.com/questions/21892570/ipython-notebook-align-table-to-the-left-of-cell 
to align tables to left -->
    
<style>
table {float:left}
</style>

# Marking criteria
I am creating this section to faciliate the person who is reviewing this document

|Criteria|Points|Sections where I cover those areas
|:---|:---|:---	
|Define the problem and perform an Exploratory Data Analysis|10|Section 2
|Illustrate the insights based on EDA|5|TBD
|Data pre-processing|15|TBD
|Model building - Linear Regression|12|TBD
|Model performance evaluation|6|TBD
|Actionable Insights & Recommendations|6|TBD
|Notebook - Overall Quality|6|TBD


# 2. Define the problem

In 2018-19, new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold in geographical location India. Anecdotally, we see car sellers replace their old cars with pre-owned car. 
Cars4U business goal is to capture the second hand car market in this region.

**The goal of this project is find a pricing model that effectively predict the prices of used cars**. This will empower the business to device strategy for differential pricing. 


# 3. Loading libraries

In [30]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 200)

# 4. Loading and exploring the data

In this section the goals are to load the data into python and then to check its basic properties. This will include the dimension, column types and names, and missingness counts.

In [31]:
df = pd.read_csv("used_cars_data.csv", index_col=0)
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.')  # f-string

There are 7253 rows and 13 columns.


In [18]:
# I'm now going to look at 10 random rows
# I'm setting the random seed via np.random.seed so that
# I get the same random results every time
#np.random.seed(1)a
df.sample(n=10)

Unnamed: 0_level_0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
7011,Toyota Innova Crysta 2.4 VX MT 8S,Coimbatore,2017,103189,Diesel,Manual,First,13.68 kmpl,2393 CC,147.8 bhp,8.0,
5343,Honda Civic 2006-2010 1.8 S AT,Pune,2008,152633,Petrol,Automatic,First,12.9 kmpl,1799 CC,130 bhp,5.0,3.0
1249,Maruti Swift VXi BSIV,Chennai,2014,54000,Petrol,Manual,Second,16.1 kmpl,1197 CC,85 bhp,5.0,4.5
4640,Honda Amaze VX i-DTEC,Delhi,2013,75348,Diesel,Manual,First,25.8 kmpl,1498 CC,98.6 bhp,5.0,3.75
3390,BMW X5 2014-2019 xDrive 30d Design Pure Experi...,Mumbai,2015,29500,Diesel,Automatic,First,15.97 kmpl,2993 CC,258 bhp,7.0,43.0
4030,Ford Ecosport 1.5 DV5 MT Trend,Hyderabad,2014,55000,Diesel,Manual,First,22.7 kmpl,1498 CC,89.84 bhp,5.0,6.0
4275,Maruti Celerio VXI AT,Pune,2015,26725,Petrol,Automatic,First,23.1 kmpl,998 CC,67.04 bhp,5.0,4.25
3142,Honda Brio 1.2 S MT,Mumbai,2015,22000,Petrol,Manual,First,18.5 kmpl,1198 CC,86.8 bhp,5.0,3.95
4979,Mahindra XUV500 W6 2WD,Delhi,2015,37000,Diesel,Manual,First,15.1 kmpl,2179 CC,140 bhp,7.0,9.3
2099,Honda Mobilio S i VTEC,Mumbai,2014,43081,Petrol,Manual,First,17.3 kmpl,1497 CC,117.3 bhp,7.0,5.25


## 3.1 Observations from initial exploration
- Dataset (DS) has 7253 rows and 13 columns
- DS has 14 features - name, location, year, kilometers_driver, fuel_type, transmission, owner_type, mileage, engine, power, seats, new_price, price.
- New_Price is NaN - candidate to be dropped. 
- Mileage, Engine, Power need to be converted to numeric to be processed

In [17]:
# Let us drop New_Price
df.drop(['New_Price'], axis=1, inplace=True)

# 4 Exploratory Data Analysis (EDA)

In [19]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7253 entries, 0 to 7252
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               7253 non-null   object 
 1   Location           7253 non-null   object 
 2   Year               7253 non-null   int64  
 3   Kilometers_Driven  7253 non-null   int64  
 4   Fuel_Type          7253 non-null   object 
 5   Transmission       7253 non-null   object 
 6   Owner_Type         7253 non-null   object 
 7   Mileage            7251 non-null   object 
 8   Engine             7207 non-null   object 
 9   Power              7207 non-null   object 
 10  Seats              7200 non-null   float64
 11  Price              6019 non-null   float64
dtypes: float64(2), int64(2), object(8)
memory usage: 736.6+ KB
