# Car Price Prediction

---

## Introduction
In this notebook, we are trying to find the correlations specific details of a car might have on the price of it.

## Setting up the notebook
Run these cells only once to get the notebook set up

***(make sure you have a kaggle api key and put it inside of notebooks folder)***

In [3]:
from pathlib import Path
import os

# Create Folder (if not already made)
DATA_DIR = Path("../data/")
DATA_DIR.mkdir(parents=True, exist_ok=True)
print("Data folder is ready:", DATA_DIR.resolve())




Data folder is ready: /home/dev/SHARED_DIRECTORY/DS/Projects/DS_Projects/Proj_2_Car_Price_Scikit/data


In [4]:
!pip install kaggle




In [5]:
#create folder for kaggle and move api token inside notebooks folder
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

cp: cannot stat 'kaggle.json': No such file or directory


In [None]:
# Download and unzip the datasets in the data folder (make sure your own kaggle api is in your folder)
os.chdir(DATA_DIR)

!kaggle datasets download -d hellbuoy/car-price-prediction
!unzip car-price-prediction.zip

os.chdir("../notebooks")

Dataset URL: https://www.kaggle.com/datasets/hellbuoy/car-price-prediction
License(s): unknown
car-price-prediction.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  car-price-prediction.zip
replace CarPrice_Assignment.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

---

## Data Exploration
Loading the dataset and general overview of the data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("../data/CarPrice_Assignment.csv")

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

### Conclusion
It seems like all the data is filled, and our key series price is the final one in the dataframe, we are ready to move forward

---

## Data Cleaning

Let's check for outliers, we don't have to fix null values since we checked and it was all filled

We can start by creating two new dataframes, one for qualitative (*cat_df*) and one for quantative (*numericx_df*) data

In [3]:
numeric_df = df.select_dtypes(include="number")

In [4]:
cat_df = df.select_dtypes(include="object")

In [5]:
Q1 = numeric_df.quantile(.25)
Q3 = numeric_df.quantile(.75)

IQR = Q3 - Q1

df = df[~((numeric_df < (Q1 - 1.5 * IQR)) | (numeric_df > (Q3 + 1.5 * IQR))).any(axis=1)]

In [25]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

(numeric_df > 1).any(axis=1)

0      True
1      True
2      True
3      True
4      True
       ... 
200    True
201    True
202    True
203    True
204    True
Length: 205, dtype: bool

In [26]:
for col in cat_df:
    print(ohe.fit_transform(cat_df[col].to_numpy().reshape(-1, 1)))

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 205 stored elements and shape (205, 147)>
  Coords	Values
  (0, 2)	1.0
  (1, 3)	1.0
  (2, 1)	1.0
  (3, 4)	1.0
  (4, 5)	1.0
  (5, 9)	1.0
  (6, 5)	1.0
  (7, 7)	1.0
  (8, 6)	1.0
  (9, 8)	1.0
  (10, 10)	1.0
  (11, 10)	1.0
  (12, 11)	1.0
  (13, 12)	1.0
  (14, 15)	1.0
  (15, 13)	1.0
  (16, 14)	1.0
  (17, 12)	1.0
  (18, 24)	1.0
  (19, 25)	1.0
  (20, 26)	1.0
  (21, 35)	1.0
  (22, 27)	1.0
  (23, 32)	1.0
  (24, 34)	1.0
  :	:
  (180, 126)	1.0
  (181, 128)	1.0
  (182, 129)	1.0
  (183, 130)	1.0
  (184, 133)	1.0
  (185, 137)	1.0
  (186, 131)	1.0
  (187, 136)	1.0
  (188, 132)	1.0
  (189, 145)	1.0
  (190, 146)	1.0
  (191, 134)	1.0
  (192, 135)	1.0
  (193, 132)	1.0
  (194, 139)	1.0
  (195, 138)	1.0
  (196, 140)	1.0
  (197, 141)	1.0
  (198, 143)	1.0
  (199, 144)	1.0
  (200, 139)	1.0
  (201, 138)	1.0
  (202, 140)	1.0
  (203, 142)	1.0
  (204, 143)	1.0
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 205 stored elements and shape (2