<div style="text-align: center;">
    <h1> Project: A Web Application for Smart Car Selection <br>
 Based on Customizable Parameters </h2>
</div>

# Introduction
This project involved the development and deployment of a public-facing web application designed to assist users in selecting a car that best fits their needs based on a dataset of car sales advertisements. Hosted on a cloud platform, the application provides a user-friendly interface to filter cars by price range, condition, and distance driven, allowing users to tailor their search to find their ideal vehicle.

The application offers the following features:
- **Customizable Search**: Users can filter cars by specifying a price range in <span style="color:blue">slider</span> and a <span style="color:blue">checkbox</span> option to show only listings in "New," "Like New," and "Excellent" condition.
- **Dynamic Data Insights**: Filtered search results include a breakdown by car condition and price in <span style="color:blue">scatter plot</span> and <span style="color:blue">histogram</span> displaying the distribution of distance driven for the selected cars.
- **Recommendations List**: Users receive a final list of recommended cars based on their specified criteria.

This tool streamlines the car selection process, providing potential buyers with a curated list of vehicles to help them make informed purchasing decisions.

# Contents 

* [Data Overview](#data-overview)
* [Checking for Duplicates](#duplicates): apparant and implicit duplicates
* [Missing values](#missing-values):
  - [cloumn 'paint_color'](#paint-color)
  - [column 'model_year'](#model-year)
  - [column 'cylinders'](#cylinders)
  - [column 'odometer'](#odometer)
  - [column 'is_4wd'](#is-4wd)
* [Removing Outliers](#outliers)
  - [outliers in 'price'](#outliers-price')
  - [outliers in 'model_year'](#outliers-model-year)
* [Creating Web App](#web-app)

<a id='data-overview'></a>
<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h2>Data Overview</h2>
</div>

In [1]:
import pandas as pd
data = pd.read_csv('vehicles_us.csv')
#from ydata_profiling import ProfileReport

In [2]:
# Check characteristics of the table: name and amount of columns and type of values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


The names of the columns are written in the snake lowercase. No need in renaming.

In [3]:
# Check the quality of data in the columns
data.head(10)  

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
5,14990,2014.0,chrysler 300,excellent,6.0,gas,57954.0,automatic,sedan,black,1.0,2018-06-20,15
6,12990,2015.0,toyota camry,excellent,4.0,gas,79212.0,automatic,sedan,white,,2018-12-27,73
7,15990,2013.0,honda pilot,excellent,6.0,gas,109473.0,automatic,SUV,black,1.0,2019-01-07,68
8,11500,2012.0,kia sorento,excellent,4.0,gas,104174.0,automatic,SUV,,1.0,2018-07-16,19
9,9200,2008.0,honda pilot,excellent,,gas,147191.0,automatic,SUV,blue,1.0,2019-02-15,17


<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h2>Data Overview: Conclusion</h2>
    <p> Data is structured in 13 columns</p>
</div>

<a id='duplicates'></a>

<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h2>Checking For Duplicates</h2>
</div>

In [68]:
# Check apparent duplicates in the table

print(data.duplicated().sum())
duplicates=data[data.duplicated()]
print(duplicates)

0
Empty DataFrame
Columns: [price, model_year, model, condition, cylinders, fuel, odometer, transmission, type, paint_color, is_4wd, date_posted, days_listed]
Index: []


In [5]:
# No apparent duplicates found.

In [6]:
# Check implicit duplicates in the table - column "model"

print(data['model'].sort_values().unique())

['acura tl' 'bmw x5' 'buick enclave' 'cadillac escalade'
 'chevrolet camaro' 'chevrolet camaro lt coupe 2d' 'chevrolet colorado'
 'chevrolet corvette' 'chevrolet cruze' 'chevrolet equinox'
 'chevrolet impala' 'chevrolet malibu' 'chevrolet silverado'
 'chevrolet silverado 1500' 'chevrolet silverado 1500 crew'
 'chevrolet silverado 2500hd' 'chevrolet silverado 3500hd'
 'chevrolet suburban' 'chevrolet tahoe' 'chevrolet trailblazer'
 'chevrolet traverse' 'chrysler 200' 'chrysler 300'
 'chrysler town & country' 'dodge charger' 'dodge dakota'
 'dodge grand caravan' 'ford econoline' 'ford edge' 'ford escape'
 'ford expedition' 'ford explorer' 'ford f-150' 'ford f-250'
 'ford f-250 sd' 'ford f-250 super duty' 'ford f-350 sd' 'ford f150'
 'ford f150 supercrew cab xlt' 'ford f250' 'ford f250 super duty'
 'ford f350' 'ford f350 super duty' 'ford focus' 'ford focus se'
 'ford fusion' 'ford fusion se' 'ford mustang' 'ford mustang gt coupe 2d'
 'ford ranger' 'ford taurus' 'gmc acadia' 'gmc sierra' '

In [7]:
# Loop through unique models to find those containing "ford f"

for model in data['model'].unique():
    if "ford f" in model.lower():
        print(model)

ford f-150
ford fusion se
ford focus
ford f150 supercrew cab xlt
ford f-250 sd
ford f250 super duty
ford f-350 sd
ford f-250
ford f150
ford f350 super duty
ford fusion
ford f-250 super duty
ford focus se
ford f250
ford f350


In [8]:
# Loop through unique models to find those containing "ford" and "150"

for model in data['model'].unique():
    if "ford" in model.lower() and "150" in model.lower():
        print(model)

ford f-150
ford f150 supercrew cab xlt
ford f150


In [9]:
# function for replacing implicit duplicates
wrong_ford_150 = ['ford f150','ford f150 supercrew cab xlt'] #list of duplicates
correct_ford_150 = 'ford f-150' #string - correct value

# Define the function with a condition for replacement
def replace_wrong_ford_150(wrong_ford_150, correct_ford_150):
    for pattern in wrong_ford_150:
        # Use regex replacement to target specific parts of the model names
        data['model'] = data['model'].str.replace(rf'\b{pattern}\b', correct_ford_150, regex=True)
        
replace_wrong_ford_150(wrong_ford_150, correct_ford_150)

In [10]:
# Check unique models, containing "ford" and "150"

for model in data['model'].unique():
    if "ford" in model.lower() and "150" in model.lower():
        print(model)

ford f-150
ford f-150 supercrew cab xlt


In [11]:
# Loop through unique models to find those containing "ford" and "250"

for model in data['model'].unique():
    if "ford" in model.lower() and "250" in model.lower():
        print(model)

ford f-250 sd
ford f250 super duty
ford f-250
ford f-250 super duty
ford f250


In [12]:
# function for replacing implicit duplicates
wrong_ford_250 = ['ford f250 super duty','ford f250'] 
correct_ford_250 = 'ford f-250' 

# Define the function with a condition for replacement
def replace_wrong_ford_250(wrong_ford_250, correct_ford_250):
    for pattern in wrong_ford_250:
        # Use regex replacement to target specific parts of the model names
        data['model'] = data['model'].str.replace(rf'\b{pattern}\b', correct_ford_250, regex=True)
    # Replace 'super duty' with 'sd' (case-insensitive)
    data['model'] = data['model'].str.replace(r'(?i)super duty', 'sd', regex=True)

replace_wrong_ford_250(wrong_ford_250, correct_ford_250)

In [13]:
# Check unique models, containing "ford" and "250"

for model in data['model'].unique():
    if "ford" in model.lower() and "250" in model.lower():
        print(model)

ford f-250 sd
ford f-250


In [14]:
# Loop through unique models to find those containing "ford" and "350"

for model in data['model'].unique():
    if "ford" in model.lower() and "350" in model.lower():
        print(model)

ford f-350 sd
ford f350 sd
ford f350


In [15]:
# function for replacing implicit duplicates
wrong_ford_350 = ['ford f350 sd','ford f350'] 
correct_ford_350 = 'ford f-350' 

# Define the function with a condition for replacement
def replace_wrong_ford_350(wrong_ford_350, correct_ford_350):
    for pattern in wrong_ford_350:
        # Use regex replacement to target specific parts of the model names
        data['model'] = data['model'].str.replace(rf'\b{pattern}\b', correct_ford_350, regex=True)

replace_wrong_ford_350(wrong_ford_350, correct_ford_350)

In [16]:
# Check unique models, containing "ford" and "350"

for model in data['model'].unique():
    if "ford" in model.lower() and "350" in model.lower():
        print(model)

ford f-350 sd
ford f-350


In [17]:
print(data['model'].sort_values().unique())

['acura tl' 'bmw x5' 'buick enclave' 'cadillac escalade'
 'chevrolet camaro' 'chevrolet camaro lt coupe 2d' 'chevrolet colorado'
 'chevrolet corvette' 'chevrolet cruze' 'chevrolet equinox'
 'chevrolet impala' 'chevrolet malibu' 'chevrolet silverado'
 'chevrolet silverado 1500' 'chevrolet silverado 1500 crew'
 'chevrolet silverado 2500hd' 'chevrolet silverado 3500hd'
 'chevrolet suburban' 'chevrolet tahoe' 'chevrolet trailblazer'
 'chevrolet traverse' 'chrysler 200' 'chrysler 300'
 'chrysler town & country' 'dodge charger' 'dodge dakota'
 'dodge grand caravan' 'ford econoline' 'ford edge' 'ford escape'
 'ford expedition' 'ford explorer' 'ford f-150'
 'ford f-150 supercrew cab xlt' 'ford f-250' 'ford f-250 sd' 'ford f-350'
 'ford f-350 sd' 'ford focus' 'ford focus se' 'ford fusion'
 'ford fusion se' 'ford mustang' 'ford mustang gt coupe 2d' 'ford ranger'
 'ford taurus' 'gmc acadia' 'gmc sierra' 'gmc sierra 1500'
 'gmc sierra 2500hd' 'gmc yukon' 'honda accord' 'honda civic'
 'honda civic 

<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h3>Duplicates: Conclusion</h3>
    <p>No apparent duplicates were found. <br> 
Implicit duplicates in "Ford" model names were identified and standardized by unifying "f" models with a hyphen between "f" and the model number (e.g., "f-150").</p>
</div>

## YDATA REPORT

In [18]:
#Auxiliary Data Analysis
#data.profile_report()

<a id='missing-values'></a>
<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h2>Missing Values</h2>
</div>

In [19]:
# Check missing values

data.isna().sum()

price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

### Columns that contain missing values:
- model_year
- cylinders
- odometer
- paint_color
- is_4wd

We state high correllation between:
- model year to odometer (distance the car has driven) and price
- odometer to model year
- price to model year

<a id='paint-color'></a>
<div style="background-color: pink; padding: 10px; border-radius: 5px;">
    <h3>Column 'paint_color'</h3>
</div>

In [20]:
# Number of missing valaues in column 'paint_color'

print(data['paint_color'].isna().sum())

9267


In [21]:
# Replace missing values in column 'paint_color' by 'unknown'

data['paint_color']=data['paint_color'].fillna('unknown')
print(data['paint_color'].unique())

['unknown' 'white' 'red' 'black' 'blue' 'grey' 'silver' 'custom' 'orange'
 'yellow' 'brown' 'green' 'purple']


<a id='model-year'></a>
<div style="background-color: pink; padding: 10px; border-radius: 5px;">
    <h3>Column 'model_year'</h3>
</div>

In [22]:
# Check number of missing values in the column 'model_year'

print(data['model_year'].isna().sum())

3619


In [23]:
# Calculate mean 'model_year' for each 'model'

mean_model_year = data.groupby('model')['model_year'].mean().round(1)
print(mean_model_year)

model
acura tl             2007.5
bmw x5               2009.0
buick enclave        2012.2
cadillac escalade    2008.5
chevrolet camaro     2008.4
                      ...  
toyota sienna        2008.6
toyota tacoma        2009.3
toyota tundra        2009.4
volkswagen jetta     2010.8
volkswagen passat    2011.3
Name: model_year, Length: 95, dtype: float64


In [24]:
# Create custom function to replace missing 'model_year' by the mean value of the corresponding 'model'

def fill_missing_model_year(row):
    if pd.isna(row['model_year']):
        return mean_model_year.get(row['model'],row['model_year'])
    else:
        return row['model_year']

data['model_year']=data.apply(fill_missing_model_year, axis=1)
print(data['model_year'].isna().sum())

0


<a id='cylinders'></a>
<div style="background-color: pink; padding: 10px; border-radius: 5px;">
    <h3>Column 'cylinders'</h3>
</div>

In [25]:
print(data['cylinders'].isna().sum())
print(data['cylinders'].unique())

5260
[ 6.  4.  8. nan  5. 10.  3. 12.]


In [66]:
# Replace missing values in column 'cylinders' by '0.0'

data['cylinders']=data['cylinders'].fillna(0.0)
print(data['cylinders'].isna().sum())
print(data['cylinders'].unique())

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
0
[ 6.  4.  8.  0.  5. 10.  3. 12.]


<a id='odometer'></a>
<div style="background-color: pink; padding: 10px; border-radius: 5px;">
    <h3>Column 'odometer'</h3>
</div>

In [27]:
# Calculate number of missing values in 'odometer' column

print(data['odometer'].isna().sum())

7892


In [28]:
# Calculate mean distance driven by car by 'model'

mean_model_distance = data.groupby('model')['odometer'].mean().round(1)
print(mean_model_distance)

model
acura tl             142760.4
bmw x5               113210.1
buick enclave        113459.5
cadillac escalade    123616.6
chevrolet camaro      71068.0
                       ...   
toyota sienna        136911.1
toyota tacoma        126521.3
toyota tundra        123271.2
volkswagen jetta     107870.0
volkswagen passat     90764.4
Name: odometer, Length: 95, dtype: float64


In [29]:
# Create custom function to replace missing 'odometer' by the mean value of the corresponding 'model'

def fill_missing_odometer(row):
    if pd.isna(row['odometer']):
        return mean_model_distance.get(row['model'],row['odometer'])
    else:
        return row['odometer']

data['odometer']=data.apply(fill_missing_odometer, axis=1)
still_missing_odometer=data['odometer'].isna()
print(still_missing_odometer)

0        False
1        False
2        False
3        False
4        False
         ...  
51520    False
51521    False
51522    False
51523    False
51524    False
Name: odometer, Length: 51525, dtype: bool


In [30]:
odometer_mean=data['odometer'].mean().round(1)
print(odometer_mean)

115518.0


In [31]:
data.loc[still_missing_odometer, 'odometer'] = odometer_mean
print(data['odometer'].isna().sum())

0


<a id='is-4wd'></a>
<div style="background-color: pink; padding: 10px; border-radius: 5px;">
    <h3>Column 'is_4wd'</h3>
</div>

In [32]:
print(data['is_4wd'].isna().sum())
print(data['is_4wd'].isnull().sum())
print(data['is_4wd'].unique())

25953
25953
[ 1. nan]


In [33]:
print(data.groupby('type')['is_4wd'].sum())

type
SUV            8853.0
bus               0.0
convertible      53.0
coupe            76.0
hatchback       160.0
mini-van         39.0
offroad         206.0
other           126.0
pickup         5026.0
sedan           563.0
truck          9357.0
van              40.0
wagon          1073.0
Name: is_4wd, dtype: float64


In [34]:
# Create a variable for the missing values in column 'is_4wd'

missing_4wd = data['is_4wd'].isna()
print(missing_4wd)

0        False
1        False
2         True
3         True
4         True
         ...  
51520     True
51521     True
51522     True
51523     True
51524     True
Name: is_4wd, Length: 51525, dtype: bool


In [35]:
# List all unique 'model' types

print(data['model'].sort_values().unique())

['acura tl' 'bmw x5' 'buick enclave' 'cadillac escalade'
 'chevrolet camaro' 'chevrolet camaro lt coupe 2d' 'chevrolet colorado'
 'chevrolet corvette' 'chevrolet cruze' 'chevrolet equinox'
 'chevrolet impala' 'chevrolet malibu' 'chevrolet silverado'
 'chevrolet silverado 1500' 'chevrolet silverado 1500 crew'
 'chevrolet silverado 2500hd' 'chevrolet silverado 3500hd'
 'chevrolet suburban' 'chevrolet tahoe' 'chevrolet trailblazer'
 'chevrolet traverse' 'chrysler 200' 'chrysler 300'
 'chrysler town & country' 'dodge charger' 'dodge dakota'
 'dodge grand caravan' 'ford econoline' 'ford edge' 'ford escape'
 'ford expedition' 'ford explorer' 'ford f-150'
 'ford f-150 supercrew cab xlt' 'ford f-250' 'ford f-250 sd' 'ford f-350'
 'ford f-350 sd' 'ford focus' 'ford focus se' 'ford fusion'
 'ford fusion se' 'ford mustang' 'ford mustang gt coupe 2d' 'ford ranger'
 'ford taurus' 'gmc acadia' 'gmc sierra' 'gmc sierra 1500'
 'gmc sierra 2500hd' 'gmc yukon' 'honda accord' 'honda civic'
 'honda civic 

In [36]:
# Create a list of models that have 4x4 version (i.e. 'is_4wd)

model_4x4=['bmw x5', 'buick enclave', 'cadillac escalade', 'chevrolet colorado', 'chevrolet equinox', 'chevrolet silverado', 
           'chevrolet silverado 1500', 'chevrolet silverado 2500hd', 'chevrolet silverado 3500hd', 'chevrolet suburban', 
           'chevrolet tahoe', 'chevrolet trailblazer', 'chevrolet traverse', 'dodge dakota', 'ford expedition', 'ford explorer', 
           'ford f-150', 'ford f-250', 'ford f-250 sd', 'ford f-350', 'ford f-350 sd', 'gmc acadia', 'gmc sierra', 'gmc sierra 1500', 
           'gmc sierra 2500hd', 'gmc yukon', 'honda cr-v', 'honda pilot', 'hyundai santa fe', 'jeep cherokee', 'jeep grand cherokee', 
           'jeep grand cherokee laredo', 'jeep liberty', 'jeep wrangler', 'jeep wrangler unlimited', 'kia sorento', 'nissan frontier', 
           'nissan frontier crew cab sv', 'nissan murano', 'nissan rogue', 'ram 1500', 'ram 2500', 'ram 3500', 'subaru forester', 
           'subaru impreza', 'subaru outback', 'toyota 4runner', 'toyota highlander', 'toyota rav4', 'toyota tacoma', 'toyota tundra']


In [37]:
# Create a loop through the object with the missing values of 'is_4wd' column to check, if the model is in the list of 'model_4x4'.
# If yes, then replace the missing value by 1, else 0

for index, row in data[missing_4wd].iterrows():
    if row['model'] in model_4x4:
        data.at[index, 'is_4wd'] = 1 
    else:
        data.at[index, 'is_4wd'] = 0

print(data[['model', 'is_4wd']].head())

            model  is_4wd
0          bmw x5     1.0
1      ford f-150     1.0
2  hyundai sonata     0.0
3      ford f-150     1.0
4    chrysler 200     0.0


In [38]:
# Check number of missing values in column 'is_4wd'

print(data['is_4wd'].isna().sum())

0


In [39]:
data.isna().sum()

price           0
model_year      0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64

<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h3>Missing Values: Conclusion</h3>
    <p>
Missing values in column 'paint_color' were replaced by "unknown".<br>
Missing values in column 'model_year' were replaced by the mean year, corresponding to each model.<br>
Missing values in column 'cylinders' were replaced by "0.0".<br>
Missing values in column 'odometer' were replaced by the mean distance, corresponding to each model.<br> 
Missing values in column 'is_4wd' were filled in, according to the list of models that have 4x4 version, that was created additionally.<br>
</p>
</div>

<a id='outliers'></a>
<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h2>Removing Outliers</h2>
</div>

<a id='outliers-price'></a>
<div style="background-color: pink; padding: 10px; border-radius: 5px;">
    <h3>Outliers in 'price'</h3>
</div>

In [40]:
# Find price borders for the outliers

q1 = data['price'].quantile(0.01)
q99 = data['price'].quantile(0.99)
print(q1)
print(q99)

1.0
43995.0


In [41]:
# Check the amount of rows with extremely high price (>= 100000)

outliers_max=data[(data['price'] >= 100000)] 
outliers_max.shape

(17, 13)

In [42]:
# Check the amount of rows with extremely low price (<= 1)

outliers_min=data[(data['price'] <= 1)] 
outliers_min.shape

(798, 13)

In [43]:
print(data['price'].isna().sum())

0


In [44]:
# Remove outliers in price = keep the rows that stay between the price borders

data[(data['price'] >= 1) & (data['price'] <= 100000)]
data = data[(data['price'] >= 1) & (data['price'] <= 100000)]
print(data)

       price  model_year           model  condition  cylinders fuel  odometer  \
0       9400      2011.0          bmw x5       good        6.0  gas  145000.0   
1      25500      2009.3      ford f-150       good        6.0  gas   88705.0   
2       5500      2013.0  hyundai sonata   like new        4.0  gas  110000.0   
3       1500      2003.0      ford f-150       fair        8.0  gas  124194.1   
4      14900      2017.0    chrysler 200  excellent        4.0  gas   80903.0   
...      ...         ...             ...        ...        ...  ...       ...   
51520   9249      2013.0   nissan maxima   like new        6.0  gas   88136.0   
51521   2700      2002.0     honda civic    salvage        4.0  gas  181500.0   
51522   3950      2009.0  hyundai sonata  excellent        4.0  gas  128000.0   
51523   7455      2013.0  toyota corolla       good        4.0  gas  139573.0   
51524   6300      2014.0   nissan altima       good        4.0  gas  105780.8   

      transmission    type 

In [45]:
# Check if the outliers are removed

outliers_1=data['price'] < 1
outliers_100000=data['price'] > 100000

print(outliers_1.sum())
print(outliers_100000.sum())

0
0


<a id='outliers-model-year'></a>
<div style="background-color: pink; padding: 10px; border-radius: 5px;">
    <h3>Outliers in 'model_year'</h3>
</div> 

In [46]:
# Find model_year borders for the outliers

q1 = data['model_year'].quantile(0.01)
q99 = data['model_year'].quantile(0.99)
print(q1)
print(q99)

1991.0
2018.0


In [47]:
# Check the amount of rows with extremely old cars (<= 1970)

outliers_old_cars=data[(data['model_year'] <= 1970)] 
outliers_old_cars.shape

(103, 13)

In [48]:
# Remove outliers in model_year = keep the rows with the year, greater than 1970

data=data[(data['model_year'] >= 1970)] 
print(data)

       price  model_year           model  condition  cylinders fuel  odometer  \
0       9400      2011.0          bmw x5       good        6.0  gas  145000.0   
1      25500      2009.3      ford f-150       good        6.0  gas   88705.0   
2       5500      2013.0  hyundai sonata   like new        4.0  gas  110000.0   
3       1500      2003.0      ford f-150       fair        8.0  gas  124194.1   
4      14900      2017.0    chrysler 200  excellent        4.0  gas   80903.0   
...      ...         ...             ...        ...        ...  ...       ...   
51520   9249      2013.0   nissan maxima   like new        6.0  gas   88136.0   
51521   2700      2002.0     honda civic    salvage        4.0  gas  181500.0   
51522   3950      2009.0  hyundai sonata  excellent        4.0  gas  128000.0   
51523   7455      2013.0  toyota corolla       good        4.0  gas  139573.0   
51524   6300      2014.0   nissan altima       good        4.0  gas  105780.8   

      transmission    type 

In [49]:
# Check if the rows with extremely old cars (< 1970) were removed from data

outliers_1970=data['model_year'] < 1970
print(outliers_1970.sum())

0


<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h2>Removing Outliers: Conclusion</h2>
<p> Outliers in 'price' outside the range of 1 to 100,000 were identified and removed. <br>
Outliers in 'model_year' before 1970 were also removed.
</p>
</div>

In [50]:
data.isna().sum()

price           0
model_year      0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64

<a id='web-app'></a>
<div style="background-color: #d0e6a5; padding: 10px; border-radius: 5px;">
    <h2>Creating Web App</h2>
</div>

In [51]:
import urllib.request
import pandas as pd
import streamlit as st
import plotly.express as px
from PIL import Image

In [52]:
st.subheader('Use this app to select a car based on the preferred parameters.')

2024-11-03 19:26:46.522 
  command:

    streamlit run /opt/anaconda3/envs/spyder5.4.2/lib/python3.11/site-packages/ipykernel_launcher.py [ARGUMENTS]


DeltaGenerator()

In [53]:
import urllib.request
from PIL import Image

In [54]:
urllib.request.urlretrieve(
    'https://www.linearity.io/blog/content/images/2023/06/how-to-create-a-car-NewBlogCover.png',
    "gfg.png")

('gfg.png', <http.client.HTTPMessage at 0x12c013b90>)

In [55]:
img = Image.open("gfg.png")
st.image(img)
img.show()



In [56]:
st.caption(':red[Choose your parameters here]')



DeltaGenerator()

In [60]:
price_range = st.slider(
     "What is your price range?",
     value=(2, 100000))



In [59]:
actual_range=list(range(price_range[0],price_range[1]+1))

In [61]:
excellent_condition = st.checkbox('Only excellent condition, like new or new')

if excellent_condition:
    filtered_data=data[data.price.isin(actual_range)]
    filtered_data = filtered_data[filtered_data['condition'].isin(['excellent', 'new', 'like new'])]
else:
    filtered_data=data[data.price.isin(actual_range)]



In [62]:
st.write('Here are your options with a split by price and condition')



In [63]:
fig = px.scatter(filtered_data, x="price", y="condition")           
st.plotly_chart(fig)



DeltaGenerator()

In [64]:
st.write('Distribution of distnace driven of filtered cars')
fig2 = px.histogram(filtered_data, x="odometer")
st.plotly_chart(fig2)



DeltaGenerator()

In [65]:
st.write('Here is the list of recommended cars')
st.dataframe(filtered_data.sample(40))



DeltaGenerator()