# Exploratory Data Analysis

## Goal
- Check data integrity (NaN, duplicates)
- Experiment with grouping  and filtering data
- Experiment with different charts

In [1]:
pip install streamlit

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
#Importing libraries
import streamlit as st
import pandas as pd 
import plotly_express as px

st.write("""
### Sergey Medvedev Sprint 6 Project
""")

st.write("""
#### Table of data
""")

df = pd.read_csv('/Users/sergeymedvedev/Downloads/vehicles_us.csv')

2023-10-04 12:54:47.705 
  command:

    streamlit run /Users/sergeymedvedev/Library/Python/3.9/lib/python/site-packages/ipykernel_launcher.py [ARGUMENTS]


In [3]:
print(df.head(5))

   price  model_year           model  condition  cylinders fuel  odometer  \
0   9400      2011.0          bmw x5       good        6.0  gas  145000.0   
1  25500         NaN      ford f-150       good        6.0  gas   88705.0   
2   5500      2013.0  hyundai sonata   like new        4.0  gas  110000.0   
3   1500      2003.0      ford f-150       fair        8.0  gas       NaN   
4  14900      2017.0    chrysler 200  excellent        4.0  gas   80903.0   

  transmission    type paint_color  is_4wd date_posted  days_listed  
0    automatic     SUV         NaN     1.0  2018-06-23           19  
1    automatic  pickup       white     1.0  2018-10-19           50  
2    automatic   sedan         red     NaN  2019-02-07           79  
3    automatic  pickup         NaN     NaN  2019-03-22            9  
4    automatic   sedan       black     NaN  2019-04-02           28  


In [4]:
# obtaining general information about the data in df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


In [5]:
#Calculating missing values
print(df.isna().sum())

price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64


We can see that there are many missing values. Some of the columns are less important than others. However, column 'model_year', in my opinion is important and we have to deal with missing values.

In [6]:
# In order to estimate missing values of 'model year' column, I used other related column, such as 'model' and took median value
df['model_year'] = df['model_year'].fillna(df.groupby(['model'])['model_year'].transform('median'))

In [7]:
#Identifying which columns has values that needs to be replaced
columns_to_replace = ['cylinders' , 'odometer', 'is_4wd']
columns_to_replace

['cylinders', 'odometer', 'is_4wd']

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    51525 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


In [9]:
#Looping over columns replacing missing values with 0
for column in columns_to_replace:
    print(column)
    df[column] = df[column].fillna(0)
    print('missing values in ', column, 'are replaced')

cylinders
missing values in  cylinders are replaced
odometer
missing values in  odometer are replaced
is_4wd
missing values in  is_4wd are replaced


In [10]:
#Identifying one extra qualitative column to replace 
another_column_to_replace = ['paint_color']
another_column_to_replace

['paint_color']

In [11]:
#Making another loop to replace that column 
for column in another_column_to_replace:
    print(column)
    df[column] = df[column].fillna('unknown')
    print('missing values in ', column, 'are replaced')

paint_color
missing values in  paint_color are replaced


In [12]:
#Check to make sure that all NaNs are removed
print (df.isna().sum())

price           0
model_year      0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64


In [13]:
df

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,unknown,1.0,2018-06-23,19
1,25500,2011.0,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,0.0,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,0.0,automatic,pickup,unknown,0.0,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,0.0,2019-04-02,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,9249,2013.0,nissan maxima,like new,6.0,gas,88136.0,automatic,sedan,black,0.0,2018-10-03,37
51521,2700,2002.0,honda civic,salvage,4.0,gas,181500.0,automatic,sedan,white,0.0,2018-11-14,22
51522,3950,2009.0,hyundai sonata,excellent,4.0,gas,128000.0,automatic,sedan,blue,0.0,2018-11-15,32
51523,7455,2013.0,toyota corolla,good,4.0,gas,139573.0,automatic,sedan,black,0.0,2018-07-02,71


In [14]:
# counting duplicates
df.duplicated().sum()

0

In [15]:
#Inserting extra column in order to do filtering
df.insert(0, 'id', range(0, 0 + len(df)))
df

Unnamed: 0,id,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,unknown,1.0,2018-06-23,19
1,1,25500,2011.0,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,0.0,2019-02-07,79
3,3,1500,2003.0,ford f-150,fair,8.0,gas,0.0,automatic,pickup,unknown,0.0,2019-03-22,9
4,4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,0.0,2019-04-02,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,51520,9249,2013.0,nissan maxima,like new,6.0,gas,88136.0,automatic,sedan,black,0.0,2018-10-03,37
51521,51521,2700,2002.0,honda civic,salvage,4.0,gas,181500.0,automatic,sedan,white,0.0,2018-11-14,22
51522,51522,3950,2009.0,hyundai sonata,excellent,4.0,gas,128000.0,automatic,sedan,blue,0.0,2018-11-15,32
51523,51523,7455,2013.0,toyota corolla,good,4.0,gas,139573.0,automatic,sedan,black,0.0,2018-07-02,71


In [16]:
# Table that shows how popular various types of vehicles 
st.write("""
### Proportion of different types of vehicles
""")
grouped_cars = df.groupby('type')['id'].nunique().reset_index()
st.table(grouped_cars)
grouped_cars


Unnamed: 0,type,id
0,SUV,12405
1,bus,24
2,convertible,446
3,coupe,2303
4,hatchback,1047
5,mini-van,1161
6,offroad,214
7,other,256
8,pickup,6988
9,sedan,12154


In [17]:
#Bodystyle popularity sorted in descending order
grouped_bodystyle=df.groupby(['type'])['model'].nunique().reset_index()
grouped_bodystyle_sorted= grouped_bodystyle.sort_values(by = 'model',ascending=False)
grouped_bodystyle_sorted

Unnamed: 0,type,model
0,SUV,67
9,sedan,64
10,truck,59
7,other,57
12,wagon,54
8,pickup,41
4,hatchback,32
3,coupe,31
6,offroad,23
11,van,23


In [18]:
## pie plot Proportions of car bodystyles
pie = px.pie(grouped_cars, values=grouped_cars.id, names=grouped_cars.type)
pie.update_layout(title="<b> Proportions of car types")
st.plotly_chart(pie)
pie

As can be seen from the graph of the top 3 types of vehicles:
- SUV (24.1%)
- Truck (24%)
- Sedan (23,6%)

In [19]:
#Average cost of a vehicle by different tyoe
average_cost_by_type=df.groupby(['type'])['price'].mean().sort_values(ascending=False)
average_cost_by_type

type
bus            17135.666667
truck          16734.894924
pickup         16057.410418
convertible    14575.881166
coupe          14353.442901
offroad        14292.294393
SUV            11149.400000
other          10989.714844
van            10546.941548
wagon           9088.134328
mini-van        8193.177433
sedan           6965.358647
hatchback       6868.513849
Name: price, dtype: float64

In [20]:
histogram = px.bar(grouped_bodystyle_sorted, x=grouped_bodystyle_sorted.type, y=grouped_bodystyle_sorted.model)
histogram.update_layout(title="<b> Popularity of the bodystyle")
st.plotly_chart(histogram)
histogram

As can be seen from the graph of the top 3 types of vehicles by the number in the data:
- SUV with 12405 vehicles
- Truck with 12353 vehicles
- Sedan 12154 vehicles 

In [21]:
# Scatter plot
fig = px.scatter(df, x='model', y='price', color='type',
                  labels={
                     'model' : 'Model',
                     'price' : 'Price',
                     'type' : 'Type'
                 },
                 title ='Prices of vehicles based on model')
fig.show()

Graph shows that one model of vehicle has a wae range of prices. There are also clear outliers in terms of price, like 375k for Nissan Frontier, given that its range of prices is within 0-17k. 

In [23]:
# Scatter plot 2
fig = px.scatter(df, x='model_year', y='price', color='condition',
                 labels={
                     'model_year' : 'Year',
                     'price' : 'Price',
                     'condition' : 'Condition'
                 },
                 title ='Prices of vehicles based on their year')
fig

This scatterplot shows that vehicles that were made from 1975 to around 2000 has little price changes with regards to its sale price. However, for those vehicles that were produced after 2000, especially after 2005, the range of sale prices for those vehicles increase dramatically. From 0 to around 20k for a car made in 1986 to price range from 1 to 85k for a car made in 2019. More recently made vehicles are sold at much wider price range than older cars.