# Vehicles Analytics <a id='back'></a>

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: odometer readings influence on a price](#odometer)
    * [3.2 Hypothesis 2: car types popularity](#type)
    * [3.3 Hypothesis 3: transmission type popularity and it's influense on a price](#transmission)
* [Findings](#end)

## Introduction <a id='intro'></a>
In this project, we compare US second hand car pricing. We study how different second hand car characteristics affect its price.

### Goal: 
Test three hypotheses:
1. Second hand cars with high odometer are cheapper then same cars with lower odometer. "Older" cars has higher odometer.
2. SUV cars are more popular then other type. 
3. Mechanic transmission is cheapper then automatic, mechanic transmission cars are not popular.

### Stages  
Project consist of three stages:
 1. Data overview
 2. Data preprocessing
 3. Testing the hypotheses
 
[Back to Contents](#back)

## Stage 1. Data overview <a id='data_review'></a>

In [38]:
# plotly install
!pip install plotly





In [39]:
# importing pandas
import pandas as pd

# importing plotly.express
import plotly.express as px

In [40]:
# reading the file and storing it to df
try:
    df = pd.read_csv('C:/Users/count/Project_sprint_6/vehicles_us.csv')
      
except:
    df = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/vehicles_us.csv') 

In [41]:
# obtaining the first 10 rows from the df table
df.head(10) 

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
5,14990,2014.0,chrysler 300,excellent,6.0,gas,57954.0,automatic,sedan,black,1.0,2018-06-20,15
6,12990,2015.0,toyota camry,excellent,4.0,gas,79212.0,automatic,sedan,white,,2018-12-27,73
7,15990,2013.0,honda pilot,excellent,6.0,gas,109473.0,automatic,SUV,black,1.0,2019-01-07,68
8,11500,2012.0,kia sorento,excellent,4.0,gas,104174.0,automatic,SUV,,1.0,2018-07-16,19
9,9200,2008.0,honda pilot,excellent,,gas,147191.0,automatic,SUV,blue,1.0,2019-02-15,17


In [42]:
# obtaining general information about the data in df
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


The table contains thirteen columns. They store different data types: `object`, 'int', 'float'.

According to the documentation:
- `'price'` — car price
- `'model_year'` — car model year
- `'model'` — car model
- `'condition'` - car condition
- `'cylinders'` — cylinders number
- `'fuel'` — fuel type
- `'odometer'` — total distance travelled by the car
- `'transmission'` — car transmission type
- `'type'` — car type
- `'paint_color'` - paint color
- `'is_4wd'` — 4wd car or not
- `'date_posted'` 
- `'days_listed'` 

There are no issues with style in the column names.

The number of column values is different. This means the data contains missing values.

### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores car marketing data. Columns describe car caracteristics: basic (model and model year, car type, number of cylinders, transmission type, paint color, 4wd car or not) and custom (condition, price, odometer). Also there are date posted and number of days listed columns.

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

[Back to Contents](#back)

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>
Correct the formatting in the column headers (not needed) and deal with the missing values. Then, check whether there are duplicates in the data.

### Missing values <a id='missing_values'></a>

In [43]:
# calculating missing values
df.isnull().sum() 

price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

Not all missing values affect the research. For instance, the missing values in `cylinders` and `is_4wd` are not critical. It can be simply replaced with clear markers.

But missing values in `'odometer'` can affect the comparison of car advertisement data set. In real life, it would be useful to learn the reasons why the data is missing and try to make up for them. But we do not have that opportunity in this project. So we have to:
* Fill in these missing values with markers
* Evaluate how much the missing values may affect our computations

In [44]:
# Replace the missing values in `'model_year'`, `'cylinders'`, 'paint_color'` and `'is_4wd'` with the string `'unknown'`. NO replacement in 'odometer' since NaN values there should be removed.
# Looping over column names and replacing missing values with 'unknown'
columns_to_replace = ['model_year', 'cylinders', 'paint_color', 'is_4wd']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')   

In [45]:
# counting missing values again
df.isna().sum() 

price           0
model_year      0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64

### Duplicates <a id='duplicates'></a>

There are no obvious duplicates:

In [46]:
# counting clear duplicates
df.duplicated().sum()

0

### Implicit duplicates <a id='duplicates'></a>

In [47]:
# viewing unique model names
df['model'].sort_values().unique() 

array(['acura tl', 'bmw x5', 'buick enclave', 'cadillac escalade',
       'chevrolet camaro', 'chevrolet camaro lt coupe 2d',
       'chevrolet colorado', 'chevrolet corvette', 'chevrolet cruze',
       'chevrolet equinox', 'chevrolet impala', 'chevrolet malibu',
       'chevrolet silverado', 'chevrolet silverado 1500',
       'chevrolet silverado 1500 crew', 'chevrolet silverado 2500hd',
       'chevrolet silverado 3500hd', 'chevrolet suburban',
       'chevrolet tahoe', 'chevrolet trailblazer', 'chevrolet traverse',
       'chrysler 200', 'chrysler 300', 'chrysler town & country',
       'dodge charger', 'dodge dakota', 'dodge grand caravan',
       'ford econoline', 'ford edge', 'ford escape', 'ford expedition',
       'ford explorer', 'ford f-150', 'ford f-250', 'ford f-250 sd',
       'ford f-250 super duty', 'ford f-350 sd', 'ford f150',
       'ford f150 supercrew cab xlt', 'ford f250', 'ford f250 super duty',
       'ford f350', 'ford f350 super duty', 'ford focus', 'ford focus

There are several implicit duplicates to be deleted. For example, 'ford f-150' and 'ford f150'.

In [48]:
# function for replacing implicit duplicates
def replace_wrong_models (wrong_models, correct_models):
    for wrong_model in wrong_models:
        df['model'] = df ['model'].replace(wrong_model, correct_model) 

In [49]:
# removing implicit duplicates. 
# removing specific details (e.g., trim level) and keeping only the base model name. 
wrong_models = ['chevrolet camaro', 'chevrolet camaro lt coupe 2d']
correct_model = 'chevrolet camaro' 
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['ford f150', 'ford f150 supercrew cab xlt']
correct_model = 'ford f-150'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['chevrolet silverado 1500', 'chevrolet silverado 1500 crew', 'chevrolet silverado 2500hd', 
                'chevrolet silverado 3500hd']
correct_model = 'chevrolet silverado' 
replace_wrong_models (wrong_models, correct_model)

wrong_models = ['ford f-250 sd', 'ford f-250 super duty', 'ford f250', 'ford f250 super duty']
correct_model = 'ford f-250'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['ford f-350 sd', 'ford f350', 'ford f350 super duty']
correct_model = 'ford f-350'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['ford focus', 'ford focus se']
correct_model = 'ford focus'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['ford fusion', 'ford fusion se']
correct_model = 'ford fusion'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['ford mustang', 'ford mustang gt coupe 2d']
correct_model = 'ford mustang'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['gmc sierra 1500', 'gmc sierra 2500hd']
correct_model = 'gmc sierra'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['honda civic', 'honda civic lx']
correct_model = 'honda civic'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['jeep grand cherokee', 'jeep grand cherokee laredo']
correct_model = 'jeep grand cherokee'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['jeep wrangler', 'jeep wrangler unlimited']
correct_model = 'jeep wrangler'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['nissan frontier', 'nissan frontier crew cab sv']
correct_model = 'nissan frontier'
replace_wrong_models (wrong_models, correct_model) 

wrong_models = ['toyota camry', 'toyota camry le']
correct_model = 'toyota camry'
replace_wrong_models (wrong_models, correct_model)    


In [50]:
# checking for implicit duplicates again
sorted(df['model'].unique()) 

['acura tl',
 'bmw x5',
 'buick enclave',
 'cadillac escalade',
 'chevrolet camaro',
 'chevrolet colorado',
 'chevrolet corvette',
 'chevrolet cruze',
 'chevrolet equinox',
 'chevrolet impala',
 'chevrolet malibu',
 'chevrolet silverado',
 'chevrolet suburban',
 'chevrolet tahoe',
 'chevrolet trailblazer',
 'chevrolet traverse',
 'chrysler 200',
 'chrysler 300',
 'chrysler town & country',
 'dodge charger',
 'dodge dakota',
 'dodge grand caravan',
 'ford econoline',
 'ford edge',
 'ford escape',
 'ford expedition',
 'ford explorer',
 'ford f-150',
 'ford f-250',
 'ford f-350',
 'ford focus',
 'ford fusion',
 'ford mustang',
 'ford ranger',
 'ford taurus',
 'gmc acadia',
 'gmc sierra',
 'gmc yukon',
 'honda accord',
 'honda civic',
 'honda cr-v',
 'honda odyssey',
 'honda pilot',
 'hyundai elantra',
 'hyundai santa fe',
 'hyundai sonata',
 'jeep cherokee',
 'jeep grand cherokee',
 'jeep liberty',
 'jeep wrangler',
 'kia sorento',
 'kia soul',
 'mercedes-benz benze sprinter 2500',
 'niss

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected two issues with the data:

- Missing values
- Implicit duplicates

All missing values have been replaced with `'unknown'`. But we still have to see whether the missing values in `'odometer'` will affect our analysis.

The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to testing hypotheses. 

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: odometer readings influence on a price <a id='activity'></a>

According to the first hypothesis second hand cars with high odometer are cheapper then same cars with lower odometer. 

* Divide the cars into groups by odometer readings: low - medium - high.
* Check and compare car price range in the groups.
* Check if "older" cars has higher odometer readings. 

Divide the cars into groups by odometer readings: low - medium - high.

In [51]:
# Convert 'odometer' column to numeric values with errors='coerce' to handle non-numeric strings
df['odometer'] = pd.to_numeric(df['odometer'], errors='coerce')

In [63]:
# Replace 'unknown' with NaN in 'odometer' only
df['odometer'].replace('unknown', nan, inplace=True)

# Drop rows with NaN values only in the 'odometer' column
df = df.dropna(subset=['odometer'])

# Define bins
bins = [-float("inf"), 50000, 100000, float("inf")]

# Define labels for each group
labels = ['low', 'medium', 'high']

# Create a new column 'group' with bin labels based on the 'odometer' values
df['group'] = pd.cut(df['odometer'], bins=bins, labels=labels, right=False)

# Output the DataFrame
print(df.head(10))

NameError: name 'nan' is not defined

In [None]:
# Group by group column (low/medium/high)
df.groupby('group')['model'].count()





group
low        7299
medium    10831
high      25503
Name: model, dtype: int64

In [None]:
# Add percentage share for each group

# Group by 'group' and count occurrences
group_counts = df.groupby('group')['model'].size().reset_index(name='count')

# Calculate percentage and add it as a new column
total_count = df.shape[0]  # Total count of rows in the DataFrame
group_counts['percentage'] = (group_counts['count'] / total_count) * 100
group_counts['percentage'] = group_counts['percentage'].round(2)

print(group_counts)

    group  count  percentage
0     low   7299       16.73
1  medium  10831       24.82
2    high  25503       58.45






Most of the cars in cars marketing data are with high odometer readings. There is a need to check if there is any price dependence.

In [None]:
# Convert 'date_posted' column to datetime format
df['date_posted'] = pd.to_datetime(df['date_posted'])

# Extract publication year
df['year_posted'] = df['date_posted'].dt.year

# Convert 'year_posted' and 'model_year' columns to numeric type
df['year_posted'] = pd.to_numeric(df['year_posted'], errors='coerce')
df['model_year'] = pd.to_numeric(df['model_year'], errors='coerce')

# Calculate the age of the car and adjust for days listed
df['car_age'] = df['year_posted'] - df['model_year']
df['car_age'] -= df['days_listed'] / 365
df['car_age'] = df['car_age'].round(2)

print(df.head(10))

    price  model_year                model  condition cylinders fuel  \
0    9400      2011.0               bmw x5       good       6.0  gas   
1   25500         NaN           ford f-150       good       6.0  gas   
2    5500      2013.0       hyundai sonata   like new       4.0  gas   
4   14900      2017.0         chrysler 200  excellent       4.0  gas   
5   14990      2014.0         chrysler 300  excellent       6.0  gas   
6   12990      2015.0         toyota camry  excellent       4.0  gas   
7   15990      2013.0          honda pilot  excellent       6.0  gas   
8   11500      2012.0          kia sorento  excellent       4.0  gas   
9    9200      2008.0          honda pilot  excellent   unknown  gas   
10  19500      2011.0  chevrolet silverado  excellent       8.0  gas   

    odometer transmission    type paint_color   is_4wd date_posted  \
0   145000.0    automatic     SUV     unknown      1.0  2018-06-23   
1    88705.0    automatic  pickup       white      1.0  2018-10-19 

Car age rows with missing value should be deleted. It is important characteristics for car price analytics.

In [None]:
# calculating missing values
df.isnull().sum() 

price              0
model_year      3070
model              0
condition          0
cylinders          0
fuel               0
odometer           0
transmission       0
type               0
paint_color        0
is_4wd             0
date_posted        0
days_listed        0
group              0
year_posted        0
car_age         3070
dtype: int64

In [None]:
# Drop rows with NaN values only in the 'car_age' column
df = df.dropna(subset=['car_age'])

In [None]:
df.sorted = df.sort_values(by='car_age', ascending=True)
df.sorted.tail(10)


Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access



Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,group,year_posted,car_age
40089,20000,1960.0,chevrolet impala,excellent,8.0,gas,1000.0,automatic,sedan,custom,unknown,2018-06-30,75,low,2018,57.79
33007,17500,1960.0,chevrolet impala,excellent,8.0,gas,31000.0,automatic,sedan,white,unknown,2019-02-01,11,low,2019,58.97
39580,35000,1958.0,chevrolet impala,excellent,8.0,gas,3184.0,automatic,coupe,black,unknown,2018-05-19,33,low,2018,59.91
48414,37900,1958.0,chevrolet impala,good,8.0,gas,62799.0,automatic,coupe,unknown,unknown,2018-08-11,10,medium,2018,59.97
10018,23900,1955.0,ford f-250,excellent,6.0,gas,47180.0,manual,truck,blue,unknown,2018-12-22,61,low,2018,62.83
14752,15000,1954.0,ford f-150,excellent,unknown,gas,3565.0,manual,pickup,black,unknown,2019-02-16,13,low,2019,64.96
36582,44900,1949.0,chevrolet suburban,good,unknown,gas,1800.0,automatic,wagon,orange,unknown,2018-08-19,10,low,2018,68.97
22595,21000,1948.0,chevrolet impala,like new,8.0,gas,4000.0,automatic,sedan,red,unknown,2019-01-18,24,low,2019,70.93
34713,5000,1936.0,ford f-150,excellent,6.0,gas,30000.0,manual,pickup,purple,unknown,2018-11-22,10,low,2018,81.97
33906,12995,1908.0,gmc yukon,good,8.0,gas,169328.0,automatic,SUV,black,unknown,2018-07-06,34,high,2018,109.91


In [None]:
# Plot scatter plot using Plotly Express
fig = px.scatter(df, x='car_age', y='price', color='group', title='Car price dependence on odometer readings and car age')
fig.show()

There is necessary to count car's age and do a graph based on it
According to the first hypothesis second hand cars with high odometer are cheapper then same cars with lower odometer. 

In [None]:
# Plot scatter plot using Plotly Express
fig = px.scatter(df, x='price', y='odometer', color='group', title='Scatter Plot by group')
fig.show()

Finally there is a need to writre result and code for an app -> show graphs to see what offered in a file.

In [None]:

for row in df['odometer']:
    if df['odometer'] <= 50
total_tracks = df.groupby('city')['track'].count() # Counting up the tracks played in each city
total_tracks

if start_button:
    st.write(f'Running the experient of {number_of_trials} trials.')
    st.session_state['experiment_no'] += 1
    mean = toss_coin(number_of_trials)
    st.session_state['df_experiment_results'] = pd.concat([
        st.session_state['df_experiment_results'],
        pd.DataFrame(data=[[st.session_state['experiment_no'],
                            number_of_trials,
                            mean]],
                     columns=['no', 'iterations', 'mean'])
        ],
        axis=0)
    st.session_state['df_experiment_results'] = \
        st.session_state['df_experiment_results'].reset_index(drop=True)


SyntaxError: expected ':' (1395266599.py, line 2)

### Hypothesis 2: car types popularity <a id='activity'></a>