In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h1>USED CARS DATASET

<h3>Overview

A dataset that contains the information about the used from different regions and contians information such as region,price, year, manufacturer, model, condition, type etc. The data has been cleaned according to the need. After analyzing the relationships between different variables in the dataset, conclusions have been made to help the potential buyer make a profitable choice.

In [None]:
data=pd.read_csv('../input/craigslist-carstrucks-data/vehicles.csv')

<h3>Data Exploration

In [None]:
#Checking the dataset
data.head()

In [None]:
#Cheching The Columns
data.columns

In [None]:
data.info

In [None]:
data.describe


<h3>Data Cleaning

In this step, we will perform following tasks-

- Pick those columns that are relevant for the analysis and drop those that are not.

- Check for any null values

- Fill or drop the records with null values

- Find the percentage of the null records

- Select the year from which data has to be analyzed

In [None]:
data.columns

***

For our analysis, the columns:

'image_url','description','county','VIN','drive','odometer','title_status','region_url','cylinders','url','posting_date' are not very much relevant. So we will drop these columns.

***

In [None]:
#Dropping columns
data.drop(columns=['image_url','description','county','VIN','drive','odometer','title_status','region_url','id','posting_date','url','cylinders','fuel','transmission', 'size'],inplace=True)

In [None]:
data.head()

In [None]:
#Checking For Null Records
data.isnull()

In [None]:
data.isnull().sum()

In [None]:
#Getting the percentage of the null records
null_values=pd.DataFrame(data.isnull().sum(),columns=['null_sum'])
null_values=null_values[null_values.null_sum>0]
null_values['percentage']=(null_values.null_sum/len(data))*100
null_values=null_values.sort_values(by='percentage',ascending=False)
null_values

It seems that the column size, condition and paint_color has relatively high number of missing data.

<h4> Since these records might be available online but there are more than 3 Million missing data so we will drop those records that have null values.

In [None]:
#Dropping Null Records
data.dropna(axis=0,inplace=True)

In [None]:
data.isnull().sum()

There are no null values

In [None]:
data.model.unique()

Let consider that f-150 and f150 are the same model, so we replace one of them with another one.

In [None]:
data['model'].replace({'f150': 'f-150'}, inplace = True)

In [None]:
data.year.unique()

In [None]:
data.year.nunique()

Since the data is from year 1944, we will analyze the data from 2000.

In [None]:
data=data[data.year>2000]

In [None]:
data.year.nunique()

In [None]:
data.head()

In [None]:
data.manufacturer.value_counts()

We will select those manufacturers for which they have count >500

In [None]:
manufacturer = pd.DataFrame(data.manufacturer.value_counts(), columns = ['manufacturer'])
manufacturer = manufacturer.sort_values(by=['manufacturer'], ascending=False)
manufacturer = manufacturer.loc[manufacturer.manufacturer>500,:]
manufacturer

In [None]:
manufacturer.index

In [None]:
data = data.loc[data['manufacturer'].isin(manufacturer.index)]
data.shape

In [None]:
data.model.value_counts()

We will select those models which have occured more than 500 times in the dataset

In [None]:
model = pd.DataFrame(data.model.value_counts(), columns = ['model'])
model = model.sort_values(by=['model'], ascending=False)
model = model.loc[model.model>500]
model

In [None]:
model.index

In [None]:
data = data.loc[data['model'].isin(model.index)]
data.shape

In [None]:
data.region.unique()

We will look for those regions for which listings have been >50

In [None]:
data.region.value_counts()

In [None]:
region = pd.DataFrame(data.region.value_counts(), columns = ['region'])
region = region.sort_values(by=['region'], ascending=False)
region = region.loc[region.region>50]
region

In [None]:
data = data.loc[data['region'].isin(region.index)]
data.shape

***

<h3>Data Visualtization

In [None]:
#Importing Libraries
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

<h4>Which manufacturer's vheicles are most listed ?

In [None]:
sns.set_style('darkgrid')
data.manufacturer.value_counts().plot(kind='bar',figsize=(12, 9),color=['#008000','#008080'])
plt.ticklabel_format( axis='y',style='plain')

Clearly Chevrolet and Ford are the most listed manufacturers followed by  Ram.

- It seems like Ford users are mostly reselling their cars. It may point that Chevrolet and Ford cars are not long lasting but it may also point that most of the sold cars are from these two companies. This hypothesis is backed by-
https://www.statista.com/statistics/264362/leading-car-brands-in-the-us-based-on-vehicle-sales/


In [None]:
data.manufacturer.unique()

<h3>From which region the cars are most listed ?

In [None]:
plt.figure(figsize=(17,6))
plt.xticks(rotation=90)
data.region.value_counts().head(20).plot(kind='bar',figsize=(12, 9),color=['#008000','#008080'])

Long Island has the most listings followed by New Hampshire,lousville and Tulsa which are listed almost same number of times. 

It is concerning to note that after the cleaned dataset, most of the car listings are in states that are not more populated. Since we do not have data for states like-NYC, California, Texas etc which are most populted, data can be misleading.

<h3> In which year most listings were made ?

In [None]:
data.head()

In [None]:
sns.set_style('darkgrid')
data.year.value_counts().plot(kind='bar',figsize=(12, 9),color=['#8C09BD','#6EE90C'])
plt.ticklabel_format( axis='y',style='plain')

2014 has the most listings followed by 2013, 2015 etc.

- For 2020, probably due to the pandemic, shipping problems, recession, and many other problems lead to the decline in the listings.

- For 2021, the data is updating but the trend is in declining slope as well.

<h3> Find the mean price for each year.

In [None]:
mean_price=data.groupby(by=data.year)[['price']].mean()

In [None]:
mean_price

In [None]:
plt.figure(figsize=(17,6))
plt.xticks(rotation=90)
sns.set_style('darkgrid')
plt.plot(mean_price.index,mean_price.price)
plt.ticklabel_format( axis='y',style='plain')



Remarkably, 2020 has seen a steep increase in the price of the cars even though it was the year of Pandemic. But there was drop from the price in 2019 from 2018
For More Information: https://www.vox.com/the-goods/21507739/coronavirus-car-market-used-expensive

<h3>Compare the condition of the w.r.t. color.

In [None]:
plt.figure(figsize=(10,7))
var=sns.countplot(data=data,x='paint_color',hue='condition')
var.set_xticklabels(var.get_xticklabels(), rotation=90)
plt.legend(loc='upper right')
plt.show()


It seems that white color cars are in excellent condition followed by black cars.It will be intresting to see when these cars were listed according to their color.

In [None]:
sns.set_style('darkgrid')
data.paint_color.value_counts().plot(kind='bar',figsize=(12, 9),color=['#C39BD3','#A9CCE3','#76D7C4','#B2BABB','#E5E7E9'])
plt.ticklabel_format( axis='y',style='plain')

It is evident from the graph that White cars are most purchased followed by black cars.

It is not true only for this dataset but also in general.
https://www.germaincars.com/most-popular-car-colors/

<h3>Find the mean price for each manufacturer

In [None]:
mean_price_manufacturer=data.groupby(by=data.manufacturer)[['price']].mean()


In [None]:
mean_price_manufacturer

In [None]:

plt.xticks(rotation=90)
sns.set_style('darkgrid')
sns.barplot(x=mean_price_manufacturer.index,y=mean_price_manufacturer.price)


This shows that ram's cars gives most profit after resell but mean can be bit misleading so it would be advisable if they look at current trends also. So if any one wants to purchase a car and plans to resell it, they should go for Ram's if they want to earn profit.

***

<h2>Remarks

- The records were dropped for which null values could't be filled.

- Columns were dropped that were not needed for the analysis

- Only records after the year 2000 were taken into account.

- Only those regions, manufacturers and model were considered which were listed more then 500 times in the dataset.

- Most of the car listings are in states that are not more populated. Since we do not have data for states like-NYC, California, Texas etc which are most populted, data can be misleading.

- For plotting a map, not enough points were remaining after data cleaning

<h2>Conclusions

-  Ford users are mostly reselling their cars. It may point that Chevrolet and Ford cars are not long lasting but it may also point that most of the sold cars are from these two companies.

- Most of the cars for reselling are from Tulsa followed by Orlnado,Lousville and Dayton which are listed almost same number of times.

- 2014 has the most listings followed by 2013, 2015 etc.

- For 2020, probably due to the pandemic, shipping problems, recession, and many other problems lead to the decline in the listings.

- For 2021, the data is updating but the trend is in declining slope as well.

- Remarkably, 2020 has seen a steep increase in the price of the cars even though it was the year of Pandemic. This unexpecetd trend is explained here:  https://www.vox.com/the-goods/21507739/coronavirus-car-market-used-expensive.

- But there was drop from the price in 2019 from 2018.

- Remarkably, 2020 has seen a steep increase in the price of the cars even though it was the year of Pandemic. But there was drop from the price in 2019 from 2018.

- Most of the white cars are in excellent condition while very few cars are new. 

- It is evident from the graph that White cars are most purchased followed by black cars. It is not true only for this dataset but also in general. This can be backed by the reasearch from : https://www.germaincars.com/most-popular-car-colors/. 


