# Exploratory Data Analysis Carsale Advetisement Dataset

## Table of Contents

1. [Problem Statement](#section1)<br>
2. [Data Loading and Description](#section2)
3. [Data Profiling](#section3)
    - * [Understanding the Dataset](#section301)<br/>
    - * [Profiling_1](#section302)<br/>
    - * [Preprocessing_1](#section303)<br/>
    - * [Profiling_2](#section304)<br/>
    - * [Preprocessing_2](#section305)<br/>
    - * [Post Profiling](#section306)<br/>
4. [Conclusions](#section4)<br/>  

<a id=section1></a>

### 1. Problem Statement

The notebooks explores the basic use of __Pandas__ and will cover the basic commands of __Exploratory Data Analysis(EDA)__ which includes __cleaning__, __munging__, __combining__, __reshaping__, __slicing__, __dicing__, and __transforming data__ for analysis purpose.

* __Exploratory Data Analysis__ <br/>
Understand the data by EDA and derive simple models with Pandas as baseline.
EDA ia a critical and first step in analyzing the data and we do this for below reasons :
    - Finding patterns in Data
    - Determining relationships in Data
    - Checking of assumptions
    - Preliminary selection of appropriate models
    - Detection of mistakes 


<a id=section2></a>

### 2. Data Loading and Description

<a id=section201></a>

- The dataset consists information collected from car sale advertisements for study/practice purpose where most of them're used cars.
- The dataset comprises of __9576 observations of 10 columns__. Below is a table showing names of all the columns and their description.


<table>
<thead>
    <style>
td {
  text-align: center;
}
</style>
<tr>
<th>Column Name</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td>car</td>
<td>Manufacturer brand</td>
</tr>
<tr>
<td>price</td>
<td>Seller’s price in advertisement (in USD)</td>
</tr>
<tr>
<td>body</td>
<td>Car body type</td>
</tr>
<tr>
<td>mileage</td>
<td>as mentioned in advertisement (‘000 Km)</td>
</tr>
<tr>
<td>engV</td>
<td>rounded engine volume (‘000 cubic cm)</td>
</tr>
<tr>
<td>engType</td>
<td>type of fuel (“Other” in this case should be treated as NA)</td>
</tr>
<tr>
<td>registration</td>
<td>whether car registered in Ukraine or not</td>
</tr>
<tr>
<td>year</td>
<td>year of production</td>
</tr>
<tr>
<td>model</td>
<td>specific model name</td>
</tr>
<tr>
<td>drive</td>
<td>drive type</td>
</tr>
</tbody></table>

#### Some Background Information
 - This data was collected from private car sale advertisements in Ukraine and provided by INSAID team to perform Exploratory Data Analysis.
 - This dataset has real raw data which has all inconvenient moments (as NA’s for example).
 - This dataset contains data for more than 9.5K cars sale in Ukraine. Most of them are used cars so it opens the possibility to analyze features related to car operation.

#### Importing packages                                          

In [None]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()

from subprocess import check_output



#### Importing the Dataset

In [None]:
#If you get UnicodeDecodeError - 'utf8' codec can't decode, invalid continuation byte error, either use engine='python'
#or encoding='latin-1' options
carsale = pd.read_csv("../input/car-sale-advertisements/car_ad.csv",engine="python")     # Importing car_sale dataset using pd.read_csv

<a id=section3></a>

## 3. Data Profiling

<a id=section301></a>

### 3.1 Understanding the Carsale Dataset

In [None]:
carsale.shape    # This will print the number of rows and comlumns of the Data Frame

__Carsale Dataset__  has __9576 rows__ and __10 columns.__

In [None]:
carsale.columns  # This will print the names of all columns.

In [None]:
carsale.head()   # Will give you first 5 records

In [None]:
carsale.tail()   # This will print the last n rows of the Data Frame

In [None]:
carsale.info() # This will give Index, Datatype and Memory information

In [None]:
# Use include='all' option to generate descriptive statistics for all columns
# You can get idea about which column has missing values using this
carsale.describe() 

- We can see all numeric columns having count __9576__ except __engV__. Looks like this column has some missing values
- __price__ and __mileage__ has __min__ value as __Zero__ which is not possible. We need to look into to replace them as __NaN__ to make them null values

In [None]:
carsale.isnull().sum() # Will show you null count for each column, but will not count Zeros(0) as null

- We can see that __engV__ and __drive__ columns contains __maximum null values__. But it didn't consider __0__ as missing value, so we have to figure out how to deal with this as mentioned above too.

<a id=section302></a>

<a id=section302></a>

### * Profiling_1

In [None]:
profile = pandas_profiling.ProfileReport(carsale)
profile.to_file(outputfile="carsale_before_preprocessing_1.html")

- I have done Pandas Profiling before preprocessing dataset, so we can get initial observations from the dataset in better visual aspects, to find correlation matrix and sample data. File was saved as html file __carsale_before_preprocessing_1.html__.

- Will take a look at the file and see what useful insight you can develop from it. <br/>


- Initial observation as a result from profiling of __Carsale Dataset__ can be seen in  __carsale_before_preprocessing_1.html__

<a id=section303></a>

### * Preprocessing_1

- __engType__ column has __"Other"__ values as well, So as mentioned above in data description, we should treat them as NA. So we'll be replacing those to __NaN__
- __price__ has __267__ zeros which should be treated as missing values. So we'll be replacing those to __NaN__
- __mileage__ has __348__ zeros which should be treated as missing values. So we'll be replacing those to __NaN__

In [None]:
carsale.replace({'engType': 'Other', 'price': 0, 'mileage': 0}, np.nan, inplace=True)

<a id=section304></a>

### * Profiling_2

- Let see what are the changes after __Preprocessing_1__ in initial observations from profiling

In [None]:
profile = pandas_profiling.ProfileReport(carsale)
profile.to_file(outputfile="carsale_before_preprocessing_2.html")

- Initial observation as a result from profiling of Carsale Dataset can be seen in __carsale_before_preprocessing_2.html__

<a id=section305></a>

### * Preprocessing_2
<br>
Now we will deal with handling duplicates and missing values

- __duplicates__: As there are __113__ duplicate rows in dataset, we have to remove those first.

In [None]:
carsale.drop_duplicates(inplace=True) #inplace used to modify the dataset with applied command
carsale.shape

### *  Handling numerical data
- __price__: Replacing missing values now for __price__ column based on __[car,model]__ group product and __median__ value of __price__ based on this group. 


In [None]:
def get_median_price(x):
    brand = x.name[0]
    if x.count() > 0:
        return x.median() # Return median for a brand/model if the median exists.
    elif carsale.groupby(['car'])['price'].count()[brand] > 0:
        brand_median = carsale.groupby(['car'])['price'].apply(lambda x: x.median())[brand]
        return brand_median # Return median of brand if particular brand/model combo has no median,
    else:                 # but brand itself has a median for the 'price' feature. 
        return carsale['price'].median() # Otherwise return dataset's median for the 'price' feature.
    
price_median = carsale.groupby(['car','model'])['price'].apply(get_median_price).reset_index()
price_median.rename(columns={'price': 'price_med'}, inplace=True)
price_median.head()

In [None]:
def fill_with_median(x):
    if pd.isnull(x['price']):
        return price_median[(price_median['car'] == x['car']) & (price_median['model'] == x['model'])]['price_med'].values[0]
    else:
        return x['price']
    
carsale['price'] = carsale.apply(fill_with_median, axis=1)
carsale.head()

- __engV__: Replacing missing values now for __engV__ column based on __[car,model]__ group product and __median__ value of __engV__ based on this group. 


In [None]:
def get_median_engV(x):
    brand = x.name[0]
    if x.count() > 0:
        return x.median() # Return median for a brand/model if the median exists.
    elif carsale.groupby(['car'])['engV'].count()[brand] > 0:
        brand_median = carsale.groupby(['car'])['engV'].apply(lambda x: x.median())[brand]
        return brand_median # Return median of brand if particular brand/model combo has no median,
    else:                 # but brand itself has a median for the 'engV' feature. 
        return carsale['engV'].median() # Otherwise return dataset's median for the 'engV' feature.
    
engV_median = carsale.groupby(['car','model'])['engV'].apply(get_median_engV).reset_index()
engV_median.rename(columns={'engV': 'engV_med'}, inplace=True)
engV_median.head()

In [None]:
def fill_with_median(x):
    if pd.isnull(x['engV']):
        return engV_median[(engV_median['car'] == x['car']) & (engV_median['model'] == x['model'])]['engV_med'].values[0]
    else:
        return x['engV']
    
carsale['engV'] = carsale.apply(fill_with_median, axis=1)
carsale.head()

- __mileage__: Replacing missing values now for __mileage__ column based on __[car,model,year]__ group product and __median__ value of __mileage__ based on this group. Year has been included here as per data observations year to year mileage is getting down for the same Car/Brand combination.

In [None]:
def get_median_mileage(x):
    brand = x.name[0]
    if x.count() > 0:
        return x.median() # Return median for a brand/model if the median exists.
    elif carsale.groupby(['car'])['mileage'].count()[brand] > 0:
        brand_median = carsale.groupby(['car'])['mileage'].apply(lambda x: x.median())[brand]
        return brand_median # Return median of brand if particular brand/model combo has no median,
    else:                 # but brand itself has a median for the 'mileage' feature. 
        return carsale['mileage'].median() # Otherwise return dataset's median for the 'mileage' feature.
    
mileage_median = carsale.groupby(['car','model'])['mileage'].apply(get_median_mileage).reset_index()
mileage_median.rename(columns={'mileage': 'mileage_med'}, inplace=True)
mileage_median.head()

In [None]:
def fill_with_median(x):
    if pd.isnull(x['mileage']):
        return mileage_median[(mileage_median['car'] == x['car']) & (mileage_median['model'] == x['model'])]['mileage_med'].values[0]
    else:
        return x['mileage']
    
carsale['mileage'] = carsale.apply(fill_with_median, axis=1)
carsale.head()

### *  Handling categorical data
- __drive__: Replacing missing values now for __drive__ column based on __[car,model]__ group product and __mode__ value of __drive__ based on this group. 

In [None]:
def get_drive_mode(x):
    brand = x.name[0]
    if x.count() > 0:
        return x.mode() # Return mode for a brand/model if the mode exists.
    elif carsale.groupby(['car'])['drive'].count()[brand] > 0:
        brand_mode = carsale.groupby(['car'])['drive'].apply(lambda x: x.mode())[brand]
        return brand_mode # Return mode of brand if particular brand/model combo has no mode,
    else:                 # but brand itself has a mode for the 'drive' feature. 
        return carsale['drive'].mode() # Otherwise return dataset's mode for the 'drive' feature.
    
drive_modes = carsale.groupby(['car','model'])['drive'].apply(get_drive_mode).reset_index().drop('level_2', axis=1)
drive_modes.rename(columns={'drive': 'drive_mode'}, inplace=True)
drive_modes.head()

In [None]:
def fill_with_mode(x):
    if pd.isnull(x['drive']):
        return drive_modes[(drive_modes['car'] == x['car']) & (drive_modes['model'] == x['model'])]['drive_mode'].values[0]
    else:
        return x['drive']
    
carsale['drive'] = carsale.apply(fill_with_mode, axis=1)
carsale.head()

- __engType__: Replacing missing values now for __engType__ column based on __[car,model]__ group product and __mode__ value of __engType__ based on this group. 

In [None]:
def get_engType_mode(x):
    brand = x.name[0]
    if x.count() > 0:
        return x.mode() # Return mode for a brand/model if the mode exists.
    elif carsale.groupby(['car'])['engType'].count()[brand] > 0:
        brand_mode = carsale.groupby(['car'])['engType'].apply(lambda x: x.mode())[brand]
        return brand_mode # Return mode of brand if particular brand/model combo has no mode,
    else:                 # but brand itself has a mode for the 'engType' feature. 
        return carsale['engType'].mode() # Otherwise return dataset's mode for the 'engType' feature.
    
engType_modes = carsale.groupby(['car','model'])['engType'].apply(get_engType_mode).reset_index().drop('level_2', axis=1)
engType_modes.rename(columns={'engType': 'engType_mode'}, inplace=True)
engType_modes.head()

In [None]:
def fill_with_mode(x):
    if pd.isnull(x['engType']):
        return engType_modes[(engType_modes['car'] == x['car']) & (engType_modes['model'] == x['model'])]['engType_mode'].values[0]
    else:
        return x['engType']
    
carsale['engType'] = carsale.apply(fill_with_mode, axis=1)
carsale.head()

- Now we'll see if still we have missing data in dataset. If not, then we are good to go with plotting

In [None]:
carsale.isnull().sum()

 - We can see there are no missing data exist in our dataset, let's go with finding observations/pattern using plotting/charting, but first we'll save this in our post profiling (which is the data states after preprocessing)

<a id=section306></a>

## * Post Pandas Profiling

In [None]:
import pandas_profiling
profile = pandas_profiling.ProfileReport(carsale)
profile.to_file(outputfile="carsale_after_preprocessing.html")

Now I have preprocessed the data, now the dataset does not contain missing values. So, the pandas profiling report which I have generated after preprocessing will give more beneficial insights. You can compare the two reports, i.e __carsale_before_preprocessing_2.html__ and __carsale_after_preprocessing.html__.<br/>

In __carsale_after_preprocessing.html__ report, observations:
- In the Dataset info, Total __Missing(%)__ = __0.0%__ 
- Number of __variables__ = __11__ 
- Observe the updated details, Click on Toggle details to get more detailed information about each feature.

- Let's look into feature available in __carsale dataset__ in detail an __Visualize them__

In [None]:
carsale.car.value_counts().head(10).plot.bar()
plt.title("Top 10 car brands on sale")

 - This shows __Volkswagen__ and __Mercedes-Benz__ are top most brands on sale and hence these would be preferred choices for high profile people

In [None]:
carsale[carsale.price.isin(carsale.price.nlargest())].sort_values(['car','model','body','mileage','price'])

 - This shows __top 5 highest price selling car and their models details__ and hence __can be used for email marketing for high profile income group peoples__ to achieve sales goals

In [None]:
carsale[carsale.price.isin(carsale.price.nsmallest())].sort_values(['car','model','body','mileage','price'])

 - This shows __top 5 lowest price selling car and their models details__ and hence __can be used for email marketing for low to middle profile income group peoples__ to achieve sales goals

In [None]:
sns.countplot(y='body', data=carsale, orient='h', hue='registration')
plt.title("Most preferred body type used in 1953-2016")

 - This shows the __car brands having "sedan" type of body having maximum registration/sale__ over the years. This shows People prefers __sedan__ type of body mostly and hence this information can be use for achieving max sale and to figure out production of units.

In [None]:
sns.countplot(x='engType', data=carsale, orient='h')
plt.title("Most preferred engType used over the years")

In [None]:
carsale.sort_values(['car','model','body','mileage','year'])

df = carsale.groupby('year')['registration'].value_counts().sort_values(ascending=False)
df = pd.DataFrame(df)
df.rename(columns={'registration': 'RegCounts'}, inplace=True)
df.reset_index(inplace=True)
display(df.head())
sns.lineplot(data=df, x='year', y='RegCounts', hue='registration')
#sns.scatterplot(data=df, x='year', y='RegCounts', hue='registration')
plt.title("Years group having max sale/registration")


 - This graph shows in which Year was the highest registrations, and hence shows max sale was done in the Year __2008__. 
    <br>This info can be use to start working/research why sale was max in this year.
 - What was the factors affected this sale/registrations

In [None]:
sns.lineplot(data=carsale, y='price', x='year', hue='drive')
plt.title("year - price lineplot (1950 - 2010)")

In [None]:
sns.lineplot(data=carsale[carsale.year >= 2010], y='price', x='year', hue='drive')
plt.title("year - price lineplot (2010 - 2016)")

 - The above graphs shows the __Price__ distribution over the years (1953-2016). As the years increase, we cannot comment on the price increase, but in general, __there has been an increase in price in recent years.__

In [None]:
sns.lineplot(x='mileage',y='price',data=carsale, hue='engType')
plt.title("mileage - price line Plot")

 - The above graph shows line plot/relation between __mileage__ and __price__. We can't comment on the price increase/decrease over the mileage but this shows, price changing accordingly based on mileage value. So __price is varying based on mileage__ too and this should be consider as a factor for the calculation.

In [None]:
sns.heatmap(carsale.corr(),annot=True, linewidths=.5)
plt.title("Heatmap for Highest correlated features for Carsale datset")

 - Above graph shows the __which features are most relative/correlated and dependent on each other__. Hence it looks __price__ and __year__ are higly correlated to each other and price may change (increase/decrease over the period of time)

In [None]:
sns.lmplot('year','price', carsale, fit_reg=False, hue='engType')
plt.title("Price distribution over the year w.r.t to engType")

 -  The above __multivariate graphs__ shows the __Price__ distribution over the years w.r.t __engType__. As the years increase, this shows there is __significant increase in prices of cars models having engine tyep = "Petrol"__ as compared to __"Gas" and "Diesel"__.

In [None]:
sns.pairplot(carsale, hue='engType', palette="viridis", height=3)

 - This pairplot gives the observations which already have been referred from other graphs above : those are
 - - Price varying based on __Year__ and __Mileage__
 - - As __Year__ increases there is increase in __Petrol__ engine type vehicles's prices which also depending on __mileage__ too.

<a id=section4></a>

## 5. Conclusion 

- With the help of notebook I learnt how exploratory data analysis can be carried out using Pandas plotting.
- Also I have seen making use of packages like __matplotlib and seaborn__ to develop better insights about the data.<br/>
- I have also seen how __preproceesing__ helps in dealing with _missing_ values and irregualities present in the data. I also learnt how to _create new features_ which will in turn help us to better predict the survival. 
- I also make use of __pandas profiling__ feature to generate an html report containing all the information of the various features present in the dataset.
- I have seen the impact of columns like __mileage, year and engType on the Price increase/decrease rate__.
- The most important inference drawn from all this analysis is, I get to know what are the __features on which price is highly positively and negatively coorelated with.__ 
- This analysis will help me to choose which __machine learning model__ we can apply to predict price of test dataset in later terms and projects. 