## Ebay Car Sales
Hello there! <br>
I have worked on a Kaggle data set named 'Ebay Car Sales'. For doing this project I have used the pandas library only. This is a project about cleaning the dataframe to analysis in a better way. I have not shown any visualization but tried to get an organized data set by cleaning unnecessary data and doing some other tasks.

<b> Data Set Columns </b> <br>
- name: Name of the car.
- seller: Whether the seller is private or a dealer.
- offerType: The type of listing
- price: The price on the ad to sell the car.
- abtest: Whether the listing is included in an A/B test.
- vehicleType: The vehicle Type.
- yearOfRegistration: The year in which the car was first registered.
- gearbox: The transmission type.
- powerPS: The power of the car in PS.
- model: The car model name.
- kilometer: How many kilometers the car has driven.
- monthOfRegistration: The month in which the car was first registered.
- fuelType: What type of fuel the car uses.
- brand: The brand of the car.
- notRepairedDamage: If the car has a damage which is not yet repaired.
- dateCreated: The date on which the eBay listing was created.
- nrOfPictures: The number of pictures in the ad.
- postalCode: The postal code for the location of the vehicle.
- lastSeenOnline: When the crawler saw this ad last online.



In [None]:
# Importing library
import numpy as np
import pandas as pd

In [None]:
# Reading the csv file
filename = "/kaggle/input/used-cars-database-50000-data-points/autos.csv"
df = pd.read_csv(filename, encoding= 'Windows-1252')

In [None]:
df.head()

In [None]:
print(df.columns)

-  **We see that the columns are not properly readable though it is in camelcase. To make it more readable, snakecase would be better.**

### Changing the column names to make it more readable 

In [None]:
df.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 
              'gear_box', 'power_PS', 'model','odometer', 'registration_month', 'fuel_type', 'brand','unrepaired_damage',
              'ad_created', 'Num_of_Pictures', 'postal_code', 'lastSeen']

In [None]:
df.columns

-   One of the column names did not change properly

In [None]:
# rename method to make specific changes
df.rename({'lastSeen':'last_seen'}, axis= 1, inplace= True)

In [None]:
df.columns

# Data Cleaning

In [None]:
df.info()

-  **There are so many null objects here.**

### let's figure out the null object in percentages for every row.

In [None]:
df.isnull().sum() * 100 / df.shape[0]

-  **vehicle_type and unrepaired_damage has the most percent of null object**

In [None]:
# To see the column datan types
df.dtypes

-  **We can see that data type of some columns are not compatible. For example, price and odometer column should not be object type. So, we are gonna change the type of these columns. Before changing, we should check on their format to see if any changes need to make. I have also noticed that the registration_month column would be easier to read if it's mentioned the month name.** 

In [None]:
df.head()

-   **There are unwanted sign in price and odometer column**

### Fixing the dollar and km sign for the price and odometer columns respectively and then changing their names and data type. 

In [None]:
# for the price column
# defining a function
def replace(x):
    x = x.replace('$','')
    x = x.replace(',','')
    return x

# applying that function
df['price'] = df['price'].apply(replace)

In [None]:
df['price'].head()

In [None]:
# For the odometer column 
def replace_odo(x):
    x = x.replace(',','')
    x = x.replace('km','')
    return x

df['odometer'] = df['odometer'].apply(replace_odo) 

In [None]:
# Assigning the data type in a dictionary
dic = {'price': 'float', 'odometer': 'float'}

# Changing the data type
df = df.astype(dic)

In [None]:
df.dtypes

In [None]:
# Changing the name to make it more readable
df.rename({'price':'price_in_dollar', 'odometer':'odometer_in_km'}, axis= 1, inplace= True)

In [None]:
df.columns

-  **As i mentioned before that, I will make a new month column and add to the dataframe from registration_month column to make it more readable**

In [None]:
# Let's check the unique value in the registration_month column
df['registration_month'].unique()

-  **There is one number 0 which basically define nothing. So, we will put unknown for the value 0 and relatable months for the others.**

In [None]:
d = {0:'unknown',1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
column = []
for item in df['registration_month']:
    if item in d:
        column.append(d[item])
df['reg_month_in_words'] = column

In [None]:
df.head(3)

-  **We see that, it would be better if the registration_year, registration_month, reg_month_in_words postioned side by side but it is not. Let's fix this thing first**

In [None]:
# We could assign all the rearragned column names to the data frame but i am gonna try the drop and insert method here.
# Assigning the columns in a variable
year = df['registration_year']
month = df['registration_month']
month_words = df['reg_month_in_words']

# Dropping all the columns
df.drop(labels=['registration_year'], axis=1, inplace= True)
df.drop(labels=['registration_month'], axis=1, inplace= True)
df.drop(labels=['reg_month_in_words'], axis=1, inplace= True)

# Inserting to our expected position
df.insert(6, 'registration_year', year)
df.insert(7, 'registration_month', month)
df.insert(8, 'reg_month_in_words', month_words)

In [None]:
df.head(3)

- **Now it's easier to look at all the months and year**

-  **In the dataframe we see that, in the date_crawled, ad_created and last_seen columns there is added time which is completely unnecessary. So, we are gonna remove the extension and also change the data type if needed**

### Removing the time and also changing the format

In [None]:
df[['date_crawled','ad_created','last_seen']].head()

In [None]:
# Since the first 10 numbers are date, we will just apply the slicing
for item in ['date_crawled','ad_created','last_seen']:
    df[item] = df[item].str[:10]

In [None]:
df.head(3)

In [None]:
# Changing the format of these columns to datetime

df['date_crawled'] = pd.to_datetime(df['date_crawled'], format= "%Y-%m-%d", dayfirst= True )
df['ad_created'] = pd.to_datetime(df['ad_created'], format= "%Y-%m-%d", dayfirst= True )
df['last_seen'] = pd.to_datetime(df['last_seen'], format= "%Y-%m-%d", dayfirst= True )

In [None]:
df.dtypes

-   **These three columns change to datetime format.**

In [None]:
# if we want we can just extract month, year or day from these
df['date_crawled'][0].year

-   **Now I will try to reduce the dataframe by removing wrong and unnecessary data. let's start with the price column. We will try to see the price range. If see that some cars are so expensive and the data are few for them then we will get rid off that**

In [None]:
df.head(3)

In [None]:
df['price_in_dollar'].value_counts().sort_index(ascending= False).head(20)

-   **Here, we found out some outliers. 10 million, 99 million are way high than the other price. So we can remove these data.**

In [None]:
# Taking data between 0 to 10 millin
df = df[df['price_in_dollar'].between(0,10000000)]

In [None]:
df.shape

 -   **Let's look at the registration_year column no**

In [None]:
df['registration_year'].value_counts().sort_index(ascending= False).head(20)

In [None]:
df['registration_year'].value_counts().sort_index(ascending= True).head(20)

-   **We can easily find out the anomaly here. There are some years which cannot be existed, for example, 1000, 9996,9999. So, we are also gonna reomve these data**

In [None]:
# Taking the year from 1900 to 2018
df = df[df['registration_year'].between(1900,2018)]

In [None]:
df.shape

### We have cleaned the data as much as possible. Now, let's try to get some insight. We will just work on the top 20 brand and try to find out the relation between their price and mileage.**

In [None]:
df.head(3)

In [None]:
# Determining the top car brand in term of their sales number
top_20 = df['brand'].value_counts(ascending= False).head(20)
top_20

-   **Volkswagen have sold most number of cars followed by opel, bmw, marcedez and then other. Let's checkn out their mean price.**

In [None]:
# making a list of top 20 brand to run the loop
brand_list = list(top_20.index )
brand_list

In [None]:
# Empty dictionary
mean_dic = {}

# Running a loop to entry the mean average
for item in brand_list:
    mean = df[df['brand'] == item]['price_in_dollar'].mean()
    mean_dic[item] = mean
        
mean_dic

-   **Since the dictionary are unordered unlike the list, it is not showing the mean price of top 20 car list in an order. What we can do is, we can make another data frame for these top 20 car bradn so that it becomes easier to understand the insights**

In [None]:
# Making a series to create the dataframe
mean_series = pd.Series(mean_dic)
mean_series

In [None]:
brand_mean = pd.DataFrame(mean_series, columns=['mean_price'])
brand_mean

In [None]:
# now calculating the avg mileage of those car
dic_mileage = {}
for item in brand_list:
    mean = df[df['brand'] == item]['odometer_in_km'].mean()
    dic_mileage[item] = mean
        
print(dic_mileage)

In [None]:
# Creating series
mileage_series = pd.Series(dic_mileage)

# Dataframe
brand_mileage = pd.DataFrame(mileage_series, columns= ['mileage'])
brand_mileage

In [None]:
brand_count = dict(top_20)
brand_count

In [None]:
# Creating series
count_series = pd.Series(dict(top_20))

# Creating the Dataframe
brand_count = pd.DataFrame(count_series, columns= ['count'])
brand_count

In [None]:
brand_price_mileage_count = pd.concat([brand_mean, brand_mileage, brand_count], axis=1)
brand_price_mileage_count

-  **Now it will be super comfy to play with this dataframe**

In [None]:
# Sorting by number of sales unit
brand_price_mileage_count.sort_values(by= 'count', ascending= False)

-   **Here, we see that, there is relation between the price of the car and its mileage and also the brand value. The mileage are pretty same for the first five car brand but the price are different.**

In [None]:
def green(val):
    color = 'green'
    return 'color: %s' % color

In [None]:
# Coloring certain cell
brand_price_mileage_count[:5].style.applymap(green, subset=pd.IndexSlice[['volkswagen','opel'], ['mean_price','mileage']])

-   **Volkswagen is 5k, on the other hand opel is almost half of volkswagen's price though the mileage are pretty same.**

In [None]:
# Coloring certain cell
brand_price_mileage_count[:5].style.applymap(green, subset=pd.IndexSlice['bmw':'audi', ['mean_price','mileage']])

-   **For bmw, marcedes and audi the price range are almost similar compare to their mileage. We could easily summarise that their customer demand and brand value might be same.**

## In the end, I would like to say that this project was all about using different methods, attributes to clean the data set. 

## Thank you so much for spending your time on this project. 