### Columns:

**General Columns**
* url: url of dfs
* short_description, description: Description of dfs (in English and German) written by users

**Categorical Columns**
* make_model, make, model: Model of dfs. Ex:Audi A1
* body_type, body: Body type of dfs Example: van, sedans
* vat: VAT deductible, price negotiable
* registration, first_registration: First registration date and year of dfs.
* prev_owner, previous_owners: Number of previous owners
* type: new or used
* next_inspection, inspection_new: information about inspection (inspection date,..)
* body_color, body_color_original: Color of df Ex: Black, red
* paint_type: Paint type of df Ex: Metallic, Uni/basic
* upholstery: Upholstery information (texture, color)
* gearing_type: Type of gear Ex: dfmatic, manual
* fuel : fuel type Ex: diesel, benzine
* co2_emission, emission_class, emission_label: emission information
* drive_chain: drive chain Ex: front,rear, 4WD
* consumption: consumption of df in city, country and combination (lt/100 km)
* country_version
* entertainment_media
* safety_security
* comfort_convenience
* extras

**Quantitative Columns**
* price: Price of cars
* km: km of dfs
* hp: horsepower of dfs (kW)
* displacement: displacement of dfs (cc)
* warranty: warranty period (month)
* weight: weight of df (kg)
* nr_of_doors: number of doors
* nr_of_seats : number of seats
* cylinders: number of cylinders
* gears: number of gears



# PART- 1 `( Data Cleaning )`

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from scipy import stats
from scipy import special

In [None]:
df_org = pd.read_json('./data/scout_car.json', lines=True)

In [None]:
df = df_org.copy()

In [None]:
df.head(3).T

In [None]:
df.info()

## change column names

In [None]:
df.columns.str.lower().str.replace(' ', '_').str.replace('\n','').str.replace('_&','')

In [None]:
df.columns = df.columns.str.lower().str.replace(' ','_').str.replace('.','').str.replace('\n','').str.replace('_&','')

df.columns

## url column

In [None]:
df.drop('url', axis=1, inplace=True)

# make_model column

In [None]:
df['make_model'].value_counts(dropna=False)

## displacement column

In [None]:
df['displacement'].value_counts(dropna=False)

In [None]:
df['displacement'].str[0].str.replace(',','').str.findall('\d+').str[0].astype('float')

In [None]:
#df['displacement'].str[0].str.replace(',','').str.replace('\n','').str.replace(' cc','').value_counts().index

In [None]:
df['displacement'].str[0].str.replace(',','').str.findall('\d+').str[0].astype('float')

In [None]:
df['displacement'].str[0].str.replace('\n','').str.replace(' cc','').str.replace(',','').astype(float)

In [None]:
df['displacement'] = df['displacement'].str[0].str.replace(',','').str.findall('\d+').str[0].astype('float')

In [None]:
df['displacement'].value_counts(dropna=False).index.sort_values()

## Short description.

In [None]:
df['short_description']

In [None]:
df['short_description'].str.findall('\d\.\d').str[0].astype(float)

In [None]:
sd_disp = df['short_description'].str.findall('\d\.\d').str[0].astype(float)

In [None]:
df['sd_disp'] = sd_disp*1000

In [None]:
df['sd_disp'] = df['sd_disp'].replace(1600,1598).replace(1800,1798)

In [None]:
df.loc[df['sd_disp']<800,'sd_disp'] = np.nan

In [None]:
df.displacement.value_counts()

In [None]:
def disp(d1,d2):
    if (d1>4000) | (d1<700) | np.isnan(d1):
        if np.isnan(d2):
            return d1
        else:
            return d2
    else:
        return d1

In [None]:
df.apply(lambda x: disp(x['displacement'],x['sd_disp']),axis=1)

In [None]:
df['displacement'] = df.apply(lambda x: disp(x['displacement'],x['sd_disp']),axis=1)
    

In [None]:
df[['displacement', 'sd_disp']].value_counts()

* The displacement values in short description column was extracted and this values compared and matched with displacement column. Some null values of displacement column were filled in this way.

In [None]:
df.drop(['short_description', 'sd_disp'], axis=1, inplace=True)

In [None]:
df['displacement'].value_counts(dropna=False)

## body_type column

In [None]:
df['body_type'].value_counts(dropna=False)

## Price Column

In [None]:
df['price'].sort_values()

In [None]:
df[df['price']<1000].T

In [None]:
df[df['price']<4000].index

In [None]:
price_outlier = df[df['price']<4000].index

In [None]:
df.drop(price_outlier, inplace=True)
# We drop 4 rows since their price and km are meaningless

In [None]:
df.price.sort_values()

## vat column 

In [None]:
df.vat.value_counts(dropna=False)

## km column

In [None]:
df.km.value_counts(dropna=False)

In [None]:
df[df['km'] == '- km'].registration.value_counts()

In [None]:
df.km.str.replace(',','').str.findall('\d+').str[0].astype(float)

In [None]:
df.km = df.km.str.replace(',','').str.findall('\d+').str[0].astype(float)

In [None]:
df[df.km<100].km.value_counts()

In [None]:
df[df.km<100].km.count()

* km column is cleaned and converted to float. '-' rows are converted to np.nan. 
* There are 2362 cars with km less than 100. These cars will be controlled after cleaning. If the year of these cars do not match with km, it can be filled  in another way.

## registration column

In [None]:
df.registration.value_counts(dropna=False)

In [None]:
df.registration = df.registration.replace('-/-',np.nan)

In [None]:
df.registration.value_counts(dropna=False)

In [None]:
pd.to_datetime(df.registration)

In [None]:
pd.DatetimeIndex(df['registration']).year

In [None]:
df['year'] = pd.DatetimeIndex(df['registration']).year 

In [None]:
df['month'] = pd.DatetimeIndex(df['registration']).month 

In [None]:
df[['registration','year','month']]

In [None]:
df.registration.isnull().sum()

In [None]:
df.drop('registration', axis=1, inplace=True)

* registration column was converted to datetime. 
* Also year and month columns were created from registration column. 
* All null values are converted to np.nan

## prev_owner and Previous Owners columns

In [None]:
df.prev_owner.value_counts(dropna=False)

In [None]:
df.prev_owner.str[0]

In [None]:
df.prev_owner = df.prev_owner.str[0].astype('float')

In [None]:
df.prev_owner.value_counts(dropna=False)

### previous_owners column

In [None]:
df['previous_owners'].value_counts(dropna=False)

In [None]:
df['previous_owners'].str.findall('\d+').str[0].value_counts(dropna=False)

In [None]:
df['previous_owners'] = df['previous_owners'].str.findall('\d+').str[0].astype('float')

In [None]:
df['previous_owners'].value_counts(dropna=False)

In [None]:
(df['previous_owners']-df['prev_owner']).value_counts(dropna=False)

### Combine these two columns by apply method

In [None]:
def prev_owner_combine(p1,p2):
    if p1 == p2:
        return p1
    elif np.isnan(p1) :
        if np.isnan(p2):
            return np.nan
        else:
            return p2
    elif np.isnan(p2):
        if np.isnan(p1):
            return np.nan
        else:
            return p1
    else:
        return 'conflict'

In [None]:
df.apply(lambda x: prev_owner_combine(x['prev_owner'],x['previous_owners']), axis=1)

In [None]:
df['prev_owner'] = df.apply(lambda x: prev_owner_combine(x['prev_owner'],x['previous_owners']), axis=1)

In [None]:
df['prev_owner'].value_counts(dropna=False)

In [None]:
df.drop(['previous_owners'],axis=1, inplace=True)

* prev_owner and Previous Owners columns were combined and Previous Owners column was dropped.

## kW column

In [None]:
df.drop('kw',axis=1,inplace=True)

## hp column

In [None]:
df.hp.str[:-3]

In [None]:
df.hp = df.hp.str.findall('\d+').str[0].astype('float')

In [None]:
#df.hp = df.hp.str[:-3].replace('-',np.nan).astype('float')

In [None]:
df.hp.value_counts(dropna=False)

In [None]:
df.hp.isnull().sum()

## type column

In [None]:
df['new_used'] = df.type.str[1]

In [None]:
df.new_used.value_counts(dropna=False)

* new_used column was created. This column consists of the information about if the car is new, used, pre-registered, demonstration or employee's car

In [None]:
df.type.str[0].value_counts(dropna=False)

In [None]:
df.type.str[2].value_counts(dropna=False)

In [None]:
df.type.str[3].value_counts(dropna=False)

In [None]:
df['fuel_type'] = df.type.str[3]

In [None]:
benzine = df.type.str[3].str.contains('Benzine', na=False, regex=True)

In [None]:
df['fuel_type'][benzine].value_counts()

In [None]:
particulate = df.type.str[3].str.contains('Particulate', na=False, regex=True)

In [None]:
df['particulate']='unparticulate'

In [None]:
df.loc[particulate,'particulate']='particulate'

In [None]:
df['particulate'].value_counts()

In [None]:
df.loc[benzine,'fuel_type'] = 'benzine'

In [None]:
df['fuel_type'][df['fuel_type'] == 'benzine'].value_counts()

In [None]:
super = df.type.str[3].str.contains('Super', na=False, regex=True)

In [None]:
gasoline = df.type.str[3].str.contains('Gasoline', na=False, regex=True)

In [None]:
df.loc[super,'fuel_type'] = 'benzine'

In [None]:
df.loc[gasoline,'fuel_type'] = 'benzine'

In [None]:
df['fuel_type'].value_counts()

In [None]:
gas = df['fuel_type'].isin(['LPG','Liquid petroleum gas (LPG)',\
                              'CNG','CNG (Particulate Filter)',\
                              'Biogas','Domestic gas H'])
          

In [None]:
df.loc[gas,'fuel_type'] = 'gas'

In [None]:
df['fuel_type'].value_counts()

In [None]:
diesel = df['fuel_type'].isin(['Diesel (Particulate Filter)', 'Diesel'])

In [None]:
df.loc[diesel,'fuel_type'] = 'diesel'

In [None]:
others = df['fuel_type'].isin(['Others', 'Others (Particulate Filter)', 'Electric'])

In [None]:
df.loc[others,'fuel_type'] = 'others'

In [None]:
df['fuel_type'].value_counts(dropna=False)

In [None]:
df.drop('type',axis=1, inplace=True)

* fuel_type column was cleaned. Another column 'particulate' that shows 'Particulate Filter' of cars was created. 

## Next Inspection', 'Inspection new' columns

In [None]:
df["next_inspection"].str[0].str.replace("\n","").value_counts(dropna=False)

In [None]:
listINS=[]
for i in df["next_inspection"]:
    if type(i)==float:
        listINS.append(i)
    elif type(i)==list:
        listINS.append(i[0].strip())
    else:
        listINS.append(i.replace("\n",""))

In [None]:
df['next_inspection_date'] = listINS

In [None]:
df['next_inspection_date'] = pd.to_datetime(df['next_inspection_date'])

## Next inspection date column was created.

In [None]:
df['next_inspection'].str[1].str.replace('\n','')

In [None]:
df['car_emission'] = df['next_inspection'].str[1].str.replace('\n','').str[:-16]

In [None]:
df['car_emission'] = df['car_emission'].replace('',np.nan)

In [None]:
df['car_emission'][df['car_emission'].str.isdecimal()==False]

In [None]:
df['car_emission'] = df['car_emission'].replace(' 0 k',0)

In [None]:
df['car_emission'] = df['car_emission'].replace('0 k',0)

In [None]:
df['car_emission'] = df['car_emission'].astype('float')

In [None]:
df['car_emission'].isnull().sum()

In [None]:
df.drop('next_inspection', axis=1, inplace=True)

* Car emission column was created. It can be dropped.

In [None]:
df.drop('car_emission', axis=1, inplace=True)

## 'Inspection new' column

In [None]:
def inspection(a):
    if type(a)== list:
        return a[0].replace('\n', '')
    elif type(a)== str:
        return a.replace('\n', '')
    else:
        return a

In [None]:
df['inspection_new'] = df['inspection_new'].apply(inspection)

* Inspection new: Other parts of this column shows fuel consumption. Since they are the same data in the following consumption column, this part of data was not extracted.

In [None]:
df['inspection_new'].str[2].value_counts()

In [None]:
#df['fuel_cons_comb'] = df['inspection_new'].str[2].str[:-16]

#df['fuel_cons_comb'] = df['fuel_cons_comb'].replace('',np.nan).astype('float')

In [None]:
#df['fuel_cons_city'] = df['inspection_new'].str[4].str[:-16]

In [None]:
#df['fuel_cons_city'] = df['fuel_cons_city'].replace('',np.nan).astype('float')

In [None]:
#df['fuel_cons_country'] = df['inspection_new'].str[6].str[:-16]

#df['fuel_cons_country'] = df['fuel_cons_city'].replace('',np.nan).astype('float')

## Warranty column

In [None]:
import re
def clean_warranty(a):
    if type(a) == list:
        b = re.findall(r'\d+', a[0])
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    elif type(a) ==str:
        b = re.findall(r'\d+', a)
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    else:
        return a

In [None]:
df['warranty'] = df['warranty'].apply(clean_warranty)

In [None]:
df['warranty'] = df['warranty'].astype('float')

* In warranty column other parts of column are meaningless.

## Full Service

In [None]:
df['full_service'].str[2].value_counts(dropna=False)

In [None]:
df.drop('full_service', axis=1, inplace=True)

### In this column there is no meaningful data about service. It can be dropped.

## 'Non-smoking Vehicle' column

In [None]:
df['non-smoking_vehicle'].str[0].value_counts()

In [None]:
df.drop('non-smoking_vehicle', axis=1, inplace=True)

### In this column there is no meaningful data about service. It can be dropped.

## 'null' column

In [None]:
df.drop('null', axis=1, inplace=True)

## 'Make' column

In [None]:
df['make'] = df['make'].str.replace('\n','')

In [None]:
df.drop('make', axis=1, inplace=True)

### This column includes only main models. Match with make_model column. It can be dropped.

## 'Model' column

In [None]:
df['model'] = df['model'].str[1]

In [None]:
df.drop('model', axis=1, inplace=True)

### Also we can drop this column. It includes only models

## Offer Number column

In [None]:
df['offer_number'].str[0].value_counts()

In [None]:
df.drop('offer_number',axis=1,inplace=True)

### Offer number column includes only ids about columns. It can be dropped.

## First Registration column

In [None]:
df['first_registration'] = df['first_registration'].str[1].astype('float')

In [None]:
### Compare the results with year.

In [None]:
df['first_registration'].value_counts(dropna=False)

In [None]:
df['year'].value_counts(dropna=False)

In [None]:
df.apply(lambda x: prev_owner_combine(x['first_registration'],x['year']), axis=1).value_counts(dropna=False)

In [None]:
df.drop('first_registration',axis=1, inplace=True)

* Year and 'First Registration' columns are same. It can be dropped.

## 'Body Color' column

In [None]:
df['body_color'] = df['body_color'].str[1]

In [None]:
df['body_color'].value_counts(dropna=False)

## 'Paint Type' column

In [None]:
df['paint_type'] = df['paint_type'].str[0].str[1:-1]

In [None]:
df['paint_type'].value_counts(dropna=False)

## body_color_original column

In [None]:
df['body_color_original'] = df['body_color_original'].str[0].str[1:-1]

In [None]:
df['body_color_original'].value_counts(dropna=False)

In [None]:
df['body_color_original'].isnull().sum()

In [None]:
#import statsmodels.api as sm
#from statsmodels.formula.api import ols
#model = ols('price ~ C(body_color_original)', data=df).fit()
#anova_table = sm.stats.anova_lm(model, typ=2)
#anova_table

### This column also can be dropped

In [None]:
df.drop('body_color_original',axis=1,inplace=True)

## upholstery column

In [None]:
df['upholstery_material'] = df['upholstery'].str[0].str.replace('\n','').str.split(', ').str[0]

In [None]:
list_color = ['Black','Grey','Brown','Beige', 'Blue', 'White']
for i in list_color:
    df['upholstery_material'] = df['upholstery_material'].replace(i,np.nan)

In [None]:
df['upholstery_material'].value_counts(dropna=False)

In [None]:
df['upholstery_color'] = df['upholstery'].str[0].str.replace('\n','').str.replace(', ','')

In [None]:
list_uph_mat = ['Cloth', 'Part leather', 'Full leather', 'Other', 'Velour', 'alcantara']
for i in list_uph_mat:
    df['upholstery_color'] = df['upholstery_color'].str.replace(i,'')

In [None]:
df['upholstery_color'] = df['upholstery_color'].replace('',np.nan)

In [None]:
df['upholstery_color'].value_counts(dropna=False)

### upholstery column cleaned and by this column two columns called upholstery_color and upholstery_material were created. upholstery column can be dropped.

In [None]:
df.drop('upholstery', axis=1,inplace=True)

## body column 

In [None]:
df['body'] = df['body'].str[1]

In [None]:
df['body'].value_counts(dropna=False)

### This column match with body_type column

In [None]:
df['body_type'].value_counts(dropna=False)

In [None]:
df[(~(df['body']==df['body_type']))][['body','body_type']]

In [None]:
df.drop('body', axis=1,inplace=True)

### Since body and body_type columns exactly match, body column was dropped.

## nr_of_doors column

In [None]:
df['nr_of_doors'] = df['nr_of_doors'].str[0].str.replace('\n','').astype(float)

In [None]:
df['nr_of_doors'].value_counts(dropna=False)

## nr_of_seats column

In [None]:
df['nr_of_seats'] = df['nr_of_seats'].str[0].str.replace('\n','').astype(float)

In [None]:
df['nr_of_seats'].value_counts(dropna=False)

## model_code column

In [None]:
df['model_code'] = df['model_code'].str[0].str.replace('\n','')

In [None]:
df.drop('model_code',axis=1,inplace=True)

### This column can be dropped.

## gearing_type column

In [None]:
df['gearing_type'] = df['gearing_type'].str[1]

In [None]:
df['gearing_type'].value_counts(dropna=False)

## cylinders column

In [None]:
df['cylinders'] = df['cylinders'].str[0].str.replace('\n','')

In [None]:
df['cylinders'].value_counts(dropna=False)

## weight column

In [None]:
df['weight'] = df['weight'].str[0].str.replace(',','').str.findall('\d+').str[0].astype('float')

In [None]:
df['weight'][df['weight']<700]

## drive_chain column

In [None]:
df['drive_chain'] = df['drive_chain'].str[0].str.replace('\n','')

In [None]:
df['drive_chain'].value_counts(dropna=False)

## fuel column

In [None]:
fuel = df['fuel'].str[1]

In [None]:
particulate = fuel.str.contains('Particulate')

In [None]:
df.particulate[~particulate].value_counts()

In [None]:
benzine = fuel.str.contains('Benzine')

In [None]:
fuel[benzine] = 'benzine'

In [None]:
df.fuel_type[benzine].value_counts()

In [None]:
super = fuel.str.contains('Super')

In [None]:
fuel[super] = 'benzine'

In [None]:
gasoline = fuel.str.contains('Gasoline')

In [None]:
fuel[gasoline] = 'benzine'

In [None]:
diesel = fuel.str.contains('Diesel')
df.fuel_type[diesel].value_counts()

In [None]:
fuel[diesel] = 'diesel'

In [None]:
fuel.value_counts(dropna=False)

In [None]:
gas = fuel.isin(['LPG','Liquid petroleum gas (LPG)',\
                              'CNG','CNG (Particulate Filter)',\
                              'Biogas','Domestic gas H'])

In [None]:
fuel[gas] = 'gas'

In [None]:
fuel.value_counts(dropna=False)

In [None]:
df.drop('fuel', axis=1, inplace=True)

* fuel column totally match with fuel_type column. It was controlled and then dropped.

## Consumption

In [None]:
def consume_combined(a):
    if type(a) == list:
        if len(a) >3:
            for i in a:
                if 'comb' in i:
                    return i
        else:
            return a[0]            
    
    else:
        return a
    
df['consumption_comb'] = df['consumption'].apply(consume_combined)

In [None]:
def cleaning_consumption(a):
    if type(a) == list:
        if len(a) > 0:
            b = re.findall("\d\.?\d?", a[0])
            return b[0]
        else:
            return np.nan
    elif type(a) == str:
        b = re.findall("\d\.?\d?",a)
        return b[0]        
    else:
        return a

In [None]:
def consume_city(a):
    if type(a) == list:
        if len(a) >3:
            for i in a:
                if 'city' in i:
                    return i
        else:
            return a[1]           
    
    else:
        return a
    
df['consumption_city'] = df['consumption'].apply(consume_city)


In [None]:
def consume_country(a):
    if type(a)== list:
        if len(a) >3:
            for i in a:
                if 'country' in i:
                    return i
        else:
            return a[2]            
    
    else:
        return a
    
df['consumption_country'] = df['consumption'].apply(consume_country)

In [None]:

df['consumption_comb'] = df['consumption_comb'].apply(cleaning_consumption).astype('float')
df['consumption_city'] = df['consumption_city'].apply(cleaning_consumption).astype('float')
df['consumption_country'] = df['consumption_country'].apply(cleaning_consumption).astype('float')

df.drop("consumption",axis=1,inplace=True)

## co2_emission column

In [None]:
df['co2_emission'] = df['co2_emission'].str[0].str.findall("\d+").str[0].astype('float')

In [None]:
df['co2_emission'].value_counts(dropna=False)

In [None]:
# df.drop('co2_emission',axis=1,inplace=True)

## emission class column

In [None]:
df['emission_class'] = df['emission_class'].str[0].str.replace('\n','')

In [None]:
df['emission_class'].value_counts(dropna=False)

In [None]:
df['emission_class'].replace(['Euro 6','Euro 6d-TEMP','Euro 6d', 'Euro 6c'], 'Euro 6', inplace = True)

## comfort_convenience column

In [None]:
df['comfort_convenience'] = df['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'')

In [None]:
df['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'').str.get_dummies(sep=",")

### This column was not changed as it will be transformed with getdummy function later

## entertainment_media column

In [None]:
df['entertainment_media']

In [None]:
df['entertainment_media'] = df['entertainment_media'].astype('str').str.replace('[','').str.replace("]",'')

### This column was not changed as it will be transformed with getdummy function later

## extras column

In [None]:
df['extras'].astype('str').str.replace('[','').str.replace("]",'').str.get_dummies(sep=', ').sum()

In [None]:
df['extras'] = df['extras'].astype('str').str.replace('[','').str.replace("]",'')

### This column was not changed as it will be transformed with getdummy function later

## safety_security column

In [None]:
df['safety_security'] = df['safety_security'].astype('str').str.replace('[','').str.replace("]",'')

### This column was not changed as it will be transformed with getdummy function later

## description column

In [None]:
df['description']

In [None]:
df.drop('description',axis=1,inplace=True)

### This column was dropped since it includes German description of car written by users

## emission_label column

In [None]:
df['emission_label'].value_counts(dropna=False)


In [None]:
df['emission_label'] = df['emission_label'].str[0].str.findall('\((.*?)\)').str[0]

In [None]:
df['emission_label'].value_counts(dropna=False)

In [None]:
df.drop('emission_label',axis=1,inplace=True)

## gears column

In [None]:
df['gears'] = df['gears'].str[0].str.findall("\d+").str[0].astype('float')

In [None]:
df['gears'].value_counts(dropna=False)

## country_version column

In [None]:
df['country_version'] = df['country_version'].str[0].str.replace('\n','')

In [None]:
df['country_version'].value_counts(dropna=False)

In [None]:
### This column can be dropped.

In [None]:
df.drop('country_version',axis=1,inplace=True)

## electricity_consumption column

In [None]:
df.loc[df['electricity_consumption'].isnull()==False, 'electricity_consumption'] = 1

In [None]:
df.loc[df['electricity_consumption'].isnull()==True, 'electricity_consumption'] = 0

In [None]:
df['electricity_consumption'].value_counts(dropna=False)

## last_service_date column

In [None]:
df['last_service_date'] = pd.to_datetime(df['last_service_date'].str[0].str.replace('\n','').replace('',np.nan))

In [None]:
df['last_service_date'].value_counts(dropna=False, normalize=True)

### This column can be dropped since 96.8% is null.

In [None]:
df.drop('last_service_date',axis=1,inplace=True)

## other_fuel_types column

In [None]:
df['other_fuel_types'].value_counts(dropna=False)

In [None]:
df.drop('other_fuel_types',axis=1,inplace=True)

## availability column

In [None]:
df['availability'] = df['availability'].str.findall('\d+')

In [None]:
df.drop('availability',axis=1,inplace=True)

## last_timing_belt_service_date column

In [None]:
df['last_timing_belt_service_date'].str[0].value_counts(dropna=False)

In [None]:
df.drop('last_timing_belt_service_date',axis=1,inplace=True)

## available_from column

In [None]:
df['available_from'].value_counts(dropna=False)

In [None]:
df.drop('available_from',axis=1,inplace=True)

In [None]:
df.columns

In [None]:
# Also vat, next_inspection_date, month, particulate columns can be deleted. 
# After filling null values consumption_city, consumption_country columns can be deleted. 

In [None]:
df.isnull().sum()

In [None]:
## Month, particulate, upholstery color columns dropped.

In [None]:
df.drop(['month','particulate', 'upholstery_color'], axis=1, inplace=True)

In [None]:
df.to_csv("df_scout_clean.csv", index=False)